Git Product home page Git Product logo

niti's People

Contributors

wangmaolin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

niti's Issues

Questions about the computation of TiLoss

I have read your paper and got impressed by it. Meanwhile, I'm very glad to see your source code.

While I read the source code, I notice the computation of TiLoss in the code is a bit different from the description of the paper and I do not understand some designs in the paper. I would appreciate it if you can help me with these questions.

    def forward(self, out_val, out_exp, target):
        # err_out_exp=0
        # integer cross entropy loss
        s=out_val.type(torch.int64)
        if out_exp >-7:
            # if out_exp is big enough
            # change the base in log softmax from e to 2
            # to approx integer loss
            s=s*47274//(2**15)
            if out_exp>=0:
                s=s*2**out_exp
            else:
                s=s//(2**-out_exp)

            out_max, _ = torch.max(s,dim=1)
            offset = out_max-10
            s=s-offset.view(-1,1)
            s=torch.max(s,Int8Tensor(0).type(torch.int64))
            out_grad = 2**s-1
        else:
            # if out_exp is too small s will be all 0
            # use another apporximation 1+e^x = 1 + x + 0.5 x^2 + o(x^2)
            out_grad = 2**(1-2*out_exp.type(torch.int64)) + \
                s*2**(1-out_exp.type(torch.int64)) + s*s

        out_sum = out_grad.sum(1,dtype=torch.int64)

        out_grad = out_grad*(2**11)//out_sum.view(-1,1)
        out_grad[torch.arange(out_val.size(0)), target] -= out_grad.sum(1,dtype=torch.int64)
        self.out_grad = StoShiftInt32(out_grad.type(torch.int32),4)

        # return self.out_grad, err_out_exp
        return self.out_grad

I have two questions:

  1. The implementation details:
  • Variable out_grad is supposed to represent $e_i$ in the paper. But in your source code for the condition of out_exp<=-7, out_grad is $2^{1-2S_a}*e_i$. What is $2^{1-2S_a}$ supposed to mean?

  • Why do you multiply out_grad with $2^{11}$ and shift 4 bits in the final?

  • The equations do not match the code:

    In the condition of out_exp<=-7, the code can be seen as $\frac{\frac{2^{1-2S_a} e_i 2^{11}}{C}-y_i*C}{2^4}$.

    In the condition of out_exp>-7, the code can be seen as $\frac{\frac{e_i 2^{11}}{C}-y_i*C}{2^4}$.

    They are different from equation 2 and equation 3 in the paper, respectively. Can you explain this?

  1. The paper details:
  • In section 3.2, there is a description about $s_w$: "Recall that the value of $s_w$ for each layer is set during initialization and remain unchanged during training.". How to compute the initialization values of $s_w$? Does it use prior knowledge?
  • I get confused by a sentence in paper section 3.4: "The error tensor e in (1) is computed using these effectively 12-bit values and eventually rounded stochastically back to 8 bits before being used in back propagation.". Can you give a more specific description of this procedure? It seems related to the second point in question 1. But I don't understand the relation between them.
  • I do not understand the computation design of $\hat{x}$ when out_exp is greater than -7 in section 3.4. The reason is not given in the paper. Moreover, in the computation of p, it seems to have a magic number 10.

Thanks again for your time and effort!

About INT4 Convolution

Hi @wangmaolin,

I find your work very interesting.

I saw your implementation of int4_conv in cutlass extension, and found that the weight is unsigned:

cutlass::TensorRef<ElementInputB, LayoutInputB> weight_ref(reinterpret_cast<ElementInputB *>(weight.data<uint8_t>()), LayoutInputB::packed(filter_size));
.

Did you test this function in real CNNs? Since weights are normally zero-centered, any conversion on weights should be done before using this int4_conv function in order to get the right results?

Thanks in advance.

Device side error when running newest implementation

Hi @wangmaolin,

I follow README to train this code and encountered this error:

xCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [52,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [53,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [54,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [55,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [56,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [57,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [58,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [59,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [60,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [61,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [62,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [63,0,0] Assertion `0` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 191, in <module>
    main()
  File "main.py", line 79, in main
    train_loss, train_prec1, train_prec5= forward(
  File "main.py", line 152, in forward
    output, output_exp = model(inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/niti/ti_net.py", line 21, in forward
    self.out, self.out_exp = self.forward_layers(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/niti/ti_torch.py", line 395, in forward
    act_out, exp_out = act_calc(temp, self.act_in_exp + self.weight_exp)
  File "/niti/ti_torch.py", line 108, in act_calc
    int32_bitwidth = RangeEstimate(int32_acc)
  File "/niti/ti_torch.py", line 374, in RangeEstimate
    if range_ ==0:
RuntimeError: CUDA error: device-side assert triggered

Could you help identify the problem? thanks a lot.

My GPU is V100, with CUDA11.2 , driver version 460.91.03, docker version 20.10.12.

Question regarding cutlass etc.

Hey

I wanted to let the project run without a docker container.
For that I used conda with all the pytorch and other required packages.
Now when it comes to use the makefile I get errors due to NVIDIA cutlass and probably other software as well.
Do you know which Toolkit Version for cutlass I need in order to install it with the non-containerized version ? or have you tried that ?
And would be amazing if you have a good start for NVIDIA's cuBLAS and CUTLASS libraries (sort of tutorial)
thanks for sharing
Greetings :)

Got cutlass error: Error Internal at: 171

Hello, I am currently trying to reproduce your training instance in a Windows PC. The development setup is mentioned as below:

OS: Windows Pro
CUDA Toolkit: 11.0
Graphics Card: Quadro P620 (Pascal Architecture - sm_61)
CUDNN Version: 8.2.0
Python: 3.8.3
Torch: 1.7.1+cu110

niti_setup

**The paths for Windows are adjusted in setup.py script for all cuda extensions. To address the graphics card architecture I replaced the using SmArch = cutlass::arch::Sm75 with using SmArch = cutlass::arch::Sm61 in source code file niti\pytorch\cutlass-extension\cutlassconv_kernel.cu (7 occurences)

I can see the installed packages in this location after successful make: C:\Users*Username*\AppData\Local\Programs\Python\Python38\Lib\site-packages

When running make architecture is confirmed by the following line at the end of the compile command. (-gencode=arch=compute_61, code=sm_61)

nvcc_cmd

When training I came across "Got cutlass error: Error Internal at: 171" error.
vgg_train

Can you please check the issue and propose an appropriate solution.

The model dose not converge

When I run the training script after installing the CUDA extension, the model dose not converge at all.

Environment:

  • Ubuntu 16.04
  • Python 3.6
  • Pytorch 1.4
  • CUDA10.1

Training Script

DEPTH=11
CUDA_VISIBLE_DEVICES=3 \
	python ./ti_main.py --model-type int \ 
	--dataset cifar10 \
	--model vgg \
        --depth $DEPTH \
	--data-dir  /Workspace/ConvNets/data \
	--results-dir ./results --save cifar10-int8-vgg$DEPTH \
	--epochs 150 \
	--batch-size 128 \
	-j 8 \
	--log-interval 50 \
	--weight-decay \
 	--init ./cifar10_vgg"$DEPTH"_rebalance_init.pth.tar \

**Log **

 ./train_vgg_cifar10.sh 
saving to ./results/cifar10-int8-vgg11
run arguments: Namespace(batch_size=128, data_dir='/Workspace/ConvNets/data', dataset='cifar10', depth=11, download=False, epochs=150, evaluate=None, grad_hist=False, init='./cifar10_vgg11_rebalance_init.pth.tar', log_interval=50, model='vgg', model_type='int', results_dir='./results', resume='', save='cifar10-int8-vgg11', save_all=False, seed=-1, start_epoch=0, weight_decay=True, weight_frac=False, weight_hist=False, workers=8)
random seed: None
act rounding scheme: PstoShiftInt32
err rounding scheme: PstoShiftInt32
gradient rounding scheme: PstoShiftInt32
Update WITH WEIGHT DECAY
Create integer model
initial weights from checkpoint './cifar10_vgg11_rebalance_init.pth.tar' 
training regime: [{'epoch': 0, 'gb': 2}, {'epoch': 100, 'gb': -1}]
changing gradient bitwidth: 2
int [0][0/391] Time 1.586 (1.586) Data 0.49 loss 2.303 (2.303) e -14 @1 8.594 (8.594) @5 50.000 (50.000)
int [0][50/391] Time 0.081 (0.114) Data 0.00 loss 2.303 (7311.003) e -13 @1 13.281 (9.926) @5 56.250 (49.464)
int [0][100/391] Time 0.059 (0.092) Data 0.00 loss 2.303 (9973.338) e -16 @1 7.812 (9.800) @5 51.562 (49.830)
int [0][150/391] Time 0.085 (0.087) Data 0.00 loss 5856.362 (8409.386) e 9 @1 11.719 (9.711) @5 43.750 (49.524)
int [0][200/391] Time 0.061 (0.081) Data 0.00 loss 55.840 (7820.313) e 3 @1 10.938 (9.919) @5 51.562 (49.545)
int [0][250/391] Time 0.061 (0.078) Data 0.00 loss 2.303 (8043.853) e -15 @1 7.812 (9.988) @5 49.219 (49.608)
int [0][300/391] Time 0.059 (0.077) Data 0.00 loss 2.303 (7114.594) e -15 @1 10.938 (9.995) @5 49.219 (49.738)
int [0][350/391] Time 0.075 (0.075) Data 0.00 loss 2.303 (9238.136) e -11 @1 11.719 (9.927) @5 43.750 (49.846)
best_prec1: 10.100000 ./results/cifar10-int8-vgg11
int Epoch: 0 Train Prec@1 9.978 Train Prec@5 49.838 Valid Prec@1 10.100 Valid Prec@5 50.270 

Plot file saved at: /media/HD1/jmlu/ConvNets/niti/results/cifar10-int8-vgg11/results.html
int [1][0/391] Time 0.537 (0.537) Data 0.43 loss 565760.625 (565760.625) e 13 @1 9.375 (9.375) @5 50.781 (50.781)
int [1][50/391] Time 0.063 (0.076) Data 0.00 loss 2.303 (77541.838) e -17 @1 15.625 (10.463) @5 52.344 (50.689)
int [1][100/391] Time 0.059 (0.071) Data 0.00 loss 2.303 (39676.120) e -18 @1 11.719 (10.442) @5 53.125 (50.178)
int [1][150/391] Time 0.059 (0.068) Data 0.00 loss 2.303 (26682.765) e -17 @1 13.281 (10.487) @5 55.469 (50.098)
int [1][200/391] Time 0.058 (0.066) Data 0.00 loss 2.303 (20267.944) e -18 @1 12.500 (10.195) @5 52.344 (49.697)
int [1][250/391] Time 0.063 (0.066) Data 0.00 loss 2.303 (16575.759) e -16 @1 4.688 (10.047) @5 53.906 (49.894)
int [1][300/391] Time 0.074 (0.066) Data 0.00 loss 2.303 (14167.478) e -17 @1 6.250 (10.091) @5 50.000 (50.010)
int [1][350/391] Time 0.090 (0.067) Data 0.00 loss 2.303 (12343.841) e -18 @1 11.719 (10.034) @5 58.594 (50.100)
best_prec1: 10.100000 ./results/cifar10-int8-vgg11
int Epoch: 1 Train Prec@1 9.980 Train Prec@5 50.026 Valid Prec@1 9.780 Valid Prec@5 50.150 

Besides, there is no definition of function grad_calc_32b in ti-torch.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.