wangmaolin / niti Goto Github PK

View Code? Open in Web Editor NEW

73.0 73.0 14.0 61.51 MB

Implementation of "NITI: Training Integer Neural Networks Using Integer-only Arithmetic" on arxiv

Home Page: https://arxiv.org/abs/2009.13108

License: BSD 2-Clause "Simplified" License

Python 0.84% Cuda 1.26% C++ 96.47% Shell 0.01% Dockerfile 0.01% Makefile 0.02% C 1.35% CMake 0.04%

niti's People

Contributors

Stargazers

Watchers

Forkers

rhhc pierav bhbruce liaopeiyuan libeamlak akimoto-cris dulvqingyunlt jawaechan liuguoyou ernoult rain-neuromorphics berumotto-vermouth ucf-ml-research sycz00

niti's Issues

Questions about the computation of TiLoss

I have read your paper and got impressed by it. Meanwhile, I'm very glad to see your source code.

While I read the source code, I notice the computation of TiLoss in the code is a bit different from the description of the paper and I do not understand some designs in the paper. I would appreciate it if you can help me with these questions.

    def forward(self, out_val, out_exp, target):
        # err_out_exp=0
        # integer cross entropy loss
        s=out_val.type(torch.int64)
        if out_exp >-7:
            # if out_exp is big enough
            # change the base in log softmax from e to 2
            # to approx integer loss
            s=s*47274//(2**15)
            if out_exp>=0:
                s=s*2**out_exp
            else:
                s=s//(2**-out_exp)

            out_max, _ = torch.max(s,dim=1)
            offset = out_max-10
            s=s-offset.view(-1,1)
            s=torch.max(s,Int8Tensor(0).type(torch.int64))
            out_grad = 2**s-1
        else:
            # if out_exp is too small s will be all 0
            # use another apporximation 1+e^x = 1 + x + 0.5 x^2 + o(x^2)
            out_grad = 2**(1-2*out_exp.type(torch.int64)) + \
                s*2**(1-out_exp.type(torch.int64)) + s*s

        out_sum = out_grad.sum(1,dtype=torch.int64)

        out_grad = out_grad*(2**11)//out_sum.view(-1,1)
        out_grad[torch.arange(out_val.size(0)), target] -= out_grad.sum(1,dtype=torch.int64)
        self.out_grad = StoShiftInt32(out_grad.type(torch.int32),4)

        # return self.out_grad, err_out_exp
        return self.out_grad

I have two questions:

The implementation details:

Variable out_grad is supposed to represent $e_i$ in the paper. But in your source code for the condition of out_exp<=-7, out_grad is $2^{1-2S_a}*e_i$. What is $2^{1-2S_a}$ supposed to mean?
Why do you multiply out_grad with $2^{11}$ and shift 4 bits in the final?
The equations do not match the code:

In the condition of out_exp<=-7, the code can be seen as $\frac{\frac{2^{1-2S_a} e_i 2^{11}}{C}-y_i*C}{2^4}$.

In the condition of out_exp>-7, the code can be seen as $\frac{\frac{e_i 2^{11}}{C}-y_i*C}{2^4}$.

They are different from equation 2 and equation 3 in the paper, respectively. Can you explain this?

The paper details:

In section 3.2, there is a description about $s_w$: "Recall that the value of $s_w$ for each layer is set during initialization and remain unchanged during training.". How to compute the initialization values of $s_w$? Does it use prior knowledge?
I get confused by a sentence in paper section 3.4: "The error tensor e in (1) is computed using these effectively 12-bit values and eventually rounded stochastically back to 8 bits before being used in back propagation.". Can you give a more specific description of this procedure? It seems related to the second point in question 1. But I don't understand the relation between them.
I do not understand the computation design of $\hat{x}$ when out_exp is greater than -7 in section 3.4. The reason is not given in the paper. Moreover, in the computation of p, it seems to have a magic number 10.

Thanks again for your time and effort!

about resnet training

Can we support resnet training, and how to deal with the batch normal in resnet?

About INT4 Convolution

Hi @wangmaolin,

I find your work very interesting.

I saw your implementation of int4_conv in cutlass extension, and found that the weight is unsigned:

niti/pytorch/cutlass-extension/cutlassconv_kernel.cu

Line 857 in d097335

 cutlass::TensorRef<ElementInputB, LayoutInputB> weight_ref(reinterpret_cast<ElementInputB *>(weight.data<uint8_t>()), LayoutInputB::packed(filter_size)); 

Did you test this function in real CNNs? Since weights are normally zero-centered, any conversion on weights should be done before using this int4_conv function in order to get the right results?

Thanks in advance.

Learning rate schedule

Sorry, where can I find the learning rate schedule?

Device side error when running newest implementation

Hi @wangmaolin,

I follow README to train this code and encountered this error:

xCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [52,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [53,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [54,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [55,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [56,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [57,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [58,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [59,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [60,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [61,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [62,0,0] Assertion `0` failed.
/niti/pytorch/cutlass-extension/include/cutlass/arch/memory_sm75.h:193: void cutlass::arch::ldsm(cutlass::Array<unsigned int, MatrixCount, true> &, const void *) [with Layout = cutlass::layout::RowMajor, MatrixCount = 4]: block: [24,0,0], thread: [63,0,0] Assertion `0` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 191, in <module>
    main()
  File "main.py", line 79, in main
    train_loss, train_prec1, train_prec5= forward(
  File "main.py", line 152, in forward
    output, output_exp = model(inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/niti/ti_net.py", line 21, in forward
    self.out, self.out_exp = self.forward_layers(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/niti/ti_torch.py", line 395, in forward
    act_out, exp_out = act_calc(temp, self.act_in_exp + self.weight_exp)
  File "/niti/ti_torch.py", line 108, in act_calc
    int32_bitwidth = RangeEstimate(int32_acc)
  File "/niti/ti_torch.py", line 374, in RangeEstimate
    if range_ ==0:
RuntimeError: CUDA error: device-side assert triggered

Could you help identify the problem? thanks a lot.

My GPU is V100, with CUDA11.2 , driver version 460.91.03, docker version 20.10.12.

int8conv_cuda.group_conv arguments mismatch

Quick note that these calls are mismatched, stride and dilation swapped.

niti/ti_torch.py

Lines 214 to 220 in d097335

 grad_weight = int8conv_cuda.group_conv(input, 

 grad_output, 

 dilation, 

 padding, 

 stride, 

 in_channels * min_batch)

niti/pytorch/int8conv-extension/int8conv.cpp

Lines 11 to 17 in d097335

 torch::Tensor tensor_core_group_conv( 

 torch::Tensor& input, 

 torch::Tensor& weight, 

 int32_t stride, 

 int32_t padding, 

 int32_t dilation, 

 int32_t groups);

Question regarding cutlass etc.

Hey

I wanted to let the project run without a docker container.
For that I used conda with all the pytorch and other required packages.
Now when it comes to use the makefile I get errors due to NVIDIA cutlass and probably other software as well.
Do you know which Toolkit Version for cutlass I need in order to install it with the non-containerized version ? or have you tried that ?
And would be amazing if you have a good start for NVIDIA's cuBLAS and CUTLASS libraries (sort of tutorial)
thanks for sharing
Greetings :)

Got cutlass error: Error Internal at: 171

Hello, I am currently trying to reproduce your training instance in a Windows PC. The development setup is mentioned as below:

OS: Windows Pro
CUDA Toolkit: 11.0
Graphics Card: Quadro P620 (Pascal Architecture - sm_61)
CUDNN Version: 8.2.0
Python: 3.8.3
Torch: 1.7.1+cu110

**The paths for Windows are adjusted in setup.py script for all cuda extensions. To address the graphics card architecture I replaced the using SmArch = cutlass::arch::Sm75 with using SmArch = cutlass::arch::Sm61 in source code file niti\pytorch\cutlass-extension\cutlassconv_kernel.cu (7 occurences)

I can see the installed packages in this location after successful make: C:\Users*Username*\AppData\Local\Programs\Python\Python38\Lib\site-packages

When running make architecture is confirmed by the following line at the end of the compile command. (-gencode=arch=compute_61, code=sm_61)

When training I came across "Got cutlass error: Error Internal at: 171" error.

Can you please check the issue and propose an appropriate solution.

Cannot connect to the Docker daemon at unix:///var/run/docker.sock

Hi @wangmaolin , I ran into a docker error while running your codes:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: invalid reference format: repository name must be lowercase.

Is it possible I build the code solely with pytorch?
Thank you very much.

The model dose not converge

When I run the training script after installing the CUDA extension, the model dose not converge at all.

Environment:

Ubuntu 16.04
Python 3.6
Pytorch 1.4
CUDA10.1

Training Script

DEPTH=11
CUDA_VISIBLE_DEVICES=3 \
	python ./ti_main.py --model-type int \ 
	--dataset cifar10 \
	--model vgg \
        --depth $DEPTH \
	--data-dir  /Workspace/ConvNets/data \
	--results-dir ./results --save cifar10-int8-vgg$DEPTH \
	--epochs 150 \
	--batch-size 128 \
	-j 8 \
	--log-interval 50 \
	--weight-decay \
 	--init ./cifar10_vgg"$DEPTH"_rebalance_init.pth.tar \

**Log **

 ./train_vgg_cifar10.sh 
saving to ./results/cifar10-int8-vgg11
run arguments: Namespace(batch_size=128, data_dir='/Workspace/ConvNets/data', dataset='cifar10', depth=11, download=False, epochs=150, evaluate=None, grad_hist=False, init='./cifar10_vgg11_rebalance_init.pth.tar', log_interval=50, model='vgg', model_type='int', results_dir='./results', resume='', save='cifar10-int8-vgg11', save_all=False, seed=-1, start_epoch=0, weight_decay=True, weight_frac=False, weight_hist=False, workers=8)
random seed: None
act rounding scheme: PstoShiftInt32
err rounding scheme: PstoShiftInt32
gradient rounding scheme: PstoShiftInt32
Update WITH WEIGHT DECAY
Create integer model
initial weights from checkpoint './cifar10_vgg11_rebalance_init.pth.tar' 
training regime: [{'epoch': 0, 'gb': 2}, {'epoch': 100, 'gb': -1}]
changing gradient bitwidth: 2
int [0][0/391] Time 1.586 (1.586) Data 0.49 loss 2.303 (2.303) e -14 @1 8.594 (8.594) @5 50.000 (50.000)
int [0][50/391] Time 0.081 (0.114) Data 0.00 loss 2.303 (7311.003) e -13 @1 13.281 (9.926) @5 56.250 (49.464)
int [0][100/391] Time 0.059 (0.092) Data 0.00 loss 2.303 (9973.338) e -16 @1 7.812 (9.800) @5 51.562 (49.830)
int [0][150/391] Time 0.085 (0.087) Data 0.00 loss 5856.362 (8409.386) e 9 @1 11.719 (9.711) @5 43.750 (49.524)
int [0][200/391] Time 0.061 (0.081) Data 0.00 loss 55.840 (7820.313) e 3 @1 10.938 (9.919) @5 51.562 (49.545)
int [0][250/391] Time 0.061 (0.078) Data 0.00 loss 2.303 (8043.853) e -15 @1 7.812 (9.988) @5 49.219 (49.608)
int [0][300/391] Time 0.059 (0.077) Data 0.00 loss 2.303 (7114.594) e -15 @1 10.938 (9.995) @5 49.219 (49.738)
int [0][350/391] Time 0.075 (0.075) Data 0.00 loss 2.303 (9238.136) e -11 @1 11.719 (9.927) @5 43.750 (49.846)
best_prec1: 10.100000 ./results/cifar10-int8-vgg11
int Epoch: 0 Train Prec@1 9.978 Train Prec@5 49.838 Valid Prec@1 10.100 Valid Prec@5 50.270 

Plot file saved at: /media/HD1/jmlu/ConvNets/niti/results/cifar10-int8-vgg11/results.html
int [1][0/391] Time 0.537 (0.537) Data 0.43 loss 565760.625 (565760.625) e 13 @1 9.375 (9.375) @5 50.781 (50.781)
int [1][50/391] Time 0.063 (0.076) Data 0.00 loss 2.303 (77541.838) e -17 @1 15.625 (10.463) @5 52.344 (50.689)
int [1][100/391] Time 0.059 (0.071) Data 0.00 loss 2.303 (39676.120) e -18 @1 11.719 (10.442) @5 53.125 (50.178)
int [1][150/391] Time 0.059 (0.068) Data 0.00 loss 2.303 (26682.765) e -17 @1 13.281 (10.487) @5 55.469 (50.098)
int [1][200/391] Time 0.058 (0.066) Data 0.00 loss 2.303 (20267.944) e -18 @1 12.500 (10.195) @5 52.344 (49.697)
int [1][250/391] Time 0.063 (0.066) Data 0.00 loss 2.303 (16575.759) e -16 @1 4.688 (10.047) @5 53.906 (49.894)
int [1][300/391] Time 0.074 (0.066) Data 0.00 loss 2.303 (14167.478) e -17 @1 6.250 (10.091) @5 50.000 (50.010)
int [1][350/391] Time 0.090 (0.067) Data 0.00 loss 2.303 (12343.841) e -18 @1 11.719 (10.034) @5 58.594 (50.100)
best_prec1: 10.100000 ./results/cifar10-int8-vgg11
int Epoch: 1 Train Prec@1 9.980 Train Prec@5 50.026 Valid Prec@1 9.780 Valid Prec@5 50.150

Besides, there is no definition of function grad_calc_32b in ti-torch.py

wangmaolin / niti Goto Github PK

niti's People

Contributors

Stargazers

Watchers

Forkers

niti's Issues

Questions about the computation of TiLoss

about resnet training

About INT4 Convolution

Learning rate schedule

Device side error when running newest implementation

int8conv_cuda.group_conv arguments mismatch

Question regarding cutlass etc.

Got cutlass error: Error Internal at: 171

Cannot connect to the Docker daemon at unix:///var/run/docker.sock

The model dose not converge

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	grad_weight = int8conv_cuda.group_conv(input,
	grad_output,
	dilation,
	padding,
	stride,
	in_channels * min_batch)

	torch::Tensor tensor_core_group_conv(
	torch::Tensor& input,
	torch::Tensor& weight,
	int32_t stride,
	int32_t padding,
	int32_t dilation,
	int32_t groups);