kssteven418 / i-bert Goto Github PK

View Code? Open in Web Editor NEW

219.0 4.0 32.0 6.54 MB

[ICML'21 Oral] I-BERT: Integer-only BERT Quantization

Home Page: https://arxiv.org/abs/2101.01321

License: MIT License

Python 97.40% C++ 0.72% Cuda 1.65% Shell 0.04% Lua 0.19%

natural-language-processing quantization efficient-model efficient-neural-networks transformer bert model-compression

i-bert's People

Contributors

Stargazers

Watchers

i-bert's Issues

0

fixed

Why use 22 bit quantized activations for some layer norms (except in Embeddings)?

Hi,
I've noticed that the QuantAct layers preceding IntLayerNorm in the IBertSelfOutput and IbertOutput modules specify a 22 bit activation width while the QuantAct layer preceding IntLayerNorm in IBertEmbedding specifies a 16 bit activation.

I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?

Thank you!

Quantization on trained model

❓ Questions and Help

Hello,
Great paper! kudos!
After reading I was wondering if it is possible to use these quantization methods on trained model using one of huggingface transformers or shall we re-train the model and use I-BERT?

why is Integer-only finetuning is much more slower than fp32 finetune

Compare with fp32 finetuning , It takes about 10x more time to inference dev data during training when do Integer-only finetune to Integer-only finetuning.
How can I do INT8 inference and achieve the seepup as described in paper?

Task name references in strings are wrong

Some inconsistencies arose due to previous commits, preventing fine-tuning for these tasks (SST-2 and STS-B).
It's better to use enumeration in order to prevent such issues in the future. However, for now I'm providing two lightweight PRs (#27 and #28) to address this issue.

Training in mixed precision

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

Hi, thanks for the amazing contribution!
I'm trying to use IBert from Huggingface/transformers (4.4.2) in my own training pipeline where I'm fine-tuning in quant mode with mixed precision (Using pytorch's cuda.amp module). This results in overflows in the QuantLinear layers, which causes following training to break due to nans. I'm considering artificially clamping the weights to a smaller range to avoid this or using a lower bit precision (from 8 to say 4) while fine-tuning.

I'm wondering if you have tried this or have any suggestions about my approaches that could help me train effectively.

Thanks.

Code

 with autocast(enabled=grad_scaler.is_enabled()):
        # TRAINING CODE...

I'm unable to post any more code (proprietary stuff, sorry!), but I can provide some specifics if you need them.

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0): 1.8.0
OS (e.g., Linux): Ubuntu 18.04
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version: 10.1/7.6.5
GPU models and configuration:
Any other relevant information:

Where can I find the integer-sqrt kernel ?

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

Code

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Cannot perform quantization-aware-finetuning due to NaN values

ValueError is raised as a result of trying to cast NaN values to integers.
See PR: #30.

Can not inference the quantilized model in my device by int8

👉 Please follow one of these issue templates 👈

Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates.
I want to test the accuracy and time consumption of ibert in int8 inference, so I installed transformers and tried to quantize the Roberta-base model to generate the weights. I have set that quant_model is true and torch type is int8. However, the time consumption of ibert in int8 inference is similar to Roberta-base model in my 1080ti device. Is any problem with my config.json or my device? Here is my config.json:
{
"_name_or_path": "./outputs/checkpoint-1150/",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": 0,
"1": 1
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"0": 0,
"1": 1
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}

(huggingface) The output of IBERT is float. Am I doing wrong?

❓ Questions and Help

What is your question?

I'm using the huggingface's implementation. Even though I set the quant_mode=True, I see the output of IBert is in float32 type. Am I using the model wrong, or is it expected?

Code

self.bert = AutoModel.from_pretrained(
    base_model, quant_mode=quant_mode, add_pooling_layer=False
)

...


def forward(
        self,
        input_ids: Tensor,
        attention_mask: Tensor,
        k: int = None,
        return_layers: List[int] = None,
        return_orig: bool = False,
    ):
        bert_out = self.bert(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            return_dict=True,
        )

        # the output dtype is float32!
        print(bert_out.hidden_states[0])

What's your environment?

PyTorch Version: 1.7.1
OS (e.g., Linux): Ubuntu 20
How you installed fairseq (pip, source): No
Python version: 3.8.5
CUDA/cuDNN version: 11.0
GPU models and configuration: RTX 3090

Possible bug in IntSoftmax

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd '....'
See error

Code sample

Expected behavior

Environment

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

I've been trying to add the Ibert quantization modules to distilbert and ran into this issue.

I-BERT/fairseq/quantization/utils/quant_modules.py

Line 658 in 45cb6da

scaling_factor = 1 / 2 ** self.output_bit

is a float value and is returned as is. I believe that this should be converted to a tensor on the appropriate device before returning something like scaling_factor = torch.tensor([1 / 2 ** self.output_bit], device=exp_int.device).

Please let me know your thoughts on this. Thanks!

IBert problems of quant_model=true

Dear Editor,
My first step is to do full-precision finetuning, and I set quant_mode:true. And then I carry out the Integer-only finetuning. When I test the Integer-only finetuning model on the MRPC, the result is very bad. Could you give some guidance？(I test the MRPC sample, the result is tensor([[0.5003, 0.4997]], grad_fn=))

{
"_name_or_path": "/home/rram/storage/cailei/nlp_project/fine_tune/standard_ibert_weights/ibert-roberta-base",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "not_equivalent",
"1": "equivalent"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"equivalent": 1,
"not_equivalent": 0
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.12.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}

How can we change the quantization settings?

Nice implementation!

I was wondering if we can experiment with different quantization settings (e.g. 6bit).
Thanks!

Wrong script of downloading GLUE datasets

🐛 Bug

The script of downloading GLUE datasets in Readme(ibert branch) is wrong.

"wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py"

It will call “403 Forbidden” as W4ngatang said they are no longer hosting the GLUE data on their server.(More details in comment of link https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/). Also there are some bugs in urllibs.

🎈 Fix methods

I fixed it with right urls of data and io operations in link -> https://gist.github.com/Alexiazzf/67f8a3ba31bf489e6b4242e9c1f516a8

Missing deployment part on TensorRt

❓ Questions and Help

You make reference in the paper and on Huggingface to a tensorRt deployment but I can't find the code.
Do you plan to share it too?

As far as I know the nvidia repo has only examples for their own models (all bert based), it's a bit hard to try it on our own without an example.

rationale considering in using floor or round

❓ What is the rationale behind floor or round

I see softmax and polynomial use floor but other places use round. What is the consideration?

Pre-trained weights for specific tasks

Hi, thank you for releasing this project!

I was wondering if you happen to have the pre-trained weights for the models finetuned on the different downstream tasks (QQP, MNLI.. etc). i.e. the initialisation weights for the quantisation-aware fine-tuning stage. I only say this as it would save me a lot of time and compute, and may be helpful to others too.

Thanks

How is the scaling factor S implemented with integer?

Thanks for sharing the code. This is really a great paper with clear and impressive ideas. I feel sorry that I might have some misunderstanding and is a little confused about some details. In the paper, the author claims the model is implemented with integer-only arithmetic, yet in Algorithms 1-4 as well as in Equations 1 and 2, the scaling factor is not an integer but seems to be some floating number. In other words, it needs to do floating-point scaling, which is also demonstrated on this line. I wonder if I misunderstood this and how this is implemented with integer arithmetic. Thanks a lot!

Can use the CPU in the inference state?

Excellent work!

Can use the CPU in the inference state?
And how much faster than baseline?

Storing both float32 and int parameters

It looks like at least in the HF code, you are storing both the float32 AND the int weights, which would increase the memory footprint. Don't you want to either load one or the other, or at least have an option to quanitize and send to cuda or something like that, where you would clear the float32 version or int version and send to cuda, thus lowering the memory footprint. Alternately you could overload the 'to' (or 'cuda'? or whatever method is used to convert to cuda) to only move over only the right parameters?

Thanks

python run.py --arch roberta_base --task STS-B

When I try " CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task STS-B(or SST-2)", there are always "FileNotFoundError: [Errno 2] No such file or directory: 'STS-B-bin/label/dict.txt'", but the glue_dataset is downloaded.
How can I solve it?

Is ibert-roberta-base on huggingface model hub the same as roberta-base

In my cognition， ibert-roberta-base add some specific parameter in config.json while the weight is same as roberta-base

3rd order polynomial approximation to GeLU

Any method to approximate GeLU using 3rd order polynomial? How to create polynomial coefficients?

Latency 20x with quant_mode = true

In the hugging face config, I set quant_mode = TRUE.
The weight_integer buffer remains 0, and the result is wrong.
Moreover, inference latency of integer mode is 20 times of float mode.
Can you please explain the reason for me?

quantize other roberta model

I am trying to use I-BERT through transformers. Can I use other roberta models to quantize it?
If so, what parameters can I change? and how to convert it?

Arguments in run.py

🐛 Bug

the hyperparameters in run.py is not corresponding to the task_speces in run.py, line 73.

Bugs in the code

🐛 Bug

x_scaling_factor in transformer_sentence_encoder.py:
I think the code here should modified scaling_factor to x_scaling_factor. Or do we use the same scaling factor for all transformer encoder layers?
freeze_model for IntSoftmax:
I think we should implement fix function in IntSoftmax because we have QuantAct here. If we don't implement the fix function, we will skip fixing the QuantAct in IntSoftmax from here.

Problem

I am recently trying to implement IBERT on TVM an I found I should add FixedPointMul and SymmetricQuantFunction operators in TVM. Do you have any existing implementation? Thanks!

About CUDA out of memory

When I try the quantization step in this code, it cannot continue to run. An error occurs: CUDA out of memory. What can I do to solve this issue in the quantization step. Are there any parameters I can change? Thank you so much!

error when runing download_glue_data.py

🐛 Bug

Runing python download_glue_data.py --data_dir glue_data --tasks all get errors

it looks like the links in the code can't be accessed

error |
code | 403
message | "Permission denied."

To Reproduce

Runing python download_glue_data.py --data_dir glue_data --tasks all

errors

C:\work\AIBenchmarking\app\I-BERT>py download_glue_data.py --data_dir glue_data --tasks all
Downloading and extracting CoLA...
Traceback (most recent call last):
File "C:\work\AIBenchmarking\app\I-BERT\download_glue_data.py", line 141, in
sys.exit(main(sys.argv[1:]))
File "C:\work\AIBenchmarking\app\I-BERT\download_glue_data.py", line 137, in main
download_and_extract(task, args.data_dir)
File "C:\work\AIBenchmarking\app\I-BERT\download_glue_data.py", line 47, in download_and_extract
urllib.request.urlretrieve(TASK2PATH[task], data_file)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 525, in open
response = meth(req, response)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Code sample

Expected behavior

Environment

Python 3.10.2

Additional context

Another setting for quantization

Thanks for the great work.

It uses 32 integer points for activation and softmax.

However, the self-attention result cannot exceed 26 bits (8 bits x 8 bits x 10 bits (768 channels)).

I want to try the result with 16-bit precision (quantized with 16-bit and softmax and GeRU algorithms).
Is 16 bit any problem?
If not, I want a way to implement this.

About scaling_factor

Dear Authors,
Thanks for sharing valuable codes.

I'm trying to use your code for vision transformer quantization.

About the scaling factor for this work, I have some questions.
If I want to swap some layers (i.e. GELU -> IntGELU), I have to set the scaling factor for input args.

For this, I suppose that I can add QuantAct in the forward function of IntGELU,

class IntGELU(nn.Module):

def forward(self, x, scaling_factor=None):
     if not self.quant_mode:
         return self.activation_fn(x), None
     x, scaling_factor = QuantAct(32, quant_mode=self.quant_mode)
--------------------------------------------------------------------------------------
Is it right? could you please give me some advice?

Thanks in advance.

kssteven418 / i-bert Goto Github PK

i-bert's People

Contributors

Stargazers

Watchers

Forkers

i-bert's Issues

❓ Questions and Help

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

👉 Please follow one of these issue templates 👈

❓ Questions and Help

What is your question?

Code

What's your environment?

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

🐛 Bug

🎈 Fix methods

❓ Questions and Help

❓ What is the rationale behind floor or round

🐛 Bug

🐛 Bug

Problem

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

For this, I suppose that I can add QuantAct in the forward function of IntGELU,

Recommend Projects

Recommend Topics

Recommend Org