kssteven418 / i-bert Goto Github PK
View Code? Open in Web Editor NEW[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
Home Page: https://arxiv.org/abs/2101.01321
License: MIT License
[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
Home Page: https://arxiv.org/abs/2101.01321
License: MIT License
I see softmax and polynomial use floor but other places use round. What is the consideration?
You make reference in the paper and on Huggingface to a tensorRt deployment but I can't find the code.
Do you plan to share it too?
As far as I know the nvidia repo has only examples for their own models (all bert based), it's a bit hard to try it on our own without an example.
I am trying to use I-BERT through transformers. Can I use other roberta models to quantize it?
If so, what parameters can I change? and how to convert it?
Dear Authors,
Thanks for sharing valuable codes.
I'm trying to use your code for vision transformer quantization.
About the scaling factor for this work, I have some questions.
If I want to swap some layers (i.e. GELU -> IntGELU), I have to set the scaling factor for input args.
class IntGELU(nn.Module):
def forward(self, x, scaling_factor=None):
if not self.quant_mode:
return self.activation_fn(x), None
x, scaling_factor = QuantAct(32, quant_mode=self.quant_mode)
--------------------------------------------------------------------------------------
Is it right? could you please give me some advice?
Thanks in advance.
Hi, thanks for the amazing contribution!
I'm trying to use IBert from Huggingface/transformers (4.4.2) in my own training pipeline where I'm fine-tuning in quant
mode with mixed precision (Using pytorch's cuda.amp
module). This results in overflows in the QuantLinear
layers, which causes following training to break due to nan
s. I'm considering artificially clamping the weights to a smaller range to avoid this or using a lower bit precision (from 8 to say 4) while fine-tuning.
I'm wondering if you have tried this or have any suggestions about my approaches that could help me train effectively.
Thanks.
with autocast(enabled=grad_scaler.is_enabled()):
# TRAINING CODE...
I'm unable to post any more code (proprietary stuff, sorry!), but I can provide some specifics if you need them.
pip
, source):Dear Editor,
My first step is to do full-precision finetuning, and I set quant_mode:true. And then I carry out the Integer-only finetuning. When I test the Integer-only finetuning model on the MRPC, the result is very bad. Could you give some guidance?(I test the MRPC sample, the result is tensor([[0.5003, 0.4997]], grad_fn=))
{
"_name_or_path": "/home/rram/storage/cailei/nlp_project/fine_tune/standard_ibert_weights/ibert-roberta-base",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "not_equivalent",
"1": "equivalent"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"equivalent": 1,
"not_equivalent": 0
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.12.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}
When I try the quantization step in this code, it cannot continue to run. An error occurs: CUDA out of memory. What can I do to solve this issue in the quantization step. Are there any parameters I can change? Thank you so much!
ValueError is raised as a result of trying to cast NaN values to integers.
See PR: #30.
Excellent work!
Can use the CPU in the inference state?
And how much faster than baseline?
Hi, thank you for releasing this project!
I was wondering if you happen to have the pre-trained weights for the models finetuned on the different downstream tasks (QQP, MNLI.. etc). i.e. the initialisation weights for the quantisation-aware fine-tuning stage. I only say this as it would save me a lot of time and compute, and may be helpful to others too.
Thanks
fixed
x_scaling_factor
in transformer_sentence_encoder.py
:scaling_factor
to x_scaling_factor
. Or do we use the same scaling factor for all transformer encoder layers?IntSoftmax
:IntSoftmax
because we have QuantAct
here. If we don't implement the fix function, we will skip fixing the QuantAct
in IntSoftmax
from here.I am recently trying to implement IBERT on TVM an I found I should add FixedPointMul
and SymmetricQuantFunction
operators in TVM. Do you have any existing implementation? Thanks!
In my cognition, ibert-roberta-base add some specific parameter in config.json while the weight is same as roberta-base
Hi,
I've noticed that the QuantAct
layers preceding IntLayerNorm
in the IBertSelfOutput
and IbertOutput
modules specify a 22 bit activation width while the QuantAct
layer preceding IntLayerNorm
in IBertEmbedding
specifies a 16 bit activation.
I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?
Thank you!
I'm using the huggingface's implementation. Even though I set the quant_mode=True
, I see the output of IBert is in float32
type. Am I using the model wrong, or is it expected?
self.bert = AutoModel.from_pretrained(
base_model, quant_mode=quant_mode, add_pooling_layer=False
)
...
def forward(
self,
input_ids: Tensor,
attention_mask: Tensor,
k: int = None,
return_layers: List[int] = None,
return_orig: bool = False,
):
bert_out = self.bert(
input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
return_dict=True,
)
# the output dtype is float32!
print(bert_out.hidden_states[0])
pip
, source): NoHi
It looks like at least in the HF code, you are storing both the float32 AND the int weights, which would increase the memory footprint. Don't you want to either load one or the other, or at least have an option to quanitize and send to cuda or something like that, where you would clear the float32 version or int version and send to cuda, thus lowering the memory footprint. Alternately you could overload the 'to' (or 'cuda'? or whatever method is used to convert to cuda) to only move over only the right parameters?
Thanks
Hello,
Great paper! kudos!
After reading I was wondering if it is possible to use these quantization methods on trained model using one of huggingface transformers or shall we re-train the model and use I-BERT?
In the hugging face config, I set quant_mode = TRUE.
The weight_integer buffer remains 0, and the result is wrong.
Moreover, inference latency of integer mode is 20 times of float mode.
Can you please explain the reason for me?
Compare with fp32 finetuning , It takes about 10x more time to inference dev data during training when do Integer-only finetune to Integer-only finetuning.
How can I do INT8 inference and achieve the seepup as described in paper?
Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates.
I want to test the accuracy and time consumption of ibert in int8 inference, so I installed transformers and tried to quantize the Roberta-base model to generate the weights. I have set that quant_model is true and torch type is int8. However, the time consumption of ibert in int8 inference is similar to Roberta-base model in my 1080ti device. Is any problem with my config.json or my device? Here is my config.json:
{
"_name_or_path": "./outputs/checkpoint-1150/",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": 0,
"1": 1
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"0": 0,
"1": 1
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}
Thanks for the great work.
It uses 32 integer points for activation and softmax.
However, the self-attention result cannot exceed 26 bits (8 bits x 8 bits x 10 bits (768 channels)).
I want to try the result with 16-bit precision (quantized with 16-bit and softmax and GeRU algorithms).
Is 16 bit any problem?
If not, I want a way to implement this.
Steps to reproduce the behavior (always include the command you ran):
pip
, source):I've been trying to add the Ibert quantization modules to distilbert and ran into this issue.
scaling_factor = torch.tensor([1 / 2 ** self.output_bit], device=exp_int.device)
.
Please let me know your thoughts on this. Thanks!
Nice implementation!
I was wondering if we can experiment with different quantization settings (e.g. 6bit).
Thanks!
Any method to approximate GeLU using 3rd order polynomial? How to create polynomial coefficients?
Thanks for sharing the code. This is really a great paper with clear and impressive ideas. I feel sorry that I might have some misunderstanding and is a little confused about some details. In the paper, the author claims the model is implemented with integer-only arithmetic, yet in Algorithms 1-4 as well as in Equations 1 and 2, the scaling factor is not an integer but seems to be some floating number. In other words, it needs to do floating-point scaling, which is also demonstrated on this line. I wonder if I misunderstood this and how this is implemented with integer arithmetic. Thanks a lot!
pip
, source):The script of downloading GLUE datasets in Readme(ibert branch) is wrong.
It will call “403 Forbidden” as W4ngatang said they are no longer hosting the GLUE data on their server.(More details in comment of link https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/). Also there are some bugs in urllibs.
I fixed it with right urls of data and io operations in link -> https://gist.github.com/Alexiazzf/67f8a3ba31bf489e6b4242e9c1f516a8
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.