Git Product home page Git Product logo

i-bert's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

i-bert's Issues

Missing deployment part on TensorRt

❓ Questions and Help

You make reference in the paper and on Huggingface to a tensorRt deployment but I can't find the code.
Do you plan to share it too?

As far as I know the nvidia repo has only examples for their own models (all bert based), it's a bit hard to try it on our own without an example.

quantize other roberta model

I am trying to use I-BERT through transformers. Can I use other roberta models to quantize it?
If so, what parameters can I change? and how to convert it?

About scaling_factor

Dear Authors,
Thanks for sharing valuable codes.

I'm trying to use your code for vision transformer quantization.

About the scaling factor for this work, I have some questions.
If I want to swap some layers (i.e. GELU -> IntGELU), I have to set the scaling factor for input args.

For this, I suppose that I can add QuantAct in the forward function of IntGELU,

class IntGELU(nn.Module):

def forward(self, x, scaling_factor=None):
     if not self.quant_mode:
         return self.activation_fn(x), None
     x, scaling_factor = QuantAct(32, quant_mode=self.quant_mode)
--------------------------------------------------------------------------------------
Is it right? could you please give me some advice?

Thanks in advance.

Training in mixed precision

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Hi, thanks for the amazing contribution!
I'm trying to use IBert from Huggingface/transformers (4.4.2) in my own training pipeline where I'm fine-tuning in quant mode with mixed precision (Using pytorch's cuda.amp module). This results in overflows in the QuantLinear layers, which causes following training to break due to nans. I'm considering artificially clamping the weights to a smaller range to avoid this or using a lower bit precision (from 8 to say 4) while fine-tuning.

I'm wondering if you have tried this or have any suggestions about my approaches that could help me train effectively.

Thanks.

Code

 with autocast(enabled=grad_scaler.is_enabled()):
        # TRAINING CODE...

I'm unable to post any more code (proprietary stuff, sorry!), but I can provide some specifics if you need them.

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0): 1.8.0
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version: 10.1/7.6.5
  • GPU models and configuration:
  • Any other relevant information:

IBert problems of quant_model=true

Dear Editor,
My first step is to do full-precision finetuning, and I set quant_mode:true. And then I carry out the Integer-only finetuning. When I test the Integer-only finetuning model on the MRPC, the result is very bad. Could you give some guidance?(I test the MRPC sample, the result is tensor([[0.5003, 0.4997]], grad_fn=))

{
"_name_or_path": "/home/rram/storage/cailei/nlp_project/fine_tune/standard_ibert_weights/ibert-roberta-base",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "not_equivalent",
"1": "equivalent"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"equivalent": 1,
"not_equivalent": 0
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.12.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}

About CUDA out of memory

When I try the quantization step in this code, it cannot continue to run. An error occurs: CUDA out of memory. What can I do to solve this issue in the quantization step. Are there any parameters I can change? Thank you so much!

Pre-trained weights for specific tasks

Hi, thank you for releasing this project!

I was wondering if you happen to have the pre-trained weights for the models finetuned on the different downstream tasks (QQP, MNLI.. etc). i.e. the initialisation weights for the quantisation-aware fine-tuning stage. I only say this as it would save me a lot of time and compute, and may be helpful to others too.

Thanks

0

fixed

Bugs in the code

🐛 Bug

  1. x_scaling_factor in transformer_sentence_encoder.py:
    I think the code here should modified scaling_factor to x_scaling_factor. Or do we use the same scaling factor for all transformer encoder layers?
  2. freeze_model for IntSoftmax:
    I think we should implement fix function in IntSoftmax because we have QuantAct here. If we don't implement the fix function, we will skip fixing the QuantAct in IntSoftmax from here.

Problem

I am recently trying to implement IBERT on TVM an I found I should add FixedPointMul and SymmetricQuantFunction operators in TVM. Do you have any existing implementation? Thanks!

Why use 22 bit quantized activations for some layer norms (except in Embeddings)?

Hi,
I've noticed that the QuantAct layers preceding IntLayerNorm in the IBertSelfOutput and IbertOutput modules specify a 22 bit activation width while the QuantAct layer preceding IntLayerNorm in IBertEmbedding specifies a 16 bit activation.

I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?

Thank you!

(huggingface) The output of IBERT is float. Am I doing wrong?

❓ Questions and Help

What is your question?

I'm using the huggingface's implementation. Even though I set the quant_mode=True, I see the output of IBert is in float32 type. Am I using the model wrong, or is it expected?

Code

self.bert = AutoModel.from_pretrained(
    base_model, quant_mode=quant_mode, add_pooling_layer=False
)

...


def forward(
        self,
        input_ids: Tensor,
        attention_mask: Tensor,
        k: int = None,
        return_layers: List[int] = None,
        return_orig: bool = False,
    ):
        bert_out = self.bert(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            return_dict=True,
        )

        # the output dtype is float32!
        print(bert_out.hidden_states[0])

What's your environment?

  • PyTorch Version: 1.7.1
  • OS (e.g., Linux): Ubuntu 20
  • How you installed fairseq (pip, source): No
  • Python version: 3.8.5
  • CUDA/cuDNN version: 11.0
  • GPU models and configuration: RTX 3090

Storing both float32 and int parameters

Hi

It looks like at least in the HF code, you are storing both the float32 AND the int weights, which would increase the memory footprint. Don't you want to either load one or the other, or at least have an option to quanitize and send to cuda or something like that, where you would clear the float32 version or int version and send to cuda, thus lowering the memory footprint. Alternately you could overload the 'to' (or 'cuda'? or whatever method is used to convert to cuda) to only move over only the right parameters?

Thanks

Quantization on trained model

❓ Questions and Help

Hello,
Great paper! kudos!
After reading I was wondering if it is possible to use these quantization methods on trained model using one of huggingface transformers or shall we re-train the model and use I-BERT?

Latency 20x with quant_mode = true

In the hugging face config, I set quant_mode = TRUE.
The weight_integer buffer remains 0, and the result is wrong.
Moreover, inference latency of integer mode is 20 times of float mode.
Can you please explain the reason for me?

Can not inference the quantilized model in my device by int8

👉 Please follow one of these issue templates 👈

Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates.
I want to test the accuracy and time consumption of ibert in int8 inference, so I installed transformers and tried to quantize the Roberta-base model to generate the weights. I have set that quant_model is true and torch type is int8. However, the time consumption of ibert in int8 inference is similar to Roberta-base model in my 1080ti device. Is any problem with my config.json or my device? Here is my config.json:
{
"_name_or_path": "./outputs/checkpoint-1150/",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": 0,
"1": 1
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"0": 0,
"1": 1
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}

Another setting for quantization

Thanks for the great work.

It uses 32 integer points for activation and softmax.

However, the self-attention result cannot exceed 26 bits (8 bits x 8 bits x 10 bits (768 channels)).

I want to try the result with 16-bit precision (quantized with 16-bit and softmax and GeRU algorithms).
Is 16 bit any problem?
If not, I want a way to implement this.

Task name references in strings are wrong

Some inconsistencies arose due to previous commits, preventing fine-tuning for these tasks (SST-2 and STS-B).
It's better to use enumeration in order to prevent such issues in the future. However, for now I'm providing two lightweight PRs (#27 and #28) to address this issue.

Possible bug in IntSoftmax

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd '....'
  2. See error

Code sample

Expected behavior

Environment

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

I've been trying to add the Ibert quantization modules to distilbert and ran into this issue.

scaling_factor = 1 / 2 ** self.output_bit

is a float value and is returned as is. I believe that this should be converted to a tensor on the appropriate device before returning something like scaling_factor = torch.tensor([1 / 2 ** self.output_bit], device=exp_int.device).

Please let me know your thoughts on this. Thanks!

How is the scaling factor S implemented with integer?

Thanks for sharing the code. This is really a great paper with clear and impressive ideas. I feel sorry that I might have some misunderstanding and is a little confused about some details. In the paper, the author claims the model is implemented with integer-only arithmetic, yet in Algorithms 1-4 as well as in Equations 1 and 2, the scaling factor is not an integer but seems to be some floating number. In other words, it needs to do floating-point scaling, which is also demonstrated on this line. I wonder if I misunderstood this and how this is implemented with integer arithmetic. Thanks a lot!

Arguments in run.py

🐛 Bug

the hyperparameters in run.py is not corresponding to the task_speces in run.py, line 73.
image
image

Where can I find the integer-sqrt kernel ?

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Code

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Wrong script of downloading GLUE datasets

🐛 Bug

The script of downloading GLUE datasets in Readme(ibert branch) is wrong.

"wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py"

It will call “403 Forbidden” as W4ngatang said they are no longer hosting the GLUE data on their server.(More details in comment of link https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/). Also there are some bugs in urllibs.

🎈 Fix methods

I fixed it with right urls of data and io operations in link -> https://gist.github.com/Alexiazzf/67f8a3ba31bf489e6b4242e9c1f516a8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.