Running inference with 4bit and 8bit quantized OpenELM-3B outputs <code class="notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

We use a weight decay of 0.1 and gradient clipping of 1.0. </blockquo

[BUG] OpenELM Quantization broken about mlx-examples HOT 12 CLOSED

Blaizzy commented on June 29, 2024 1

[BUG] OpenELM Quantization broken

from mlx-examples.

Comments (12)

awni commented on June 29, 2024 1

Interestingly the smaller 270m model seems to work fine in 8 bit.

from mlx-examples.

QueryType commented on June 29, 2024

I am just a beginner in this project, however a quick debug, reveals that the test string "hello" isn't getting encoded properly as per the tokenizer. I will need to check more.

from mlx-examples.

awni commented on June 29, 2024

When I try to load this model for quantization, I get the following error:

  File "/Users/awnihannun/mlx-examples/llms/mlx_lm/tokenizer_utils.py", line 327, in load_tokenizer
    AutoTokenizer.from_pretrained(model_path, **tokenizer_config_extra),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/awnihannun/miniconda3/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 891, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.011986e180a41be8d1972cba11929b1174295f8c.configuration_openelm.OpenELMConfig'> to build an AutoTokenizer.

I'm wondering how you were even able to quantize it @Blaizzy ?

from mlx-examples.

Blaizzy commented on June 29, 2024

Yeah, you get that error because none of the OpenELM models come with the tokenizer but the generate file says they use llama-2-7B tokenizer.

https://huggingface.co/apple/OpenELM-3B/blob/main/generate_openelm.py#L16

I tweeted at you about this.

https://x.com/prince_canuma/status/1783155293943214406?s=46

The solution I found was to hardcode the tokenizer name in the tokenizer_utils.

from mlx-examples.

awni commented on June 29, 2024

So some findings:

The model doesn't work in float16 either, with or without quantization
Works in bf16 without quantization but not so well with quantization
Works in fp32 without quantization and sort of with quantization

Doesn't seem like a quantization issue per-se. The model seems very sensitive to precision.

@Blaizzy for these types of issues, sometimes there is a place in the model that is particularly sensitive and you can up/down cast around it. But if it's just a general issue across all layers then this will be very difficult to fix.

from mlx-examples.

awni commented on June 29, 2024

The numbers get to be quite large and are overflowing the range of fp16. Once that happens all is lost. I think most models are trained with some regularization to keep values from getting so big. But I don't think this model will be very amenable to low precision inference.

I think you could get it to work with 8 or maybe 4 bit quantization + fp32 as the accumulation type. Right now that's not a supported option in the conversion script but it is easy to change, just set the dtype there: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L592

Otherwise I don't think we can do much for this model so I will close this issue as won't fix. Sorry for the not so helpful outcome.

from mlx-examples.

awni commented on June 29, 2024

@sacmehta I'm curious if these findings make sense to you? We're finding the larger OpenELM models don't reduce precision well. Usually models are trained with some L2 to prevent weights from getting too large (so they quantize well). Maybe OpenELM was not trained that way / the regularization parameter is not high enough?

Could be useful to work on that for future versions so we can quantize the larger models.

from mlx-examples.

Blaizzy commented on June 29, 2024

So some findings:

The model doesn't work in float16 either, with or without quantization

Works in bf16 without quantization but not so well with quantization

Works in fp32 without quantization and sort of with quantization

Doesn't seem like a quantization issue per-se. The model seems very sensitive to precision.

@Blaizzy for these types of issues, sometimes there is a place in the model that is particularly sensitive and you can up/down cast around it. But if it's just a general issue across all layers then this will be very difficult to fix.

Thanks for the update, this is super helpful reference for the future!

In this case it's all layers are affected. In the report they describe the technique as "layer-wise scaling strategy"

https://arxiv.org/pdf/2404.14619

from mlx-examples.

awni commented on June 29, 2024

In the report they describe the technique as "layer-wise scaling strategy"

That's something different. That is the strategy used to determine the number of hidden units per layer.

from mlx-examples.

awni commented on June 29, 2024

We use a weight decay of 0.1 and gradient clipping of 1.0.

I see that in the paper.. so there should be some weight decay. I don't know if the value is comparable to other models of the same size. Maybe there is something else going on.

from mlx-examples.

Blaizzy commented on June 29, 2024

@awni

My hypothesis is that perhaps the layer scaling is not effective with larger models and that's what's breaking.

Because the only difference between 1.1B that works well when quantized and 3B is only number of layers and their scale(ffn_multipliers, num_kv_heads and num_q_heads).

from mlx-examples.

Blaizzy commented on June 29, 2024

I don't know if the value is comparable to other models of the same size

It is, here are the values used for a comparable model in size.

Check section 4.2

https://static1.squarespace.com/static/6213c340453c3f502425776e/t/6601c5713150412edcd56f8e/1711392114564/Stable_Code_TechReport_release.pdf

from mlx-examples.

[BUG] OpenELM Quantization broken about mlx-examples HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent