Comments (12)
Interestingly the smaller 270m model seems to work fine in 8 bit.
from mlx-examples.
I am just a beginner in this project, however a quick debug, reveals that the test string "hello" isn't getting encoded properly as per the tokenizer. I will need to check more.
from mlx-examples.
When I try to load this model for quantization, I get the following error:
File "/Users/awnihannun/mlx-examples/llms/mlx_lm/tokenizer_utils.py", line 327, in load_tokenizer
AutoTokenizer.from_pretrained(model_path, **tokenizer_config_extra),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/awnihannun/miniconda3/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 891, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.011986e180a41be8d1972cba11929b1174295f8c.configuration_openelm.OpenELMConfig'> to build an AutoTokenizer.
I'm wondering how you were even able to quantize it @Blaizzy ?
from mlx-examples.
Yeah, you get that error because none of the OpenELM models come with the tokenizer but the generate file says they use llama-2-7B tokenizer.
https://huggingface.co/apple/OpenELM-3B/blob/main/generate_openelm.py#L16
I tweeted at you about this.
https://x.com/prince_canuma/status/1783155293943214406?s=46
The solution I found was to hardcode the tokenizer name in the tokenizer_utils.
from mlx-examples.
So some findings:
- The model doesn't work in float16 either, with or without quantization
- Works in bf16 without quantization but not so well with quantization
- Works in fp32 without quantization and sort of with quantization
Doesn't seem like a quantization issue per-se. The model seems very sensitive to precision.
@Blaizzy for these types of issues, sometimes there is a place in the model that is particularly sensitive and you can up/down cast around it. But if it's just a general issue across all layers then this will be very difficult to fix.
from mlx-examples.
The numbers get to be quite large and are overflowing the range of fp16. Once that happens all is lost. I think most models are trained with some regularization to keep values from getting so big. But I don't think this model will be very amenable to low precision inference.
I think you could get it to work with 8 or maybe 4 bit quantization + fp32 as the accumulation type. Right now that's not a supported option in the conversion script but it is easy to change, just set the dtype there: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L592
Otherwise I don't think we can do much for this model so I will close this issue as won't fix. Sorry for the not so helpful outcome.
from mlx-examples.
@sacmehta I'm curious if these findings make sense to you? We're finding the larger OpenELM models don't reduce precision well. Usually models are trained with some L2 to prevent weights from getting too large (so they quantize well). Maybe OpenELM was not trained that way / the regularization parameter is not high enough?
Could be useful to work on that for future versions so we can quantize the larger models.
from mlx-examples.
So some findings:
- The model doesn't work in float16 either, with or without quantization
- Works in bf16 without quantization but not so well with quantization
- Works in fp32 without quantization and sort of with quantization
Doesn't seem like a quantization issue per-se. The model seems very sensitive to precision.
@Blaizzy for these types of issues, sometimes there is a place in the model that is particularly sensitive and you can up/down cast around it. But if it's just a general issue across all layers then this will be very difficult to fix.
Thanks for the update, this is super helpful reference for the future!
In this case it's all layers are affected. In the report they describe the technique as "layer-wise scaling strategy"
https://arxiv.org/pdf/2404.14619
from mlx-examples.
In the report they describe the technique as "layer-wise scaling strategy"
That's something different. That is the strategy used to determine the number of hidden units per layer.
from mlx-examples.
We use a weight decay of 0.1 and gradient clipping of 1.0.
I see that in the paper.. so there should be some weight decay. I don't know if the value is comparable to other models of the same size. Maybe there is something else going on.
from mlx-examples.
My hypothesis is that perhaps the layer scaling is not effective with larger models and that's what's breaking.
Because the only difference between 1.1B that works well when quantized and 3B is only number of layers and their scale(ffn_multipliers, num_kv_heads and num_q_heads).
from mlx-examples.
I don't know if the value is comparable to other models of the same size
It is, here are the values used for a comparable model in size.
Check section 4.2
from mlx-examples.
Related Issues (20)
- If we do not specify the specific LoRa configuration in the evaluate script, the program will automatically overwrite the default configuration to adapter_config.json.
- Model type phi3 not supported HOT 1
- [Feature Request] Support for QDoRA: Efficient quantized fine-tuning HOT 1
- Loss nan for phi-3 HOT 6
- Curl response got truncated HOT 1
- Model type openelm not supported HOT 2
- Seems like when generating, some memory usage cannot be correctly released. HOT 19
- Looks like llama.py sanitize_config is outdated HOT 3
- Colorize not working with phi-3 HOT 3
- Phi-3 q4 systematic wrong token in first date HOT 7
- [Feature request] A version of mlx_lm.utils.generate() that acts as an iterator HOT 2
- TypeError: ModelArgs.__init__() missing 5 required positional arguments HOT 3
- Phi 3 mini 4k , 128k does nor work LORA HOT 2
- Add a βscan-models to mlx_lm.server to check downloaded models HOT 5
- generate mlx-community/Meta-Llama-3-70B-Instruct-4bit doesn't halt at <|eot_id|> HOT 5
- Potential memory leak during Llama 3 8b model fine-tuning with LoRA HOT 9
- Bug due to Typo in starcoder2 model file
- Convert OpenELM to MLX compatible (ValueError: Unrecognized configuration class_) HOT 2
- Model doesn't know when to stop generating. HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlx-examples.