Comments (13)
Yes I found the model. I don't yet know why it's not working, but there are certainly some differences between the GGUF quantized model and generation code. Some examples:
- GGUF uses groups size 32, bits 4
- GGUF does not quantize the linear layer
- It's possible the sampling strategy matters (is ollama/llama.cpp using topk by default?)
- It's possible one needs a repetition penalty on the quantized model
Since I am able to get other quantized models working fine and this model has nothing noticeably unusual about it in the sizes or config, my guess is this is not a bug in related to our quantization but more some missing feature like something from the above list.
from mlx-examples.
There was an OOB read in our QMV kernel for the shapes in that QWEN model. This is fixed on main in MLX and I confirmed generating text with quantized 1.5B works quite well now :). Will be in the next release.
from mlx-examples.
@awni Yes, qwen:1.8b-chat-v1.5-q4_0
is in GGUF
format, here's more about the model qwen:1.8b-chat-v1.5-q4_0 .
I haven't tried quantization with the Qwen 1.8B model before, but I haven't encountered similar problems with the 4-bit quantized version of the Phi2 model, and Phi2-4bit works fine.
from mlx-examples.
I use
mlx_lm.convert
to quantize theQwen/Qwen1.5-1.8B-Chat
model using the following command:我用mlx_lm.convert
以量化Qwen/Qwen1.5-1.8B-Chat
模型使用的以下命令:python -m mlx_lm.convert --hf-path Qwen/Qwen1.5-1.8B-Chat \ -q \ --upload-repo madroid/Qwen1.5-1.8B-Chat-4bit-mlx <frozen runpy>:128: RuntimeWarning: 'mlx_lm.convert' found in sys.modules after import of package 'mlx_lm', but prior to execution of 'mlx_lm.convert'; this may result in unpredictable behaviour [INFO] Loading Fetching 6 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 78889.73it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO] Quantizing model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48G/1.48G [01:24<00:00, 17.5MB/s]
When using the quantized model for inference, it is found that it can no longer perform inference work properly, the output is some meaningless content, the output is roughly as follows:当使用的量化模型对于推理,可以发现,它可以不再进行推理正常工作,输出是一些无意义的内容,输出大致如下:
python mlx_qwen.py Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 96420.78it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ========== Prompt: <|im_start|>user hello<|im_end|> <|im_start|>assistant Hello! is 0! in Python, which stands for "halt all communication exchange on to communicate and transmit information and information. (hl):). 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ========== Prompt: 185.258 tokens-per-sec Generation: 105.760 tokens-per-sec
I used
Ollama
to verify that it was theQwen/Qwen1.5-1.8B-Chat
model that had limitations on the quantization itself, but found that it worked well in Ollama, and here is the corresponding output:我用Ollama
来验证其是的Qwen/Qwen1.5-1.8B-Chat
模型,该模型有限制的量化本身,但发现它的工作以及在Ollama,在这里是相应的输出:❯ ollama run qwen:1.8b-chat-v1.5-q4_0 >>> hello Hello! How can I help you today? Is there something specific that you would like to know or discuss? I'm here to listen and provide helpful information, so please feel free to ask me anything you'd like to know. >>> Use Ctrl + d or /bye to exit. >>> /bye
I don't know much about this part, how can I further troubleshoot what the problem is? Is it the
convert.py
script that needs special compatibility support forqwen2
?我不知道很多关于这部分,我可以如何进一步解决的问题是什么? 它是的convert.py
脚本,需要特别的兼容性支持qwen2
?
I have the same problem
python -m mlx_lm.generate \
--model ~/Documents/open/mlx-examples/lora/Qwen1.5-1.8B-Chat-4bit/ --prompt "你是谁?" --eos-token="<|endoftext|>" --trust-remote-code --max-tokens=100 --top-p=1.2 --use-default-chat-template
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
你是谁?<|im_end|>
<|im_start|>assistant
我是!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
==========
Prompt: 340.437 tokens-per-sec
Generation: 41.714 tokens-per-sec
model: https://huggingface.co/mlx-community/Qwen1.5-1.8B-Chat-4bit
code: https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm
from mlx-examples.
When using the quantized model for inference, it is found that it can no longer perform inference work properly
Was this working for you before? Just wondering so we can narrow in on the issue.
from mlx-examples.
Interestingly, converting the 7B works fine so it seems like an issue with the smaller model. Small models tend to be more difficult to quantize in a stable way. I'm wondering why the GGUF one works so well though 🤔 , my guess is they don't quantize everything but I have to check.
python -m mlx_lm.convert --hf-path Qwen/Qwen1.5-7B-Chat
python -m mlx_lm.generate --model mlx_model --prompt "Write a quick sort in C++"
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Write a quick sort in C++<|im_end|>
<|im_start|>assistant
Sure! Here's a simple implementation of the Quick Sort algorithm in C++:
```cpp
#include <iostream>
using namespace std;
// Function to swap elements in an array
void swap(int* a, int* b) {
int t = *a;
*a = *b;
*b = t;
}
// Function to partition the array
int partition(int arr[], int low, int high) {
// Choose the last element as pivot
int pivot =
==========
Prompt: 59.032 tokens-per-sec
Generation: 32.800 tokens-per-sec
from mlx-examples.
Actually what model is this: qwen:1.8b-chat-v1.5-q4_0
? Is it GGUF?
from mlx-examples.
One more piece of information, I encountered the same problem when trying the Qwen1.5-72B-Chat
model.
from mlx-examples.
I also tried quanting with mlx for the qwen1.5 4B model and there are no tokens generated when I run inference.
from mlx-examples.
To add to the info, I have tried the 1.6B model separately and it doesn't look like I'll run into problems with the Qwen1.5-1.8B model. Here are some specific results:
mlx-community/stablelm-2-zephyr-1_6b-4bit
:
python -m mlx_lm.generate --model mlx-community/stablelm-2-zephyr-1_6b-4bit --max-tokens 150 --temp 0 --colorize --prompt "Write a quick sort in C++"
![image](https://private-user-images.githubusercontent.com/6247142/307849156-7508cd5d-f42f-4c43-b1c2-9a72a0eb85e9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYwNDEwMTcsIm5iZiI6MTcxNjA0MDcxNywicGF0aCI6Ii82MjQ3MTQyLzMwNzg0OTE1Ni03NTA4Y2Q1ZC1mNDJmLTRjNDMtYjFjMi05YTcyYTBlYjg1ZTkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDUxOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA1MThUMTM1ODM3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZmYxMDgwZDVhODJlMzkxZjQwNTkzZGZjYTY0Y2RjNzhlZjI1NzM3MmNlNjY5OGQ3ZjhmMmQyYjMxZTgzYWYzNyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.zKb1q8ajFF7SxE98gU868WShM--_J8wx1V6-DV-6FoU)
mlx-community/stablelm-2-zephyr-1_6b
:
python -m mlx_lm.generate --model mlx-community/stablelm-2-zephyr-1_6b --max-tokens 150 --temp 0 --colorize --prompt "Write a quick sort in C++"
![image](https://private-user-images.githubusercontent.com/6247142/307849428-d4d0e98c-0d5f-4ba4-8ea3-2c631ccde5d9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYwNDEwMTcsIm5iZiI6MTcxNjA0MDcxNywicGF0aCI6Ii82MjQ3MTQyLzMwNzg0OTQyOC1kNGQwZTk4Yy0wZDVmLTRiYTQtOGVhMy0yYzYzMWNjZGU1ZDkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDUxOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA1MThUMTM1ODM3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZjkwMGU4ODAwNTViZmY2MGUxNDBkZTdmMzJjYmRiNTEzOTcyYjQ4MjRjYTk4MGZlN2EyZTg2MTdhMWM4MzgzMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.JWOW_ugWDHiCTKbI8LoaXUpMCCPg-vSZvrlhP25yUyM)
from mlx-examples.
mlx-community/stablelm-2-zephyr-1_6b
That looks like a different model, not the qwen model?
from mlx-examples.
mlx-community/stablelm-2-zephyr-1_6b
That looks like a different model, not the qwen model?
Yes, this is not qwen's model. I'm simply providing information before and after quantization of something like the 1B or so model, and I'm not sure that part of the information is helpful for the question at hand.
from mlx-examples.
Looks like there may be a bug here. We are investigating and will report back.
from mlx-examples.
Related Issues (20)
- Llama-3-8B-Instruct-Gradient-1048k-4bit not working? HOT 2
- Generating after LORA training CAN NOT Stop Properly HOT 3
- Issue with Fusing Models - Output is Bad HOT 2
- GatedRepoError: 401 Client Error; "You must be authenticated to access it." HOT 1
- [Feature Request] When generating using mlx_lm, specify data format HOT 2
- how to merge lora adapter to base model HOT 1
- delete and uninstall HOT 11
- KV Cache can only process more than self.step tokens if offset % step == 0 HOT 2
- Text to Speech MLX model. HOT 1
- SLM Example Code HOT 1
- Enhance load function to support model configuration editing HOT 1
- Support for full set of output formats - e.g. vtt, json and json-full HOT 2
- Whisper stutters HOT 8
- mlx 0.13 very slow with q8 and fp16 HOT 5
- Fine tuned a Mixtral-8x7B-Instruct-v0.1 model and unable to load with AutoModelForCausalLM HOT 1
- Phi-3-mini-4k-instruct : Failing to stop at <|end|> on generating the answer. HOT 5
- PaliGemma 4bit Quantization broken and Inference issues. HOT 27
- [Feature Request] Function Calling for mlx_lm.server HOT 4
- OS system requirement for mlx HOT 1
- 01-ai/Yi-1.5-9B-Chat got ValueError: Cannot instantiate this tokenizer from a slow version. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlx-examples.