I use mlx_lm.convert to quantize the <code class="not

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I use mlx_lm.convert to quantize the <co

Qwen/Qwen1.5-1.8B-Chatt quantize error about mlx-examples HOT 13 CLOSED

madroidmaq commented on July 22, 2024

Qwen/Qwen1.5-1.8B-Chatt quantize error

from mlx-examples.

Comments (13)

awni commented on July 22, 2024 2

Yes I found the model. I don't yet know why it's not working, but there are certainly some differences between the GGUF quantized model and generation code. Some examples:

GGUF uses groups size 32, bits 4
GGUF does not quantize the linear layer
It's possible the sampling strategy matters (is ollama/llama.cpp using topk by default?)
It's possible one needs a repetition penalty on the quantized model

Since I am able to get other quantized models working fine and this model has nothing noticeably unusual about it in the sizes or config, my guess is this is not a bug in related to our quantization but more some missing feature like something from the above list.

from mlx-examples.

awni commented on July 22, 2024 2

There was an OOB read in our QMV kernel for the shapes in that QWEN model. This is fixed on main in MLX and I confirmed generating text with quantized 1.5B works quite well now :). Will be in the next release.

from mlx-examples.

madroidmaq commented on July 22, 2024 1

@awni Yes, qwen:1.8b-chat-v1.5-q4_0 is in GGUF format, here's more about the model qwen:1.8b-chat-v1.5-q4_0 .

I haven't tried quantization with the Qwen 1.8B model before, but I haven't encountered similar problems with the 4-bit quantized version of the Phi2 model, and Phi2-4bit works fine.

from mlx-examples.

jincdream commented on July 22, 2024 1

I use mlx_lm.convert to quantize the Qwen/Qwen1.5-1.8B-Chat model using the following command:我用 mlx_lm.convert 以量化 Qwen/Qwen1.5-1.8B-Chat 模型使用的以下命令：
python -m mlx_lm.convert --hf-path Qwen/Qwen1.5-1.8B-Chat \
    -q \
    --upload-repo madroid/Qwen1.5-1.8B-Chat-4bit-mlx
<frozen runpy>:128: RuntimeWarning: 'mlx_lm.convert' found in sys.modules after import of package 'mlx_lm', but prior to execution of 'mlx_lm.convert'; this may result in unpredictable behaviour
[INFO] Loading
Fetching 6 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 78889.73it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO] Quantizing
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48G/1.48G [01:24<00:00, 17.5MB/s]
When using the quantized model for inference, it is found that it can no longer perform inference work properly, the output is some meaningless content, the output is roughly as follows:当使用的量化模型对于推理，可以发现，它可以不再进行推理正常工作，输出是一些无意义的内容，输出大致如下：
python mlx_qwen.py
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 96420.78it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: <|im_start|>user
hello<|im_end|>
<|im_start|>assistant

Hello! is 0! in Python, which stands for "halt all communication exchange on to communicate and transmit information and information. (hl):).
 2!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
==========
Prompt: 185.258 tokens-per-sec
Generation: 105.760 tokens-per-sec
I used Ollama to verify that it was the Qwen/Qwen1.5-1.8B-Chat model that had limitations on the quantization itself, but found that it worked well in Ollama, and here is the corresponding output:我用 Ollama 来验证其是的 Qwen/Qwen1.5-1.8B-Chat 模型，该模型有限制的量化本身，但发现它的工作以及在Ollama，在这里是相应的输出：
❯ ollama run qwen:1.8b-chat-v1.5-q4_0
>>> hello
Hello! How can I help you today? Is there something specific that you would like to know or discuss? I'm here to listen and provide helpful information, so please feel free to ask me anything you'd like to know.

>>>
Use Ctrl + d or /bye to exit.
>>> /bye
I don't know much about this part, how can I further troubleshoot what the problem is? Is it the convert.py script that needs special compatibility support for qwen2?我不知道很多关于这部分，我可以如何进一步解决的问题是什么? 它是的 convert.py 脚本，需要特别的兼容性支持 qwen2 ?

I have the same problem

python -m mlx_lm.generate \
    --model ~/Documents/open/mlx-examples/lora/Qwen1.5-1.8B-Chat-4bit/ --prompt "你是谁？" --eos-token="<|endoftext|>" --trust-remote-code --max-tokens=100 --top-p=1.2 --use-default-chat-template
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
你是谁？<|im_end|>
<|im_start|>assistant

我是!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
==========
Prompt: 340.437 tokens-per-sec
Generation: 41.714 tokens-per-sec

model: https://huggingface.co/mlx-community/Qwen1.5-1.8B-Chat-4bit
code: https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm

from mlx-examples.

awni commented on July 22, 2024

When using the quantized model for inference, it is found that it can no longer perform inference work properly

Was this working for you before? Just wondering so we can narrow in on the issue.

from mlx-examples.

awni commented on July 22, 2024

Interestingly, converting the 7B works fine so it seems like an issue with the smaller model. Small models tend to be more difficult to quantize in a stable way. I'm wondering why the GGUF one works so well though 🤔 , my guess is they don't quantize everything but I have to check.

python -m mlx_lm.convert --hf-path Qwen/Qwen1.5-7B-Chat

 python -m mlx_lm.generate --model mlx_model --prompt "Write a quick sort in C++"
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Write a quick sort in C++<|im_end|>
<|im_start|>assistant

Sure! Here's a simple implementation of the Quick Sort algorithm in C++:

```cpp
#include <iostream>
using namespace std;

// Function to swap elements in an array
void swap(int* a, int* b) {
    int t = *a;
    *a = *b;
    *b = t;
}

// Function to partition the array
int partition(int arr[], int low, int high) {
    // Choose the last element as pivot
    int pivot =
==========
Prompt: 59.032 tokens-per-sec
Generation: 32.800 tokens-per-sec

from mlx-examples.

awni commented on July 22, 2024

Actually what model is this: qwen:1.8b-chat-v1.5-q4_0 ? Is it GGUF?

from mlx-examples.

madroidmaq commented on July 22, 2024

One more piece of information, I encountered the same problem when trying the Qwen1.5-72B-Chat model.

from mlx-examples.

RonanKMcGovern commented on July 22, 2024

I also tried quanting with mlx for the qwen1.5 4B model and there are no tokens generated when I run inference.

from mlx-examples.

madroidmaq commented on July 22, 2024

To add to the info, I have tried the 1.6B model separately and it doesn't look like I'll run into problems with the Qwen1.5-1.8B model. Here are some specific results:

mlx-community/stablelm-2-zephyr-1_6b-4bit ：

python -m mlx_lm.generate --model mlx-community/stablelm-2-zephyr-1_6b-4bit --max-tokens 150 --temp 0 --colorize --prompt "Write a quick sort in C++"

mlx-community/stablelm-2-zephyr-1_6b ：

python -m mlx_lm.generate --model mlx-community/stablelm-2-zephyr-1_6b --max-tokens 150 --temp 0 --colorize --prompt "Write a quick sort in C++"

from mlx-examples.

awni commented on July 22, 2024

mlx-community/stablelm-2-zephyr-1_6b

That looks like a different model, not the qwen model?

from mlx-examples.

madroidmaq commented on July 22, 2024

mlx-community/stablelm-2-zephyr-1_6b

That looks like a different model, not the qwen model?

Yes, this is not qwen's model. I'm simply providing information before and after quantization of something like the 1B or so model, and I'm not sure that part of the information is helpful for the question at hand.

from mlx-examples.

awni commented on July 22, 2024

Looks like there may be a bug here. We are investigating and will report back.

from mlx-examples.

Qwen/Qwen1.5-1.8B-Chatt quantize error about mlx-examples HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent