Describe the bug Error converting mistral to onnx <p dir="aut

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-h

Error converting mistral to onnx about sparseml HOT 13 CLOSED

meomeomeome commented on June 9, 2024

Error converting mistral to onnx

from sparseml.

Comments (13)

meomeomeome commented on June 9, 2024

I also want to add that after quantization and optimization, the model remains the same size. Although the recipe says quantization 8

from sparseml.

dbogunowicz commented on June 9, 2024

Hey @meomeomeome
Regarding your export issue, please use the following entrypoint for export;
sparseml.export --task text-generation --model_path obcq_deployment

Regarding the model size issue, could you paste here an artifact that illustrates the comparison? Perhaps some stdout from du -sh * or tree ?

from sparseml.

meomeomeome commented on June 9, 2024

sparseml.export --task text-generation --model_path obcq_deployment

In your instructions at https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq

Model preparation is done with this command
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py HuggingFaceH4/zephyr-7b-beta open_platypus --recipe recipe.yaml --precision float16 --save True

Those. we load the model in float16 format

Next is the conversion script python
sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
Which is unable to perform half operations on the CPU
I studied the python files
sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
and the file
src/sparseml/pytorch/utils/exporter.py from the library, there is an explicit loading of models into the CPU

Does your suggestion

sparseml.export --task text-generation --model_path obcq_deployment

solve the problem of exporting a model in onnx to float16? What library files are used for this and how is the problem of incompatibility of CPU operations with float16 solved?

Based on the size of the model, from my experience with Tiny LLama, I realized that the final reduction of the model occurs after complete conversion to onnx in the deployment folder

from sparseml.

meomeomeome commented on June 9, 2024

sparseml.export --task text-generation --model_path obcq_deployment
dont have options --model_path
also i can't try all for end procces convert to onnx because conversion kill process with 83 gb memory when bin files of model only 15 gb

from sparseml.

dbogunowicz commented on June 9, 2024

Let me take a look, will come back to you shortly

from sparseml.

dbogunowicz commented on June 9, 2024

Hey @meomeomeome

Short update from my side:
I tried to recreate your problem locally.

I generated your obcq_deployment directory
Exported the model using sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying --sequence_lenght {int} argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on.
I was also able to reproduce the export error in obcq/export.py (RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' ). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor of sparseml.export.

from sparseml.

meomeomeome commented on June 9, 2024

Hey @meomeomeome

Short update from my side: I tried to recreate your problem locally.

I generated your obcq_deployment directory

Exported the model using sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying --sequence_lenght {int} argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on.

I was also able to reproduce the export error in obcq/export.py (RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' ). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor of sparseml.export.

Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model - What do you mean? Will this slow down the export process, or will the model lose quality after exporting to onnx? P.S. base model context window 4096

from sparseml.

dbogunowicz commented on June 9, 2024

When running in our deepsparse pipeline you will not be able to generate more than e.g. 64 - num_tokens(prompt) tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process

from sparseml.

meomeomeome commented on June 9, 2024

When running in our deepsparse pipeline you will not be able to generate more than e.g. 64 - num_tokens(prompt) tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process

Does this apply only to the pipeline or any other output methods?

import psutil
import time
# Получение информации о памяти и CPU
memory_usage = psutil.virtual_memory()
cpu_frequency = psutil.cpu_freq()

print(f"Total Memory: {memory_usage.total / (1024 ** 3)} GB")
print(f"Memory Used: {memory_usage.used / (1024 ** 3)} GB")

# Получение информации о CPU
cpu_frequency = psutil.cpu_freq(percpu=True)
cpu_count = psutil.cpu_count(logical=False)
cpu_logical_count = psutil.cpu_count(logical=True)
cpu_model = None
with open("/proc/cpuinfo", "r") as f:
    for line in f:
        if "model name" in line:
            cpu_model = line.strip().split(":")[1].strip()
            break

# Вывод информации
print(f"CPU Model: {cpu_model}")
print(f"Physical Cores: {cpu_count}")
print(f"Logical Cores (including hyperthreading): {cpu_logical_count}")

for i, freq in enumerate(cpu_frequency):
    print(f"Core {i}: {freq.current / 1000:.2f} GHz")

print(f"Total CPU Frequency: {psutil.cpu_freq().current / 1000:.2f} GHz")


prompt = "How to make banana bread?"
formatted_prompt =  f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# Измерение до инференса
memory_before = psutil.virtual_memory().used

start_time = time.time()
output = model(formatted_prompt, max_new_tokens=500).generations[0].text
end_time = time.time()

# Измерение после инференса
memory_after = psutil.virtual_memory().used

print(f"Inference Time: {end_time - start_time} seconds")
print(f"Memory Used During Inference: {(memory_after - memory_before) / (1024 ** 2)} MB")

Resullt and speed model loaded by from deepsparse import TextGeneration in memory TinyLlama 1.19GB(converted seq_l 128)-- 19 seconds!!

Total Memory: 50.993690490722656 GB
Memory Used: 3.8117218017578125 GB
CPU Model: Intel(R) Xeon(R) CPU @ 2.20GHz
Physical Cores: 4
Logical Cores (including hyperthreading): 8
Core 0: 2.20 GHz
Core 1: 2.20 GHz
Core 2: 2.20 GHz
Core 3: 2.20 GHz
Core 4: 2.20 GHz
Core 5: 2.20 GHz
Core 6: 2.20 GHz
Core 7: 2.20 GHz
Total CPU Frequency: 2.20 GHz
Inference Time: 19.88378143310547 seconds
Memory Used During Inference: 2.390625 MB
Banana bread is a delicious and nutty bread that is easy to make. Here is a recipe for banana bread:

Ingredients:

    1 1/2 cups flour
    1/2 cup sugar
    1/2 cup baking powder
    1/2 cup whole milk
    1/4 cup oil
    1/4 cup eggs
    1/4 cup raisins
    1/4 cup raisin bread crumbs
    1/4 cup pecans
    Salt
    Sugar
    Bread
    Flour
    Water
    Butter
    Oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt

And i have 2 questions

Does this apply only to the pipeline or any other output methods? (seq_l)
And what method most speed for inferense? (i intrested load model form my disk and memory)

from sparseml.

dbogunowicz commented on June 9, 2024

I do not understand the two questions, could you rephrase them, please?

I imagine that if you run the exported post-obcq ONNX model in the deepsparse pipeline (as you do above), setting a small sequence_length on the export may mess up some models. This is because the sequence_length set during export influences the size of the positional embeddings available for the exported model. As a result, you may get unexpected errors. I see that you are getting satisfying results for your model, so maybe that is not the case for this particular network.
@mgoin Could you take a look? Is my hypothesis more or less correct?

from sparseml.

meomeomeome commented on June 9, 2024

As I understand it, no one knows how to solve the export problem without limiting the context window --sequence_length 64. If you leave it the same as in the base model, when exporting, the memory for exporting the model with the original size of 15 GB takes up the entire memory of 83 GB.
Does anyone know the methodology of how to solve this using batch size or distributing processing in parts?

from sparseml.

jeanniefinks commented on June 9, 2024

@meomeomeome
This is a known issue exporting requires a lot of memory, depending on the sequence_length. We'll be noting this as a known issue in the pending 1.7 product release.

from sparseml.

jeanniefinks commented on June 9, 2024

Hello @meomeomeome
A heads up that 1.7 recently went out. We hope this can address the issue you faced.
Thank you! Jeannie / Neural Magic

from sparseml.

Error converting mistral to onnx about sparseml HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent