Comments (13)
I also want to add that after quantization and optimization, the model remains the same size. Although the recipe says quantization 8
from sparseml.
Hey @meomeomeome
Regarding your export issue, please use the following entrypoint for export;
sparseml.export --task text-generation --model_path obcq_deployment
Regarding the model size issue, could you paste here an artifact that illustrates the comparison? Perhaps some stdout from du -sh *
or tree
?
from sparseml.
sparseml.export --task text-generation --model_path obcq_deployment
In your instructions at https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq
Model preparation is done with this command
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py HuggingFaceH4/zephyr-7b-beta open_platypus --recipe recipe.yaml --precision float16 --save True
Those. we load the model in float16 format
Next is the conversion script python
sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
Which is unable to perform half operations on the CPU
I studied the python files
sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
and the file
src/sparseml/pytorch/utils/exporter.py from the library, there is an explicit loading of models into the CPU
Does your suggestion
sparseml.export --task text-generation --model_path obcq_deployment
solve the problem of exporting a model in onnx to float16? What library files are used for this and how is the problem of incompatibility of CPU operations with float16 solved?
Based on the size of the model, from my experience with Tiny LLama, I realized that the final reduction of the model occurs after complete conversion to onnx in the deployment folder
from sparseml.
sparseml.export --task text-generation --model_path obcq_deployment
dont have options --model_path
also i can't try all for end procces convert to onnx because conversion kill process with 83 gb memory when bin files of model only 15 gb
from sparseml.
Let me take a look, will come back to you shortly
from sparseml.
Hey @meomeomeome
Short update from my side:
I tried to recreate your problem locally.
- I generated your
obcq_deployment
directory - Exported the model using
sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation
. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying--sequence_lenght {int}
argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on. - I was also able to reproduce the export error in
obcq/export.py
(RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor ofsparseml.export
.
from sparseml.
Hey @meomeomeome
Short update from my side: I tried to recreate your problem locally.
- I generated your
obcq_deployment
directory- Exported the model using
sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation
. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying--sequence_lenght {int}
argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on.- I was also able to reproduce the export error in
obcq/export.py
(RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor ofsparseml.export
.
Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model - What do you mean? Will this slow down the export process, or will the model lose quality after exporting to onnx? P.S. base model context window 4096
from sparseml.
When running in our deepsparse
pipeline you will not be able to generate more than e.g. 64 - num_tokens(prompt)
tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process
from sparseml.
When running in our
deepsparse
pipeline you will not be able to generate more than e.g.64 - num_tokens(prompt)
tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process
Does this apply only to the pipeline or any other output methods?
import psutil
import time
# Получение информации о памяти и CPU
memory_usage = psutil.virtual_memory()
cpu_frequency = psutil.cpu_freq()
print(f"Total Memory: {memory_usage.total / (1024 ** 3)} GB")
print(f"Memory Used: {memory_usage.used / (1024 ** 3)} GB")
# Получение информации о CPU
cpu_frequency = psutil.cpu_freq(percpu=True)
cpu_count = psutil.cpu_count(logical=False)
cpu_logical_count = psutil.cpu_count(logical=True)
cpu_model = None
with open("/proc/cpuinfo", "r") as f:
for line in f:
if "model name" in line:
cpu_model = line.strip().split(":")[1].strip()
break
# Вывод информации
print(f"CPU Model: {cpu_model}")
print(f"Physical Cores: {cpu_count}")
print(f"Logical Cores (including hyperthreading): {cpu_logical_count}")
for i, freq in enumerate(cpu_frequency):
print(f"Core {i}: {freq.current / 1000:.2f} GHz")
print(f"Total CPU Frequency: {psutil.cpu_freq().current / 1000:.2f} GHz")
prompt = "How to make banana bread?"
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# Измерение до инференса
memory_before = psutil.virtual_memory().used
start_time = time.time()
output = model(formatted_prompt, max_new_tokens=500).generations[0].text
end_time = time.time()
# Измерение после инференса
memory_after = psutil.virtual_memory().used
print(f"Inference Time: {end_time - start_time} seconds")
print(f"Memory Used During Inference: {(memory_after - memory_before) / (1024 ** 2)} MB")
Resullt and speed model loaded by from deepsparse import TextGeneration in memory TinyLlama 1.19GB(converted seq_l 128)-- 19 seconds!!
Total Memory: 50.993690490722656 GB
Memory Used: 3.8117218017578125 GB
CPU Model: Intel(R) Xeon(R) CPU @ 2.20GHz
Physical Cores: 4
Logical Cores (including hyperthreading): 8
Core 0: 2.20 GHz
Core 1: 2.20 GHz
Core 2: 2.20 GHz
Core 3: 2.20 GHz
Core 4: 2.20 GHz
Core 5: 2.20 GHz
Core 6: 2.20 GHz
Core 7: 2.20 GHz
Total CPU Frequency: 2.20 GHz
Inference Time: 19.88378143310547 seconds
Memory Used During Inference: 2.390625 MB
Banana bread is a delicious and nutty bread that is easy to make. Here is a recipe for banana bread:
Ingredients:
1 1/2 cups flour
1/2 cup sugar
1/2 cup baking powder
1/2 cup whole milk
1/4 cup oil
1/4 cup eggs
1/4 cup raisins
1/4 cup raisin bread crumbs
1/4 cup pecans
Salt
Sugar
Bread
Flour
Water
Butter
Oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
And i have 2 questions
Does this apply only to the pipeline or any other output methods? (seq_l)
And what method most speed for inferense? (i intrested load model form my disk and memory)
from sparseml.
I do not understand the two questions, could you rephrase them, please?
I imagine that if you run the exported post-obcq ONNX model in the deepsparse pipeline (as you do above), setting a small sequence_length
on the export may mess up some models. This is because the sequence_length
set during export influences the size of the positional embeddings available for the exported model. As a result, you may get unexpected errors. I see that you are getting satisfying results for your model, so maybe that is not the case for this particular network.
@mgoin Could you take a look? Is my hypothesis more or less correct?
from sparseml.
As I understand it, no one knows how to solve the export problem without limiting the context window --sequence_length 64. If you leave it the same as in the base model, when exporting, the memory for exporting the model with the original size of 15 GB takes up the entire memory of 83 GB.
Does anyone know the methodology of how to solve this using batch size or distributing processing in parts?
from sparseml.
@meomeomeome
This is a known issue exporting requires a lot of memory, depending on the sequence_length. We'll be noting this as a known issue in the pending 1.7 product release.
from sparseml.
Hello @meomeomeome
A heads up that 1.7 recently went out. We hope this can address the issue you faced.
Thank you! Jeannie / Neural Magic
from sparseml.
Related Issues (20)
- Adding a `.pre-commit-config.yaml` file for maintaining consistent style and code quality. HOT 3
- Oriented Bounding Box support HOT 1
- Sparse ML not working for Transformers HOT 3
- Models with loops in their graph can't be converted to DeepSparse after QAT HOT 4
- RecursionError when converting LlaMa model to ONNX HOT 6
- SparseML/YOLOv5s - ValueError: Unable to find any modifiers in given recipe. HOT 1
- Feature Request: Oriented Bounding Box Sparsification for YOLOv5/YOLOv8 on Custom Models/Datasets HOT 1
- [Roadmap] SparseML Roadmap Q1 2024 HOT 1
- Regarding the execution speed and model size after Sparsifying ResNet-50 HOT 2
- Class Index change observed when validating a yolov5 pruned sparseml model HOT 2
- yolov5 sparse fine tuning error HOT 2
- [Roadmap] SparseML Roadmap Q2 2024
- Does Sparseml support Integer-Arithmetic-Only Inference? HOT 1
- recipe.yaml not found HOT 3
- Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale" HOT 5
- Performance Degradation in YOLOv8s Model Exported to ONNX via SparseML's Exporter HOT 2
- How to export a GPTQ model to ONNX to run in DeepSparse HOT 2
- Sparsify custom pytorch models from scratch
- Remove QuantizeLinear/DequantizeLinear of ONNX model
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparseml.