Comments (6)
Thank you @awni for the suggestion,
I will try the specified library out and keep you posted!
from mlx-examples.
Hmm, maybe good to try with mlx_lm: https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md
Our original llama implementation was mostly tested with llama v1 7B so there could be some config that is off for the larger llama 2 models.
from mlx-examples.
Hmm, maybe good to try with mlx_lm: https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md
Our original llama implementation was mostly tested with llama v1 7B so there could be some config that is off for the larger llama 2 models.
Hi @awni ,
Sorry for the wait. I tried running llama-2-70b-chat today (with weights downloaded from https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) with mlx_lm library and it worked. Nevertheless, the latency is quite concerning. Please refer to the screenshot attached below. In comparison, with llama.cpp, I am getting 11.87 t/s for prompt evaluation and 4.78 t/s for generation. Given the strength of mlx, I would guess that it would at least have comparable speed to llama.cpp?
![Screenshot 2024-02-14 at 6 07 59 PM](https://private-user-images.githubusercontent.com/126644960/304761243-31a16847-5426-4ff6-80bc-bdc53f437970.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDk0NjU2MzMsIm5iZiI6MTcwOTQ2NTMzMywicGF0aCI6Ii8xMjY2NDQ5NjAvMzA0NzYxMjQzLTMxYTE2ODQ3LTU0MjYtNGZmNi04MGJjLWJkYzUzZjQzNzk3MC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwMzAzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDMwM1QxMTI4NTNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hMDU4Mjc2MzNjYTNlZTUzYWU5NDAxNTI0ZjY5NDU0ZjJjYzNmMjRmYmJmYWQwMmRkMmVjNTBhZDNiYjE1MDJhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.xAHz1kCTQOacX9kiDuRDWv-fUOuZBkzHloZzUP0yUUU)
I will continue to investigate the bottleneck, just following up with you and documenting the results.
Thanks again for your time and assistance!
from mlx-examples.
- Just to be clear did you try the 7B or the 70B model?
- What machine are you running on? If it does not have a lot of RAM you may need to use a quantized model to avoid swapping. Usually we are slower than llama.cpp but not by nearly that much.
from mlx-examples.
- Just to be clear did you try the 7B or the 70B model?
- What machine are you running on? If it does not have a lot of RAM you may need to use a quantized model to avoid swapping. Usually we are slower than llama.cpp but not by nearly that much.
Ah,
So sorry for the confusion, @awni , I accidentally pasted the link to the 7B hf model instead of the 70b one.
To answer your questions:
- I tried the 70b-chat model
- as mentioned in the first comment, I am using M2 Ultra with 60 GPU cores and 192GB memory. I believe this should be plenty to fit the original fp16 model? According to my experiment, I do not see the process using swap memory. Please refer to the screenshots attached below
This is my machine's swap memory usage before I started the process (btw, pretty confused why the machine is using swap when there is plenty of physical memory available)
This is how I am running the 70b-chat model and the observed memory consumption (according to htop)
This is when the process finishes, which shows the latency metrics as well as memory usage in normal times
Thank you again for your time, and I will keep on digging!
from mlx-examples.
Hi @awni ,
For your reference, I have recorded a video that shows the output from vmstat when running the model.
https://drive.google.com/file/d/1ze65_YsETJ5pS7Mz_YvVH4stMDNhkdI3/view?usp=sharing
From there, you could see that:
- page-ins spiked for a brief period when the process first started, but if I am correct, this is just the model being loaded into memory
- after that, page in/out & swap in/out were mostly near 0
I believe the observations above indicate that swapping isn't an issue here.
Thank you again for your time.
from mlx-examples.
Related Issues (20)
- Unable to build mlx_lm from source HOT 4
- mlx_lm.server OpenAI compatible endpoints HOT 2
- Error loading models from mlx-community on HF HOT 3
- Command buffer execution failed: Insufficient Memory While tying to fine tune using lora.py
- Instruct tuning for lora/finetune? HOT 7
- No example for generating using the adapters.npz HOT 5
- Mnist example run error HOT 12
- mlx_lm.lora need an option to set --eos-token HOT 7
- [mlx-lm] Add format option to saveweights
- The stable diffusion example encountered an error after being upgraded to version 0.4. HOT 4
- is a mlx_lm.train command feasible? HOT 1
- How to generate an answer without pre-defined length of tokens? HOT 2
- Starcoder2 Support HOT 8
- Error 'Received invalid kth 2along axis -1 for array with shape: (1,2)' when generate using mixtral model with MLX format 4 quantized. HOT 7
- Inferencing with adapter vs Inferencing with fused model HOT 3
- Running python -m mlx_lm.fuse - Resulting in this error: IndexError: list index out of range HOT 2
- LoRA vs. QLoRA performance comparison HOT 5
- feature : implementing BitNet HOT 6
- Reinforcement Learning from Human Feedback (RLHF) examples: Direct Preference Optimization (DPO) HOT 7
- Panic while fine-tuning with LORA HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlx-examples.