Hi, On my machine: M2 Ultra (60 GPU cores, 192GB memory), I did the

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Hmm, maybe good to try with mlx_lm: <a href="https://github.com/ml-explore/mlx-example

Hmm, maybe good to try with mlx_lm: <a href="https://github.com/ml-explor

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Weird Output from llama-2-70b-chat about mlx-examples HOT 6 OPEN

muchi674 commented on July 22, 2024

Weird Output from llama-2-70b-chat

from mlx-examples.

Comments (6)

muchi674 commented on July 22, 2024 1

Thank you @awni for the suggestion,
I will try the specified library out and keep you posted!

from mlx-examples.

awni commented on July 22, 2024

Hmm, maybe good to try with mlx_lm: https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md

Our original llama implementation was mostly tested with llama v1 7B so there could be some config that is off for the larger llama 2 models.

from mlx-examples.

muchi674 commented on July 22, 2024

Hmm, maybe good to try with mlx_lm: https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md

Our original llama implementation was mostly tested with llama v1 7B so there could be some config that is off for the larger llama 2 models.

Hi @awni ,

Sorry for the wait. I tried running llama-2-70b-chat today (with weights downloaded from https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) with mlx_lm library and it worked. Nevertheless, the latency is quite concerning. Please refer to the screenshot attached below. In comparison, with llama.cpp, I am getting 11.87 t/s for prompt evaluation and 4.78 t/s for generation. Given the strength of mlx, I would guess that it would at least have comparable speed to llama.cpp?

I will continue to investigate the bottleneck, just following up with you and documenting the results.

Thanks again for your time and assistance!

from mlx-examples.

awni commented on July 22, 2024

Just to be clear did you try the 7B or the 70B model?
What machine are you running on? If it does not have a lot of RAM you may need to use a quantized model to avoid swapping. Usually we are slower than llama.cpp but not by nearly that much.

from mlx-examples.

muchi674 commented on July 22, 2024

Just to be clear did you try the 7B or the 70B model?

What machine are you running on? If it does not have a lot of RAM you may need to use a quantized model to avoid swapping. Usually we are slower than llama.cpp but not by nearly that much.

Ah,
So sorry for the confusion, @awni , I accidentally pasted the link to the 7B hf model instead of the 70b one.

To answer your questions:

I tried the 70b-chat model
as mentioned in the first comment, I am using M2 Ultra with 60 GPU cores and 192GB memory. I believe this should be plenty to fit the original fp16 model? According to my experiment, I do not see the process using swap memory. Please refer to the screenshots attached below

This is my machine's swap memory usage before I started the process (btw, pretty confused why the machine is using swap when there is plenty of physical memory available)

This is how I am running the 70b-chat model and the observed memory consumption (according to htop)

This is when the process finishes, which shows the latency metrics as well as memory usage in normal times

Thank you again for your time, and I will keep on digging!

from mlx-examples.

muchi674 commented on July 22, 2024

Hi @awni ,

For your reference, I have recorded a video that shows the output from vmstat when running the model.
https://drive.google.com/file/d/1ze65_YsETJ5pS7Mz_YvVH4stMDNhkdI3/view?usp=sharing

From there, you could see that:

page-ins spiked for a brief period when the process first started, but if I am correct, this is just the model being loaded into memory
after that, page in/out & swap in/out were mostly near 0

I believe the observations above indicate that swapping isn't an issue here.

Thank you again for your time.

from mlx-examples.

Weird Output from llama-2-70b-chat about mlx-examples HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent