Git Product home page Git Product logo

Comments (6)

muchi674 avatar muchi674 commented on July 22, 2024 1

Thank you @awni for the suggestion,
I will try the specified library out and keep you posted!

from mlx-examples.

awni avatar awni commented on July 22, 2024

Hmm, maybe good to try with mlx_lm: https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md

Our original llama implementation was mostly tested with llama v1 7B so there could be some config that is off for the larger llama 2 models.

from mlx-examples.

muchi674 avatar muchi674 commented on July 22, 2024

Hmm, maybe good to try with mlx_lm: https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md

Our original llama implementation was mostly tested with llama v1 7B so there could be some config that is off for the larger llama 2 models.

Hi @awni ,

Sorry for the wait. I tried running llama-2-70b-chat today (with weights downloaded from https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) with mlx_lm library and it worked. Nevertheless, the latency is quite concerning. Please refer to the screenshot attached below. In comparison, with llama.cpp, I am getting 11.87 t/s for prompt evaluation and 4.78 t/s for generation. Given the strength of mlx, I would guess that it would at least have comparable speed to llama.cpp?

Screenshot 2024-02-14 at 6 07 59 PM

I will continue to investigate the bottleneck, just following up with you and documenting the results.

Thanks again for your time and assistance!

from mlx-examples.

awni avatar awni commented on July 22, 2024
  • Just to be clear did you try the 7B or the 70B model?
  • What machine are you running on? If it does not have a lot of RAM you may need to use a quantized model to avoid swapping. Usually we are slower than llama.cpp but not by nearly that much.

from mlx-examples.

muchi674 avatar muchi674 commented on July 22, 2024
  • Just to be clear did you try the 7B or the 70B model?
  • What machine are you running on? If it does not have a lot of RAM you may need to use a quantized model to avoid swapping. Usually we are slower than llama.cpp but not by nearly that much.

Ah,
So sorry for the confusion, @awni , I accidentally pasted the link to the 7B hf model instead of the 70b one.

To answer your questions:

  • I tried the 70b-chat model
  • as mentioned in the first comment, I am using M2 Ultra with 60 GPU cores and 192GB memory. I believe this should be plenty to fit the original fp16 model? According to my experiment, I do not see the process using swap memory. Please refer to the screenshots attached below

This is my machine's swap memory usage before I started the process (btw, pretty confused why the machine is using swap when there is plenty of physical memory available)
Screenshot 2024-02-14 at 10 36 50 PM

This is how I am running the 70b-chat model and the observed memory consumption (according to htop)
Screenshot 2024-02-14 at 10 34 20 PM

This is when the process finishes, which shows the latency metrics as well as memory usage in normal times
Screenshot 2024-02-14 at 10 35 47 PM

Thank you again for your time, and I will keep on digging!

from mlx-examples.

muchi674 avatar muchi674 commented on July 22, 2024

Hi @awni ,

For your reference, I have recorded a video that shows the output from vmstat when running the model.
https://drive.google.com/file/d/1ze65_YsETJ5pS7Mz_YvVH4stMDNhkdI3/view?usp=sharing

From there, you could see that:

  • page-ins spiked for a brief period when the process first started, but if I am correct, this is just the model being loaded into memory
  • after that, page in/out & swap in/out were mostly near 0

I believe the observations above indicate that swapping isn't an issue here.

Thank you again for your time.

from mlx-examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.