Git Product home page Git Product logo

Comments (9)

awni avatar awni commented on June 29, 2024 4

Seems to be related to gradient checkpointing but that's all I know so far.. turning off gradient checkpointing and the peak memory is constant for a fixed sequence length as expected.

from mlx-examples.

mzbac avatar mzbac commented on June 29, 2024 4

@mzbac, a bit off topic and please let me know if I'm breaking some rules so I'll delete the comment.

But can you share an example of your dataset used for finetuning llama3 instruct? I am a bit confued by its template.

Thanks

@Satyam7166-tech You can check the data here at https://huggingface.co/datasets/mzbac/function-calling-llama-3-format-v1.1/viewer and feel free to create a discussion there if you have any questions about the dataset.

from mlx-examples.

mzbac avatar mzbac commented on June 29, 2024 4

@awni I can confirm that the issue is related to gradient checkpointing. With gradient checkpointing disabled, I trained the model for 8000 iterations and the peak memory is consistent and works as expected.

from mlx-examples.

awni avatar awni commented on June 29, 2024 3

Just FYI, the leak should be fixed now with gradient checkpointing enabled.

from mlx-examples.

mzbac avatar mzbac commented on June 29, 2024 2

Seems to be related to gradient checkpointing but that's all I know so far.. turning off gradient checkpointing and the peak memory is constant for a fixed sequence length as expected.

After disabling gradient checkpointing, the peak memory has stabilized. I will continue training for a few thousand iterations to see if everything is running as expected. Thanks @awni 🚀

from mlx-examples.

ivanfioravanti avatar ivanfioravanti commented on June 29, 2024 1

Thanks for the fix!

from mlx-examples.

awni avatar awni commented on June 29, 2024

Very strange .. seems like there is a leak of some sort.

from mlx-examples.

mzbac avatar mzbac commented on June 29, 2024

and the crash logs:

Iter 7320: Train loss 0.216, Learning Rate 1.000e-06, It/sec 0.116, Tokens/sec 40.214, Trained Tokens 3893259, Peak mem 342.467 GB
libc++abi: terminating due to uncaught exception of type std::runtime_error: [malloc_or_wait] Unable to allocate 554319936 bytes.
zsh: abort      mlx_lm.lora --config lora_config.yaml

from mlx-examples.

Satyam7166-tech avatar Satyam7166-tech commented on June 29, 2024

@mzbac, a bit off topic and please let me know if I'm breaking some rules so I'll delete the comment.

But can you share an example of your dataset used for finetuning llama3 instruct? I am a bit confued by its template.

Thanks

from mlx-examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.