Git Product home page Git Product logo

Comments (6)

davidjoffe avatar davidjoffe commented on June 20, 2024 4

Thank you!

I opened an issue for this on mlx: ml-explore/mlx#63 and created this Pull Request: ml-explore/mlx#64 - hope it's accepted, though of course it's up to the mlx maintainers.

from mlx-examples.

angeloskath avatar angeloskath commented on June 20, 2024 1

May I say, awesome work :-)

We need to start adding more 8GB Macs in our testing pool. There really wasn't a particular reason for the 1.5. It just gets significantly slower at that point and we wanted to avoid freezing the system.

I would encourage you to at least open an issue or maybe a PR at the main repo. Can't tell you that it will be the first one merged but for sure we 'll test the implications of increasing this limit and if there are none for most use cases then merge it.

Thanks for investigating! Feel free to close the issue and link to it from the PR.

from mlx-examples.

angeloskath avatar angeloskath commented on June 20, 2024

Hm unfortunately that means it ran out of memory. We haven't tested it on a Mac mini with 8GB of ram unfortunately. I am using it on my M2 Air but it has 24GB. Can you try removing CFG by setting --cfg 0 ?

from mlx-examples.

davidjoffe avatar davidjoffe commented on June 20, 2024

Hm, thanks .. adding --cfg 0 unfortunately makes no difference

Even if it's running out of memory, seems odd, why wouldn't it just use the system pagefile/swap for such a relatively small amount like ~134MB? In 20+ years of dev I've never seen malloc() fail from high memory load, it normally just allocs and uses swap ... these 8GB Macs regularly swap like crazy but continue, I often use much MUCH more on these Macs than this appears to be using from Activity Monitor
I've closed everything else also.

Or does this have something to do with the unified memory architecture? Or memory fragmentation?

It runs smoothly through the default 50 steps - it just fails at the end. I'm watching memory load as this runs and it's in the green most the way, using max 2GB of swap until it completes the 50 steps ... using only 2GB swap it doesn't seem to me like such an excessive memory load that a malloc of just 134MB should fail, that's very very light memory load:

(foo) (base) david@Davids-Mac-mini stable_diffusion % python3 txt2image.py "A cat on a sandwich" --cfg 0 --n_images 1 --n_rows 1
/Users/david/mlx/foo/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:37<00:00,  3.14s/it]
  0%|                                                                                                           | 0/1 [00:00<?, ?it/s]libc++abi: terminating due to uncaught exception of type std::runtime_error: [malloc_or_wait] Unable to allocate 134217728 bytes.
zsh: abort      python3 txt2image.py "A cat on a sandwich" --cfg 0 --n_images 1 --n_rows 1

from mlx-examples.

davidjoffe avatar davidjoffe commented on June 20, 2024

I did some more tests, ran it again with --steps 2 - the maximum memory load system hits is about 7.5GB and the entire system is using ONLY ~500MB swap this time (that's nothing). I don't see why a malloc of just ~134MB should fail under those conditions, why it shouldn't just use swap

I upgraded to Sonoma, same thing

EDIT2 FWIW the system report looks like this:

Edit3 Looking at the source I see this looks like it internally calls/uses metal::allocator?

Edit4 is it possible the line 'block_limit_(1.5 * device_->recommendedMaxWorkingSetSize()) {}' is where the issue stems? If I have time later I may try rebuild with that line changed and see if it helps


Crashed Thread:        2

Application Specific Information:
abort() called


Thread 0::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	       0x18fc680ac __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x18fca55fc _pthread_cond_wait + 1228
2   libc++.1.dylib                	       0x18fbd04dc std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   libc++.1.dylib                	       0x18fbd0fec std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock<std::__1::mutex>&) + 56
4   libc++.1.dylib                	       0x18fbd1090 std::__1::__assoc_sub_state::wait() + 56
5   libmlx.dylib                  	       0x101d0e0e8 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 2756
...

Thread 2 Crashed:
0   libsystem_kernel.dylib        	       0x18fc6d11c __pthread_kill + 8
1   libsystem_pthread.dylib       	       0x18fca4cc0 pthread_kill + 288
2   libsystem_c.dylib             	       0x18fbb4a40 abort + 180
3   libc++abi.dylib               	       0x18fc5c6d8 abort_message + 132
4   libc++abi.dylib               	       0x18fc4c7ac demangling_terminate_handler() + 320
5   libobjc.A.dylib               	       0x18f8f78a4 _objc_terminate() + 160
6   libc++abi.dylib               	       0x18fc5ba9c std::__terminate(void (*)()) + 16
7   libc++abi.dylib               	       0x18fc5ea48 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
8   libc++abi.dylib               	       0x18fc5e9f4 __cxa_throw + 140
9   libmlx.dylib                  	       0x101cc0c30 mlx::core::allocator::malloc_or_wait(unsigned long) + 344
10  libmlx.dylib                  	       0x1023a9d8c mlx::core::(anonymous namespace)::binary_op(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 380
11  libmlx.dylib                  	       0x1023a9bc4 mlx::core::Add::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 56
12  libmlx.dylib                  	       0x1023a854c std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2>, void ()>::operator()() + 148
13  libmlx.dylib                  	       0x101d0bf14 mlx::core::scheduler::StreamThread::thread_fn() + 500
14  libmlx.dylib                  	       0x101d0c0d0 void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 72
15  libsystem_pthread.dylib       	       0x18fca5034 _pthread_start + 1

from mlx-examples.

davidjoffe avatar davidjoffe commented on June 20, 2024

New information, and good news as I found the cause: I forked mlx itself, changed one line of code, rebuilt from my source fork, and now it works 😊

This line of code in MetalAllocator::MetalAllocator() in mlx/backend/metal/allocator.cpp I changed to a much higher limit - this 1.5 seems maybe a bit conservative for low-RAM Macs:

block_limit_(1.5 * device_->recommendedMaxWorkingSetSize()) {}'
https://github.com/davidjoffe/mlx/blob/main/mlx/backend/metal/allocator.cpp

There is indeed a relatively big spike of memory usage as it finishes the steps, but not world-ending, I'd rather have 'something that works' even if it spikes my swap, but I suppose it's debatable how best to handle that in the long run as a general solution for all users, or just warn, or maybe give more options to control how much memory to use or how to handle low memory, anyway.

That means the issue though is not in mlx-examples but probably really in mlx.

Should I try submit a PR for mlx? But I'm not sure if someone had some 'very good reason' for the particular choice of 1.5 as a hard factor here.

from mlx-examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.