Comments (6)
Thank you!
I opened an issue for this on mlx: ml-explore/mlx#63 and created this Pull Request: ml-explore/mlx#64 - hope it's accepted, though of course it's up to the mlx maintainers.
from mlx-examples.
May I say, awesome work :-)
We need to start adding more 8GB Macs in our testing pool. There really wasn't a particular reason for the 1.5. It just gets significantly slower at that point and we wanted to avoid freezing the system.
I would encourage you to at least open an issue or maybe a PR at the main repo. Can't tell you that it will be the first one merged but for sure we 'll test the implications of increasing this limit and if there are none for most use cases then merge it.
Thanks for investigating! Feel free to close the issue and link to it from the PR.
from mlx-examples.
Hm unfortunately that means it ran out of memory. We haven't tested it on a Mac mini with 8GB of ram unfortunately. I am using it on my M2 Air but it has 24GB. Can you try removing CFG by setting --cfg 0
?
from mlx-examples.
Hm, thanks .. adding --cfg 0 unfortunately makes no difference
Even if it's running out of memory, seems odd, why wouldn't it just use the system pagefile/swap for such a relatively small amount like ~134MB? In 20+ years of dev I've never seen malloc() fail from high memory load, it normally just allocs and uses swap ... these 8GB Macs regularly swap like crazy but continue, I often use much MUCH more on these Macs than this appears to be using from Activity Monitor
I've closed everything else also.
Or does this have something to do with the unified memory architecture? Or memory fragmentation?
It runs smoothly through the default 50 steps - it just fails at the end. I'm watching memory load as this runs and it's in the green most the way, using max 2GB of swap until it completes the 50 steps ... using only 2GB swap it doesn't seem to me like such an excessive memory load that a malloc of just 134MB should fail, that's very very light memory load:
(foo) (base) david@Davids-Mac-mini stable_diffusion % python3 txt2image.py "A cat on a sandwich" --cfg 0 --n_images 1 --n_rows 1
/Users/david/mlx/foo/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:37<00:00, 3.14s/it]
0%| | 0/1 [00:00<?, ?it/s]libc++abi: terminating due to uncaught exception of type std::runtime_error: [malloc_or_wait] Unable to allocate 134217728 bytes.
zsh: abort python3 txt2image.py "A cat on a sandwich" --cfg 0 --n_images 1 --n_rows 1
from mlx-examples.
I did some more tests, ran it again with --steps 2 - the maximum memory load system hits is about 7.5GB and the entire system is using ONLY ~500MB swap this time (that's nothing). I don't see why a malloc of just ~134MB should fail under those conditions, why it shouldn't just use swap
I upgraded to Sonoma, same thing
EDIT2 FWIW the system report looks like this:
Edit3 Looking at the source I see this looks like it internally calls/uses metal::allocator?
Edit4 is it possible the line 'block_limit_(1.5 * device_->recommendedMaxWorkingSetSize()) {}' is where the issue stems? If I have time later I may try rebuild with that line changed and see if it helps
Crashed Thread: 2
Application Specific Information:
abort() called
Thread 0:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 0x18fc680ac __psynch_cvwait + 8
1 libsystem_pthread.dylib 0x18fca55fc _pthread_cond_wait + 1228
2 libc++.1.dylib 0x18fbd04dc std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3 libc++.1.dylib 0x18fbd0fec std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock<std::__1::mutex>&) + 56
4 libc++.1.dylib 0x18fbd1090 std::__1::__assoc_sub_state::wait() + 56
5 libmlx.dylib 0x101d0e0e8 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 2756
...
Thread 2 Crashed:
0 libsystem_kernel.dylib 0x18fc6d11c __pthread_kill + 8
1 libsystem_pthread.dylib 0x18fca4cc0 pthread_kill + 288
2 libsystem_c.dylib 0x18fbb4a40 abort + 180
3 libc++abi.dylib 0x18fc5c6d8 abort_message + 132
4 libc++abi.dylib 0x18fc4c7ac demangling_terminate_handler() + 320
5 libobjc.A.dylib 0x18f8f78a4 _objc_terminate() + 160
6 libc++abi.dylib 0x18fc5ba9c std::__terminate(void (*)()) + 16
7 libc++abi.dylib 0x18fc5ea48 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
8 libc++abi.dylib 0x18fc5e9f4 __cxa_throw + 140
9 libmlx.dylib 0x101cc0c30 mlx::core::allocator::malloc_or_wait(unsigned long) + 344
10 libmlx.dylib 0x1023a9d8c mlx::core::(anonymous namespace)::binary_op(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 380
11 libmlx.dylib 0x1023a9bc4 mlx::core::Add::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 56
12 libmlx.dylib 0x1023a854c std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2>, void ()>::operator()() + 148
13 libmlx.dylib 0x101d0bf14 mlx::core::scheduler::StreamThread::thread_fn() + 500
14 libmlx.dylib 0x101d0c0d0 void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 72
15 libsystem_pthread.dylib 0x18fca5034 _pthread_start + 1
from mlx-examples.
New information, and good news as I found the cause: I forked mlx itself, changed one line of code, rebuilt from my source fork, and now it works 😊
This line of code in MetalAllocator::MetalAllocator() in mlx/backend/metal/allocator.cpp I changed to a much higher limit - this 1.5 seems maybe a bit conservative for low-RAM Macs:
block_limit_(1.5 * device_->recommendedMaxWorkingSetSize()) {}'
https://github.com/davidjoffe/mlx/blob/main/mlx/backend/metal/allocator.cpp
There is indeed a relatively big spike of memory usage as it finishes the steps, but not world-ending, I'd rather have 'something that works' even if it spikes my swap, but I suppose it's debatable how best to handle that in the long run as a general solution for all users, or just warn, or maybe give more options to control how much memory to use or how to handle low memory, anyway.
That means the issue though is not in mlx-examples but probably really in mlx.
Should I try submit a PR for mlx? But I'm not sure if someone had some 'very good reason' for the particular choice of 1.5 as a hard factor here.
from mlx-examples.
Related Issues (20)
- Saved checkpoints are overwritten by the final model / following checkpoints HOT 8
- mlx-example/transformer_lm/main.py throws HOT 1
- Support bfloat16 for quantization convert HOT 11
- Add OpenAI API compatible server with tool_calls support HOT 2
- Weird Output from llama-2-70b-chat HOT 6
- LoRA Qwen2 Error: non-default argument 'hidden_size' follows default argument HOT 4
- image2image.py broken HOT 1
- [Feature Request] support for model architectures like T5 HOT 1
- Failing test on whisper
- homebred v conda: cannot build cifar example HOT 4
- [Question] How to stop generate properly? HOT 2
- Qwen/Qwen1.5-1.8B-Chatt quantize error HOT 13
- Additional parameters to mlx_lm lora? r, lora_alpha, lora_dropout, scale? HOT 6
- TypeError: transcribe() missing 1 required positional argument: 'model'
- How to train customized voice commands? This will be of great help in driving safety. Thank you. HOT 1
- Fused model seems broken HOT 2
- Support the HF format clip example HOT 1
- Mixtral 8x7B model loading ERROR HOT 2
- unable to run mixtral HOT 7
- QUESTION FOR HELP: Parametric quantities for quantitative models HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlx-examples.