Comments (11)
I'm not opposed to supporting bfloat but I would not want to make it the default:
- float16 is still considerably faster given it has native support. The benchmarks in #633 don't tell the whole story. I see 42 TPS with fp16 vs 32 with bfloat16.
- the precision loss from bfloat16 -> float16 is nothing compared to float16 -> 4-bit quantization. So I think the impact of using bfloat is minor from a precision standpoint. Although maybe you have some case in mind that would be easier in bfloat16?
from mlx-examples.
I have tried the build mlx from mlx last master branch, but I didn't notice significant performance improvement as stated in #663. I was thinking that it might be because we quantized the model in float16. If that's not the case, I am happy to close this issue.
Not sure I follow your comment. But just in case, #663 really only sped up bfloat16 quantized models of which there are not very many. I believe this one is one of the few examples. There should be no change for float16 and float32 quantized models.
from mlx-examples.
I have tried the build mlx from mlx last master branch, but I didn't notice significant performance improvement as stated in #663. I was thinking that it might be because we quantized the model in float16. If that's not the case, I am happy to close this issue.
Not sure I follow your comment. But just in case, #663 really only sped up bfloat16 quantized models of which there are not very many. I believe this one is one of the few examples. There should be no change for float16 and float32 quantized models.
Okay, I will try to quant a bfloat16 model to see the performance difference. Sorry, my previous comments were saying that I tried converting models from the current convert.py but didn't see much improvement and noticed that we are doing float16 quantization. So, I raised this to see if we can enable bfloat16 quantization.
from mlx-examples.
It does look like our bloat16 scans are commented.. not sure why that is.
from mlx-examples.
Oh I see, that is the cumsum
from the topk sampling. That will be an issue for bfloat
, one workaround until we figure out why there is no scan for bfloat (and possibly implement it) is to cast the logits to float32
before the sampling step if they are bfloat16.
from mlx-examples.
I have tried the build mlx from mlx last master branch, but I didn't notice significant performance improvement as stated in #663. I was thinking that it might be because we quantized the model in float16. If that's not the case, I am happy to close this issue.
from mlx-examples.
@awni Somehow I am getting an error for the bfloat16 quant model (m2 ultra) from mlx's master build:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [metal::Device] Unable to load kernel contiguous_scan_inclusive_sum_bfloat16_bfloat16
from mlx-examples.
Can you share the command you ran? I didn't realize there was a scan in the LLM code..
from mlx-examples.
Can you share the command you ran? I didn't realize there was a scan in the LLM code..
python -m mlx_lm.server --model <path_to_bfloat16_quant_model>
happened to mlx-community/Mistral-7B-Instruct-v0.2-4bit-mlx
as well, double checked only happend on mlx_lm.server
and mlx_lm.generate
works fine.
from mlx-examples.
@angeloskath or @jagrit06 do you know why bfloat is not supported in the scans? Is that because there are no simd reductions for bfloat?
from mlx-examples.
@angeloskath might have the best insight for scans in particular but I think it came down to exclusive and inclusive reduction primitives for the scans that weren't there for bfloat - I'm not sure if we ended up making the same workarounds available for it as we did for other sims reductions
from mlx-examples.
Related Issues (20)
- [Feature Request] Supports fine-tuning of Vision models HOT 1
- Feature Request - Generate function, which returns response and logprobs for the response? HOT 2
- LoRA tune ibm geanite 8b insteuct HOT 2
- Where can I get started to convert internvl model to mlx format? HOT 4
- M2 Ultra 192 GB fails to run while M3 Max 128GB can run HOT 3
- Inference shapes exception with Gemma 2 SPPO HOT 5
- Unlike the document, the code here didn't force a graph evaluation for the optimizer's parameters. HOT 1
- Can mlx_lm.fust model convert to Huggingface model? HOT 1
- Peak mem 201 GB running on M2 Ultra 192 GB, how is this possible? HOT 1
- Quantization causing tensor shape mismatch HOT 1
- gemma-2-27b-it-4bit generate only <pad> HOT 11
- grad-checkpoint makes trained tokens increase gradually HOT 2
- DoRa training is never activated
- Finetuning gemma-2-27b-8bits error HOT 1
- Support model with mlx - stable video diffusion
- conversion of custom transformer HOT 2
- support for mamba 2 (Codestral mamba) #859
- Classification Example HOT 1
- When I use mlx-community/clip-vit-base-patch32, the bug "FileNotFoundError: No safetensors found in mlx_model" happens. HOT 1
- Support for nanogpt (and gpt-j)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlx-examples.