ml-explore / mlx Goto Github PK

View Code? Open in Web Editor NEW

15.3K 140.0 870.0 20.58 MB

MLX: An array framework for Apple silicon

Home Page: https://ml-explore.github.io/mlx/

License: MIT License

C++ 66.01% Python 22.83% CMake 0.75% Metal 10.06% C 0.30% Shell 0.04%

mlx

mlx's People

Contributors

Stargazers

Watchers

Forkers

ambition1994 stanleyjacob braincomputingsantosh brianmillsjr n1lanjan poodarchu dotvignesh cservin69 lggg123 chen-guoyi yingboma rushikd27 powerpig99 malenie allthingsllm uddagiri hmahal btbujiangjun llamaherder athulkrishnan wanglongzhi2001 jiekeshi apollohuang1 yparvej mz0in ddliu123 sankeerthrao vguerra thargyi74 praveen5733 dungxibo123 yaddayaddayaddayadda hospedales lukasbongartz hmthanh sorokinvld gotfusion srini1971 chrgit dhikshith12 athy125 aboutsome caffeine2 rock2z mindkhichdi nrvo godzab richardsonjf smhanes15 v1ta111 macewan menzhse lgs zsoftwarerepository a10y skyrider3 jaedukseo eltociear sillyroot jahidmunna touristshaun vladg-cloudml leonericsson thanhpham1987 go-noah randxie ritam-guha cygwynd davidjoffe bikrant someshk modelturnedgeek conglesolutionx yuanli1 vaibhavs10 pshiko tumonn mohsen-bagheri shashipal95 jacobsonrobotics lundgrenalex sumit6597 ravi1g dc-dc-dc freefrancisco dextercorley19 mailmahee minhtran-kiel hqy168 avesus wisamreid elliottbarnes hmaarrfk dantegpt k78ma bigdatasciencegroup 309746069 wilsonliu123 mqincuhk tiansiyuan

mlx's Issues

What is the Expected Inference Performance

I am running Llama/Mistral inference examples on my M1Pro with 16GB of memory and getting around 80sec/token.

Does the framework support FP16?
GPU usage seems low, do I need to do something to use the Metal GPU?
mx.default_device reports Device(gpu, 0)

Upsample support

I would like to request the addition of native support for the Upsample operation in MLX. Currently, the absence of Upsample functionality limits the flexibility of certain tasks that require resizing or upsampling of data (like UNet definition).

In the meanwhile, does someone have alternative methods for emulating the upsample functionality?

Thank you!

random.uniform with dtype float16 returns all 0s

random.uniform returns all 0s when running on either cpu/gpu and dtype is set to float16

To reproduce

import mlx.core as mx
mx.random.uniform(shape=[2,2], dtype=mx.float16)

array([[0, 0],
       [0, 0]], dtype=float16)

API for Golang

I would like to request a api for go to
Utilize this library.

Docker: xcrun: not found

When building within a docker container with OpenBLAS installed I'm of course getting xcrun: not found

Obtaining file:///app/mlx
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Installing backend dependencies ... done
  Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: mlx
  Building editable for mlx (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building editable for mlx (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [172 lines of output]
      running editable_wheel
      creating /tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info
      writing /tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info/PKG-INFO
      writing dependency_links to /tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info/dependency_links.txt
      writing requirements to /tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info/requires.txt
      writing top-level names to /tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info/top_level.txt
      writing manifest file '/tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info/SOURCES.txt'
      reading manifest file '/tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      adding license file 'LICENSE'
      writing manifest file '/tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx.egg-info/SOURCES.txt'
      creating '/tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx-0.0.4.dev2023127+2e126ae.dist-info'
      creating /tmp/pip-wheel-z5lwk8ns/.tmp-a96b8jhe/mlx-0.0.4.dev2023127+2e126ae.dist-info/WHEEL
      running build_py
      running build_ext
      -- The CXX compiler identification is GNU 12.2.0
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Metal not found. Unable to build GPU
      -- Accelerate not found, using default backend.
      -- Looking for sgemm_
      -- Looking for sgemm_ - not found
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Looking for sgemm_
      -- Looking for sgemm_ - found
      -- Found BLAS: /usr/lib/aarch64-linux-gnu/libopenblas.so
      -- /usr/lib/aarch64-linux-gnu/libopenblas.so
      -- /usr/include/aarch64-linux-gnu
      -- Building Python bindings.
      -- Found Python: /usr/local/bin/python3.9 (found version "3.9.18") found components: Interpreter Development Development.Module Development.Embed
      -- Performing Test HAS_FLTO
      -- Performing Test HAS_FLTO - Success
      -- Found pybind11: /usr/local/include (found version "2.11.1")
      -- Configuring done (1.1s)
      -- Generating done (0.0s)
      -- Build files have been written to: /tmp/tmpoi00n5vt.build-temp/mlx.core
      [  1%] Building arange.air
      [  2%] Building indexing.air
      [  4%] Building unary.air
      [  5%] Building sort.air
      /bin/sh: 1: xcrun: not found
      [  7%] Building softmax.air
      /bin/sh: 1: xcrun: not found
      /bin/sh: 1: xcrun: not found
      [  8%] Building scan.air
      /bin/sh: 1: xcrun: not found
      gmake[2]: *** [mlx/backend/metal/kernels/CMakeFiles/mlx-metallib.dir/build.make:97: mlx/backend/metal/kernels/arange.air] Error 127
      /bin/sh: 1: xcrun: not found

In the case of llama.cpp Makefile to build with metal support there is no need of xcrun

Onnx/Torch model transpilation / Parsing to MLX. (And Supporting onnx weights)

Everything I do is on the new Mac chips, so I'm very excited. I use C++/cmake/Juce and onnx/torch script. I'm trying to squeeze every last second of performance to do real time inference locally. I've tried swapping backend in onnx for CoreML, but didn't notice an improvement.

In short, I've tried out a few ways to increase inference speed using onnx and torch, I'm very intrigued by mlx.

This is this question. Is there any plan to support parsing of a graph, such as from Onnx? This would be an incredibly useful feature.

I've had a brief look at export weights, and looks like you're parsing and renaming some layer names, and exporting to a format that can be loaded in mlx. That script is for torch weights, how about one for onnx files?

I think it would be amazing if you supported parsing of graphs, I appreciate there's an example with 200 lines of code, the llama inference code. I think it could be helpful (potentially) if you were to show the original for comparison, if there's a torch equivalent, show the major conceptual changes/differences in setting up a model.

(You probably think I'm an idiot) The reason for showing differences is that; if there is a difference you highlight them, like remove torch calls, swap xname for yname, then we can visualise the changes needed and potentially make a manual model parser.

And please forgive my naivety, is there anything that would prevent a model parser currently?

For context I've built very simple neural net libraries for MLPs in C++ without any matrix libraries Eigen etc, and have a persistent obsession to understand the lower levels. I would be happy to invest some of my time to make an onnx parser if you could confirm the framework is feature rich enough. Or, should the first tool be a model checker, to see if a model can be converted, whether the ops are supported?

Comparison with llama.cpp and GGML

I wonder how this compares to llama.cpp for example in terms of performance in the same settings?

mlx segfault when running unit tests on M1 Pro

Getting segfault with unit tests on Apple M1 Pro and 13.6.2

$ pip install numpy
Collecting numpy
  Downloading numpy-1.26.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.2/61.2 kB 3.0 MB/s eta 0:00:00
Downloading numpy-1.26.2-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.7/13.7 MB 9.6 MB/s eta 0:00:00
Installing collected packages: numpy
Successfully installed numpy-1.26.2
$ env CMAKE_BUILD_PARALLEL_LEVEL="" pip install .
Processing ml/mlx
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: mlx
  Building wheel for mlx (pyproject.toml) ... done
  Created wheel for mlx: filename=mlx-0.0.3.dev2023126+8c96b9a-cp312-cp312-macosx_13_0_arm64.whl size=9421096 sha256=6aaf46747055c8825696a7892f42f43d959842b44446079e28b4933dc9ae2da3
  Stored in directory: /private/var/folders/nm/nyjkgrfn3fg207v5z5rz1w8m0000gp/T/pip-ephem-wheel-cache-93ng7che/wheels/1a/d0/9d/cbc077676fa323205d1fc73c17c324df59cac918a9352a52e5
Successfully built mlx
Installing collected packages: mlx
  Attempting uninstall: mlx
    Found existing installation: mlx 0.0.3.dev2023126+8c96b9a
    Uninstalling mlx-0.0.3.dev2023126+8c96b9a:
      Successfully uninstalled mlx-0.0.3.dev2023126+8c96b9a
Successfully installed mlx-0.0.3.dev2023126+8c96b9a
$ python -m unittest discover python/tests
.............................................ssss..........[1]    72687 segmentation fault  python -m unittest discover python/tests

[Feature Request]: Add Function `NaiveSyncBatchNorm` `1d, 2d, 3d`

🚀 The feature, motivation and pitch

This feature request proposes the extension of BatchNorm1d to the library. NaiveSyncBatchNorm1d is an extension of the existing nn.BatchNorm1d module, designed to support synchronization across multiple devices, either locally or globally.

The motivation behind this feature request is to enhance capabilities in distributed deep learning scenarios. the proposed NaiveSyncBatchNorm1d module would fill this gap.

I propose the addition of the NaiveSyncBatchNorm1d module. The module provides the following features:

Synchronization of batch normalization statistics (mean and variance) across multiple devices.
Support for both global synchronization and local synchronization based on user requirements.

Alternatives

While there are alternative ways to implement synchronized batch normalization, NaiveSyncBatchNorm1d offers a simple and efficient solution.

Additional context

Here's an example of how NaiveSyncBatchNorm1d can be used in PyTorch:

sync_bn = NaiveSyncBatchNorm1d(num_sync_devices=4, global_sync=False, num_features=64)
output = sync_bn(input_tensor)

Source: https://github.com/facebookresearch/pytorchvideo/blob/64e5a17ccefcd6b93ad331d1a9c2a130f179ff44/pytorchvideo/layers/batch_norm.py#L10C44-L10C44

I would like to create a PR for the same.

Swift Frontend for iOS / iPadOS

Creating a meta issue for clarity / visibility so others can upvote / the team can prioritize.

It would be extremely useful to have a Swift ml-explore SDK that can:

Be used in iOS / iPadOS applications
Run using the A-series and A- or M-series hardware acceleration (GPU+ANE).

One might not even need to code the model in Python mlx in the first place: popular open source architectures could be written directly in Swift and used by many developers.

Quantisation Support

Hello there, great work !

I was checking if models with int8, int5, int4 quant formats can be used with this package ? Could you please create an example if possible ?

`conv3d` operation missing

Thanks for the nice framework! However, the medical imaging community is missing 3d operations, such as conv3d.

[feature request] Plans for linear solvers in MLX?

It would be awesome if linear algebra operations could be supported directly in MLX, for example the equivalent of PyTorch linalg.solve, which PyTorch currently supports only on CPU but not on mps .

Thank you.

Segfaults when running examples using GPU inside a VM

When running mlx-examples/mnist on MacOS 14.1.1 VM running via Parallels 19.1.1:

user@Users-Virtual-Machine mnist % python main.py --gpu
*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[AppleParavirtDevice newArgumentEncoderWithLayout:]: unrecognized selector sent to instance 0x149029e00'
*** First throw call stack:
(
	0   CoreFoundation                      0x000000018a682800 __exceptionPreprocess + 176
	1   libobjc.A.dylib                     0x000000018a179eb4 objc_exception_throw + 60
	2   CoreFoundation                      0x000000018a7343bc -[NSObject(NSObject) __retain_OA] + 0
	3   AppleParavirtGPUMetalIOGPUFamily    0x0000000101fe1118 doUncompressedBlit + 11160
	4   Metal                               0x0000000194729184 -[_MTLDevice newArgumentEncoderWithArguments:structType:] + 136
	5   libmlx.dylib                        0x00000001113f660c _ZN3mlx4core6Gather8eval_gpuERKNSt3__16vectorINS0_5arrayENS2_9allocatorIS4_EEEERS4_ + 1380
	6   libmlx.dylib                        0x00000001113fc54c _ZNSt3__110__function6__funcIZN3mlx4core5metal9make_taskERNS3_5arrayENS_6vectorINS_13shared_futureIvEENS_9allocatorIS9_EEEENS_10shared_ptrINS_7promiseIvEEEEbE3$_2NSA_ISH_EEFvvEEclEv + 148
	7   libmlx.dylib                        0x0000000110d5ff14 _ZN3mlx4core9scheduler12StreamThread9thread_fnEv + 500
	8   libmlx.dylib                        0x0000000110d600d0 _ZNSt3__114__thread_proxyB7v160006INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN3mlx4core9scheduler12StreamThreadEFvvEPSA_EEEEEPvSF_ + 72
	9   libsystem_pthread.dylib             0x000000018a531034 _pthread_start + 136
	10  libsystem_pthread.dylib             0x000000018a52be3c thread_start + 8
)
libc++abi: terminating due to uncaught exception of type NSException
zsh: abort      python main.py --gpu

I've had similar problems trying to use MPS inside a VM. Is there any plans to support the use of Metal inside VMs?

ANE support

The top level readme mentions that current device support is limited to COU and GPU, is ANE support in the works?

malloc error "Unable to allocate" on 8GB RAM Mac

I kept encountering the below error while trying the stable diffusion sample in mlx-examples on an 8GB M2 Mac Mini here. After some investigation (detailed here: ml-explore/mlx-examples#21) I found changing one line of code in MetalAllocator::MetalAllocator() in mlx/backend/metal/allocator.cpp to a much higher limit seems to have fixed the problem (this 1.5 seems maybe a bit conservative for low-RAM Macs):

block_limit_(1.5 * device_->recommendedMaxWorkingSetSize()) {}'
https://github.com/davidjoffe/mlx/blob/main/mlx/backend/metal/allocator.cpp

I made a fork with this change, and built from source to test.
I'd like to submit a Pull Request. This change should help for low-RAM Macs like 8GB Macs, though effectively just allows it to use swap instead of failing - arguably better than failing, but in the long run this behavior may need further improvement/refining, and/or giving users more control over whether/how they want this, or perhaps warning, or something.

(foo) david@Davids-Mac-mini stable_diffusion % python txt2image.py "A photo of an astronaut riding a horse on Mars." --n_images 1 --n_rows 1
/Users/david/mlx/foo/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
100%|
 [00:00<?, ?it/s]libc++abi: terminating due to uncaught exception of type std::runtime_error: [malloc_or_wait] Unable to allocate 134217728 bytes.
zsh: abort      python txt2image.py "A photo of an astronaut riding a horse on Mars."  1  1
(foo) david@Davids-Mac-mini stable_diffusion % /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Add INT4 quantized GEMV metal kernel

For running on-device, it would be hugely beneficial if mlx could support quantized kernels. I am the author of AutoAWQ which is a framework that allows users to quantize all the new large language models with minimal loss in performance.

When developing AutoAWQ, I always have to SSH into a machine with a CUDA device since it is not adapted to Metal kernels. If mlx adds support for a quantized GEMV kernel, that could change, and it would mean users could run inference on-device.

Support GEMV
Support for zero point
Support for grouped channel-wise scales
Unpacking weights and running matrix multiplication over a certain number of groups

Reference to CUDA kernel: https://github.com/casper-hansen/AutoAWQ/blob/main/awq_cuda/quantization/gemv_cuda.cu
Other Metal kernels that are related to AWQ: https://github.com/mit-han-lab/TinyChatEngine/tree/main/kernels/metal

Support for pooling layers for CNNs, e.g. MaxPool2d

Thanks to the mlx team for creating and sharing mlx.

I have managed to get a small CIFAR-10 image classification CNN up and running rather quickly in mlx (inspired by the PyTorch CIFAR-10 tutorial). I have found that pooling layers (e.g. MaxPool2D) are not available yet. I hope that they will be available in the next release(s).

Code is here: https://github.com/menzHSE/mlx-cifar-10-cnn
Heavily borrows from the mnist example in mlx (https://github.com/ml-explore/mlx-examples/tree/main/mnist)

Swift bindings?

Are Swift bindings to MLX in the roadmap/within scope?

I attempted to wrap up the built library + metallib file into a macOS .framework bundle, giving it a correct .modulemap and importing it into a Swift target with C++ interop enabled, but soon ran into this roadblock:

From Swift's documentation here:

Swift currently cannot import C++ modules introduced in the C++20 language standard.

I understand that C++20 is the blocker here, but I'm wondering if somehow the headers could be made backwards compatible for a version that the Swift compiler can understand.

Can't install the project from sources

Running either env CMAKE_BUILD_PARALLEL_LEVEL="" pip install . or env CMAKE_BUILD_PARALLEL_LEVEL="" pip install -e . will result in the following error:

error: can't copy '/var/folders/b8/6mjky64x0kn0v0s2l_4_pwm00000gn/T/tmpwy1qmvjs.build-lib/mlx/core.cpython-310-darwin.so': doesn't exist or not a regular file
      [end of output]

Can MLX model be natives deployed on IOS device GPU?

Our team is developing on device training and inference for iOS devices. We wish to know whether MLX supports native model deployment on iOS device hardware such as GPU and CPU?

Understanding the Relationship between Stream and Device in Computational Graphs

Hello there, great work!

Please tell me about the relationship between Stream and Device.

I recognized Stream as a so-called calculation graph where data flows, is it correct? And I thought that there was a keyword called Stream in the argument of each function to specify which device the calculation graph should be executed on, is that correct?

what about support for the apple M3 silicons

Hi, there:
May i ask what about support for the apple M3 silicons.

[Feature Request] Groups added to Conv2d

🚀 What is the purpose of this issue?

According to the documentation, these are the parameters available for Conv2d.

in_channels (int) – The number of input channels.
out_channels (int) – The number of output channels.
kernel_size (int or tuple) – The size of the convolution filters.
stride (int or tuple, optional) – The size of the stride when applying the filter. Default: 0.
padding (int or tuple, optional) – How many positions to 0-pad the input with. Default: 0.
bias (bool, optional) – If True add a learnable bias to the output. Default: True

However, the dilation and groups parameter as defined in the torch implementation of Conv2d, would be desirable to implement well-known architectures, e.g. ASPP. The following is the description of these two parameters according to torch:

dilation controls the spacing between the kernel points; also known as the à trous algorithm. It is harder to describe, but this link has a nice visualization of what dilation does.
groups control the connections between inputs and outputs. in_channels and out_channels must both be divisible by groups. For example, at groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels and producing half the output channels, and both subsequently concatenated.

Thank you for your attention 👍

How does mx.pad work?

I am trying to use mx.pad with the optional constant array containing the padding values to be inserted, but it seems that only the first element of the constant array is utilized whatever I try. Am I interpreting the description of mlx.core.pad() the wrong way?

`import numpy as np
import mlx.core as mx

nx, ny = 10, 10 # Set grid dimensions

x = mx.array(np.linspace(0,1,nx))
y = mx.array(np.linspace(0,1,ny))

T = mx.zeros([nx, ny])

Tconst = mx.array(np.linspace(9,0,10))

print(Tconst)

T = mx.pad(T, ((1, 1), (1, 1)),Tconst)

print(T)`

array([9, 8, 7, ..., 2, 1, 0], dtype=float32)
array([[9, 9, 9, ..., 9, 9, 9],
[9, 0, 0, ..., 0, 0, 9],
[9, 0, 0, ..., 0, 0, 9],
...,
[9, 0, 0, ..., 0, 0, 9],
[9, 0, 0, ..., 0, 0, 9],
[9, 9, 9, ..., 9, 9, 9]], dtype=float32)

Process finished with exit code 0

Examples don't exist, link forwards to wrong place

See end of this tutorial:

https://ml-explore.github.io/mlx/build/html/examples/llama-inference.html

the link after "The full example code is available in" directs to the wrong place. (https://ml-explore.github.io/mlx/build/html/examples/code)

it should be directed to here https://github.com/ml-explore/mlx/tree/main/examples

There's also no examples of the transformer LLM/llama inference in either python or Cpp folders: https://github.com/ml-explore/mlx/tree/main/examples/python

Could you please add examples for both py + C++?

arm_neon.h:28:2: error: "NEON intrinsics not available with the soft-float ABI.

When I build on my MacBook, error is:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/15.0.0/include/arm_neon.h:28:2: error: "NEON intrinsics not available with the soft-float ABI. Please use -mfloat-abi=softfp or -mfloat-abi=hard"
#error "NEON intrinsics not available with the soft-float ABI. Please use -mfloat-abi=softfp or -mfloat-abi=hard"
^
/Volumes/long_yw/mlx/mlx/backend/accelerate/softmax.cpp:59:8: error: unknown type name 'float16x8_t'; did you mean 'float16_t'?
inline float16x8_t neon_fast_exp(float16x8_t x) {
^~~~~~~~~~~
float16_t
/Volumes/long_yw/mlx/mlx/types/half_types.h:16:29: note: 'float16_t' declared here
typedef struct _MLX_Float16 float16_t;
How to avoid this issue.

Is there an example code for convert hugging face model to pytorch then mlx ?

as title

Gather vmap is NYI, but `Slice` also is `TODO implement`

Thanks for this library! It is very useful so far.

I am trying to use vmap with a function where argument 1 is the indices into argument 2 (effectively a boolean mask to sum over). However, when trying to use vmap, I get the runtime error Gather vmap is NYI, please change slices instead. However, when I look at the Slice::vmap function in primitives.cpp, I see that it is just a placeholder with // TODO implement
https://github.com/ml-explore/mlx/blob/v0.0.4/mlx/primitives.cpp#L1907

Document 100+ pages summarization and generation

Hi,

can you write example how to train model with 100+ pages x 100 document and ask system for summarization and generation new document with new data or train correct and false for checking new ones , second to train pdf or image document and labeling errors for checking new document. Simple api for this purposes

Extend activation functions

Hi everyone,

Proposal:

I would like to propose the addition of several other activation functions to the framework:

I am willing to contribute these or others.

Pretrained CV Models available?

Is there any way to import pretrained SOTA CV models (e.g. MobileNet) instead of creating toy CV models? I didn't find examples. Thanks for any response

More losses!

Hi, I thought it could be nice if there were a few more losses ready to go in mlx. To start with I was thinking:

L1_loss (L1 loss)
mse_loss (mean squared error)

Maybe also:

nll_loss (negative log likelihood loss)
kl_div_loss (KL-divergence loss)

Think this should pretty straightforward to do? Simply a case of adding them to python/mlx/nn/losses.py

Any thoughts? And I am happy to work on this issue.

pip install lastest commit

Can you tell me how to get pip install of the lastest commit ?

pip install --upgrade --no-cache-dir https://github.com/ml-explore/mlx/archive/refs/heads/master.zip

pip install git+https://github.com/ml-explore/mlx@[commit-sha]

Type annotations

Is there aversion to including type annotations in the python implementation?

While the python interpreter doesn’t directly check/enforce type annotations it can have a non-trivial impact on developer experience to have well-typed libraries.

https://typing.readthedocs.io/en/latest/

macbook pro A1990 ,Can I use it?

CPU:intel I7 9700K
Memory:16GB
MacOS:MacOS 14

Can I use mlx?

No sine and bessel functions in documentation?

Can you add more mps accelerated functions to mlx like sine, cosine, bessel functions?

Releases Roadmap?

Hello, is there a roadmap for future releases of MLX? I see there are a number of requests regarding features like missing operations (e.g. pooling/upsampling), quantized inference, profiling tools, a Swift API, and so on. Is there anywhere public these are prioritized?

I've been playing with MLX and have been enjoying using it for toy models and am attempting to implement StripedHyena currently. However, for anything more serious, I find it hard to justify using MLX over other options: PyTorch and JAX support MPS, are usable on other platforms, utilize Nvidia GPU's for larger jobs, have a substantially larger community. MLX also doesn't have any conversion support to ONNX (or ironically, CoreML).

I'm probably speaking for many people, but a roadmap would help immensely for determining how much time to invest in this framework, what kind of work would be best suited for it (e.g. I presume pre-training a foundation model is non-starter), and what the level of commitment there will be from Apple AI/ML.

Adding linear algebra and other array operations

It looks like this is still missing many matrix operations like QR, SVD, einsum, etc. Is there a clear path to using these with or without MLX?

This has been a similar issue with the PyTorch MPS backend. While there is a long tail of these operations to support, they are essential to many machine learning models. As can be seen in the PyTorch issue, not including them limits the utility of packages like this.

error: "NEON intrinsics not available with the soft-float ABI. Please use -mfloat-abi=softfp or -mfloat-abi=hard"

When building the mlx project I'm getting an error in regards to : NEON intrinsics not available with the soft-float ABI
Please use -mfloat-abi=softfp or -mfloat-abi=hard

ChatGPT suggests updating MakeFile with, but this doesn't work

CFLAGS += -mfloat-abi=softfp

or updating CMakeLists.txt (in project root) didn't help either.

set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mfloat-abi=softfp")

Steps to reproduce

git clone [email protected]:ml-explore/mlx.git mlx && cd mlx

mkdir -p build && cd build
cmake .. && make -j

throws

/Library/Developer/CommandLineTools/usr/lib/clang/15.0.0/include/arm_neon.h:28:2: error: "NEON intrinsics not available with the soft-float ABI. Please use -mfloat-abi=softfp or -mfloat-abi=hard"
#error "NEON intrinsics not available with the soft-float ABI. Please use -mfloat-abi=softfp or -mfloat-abi=hard"
 ^
/Users/stephan/projects/mlx/mlx/backend/accelerate/softmax.cpp:59:8: error: unknown type name 'float16x8_t'; did you mean 'float16_t'?
inline float16x8_t neon_fast_exp(float16x8_t x) {
       ^~~~~~~~~~~
       float16_t
/Users/stephan/projects/mlx/mlx/types/half_types.h:16:29: note: 'float16_t' declared here
typedef struct _MLX_Float16 float16_t;
                            ^
/Users/stephan/projects/mlx/mlx/backend/accelerate/softmax.cpp:59:34: error: unknown type name 'float16x8_t'; did you mean 'float16_t'?
inline float16x8_t neon_fast_exp(float16x8_t x) {
                                 ^~~~~~~~~~~
                                 float16_t
...

Apple M1 Max
I'm on Sonoma 14.1.1 (23B81)
cmake version 3.27.9
Python 3.10.5

Extend mlx with Additional Optimizers

Issue Type: Enhancement

Proposal:
I would like to propose an enhancement to the framework by adding a variety of optimizers. While the current implementation includes Stochastic Gradient Descent (SGD) and Adam optimizers, the addition of more optimization algorithms could greatly benefit users with diverse requirements.

Potential Optimizers to Include:

Adagrad:
RMSprop:
Adadelta:
Nadam:
AdamW
etc ...

I would appreciate your guidance on the next steps for contributing to this enhancement. Whether it's providing additional details, discussing the proposed optimizers, or collaborating on the implementation, I am eager to contribute to the growth of this fantastic framework.

[RFC] Implement the Python array API standard

The Python array API standard standardises common functionality across Python array/tensor libraries. NumPy, PyTorch and CuPy are planning to have full implementations, and Dask and JAX also have implementations in progress. You could implement this in your main namespace or a separate namespace.

Why should you do this? As well as making it easier for users to convert existing NumPy/PyTorch/CuPy code to MLX, there is potential for interoperability with other libraries. For example, from the NumPy ecosystem, SciPy and scikit-learn have partial experimental support for arrays which comply with the standard.

If you are interested in this, the consortium would love to hear feedback over at https://github.com/data-apis/consortium-feedback/. Some potential pain points, such as missing float64 support, have already been discussed very briefly in data-apis/array-api#719.

LLM Fine-tuning example

It is amazing that mlx supports efficient training with metal.
While we have the LLM inference example with Llama and Mistral, could you share an example or advice on how to fine-tune LLM (or quantized model) using MLX?

Issue encountered in solving 2D Heat Equation with needing mx.eval to avoid segmentation fault

I have implemented a simple solution of the 2D Heat Conduction Equation with 2 Neumann and 2 Dirichlet BCs. I have the code implemented both using PyTorch and the MLX framework and I am testing the relative performance on an M2 Ultra with 128GB memory.

The MLX code is included below. So far, performance in various tests (on the same machine) show the MLX version to be somewhere between X2 and X10 faster depending on the problem size.

However, I have an issue that I need to understand. Depending on the problem size, I need to include the line

if step % 15000== 0: mx.eval(T)

to avoid segmentation fault. I imagine this has to do with the lazy evaluation and arrays being in buffer? My issue is that currently I figure each time how often I need to mx.eval empirically. Is there some programmatic and more elegant way to automatically issue the mx.eval at the right frequency based on the problem size?

Here is the complete code below. Thank you for all your help @awni !

# Solving the 2D Heat Conduction Equation with 2 Neumann and 2 Dirichlet PCs
import numpy as np
import matplotlib.pyplot as plt
import time
import mlx.core as mx

# Convergence tolerance to stop early (currently disabled)
#convergence_tolerance = 1e-8

# Grid size and material properties setup
nx, ny = 5000, 5000  # Set grid dimensions
k = 1.0              # Thermal conductivity

# Time-stepping parameters
desired_dt = 0.01  # Desired time step
max_steps = 10000 # Maximum number of time steps

# Creating a linearly spaced grid
x = mx.array(np.linspace(0,1,nx))
y = mx.array(np.linspace(0,1,ny))
dx = x[1] - x[0]   # Grid spacing in x direction
dy = y[1] - y[0]   # Grid spacing in y direction

# Function to calculate the maximum stable time step for the explicit Euler method
def calculate_max_stable_dt(alpha, dx, dy):
    return (1 / (2 * alpha)) * (1 / (1/dx**2 + 1/dy**2))

# Material properties for stability calculation
rho = 1.0  # Density
cp = 1.0   # Specific heat capacity
alpha = k / (rho * cp)  # Thermal diffusivity

# Compute maximum stable time step
dt_max = calculate_max_stable_dt(alpha, dx, dy)
dt = min(dt_max, desired_dt)  # Use the smaller of the desired or maximum stable time step

# Initializing the temperature field on the GPU
T = mx.zeros([nx, ny])
T_old = mx.zeros_like(T)

# Applying Dirichlet boundary conditions
T[:, 0] = 0.0   # Set left boundary temperature
T[:, -1] = 1.0  # Set right boundary temperature

# Time-stepping loop for the heat equation

start_time = time.time()  # Capture start time
for step in range(max_steps):
    T_old = mx.broadcast_to(T,shape=T.shape)

    # Update interior points using finite difference method
    # Pad the interior points for broadcasting
    T =  mx.pad(mx.pad( (T_old[1:-1,1:-1] + dt * k * (
        (T_old[2:, 1:-1] - 2 * T_old[1:-1, 1:-1] + T_old[:-2, 1:-1]) / dx**2 +
        (T_old[1:-1, 2:] - 2 * T_old[1:-1, 1:-1] + T_old[1:-1, :-2]) / dy**2
    )), ((0,0),(0,1)),1),((0,0),(1,0)), 0)

    # Update Neumann boundaries (zero-flux) at top and bottom
    T = mx.concatenate([mx.expand_dims(T[0, :], (0)), T, mx.expand_dims(T[-1, :], (-0))], axis=0)

    if step % 15000== 0:
        mx.eval(T)

end_time = time.time()  # Capture end time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

#  Visualizing the temperature field using matplotlib
plt.imshow(T, cmap='hot', interpolation='nearest')
plt.colorbar()  # Add a color bar to indicate temperature scales
plt.show()

Why not implement this in Pytorch?

Building from source requires Xcode (not only Xcode command-line tools)

In case anyone runs into the following error when installing from source:

xcrun: error: unable to find utility "metal", not a developer tool or in PATH

It's solvable with instructions from here: gfx-rs/gfx#2309

I had previously only installed Xcode command-line tools (xcode-select --install), but was running into the above error. Installing full Xcode and running:

sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

Allowed me to install from source (env CMAKE_BUILD_PARALLEL_LEVEL="" pip install -e .)

Building a conda(-forge) package for this.

It would also be nice to have a binary for this on conda-forge. You can see my start at implementing a recipe for this at https://github.com/conda-forge/staged-recipes/pull/24687/files As the CI over there doesn't build for Apple silicon, we don't see any failure. Thus I would use this issue to reach out for help with build issues.

Python 3.12 Support

I've noticed that the current mlx builds support Python versions 3.8 to 3.11. As Python 3.12 is gaining traction, I'm initiating a compatibility check by building mlx from source on Python 3.12. 3.7 support is declared but it don't see the build in the pip package here: https://pypi.org/project/mlx/#files

Python 3.11 (cp311)
Python 3.10 (cp310)
Python 3.9 (cp39)
Python 3.8 (cp38)

My goal is to identify any compatibility issues and report back with detailed findings. Depending on the results, I'm also willing to contribute fixes or updates needed to ensure Python 3.12 support.

I'm going to try building it on 3.12 python version from the source code and I will report back the findings.

Can we add a example of a graph neural network for molecules ?

we can start there:
https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/JAX/tutorial7/GNN_overview.html
or here:
https://danielegrattarola.github.io/posts/2021-03-12/gnn-lecture-part-2.html

but ideally GIN, DMPNN and AttentiveFP are part of the best models so far: https://github.com/aimat-lab/gcnn_keras/tree/master/kgcnn/literature

'pip install mlx' does not seem to work correctly

Seems like the pip install does not install the correct package on my env

>> pip install mlx
Collecting mlx
  Using cached mlx-0.0.0-py3-none-any.whl.metadata (505 bytes)
Using cached mlx-0.0.0-py3-none-any.whl (2.1 kB)
Installing collected packages: mlx
Successfully installed mlx-0.0.0

If you trace the installed package it gives

>> cd ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/mlx
>> ls
__init__.py __pycache__
>> cat __init__.py
print("HELLO WORLD!")

Environment:
M1 Mac Air + Miniconda