MLX
Last reviewed
May 1, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,389 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,389 words
Add missing citations, update stale details, or suggest a clearer explanation.
MLX is an open-source array and machine-learning framework developed by Apple Inc.'s machine-learning research team and designed specifically for Apple Silicon (the M1, M2, M3, M4, and M5 series of system-on-chip processors). It was first released on GitHub on December 5, 2023, under the permissive MIT license. MLX provides a familiar NumPy-like Python API for arrays and a PyTorch-like API for neural networks, while taking direct advantage of the unified-memory architecture of Apple Silicon so that tensors can be shared between CPU and GPU without explicit data copies. The project lives at github.com/ml-explore/mlx and is one of the very few major ML frameworks built from scratch for a single hardware family.
MLX was conceived as a research-friendly framework that closes the gap between CPU and GPU programming on the Mac. On most other platforms, an ML developer has to think about which device a tensor lives on and explicitly move it across an interconnect such as PCIe. On Apple Silicon, CPU and GPU share the same physical memory, so the framework can avoid those copies entirely. MLX bakes that assumption into its core, which is what makes it noticeably different from cross-platform frameworks like PyTorch with the Metal Performance Shaders backend or JAX with its experimental Metal backend.
Since the initial release, MLX has grown into a small ecosystem of related projects: the core mlx array library, the mlx-lm package for large language model inference and fine-tuning, the mlx-data library for data loading, the mlx-swift package that exposes the API to Swift and iOS, and the mlx-examples repository that hosts reference implementations of models such as Llama, Mistral, Mixtral, Stable Diffusion, Whisper, and many others. By 2026 the GitHub repository for MLX had passed roughly 25,000 stars, and the MLX Community organization on Hugging Face hosted thousands of pre-converted model weights.
MLX appeared during a period when running large language models locally on a Mac was becoming popular but was mostly limited to projects like llama.cpp that targeted the CPU plus a thin Metal layer. Apple's ML research division wanted a framework that researchers inside and outside the company could actually train and prototype with on consumer hardware, not just deploy to. The result was MLX, released alongside the mlx-examples repository so that the framework would launch with working reference code rather than as a bare runtime.
| Date | Event |
|---|---|
| December 5, 2023 | Initial open-source release of MLX and mlx-examples on GitHub by Apple's machine-learning research team; announced by Awni Hannun on X |
| Late December 2023 | Quick wave of community ports: Llama 2, Mistral 7B, Stable Diffusion, and Whisper inference examples appear in mlx-examples |
| January 2024 | Optimized inference paths for Llama 2, Mistral, and Qwen released; LoRA fine-tuning example added |
| February 20, 2024 | Release of MLX Swift, a Swift API to MLX with neural-network and optimizer packages plus Mistral 7B and MNIST examples |
| 2024 (spring and summer) | Quantization support (4-bit and 8-bit), faster Metal kernels, and an mx.compile tracing compiler are added |
| May 2024 | Apple ships the M4 in the iPad Pro; MLX runs on M4 immediately because it targets the Apple Silicon Metal stack rather than a specific chip generation |
| June 2024 | At WWDC 2024 Apple announces Apple Intelligence and the Foundation Models framework; Apple highlights MLX as the recommended framework for fine-tuning and experimenting with open models on Mac |
| Late 2024 | Distributed training and inference across multiple Macs over Thunderbolt or Ethernet land in MLX |
| 2024 to 2025 | The mlx-lm package matures into the standard CLI for running LLMs on Mac; thousands of pre-quantized models appear under the mlx-community organization on Hugging Face |
| 2025 | Apple ships the M5 with dedicated neural accelerators in the GPU; MLX 0.2x and 0.3x releases add Metal 4 tensor-operation support and large speedups for time-to-first-token |
| Mid-2025 | A CUDA back-end for MLX is introduced (pip install "mlx[cuda]"), letting code developed on Apple Silicon also run on Nvidia GPUs |
| Late 2025 | JACCL distributed back-end uses RDMA over Thunderbolt 5 for low-latency multi-Mac clusters; demonstrations include trillion-parameter models running across four Mac Studios |
| April 2026 | MLX 0.31.2 documented as the current release; project crosses about 25,900 stars on GitHub |
MLX was created by a small team inside Apple's machine-learning research division. The four engineers credited with equal contribution on the initial release are Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert.
Awni Hannun acted as the public face of the project for most of its first three years. Before joining Apple, he worked at Facebook AI Research and earned his PhD at Stanford University, where he co-authored the Deep Speech speech-recognition papers in Andrew Ng's group. He announced the original release on X on December 5, 2023, with the line "MLX is an efficient machine learning framework specifically designed for Apple silicon (i.e. your laptop!)" and continued posting roadmap updates, demos, and benchmarks for the next several years. Hannun left Apple in February 2026 but, in his farewell post, named the people continuing to maintain MLX (Angelos Katharopoulos, Cheng Zhao, Jagrit Digani, and others) and said the project would keep moving.
Ronan Collobert is a longtime researcher in deep learning who previously worked at Facebook AI Research and on the Torch and Flashlight frameworks. Angelos Katharopoulos is known for work on linear-attention transformers and contributes heavily to MLX's compiler and distributed back-ends. Jagrit Digani has focused on Metal kernels and operator coverage. Outside contributions come from a wide community of Mac-using ML developers, with the GitHub project showing thousands of merged pull requests across its first three years.
MLX is built around a small set of design decisions that, taken together, distinguish it from PyTorch, JAX, and TensorFlow. The most consequential is unified memory: there is no .to("cuda") and no separate host and device tensor, because on Apple Silicon the CPU, GPU, and Neural Engine all address the same physical RAM. Operations specify the device they should run on, but the data does not move.
| Design decision | Description |
|---|---|
| Unified memory | Arrays live in shared memory and are visible to CPU and GPU without copies. Specifying a device chooses where the kernel runs, not where the data lives. |
| Lazy evaluation | Operations record nodes in a computation graph; nothing executes until a value is actually needed (for example via mx.eval or by printing or saving an array). This is similar in spirit to JAX. |
| Composable function transformations | mlx.grad, mlx.value_and_grad, mlx.vmap, mlx.compile, mlx.jvp, and mlx.vjp can be stacked, so a user can take the gradient of a vmapped, JIT-compiled function. |
| Familiar Python API | mlx.core mirrors NumPy in array creation, slicing, broadcasting, and reductions; mlx.nn mirrors the PyTorch nn.Module style of subclassing. |
| Multi-language bindings | Python, C++, C, and Swift APIs all wrap the same C++ core, with the C layer acting as a stable bridge. |
| Dynamic graphs with optional compilation | Code is eager by default; mx.compile traces and fuses operations to remove Python overhead and improve performance. |
| Native Metal kernels | Many operators ship hand-written Metal shaders rather than relying on a vendor BLAS, which lets MLX exploit details of the Apple GPU like SIMD groups and tile memory. |
| Multi-backend | CPU and Apple GPU are the primary back-ends; later releases added a CUDA back-end for Nvidia hardware and a JACCL distributed back-end. |
Lazy evaluation matters because it lets MLX fuse operators and skip work whose results are never read. It also makes function transformations cleaner, since the framework already has a graph to differentiate, vectorize, or compile. The trade-off is that the user has to remember to call mx.eval (or perform any operation that materializes a value) when they want a side effect like printing or timing.
The Python package mlx.nn defines familiar building blocks: Linear, Embedding, Conv1d, Conv2d, Conv3d, LSTM, GRU, Transformer, MultiHeadAttention, RMSNorm, LayerNorm, RoPE, and the usual activation functions. The mlx.optimizers package provides SGD, Adam, AdamW, Lion, Muon, and Adafactor, plus learning-rate schedules such as cosine decay, exponential decay, and linear schedules. Anyone who has used PyTorch can read MLX neural-network code without much trouble.
The headline features that come up most often in tutorials and benchmarks are listed below.
| Feature | What it does |
|---|---|
| Lazy execution | Defers computation until a result is needed, enabling fusion and skipping unused work |
| Function transformations | grad, value_and_grad, jit (compile), vmap, and jvp/vjp for forward and reverse mode autodiff |
| Module-style neural networks | mlx.nn.Module subclasses with __call__, parameter pytrees, and easy state-dict style serialization |
| Optimizers | SGD, Adam, AdamW, Lion, Muon, Adafactor, with weight decay and gradient clipping helpers |
| Quantization | 4-bit and 8-bit weight quantization for inference, with mx.quantize and mx.dequantize primitives and a CLI in mlx-lm |
| LoRA and QLoRA fine-tuning | mlx-lm.lora trains low-rank adapters on top of a frozen base model; adapters can be fused back for deployment |
| Distributed training and inference | mx.distributed with all_reduce, all_gather, send, and recv over MPI, Ring, or the JACCL back-end on Thunderbolt 5 |
| Hugging Face integration | mlx_lm.convert downloads a Hugging Face model, quantizes it, and saves it locally or uploads to the mlx-community organization |
| Streaming generation and prompt caching | Token-by-token generation with reusable KV caches and rotating caches for long contexts |
| Multi-language bindings | Python, C++, C, and Swift; a Go community port also exists |
One practical detail that surprises new users is that MLX does not use the Apple Neural Engine (Apple Neural Engine, or ANE) directly. The ANE is exposed through Core ML, and MLX runs on the CPU and GPU instead. From the M5 onward, however, the GPU itself contains "neural accelerators" that provide matrix-multiplication primitives, and MLX uses those through Metal 4's tensor operations.
mlx-lm is the package most users actually install. It is a high-level Python library and command-line interface for running, quantizing, and fine-tuning large language models on Apple Silicon. The default install pulls Llama-3.2-3B-Instruct in a 4-bit quantized form, so a user can type pip install mlx-lm followed by mlx_lm.chat and get a working chatbot in a few minutes on any modern Mac.
The package supports thousands of open-weight models on the Hugging Face Hub. The most commonly used families include Llama 1, Llama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 4, Mistral 7B, Mixtral 8x7B and 8x22B, Qwen 1.5, Qwen2, Qwen2.5, Qwen3, Phi-2 and Phi-3, Gemma and Gemma 2, DeepSeek (including DeepSeek R1 and V3 in distributed mode), Yi, Stable LM, and Falcon. Vision-language models are covered by the related community package mlx-vlm.
The mlx-lm CLI exposes the main workflows directly:
mlx_lm.generate for one-shot text generation with a promptmlx_lm.chat for an interactive REPL-style chat sessionmlx_lm.server for a local OpenAI-compatible HTTP servermlx_lm.convert for downloading a Hugging Face model and quantizing it to 2-bit, 3-bit, 4-bit, 6-bit, or 8-bitmlx_lm.lora for low-rank adapter fine-tuning, including QLoRA on top of a quantized base modelmlx_lm.fuse for merging a trained LoRA adapter back into the base weightsmlx_lm.upload for pushing a converted model to the Hugging Face HubUnder the hood, mlx-lm uses streaming generation with a key-value cache, supports prompt caching to make repeated prefixes free, and offers a rotating KV cache to keep memory bounded for very long contexts. Distributed inference is available via mx.distributed, which is how Awni Hannun and others have demonstrated trillion-parameter models like DeepSeek V3 and Kimi K2 Thinking running across clusters of Mac Studios.
mlx-data is a separate library for building data-loading and preprocessing pipelines. It is framework-agnostic, meaning that although it ships from the ml-explore organization it can be used with PyTorch or JAX in addition to MLX. The library focuses on streaming, sharding, and on-the-fly transformations such as resizing images, tokenizing text, and decoding audio. It is written in C++ with Python bindings and is designed to keep up with the throughput that Apple Silicon GPUs can reach, especially for image and audio workloads.
mlx-examples is a separate Apple-maintained GitHub repository that ships reference implementations rather than reusable APIs. It is the place most people go first to see how a model is meant to be expressed in MLX.
Notable examples include:
| Domain | Example |
|---|---|
| Language | Transformer training, Llama and Mistral inference, Mixtral 8x7B mixture of experts, T5, BERT |
| Vision | ResNet on CIFAR-10, CVAE on MNIST, FLUX and Stable Diffusion (including SDXL) image generation |
| Audio | OpenAI Whisper for speech recognition, Meta EnCodec for audio compression, Meta MusicGen for music generation |
| Video | Wan2.1 for text-to-video and image-to-video |
| Multimodal | CLIP, LLaVA, Segment Anything (SAM) |
| Other | Graph Convolutional Networks, Real NVP normalizing flows, parameter-efficient fine-tuning with LoRA and QLoRA |
These examples are a major reason MLX got traction quickly: a developer who wanted to run Stable Diffusion or Llama on a MacBook could clone the repo and have a working pipeline within an hour of the framework's release.
MLX is fast on Apple Silicon, particularly for inference. The exact numbers depend on chip, model, batch size, and quantization, but a few patterns show up consistently across community benchmarks.
For 7B-class language models in 4-bit quantization, throughput on the GPU runs roughly in the tens of tokens per second on M2 Max and M3 Max systems and climbs into the hundreds on M3 Ultra and M4 Max systems. Independent benchmarks have measured Llama 2 7B in Q4 at around 26 tokens per second on an M3 MacBook Pro and about 115 tokens per second on an M3 Ultra. On the M5, Apple's own research blog reports that time-to-first-token improved by 3.3x to 4.06x compared with M4 across tested models, and subsequent token generation by 1.19x to 1.27x, mostly because of the M5's higher memory bandwidth (153 GB/s versus 120 GB/s) and the new Metal 4 tensor operations.
In head-to-head comparisons, MLX tends to beat PyTorch with the MPS backend for inference and many single-array operations, while PyTorch can still be faster for some training workloads. One frequently cited Towards Data Science benchmark found PyTorch MPS at roughly 10 to 14 seconds per training epoch for a small CNN, with the MLX version of the same model at roughly 21 to 27 seconds, so the framework has not entirely closed the gap on every workload. For inference and serving, MLX is generally the fastest option among Apple Silicon native frameworks. Community benchmarks against Ollama (which itself has added MLX as a back-end) regularly show MLX coming in 30 to 50 percent faster.
Nvidia GPUs are still faster in absolute terms for large models that fit in their VRAM, but MLX's energy efficiency on Apple hardware is competitive: per-iteration energy on Apple Silicon is in the same range as an RTX 4090 and well below an A6000 in some benchmarks.
MLX sits in a fairly crowded landscape, but its niche of "first-class Apple Silicon framework that looks like NumPy and PyTorch" is well defined. The table below compares it with the frameworks it is most often weighed against.
| Framework | Organization | Focus | Eager or lazy | Multi-platform | Hugging Face support | License | First release |
|---|---|---|---|---|---|---|---|
| MLX | Apple ML Research | Research and on-device development on Apple Silicon | Eager with optional compile and lazy graph capture | Apple Silicon primary; CUDA back-end added later | Yes, via mlx-lm and the mlx-community organization | MIT | December 2023 |
| PyTorch + MPS | PyTorch Foundation (Linux Foundation) | General research and production deep learning | Eager with torch.compile | Cross-platform; MPS back-end since PyTorch 1.12 (2022) | Yes, via Transformers | BSD-3 | 2016 (PyTorch); 2022 (MPS) |
| JAX with Metal | Google plus Apple plugin | Functional, transformation-first numerical computing | Lazy, traced | Cross-platform; Metal back-end is experimental | Limited (some Flax models) | Apache 2.0 | 2018; Metal plugin in 2024 |
| Core ML | Apple | On-device deployment and inference | Static graphs (.mlpackage) | Apple platforms only | Indirect, via converters | Proprietary (free SDK) | 2017 |
| llama.cpp | Open source community (ggml-org) | Quantized LLM inference in plain C++ | Static, hand-written | Cross-platform (CPU, Metal, CUDA, ROCm, Vulkan) | Yes, via GGUF format | MIT | 2023 |
| Apple Foundation Models framework | Apple | Third-party access to Apple's on-device LLM (Apple Intelligence) | Closed-source runtime | Apple platforms only | No | Proprietary | 2024 |
| TinyGrad | Tiny Corp | Minimal research framework, multiple back-ends | Lazy | Cross-platform (CPU, CUDA, Metal, others) | Some | MIT | 2020 |
A quick summary of how to think about each:
MLX has settled into a few clearly identified use cases over its first years:
The main reasons people pick MLX over alternatives:
mlx-community organization, which had close to 5,000 pre-converted models by 2026The usual caveats:
MLX has been adopted broadly within the Mac-using ML community. The GitHub repository passed roughly 25,000 stars by 2026, the mlx-community Hugging Face organization had thousands of converted model checkpoints with thousands of members, and major tools like Ollama added MLX as an optional back-end. Apple itself uses MLX internally for some of the work on its on-device foundation models, and Apple's WWDC 2024 and 2025 sessions explicitly recommend MLX for developers who want to fine-tune or experiment with open models on Mac.
Outside Apple, MLX shows up in academic work on private LLMs (for example, multi-node mixture-of-experts research using Apple Silicon clusters) and in many indie developer tools that ship on-device AI features. Demonstrations of trillion-parameter models running across four Mac Studios with MLX's distributed back-end have generated coverage in the trade press and pushed Apple's small-cluster story as a low-cost alternative to Nvidia DGX boxes for some workloads.
In February 2026, Awni Hannun, the public face of MLX, announced that he was leaving Apple. Coverage at the time noted that he was one of roughly a dozen senior AI researchers who had departed Apple in the previous year. Hannun named the engineers continuing to maintain MLX, and the project has continued to ship releases since.
Apple has several machine-learning frameworks, and they serve different jobs.
| Framework | Primary purpose |
|---|---|
| Core ML | Deployment and inference of trained models inside iOS, macOS, watchOS, and visionOS apps; uses CPU, GPU, and Apple Neural Engine |
| Create ML | A graphical and Swift API for training simple models from labeled data without writing low-level code |
| MLX | Research and on-device development framework, the most PyTorch-like of Apple's offerings, used for training and fine-tuning |
| Apple Foundation Models framework | Third-party access (in Swift) to Apple's own on-device LLM that powers Apple Intelligence |
| Metal Performance Shaders Graph (MPSGraph) | Lower-level graph compiler for Metal that PyTorch's MPS back-end and other frameworks build on |
In practice, a developer might prototype a model in MLX, convert it to Core ML for shipping in a consumer app, and call Apple's Foundation Models framework when they just need general-purpose text from the system LLM rather than their own custom model.
MLX is released under the MIT license, which permits commercial use, modification, distribution, and private use without requiring derivative works to be open-sourced. The project is maintained on GitHub under the ml-explore organization, which is owned by Apple. Releases are cut directly from the main branch, and contributions follow a standard pull-request workflow. The mlx-examples, mlx-lm, mlx-data, mlx-swift, and mlx-c repositories use the same MIT license and the same governance model.
By 2026 MLX has settled in as the standard Mac-native ML framework for local LLM development and on-device fine-tuning. The Hugging Face community treats it as a first-class target, alongside the PyTorch and GGUF formats. Apple continues to invest in the project: distributed training and inference across multiple Macs has matured, quantization libraries have grown to include 2-bit and 3-bit variants, the Swift bindings are good enough to ship inside iOS and macOS apps, and Metal 4's tensor operations let MLX exploit the new neural accelerators in the M5 GPU. The CUDA back-end added in 2025 also widened MLX's appeal: the same code can prototype on a MacBook and then run on Nvidia hardware in a cloud environment.
The departure of Awni Hannun in early 2026 left an open question about Apple's long-term commitment to MLX, but the project has continued to ship and the rest of the team has remained in place. For most practical purposes, if you want to run or train a model on a Mac in 2026, MLX is the framework people reach for first.