Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
Install MLX on Apple Silicon for Local Llama
Python venv, mlx-lm, and a tiny model smoke test.
Key takeaways
- Python venv, mlx-lm, and a tiny model smoke test.
- Parent pillar: /blog/mlx-apple-silicon-local-ai
10+ years in Digital Marketing & SEO
Yes, you can run Llama, Qwen, Gemma, and other open-weight models natively on an M-series Mac, and MLX is the fastest way to do it. The short version: create a Python virtual environment, pip install mlx-lm, then run a one-line generate command against a tiny model to confirm the whole chain works before you download anything big. The entire smoke test takes about five minutes on a clean machine.
MLX is Apple's open-source array framework built specifically for the unified-memory architecture in Apple Silicon, and mlx-lm is the companion package that loads and runs language models on top of it. Because MLX is designed for the M-series chips from the ground up, it usually squeezes more tokens per second out of the same hardware than a generic GGUF runner does. Let me walk you through the clean install.
What is MLX and why use it instead of Ollama on a Mac?
MLX is a NumPy-like array framework from Apple that runs computation on the GPU and Neural Engine of M1/M2/M3/M4 chips while sharing one pool of memory between CPU and GPU. That unified-memory design is the whole point: on a Mac there's no separate VRAM to copy weights into, so a model loads straight into the same RAM your apps use.
Here's the honest comparison so you pick the right tool:
| Tool | Best for | Format | Speed on Apple Silicon | Setup effort |
|---|---|---|---|---|
MLX (mlx-lm) |
Max throughput, Python scripting, fine-tuning | MLX (safetensors) | Fastest, Apple-native | Medium (venv + pip) |
| Ollama | One-command convenience, API server | GGUF | Good | Lowest |
| LM Studio | GUI users, model browsing | GGUF + MLX | Good (it can use MLX too) | Low (app install) |
| llama.cpp | Cross-platform, fine control | GGUF | Good | Higher (build) |
If you want the absolute fastest inference on an M-series Mac and you're comfortable in a terminal, use MLX. If you just want a model running with one command and an OpenAI-compatible API, Ollama is the easier path. For a fuller breakdown of the three big runners, see LM Studio vs Ollama vs llama.cpp.
What do I need before installing MLX?
Three things, and the requirements are refreshingly simple:
- An Apple Silicon Mac. MLX runs on M1, M2, M3, and M4 (any variant). It does not run on Intel Macs — if you're on Intel, use llama.cpp or Ollama instead.
- Python 3.9 or newer. macOS ships with a usable Python 3, but I strongly prefer Homebrew's (
brew install python) so you're not touching the system interpreter. - Recent macOS. Keep it reasonably current (Sonoma or newer is a safe bet). Older releases technically work but you'll hit edge cases.
Check what you've got:
# Confirm you're on Apple Silicon (should print "arm64")
uname -m
# Confirm Python is 3.9+
python3 --version
If uname -m prints x86_64, you're on Intel and MLX isn't for you. If it prints arm64, you're good.
How do I install MLX with a Python venv?
Always install into a virtual environment. A venv is an isolated Python sandbox that keeps mlx-lm and its dependencies from colliding with anything else on your system. Skipping this is the number-one cause of "it worked yesterday" breakage.
# Make a project folder and a venv inside it
mkdir ~/mlx-local && cd ~/mlx-local
python3 -m venv .venv
# Activate it (you'll do this every new terminal session)
source .venv/bin/activate
# Install mlx-lm — this pulls in mlx itself as a dependency
pip install --upgrade pip
pip install mlx-lm
That's it. You don't pip install mlx separately; mlx-lm brings the framework along. When the venv is active you'll see (.venv) at the front of your prompt. To leave it later, type deactivate.
Quick sanity check that the framework imported cleanly:
python3 -c "import mlx.core as mx; print(mx.array([1,2,3]).sum())"
If that prints array(6, dtype=int32), MLX is alive and talking to your GPU.
How do I run a tiny model smoke test?
This is the step everyone skips and then regrets. Before you pull a 14B model, prove the pipeline end-to-end with something small. MLX-format models live on Hugging Face under the mlx-community org, already converted and quantized, so you just point at one and go. mlx-lm downloads it automatically on first run.
mlx_lm.generate \
--model mlx-community/Qwen2.5-0.5B-Instruct-4bit \
--prompt "In one sentence, what is unified memory?" \
--max-tokens 80
A sub-1B model at 4-bit is a few hundred megabytes, downloads fast, and runs instantly even on a base M1. If you get a coherent sentence back, your entire MLX stack works. Now you can scale up with confidence instead of debugging a 9GB download.
Want it interactive instead of one-shot? Use the chat REPL:
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit
And if you want an OpenAI-compatible endpoint so your existing scripts and apps (Open WebUI, scripts hitting /v1/chat/completions) can talk to it:
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080
That mirrors the pattern in Ollama's OpenAI-compatible API, so anything you've already wired up keeps working.
Which model and quant should I pick?
The model size you can run is bounded by your Mac's total RAM, because weights and your other apps share the same pool. A rough rule: a 4-bit model needs a little over half a gigabyte of RAM per billion parameters, plus headroom for context. So a 7B–8B at 4-bit wants roughly 5–6 GB resident, and a 14B wants somewhere in the low double digits — verify on your own machine, since context length and macOS overhead move the number around. The same VRAM math from the how much VRAM for Llama 3 8B guide applies; on a Mac, "VRAM" is just your RAM.
Quantization is the lever that makes big models fit. It shrinks the weights by storing them at lower precision — 4-bit drastically cuts memory at a small quality cost, while 8-bit keeps near-full quality for roughly double the footprint. MLX names quants directly in the repo (-4bit, -8bit), unlike GGUF's Q4_K_M / Q8_0 labels, but the tradeoff is the same one I cover in Q4 vs Q8 quality tradeoffs.
Use this to choose:
- If you have a base 8GB Mac then stick to 4-bit models at 3B–8B (Qwen2.5-3B, Llama-3.2-3B, Gemma-2-2B). Close other heavy apps.
- If you have 16–24GB then 8B at 4-bit runs comfortably, and you can reach for a 14B at 4-bit (Qwen2.5-14B) with room for a decent context window.
- If you have 32–64GB then 4-bit 32B models and even some 70B-class quants come into play; this is where Apple Silicon really shines versus a single consumer GPU.
- If quality matters more than memory then bump from
-4bitto-8biton a smaller model rather than running a bigger model at aggressive quantization.
How do I convert a Hugging Face model to MLX myself?
If a model isn't already in the mlx-community org, you can convert and quantize it yourself in one command:
mlx_lm.convert \
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
--mlx-path ./mistral-7b-mlx-4bit \
-q
The -q flag quantizes to 4-bit during conversion. Drop it for full precision, or pass --q-bits 8 for 8-bit. The output folder is a standard MLX model directory you can load with --model ./mistral-7b-mlx-4bit. This is how you get newer Mistral, DeepSeek, Phi, or GLM releases running before someone uploads a pre-converted version.
Troubleshooting: common MLX install problems
command not found: mlx_lm.generate— your venv isn't active. Runsource .venv/bin/activateagain.Library not loadedor Metal errors — you're likely on an Intel Mac or an x86 Python build under Rosetta. Confirmuname -msaysarm64and that your Python is the arm64 build.- Out of memory / system crawls to a halt — the model plus context exceeded available RAM. Drop to a smaller model or a lower quant, and close memory-hungry apps (browsers, Docker, IDEs).
- Slow first run — that's the model downloading, not inference. Subsequent runs load from the local Hugging Face cache and are fast.
If you'd rather avoid the terminal entirely, LM Studio now ships an MLX backend, so you get the same Apple-native speed through a GUI. But for scripting, batching, and fine-tuning, native mlx-lm is where you want to be.
Bottom line
MLX is the fastest, most Mac-native way to run open-weight LLMs on Apple Silicon, and the install is genuinely a five-minute job: make a venv, pip install mlx-lm, then smoke-test a 0.5B model before scaling up. Match your model size to your total RAM, lean on 4-bit quants to fit bigger models, and convert anything from Hugging Face yourself when a pre-made MLX version doesn't exist. For the full picture — fine-tuning, performance tuning, and where MLX fits in a local stack — head back to the cornerstone guide at MLX on Apple Silicon for local AI.
Frequently asked questions
See /blog/mlx-apple-silicon-local-ai for the full cornerstone guide.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.