Which pillar does this cluster support?

See /blog/mlx-apple-silicon-local-ai for the full cornerstone guide.

Install MLX on Apple Silicon for Local Llama | WikiWayne

Yes, you can run Llama, Qwen, Gemma, and other open-weight models natively on an M-series Mac, and MLX is the fastest way to do it. The short version: create a Python virtual environment, pip install mlx-lm, then run a one-line generate command against a tiny model to confirm the whole chain works before you download anything big. The entire smoke test takes about five minutes on a clean machine.

MLX is Apple's open-source array framework built specifically for the unified-memory architecture in Apple Silicon, and mlx-lm is the companion package that loads and runs language models on top of it. Because MLX is designed for the M-series chips from the ground up, it usually squeezes more tokens per second out of the same hardware than a generic GGUF runner does. Let me walk you through the clean install.

What is MLX and why use it instead of Ollama on a Mac?

MLX is a NumPy-like array framework from Apple that runs computation on the GPU and Neural Engine of M1/M2/M3/M4 chips while sharing one pool of memory between CPU and GPU. That unified-memory design is the whole point: on a Mac there's no separate VRAM to copy weights into, so a model loads straight into the same RAM your apps use.

Here's the honest comparison so you pick the right tool:

Tool	Best for	Format	Speed on Apple Silicon	Setup effort
MLX (`mlx-lm`)	Max throughput, Python scripting, fine-tuning	MLX (safetensors)	Fastest, Apple-native	Medium (venv + pip)
Ollama	One-command convenience, API server	GGUF	Good	Lowest
LM Studio	GUI users, model browsing	GGUF + MLX	Good (it can use MLX too)	Low (app install)
llama.cpp	Cross-platform, fine control	GGUF	Good	Higher (build)

If you want the absolute fastest inference on an M-series Mac and you're comfortable in a terminal, use MLX. If you just want a model running with one command and an OpenAI-compatible API, Ollama is the easier path. For a fuller breakdown of the three big runners, see LM Studio vs Ollama vs llama.cpp.

What do I need before installing MLX?

Three things, and the requirements are refreshingly simple:

An Apple Silicon Mac. MLX runs on M1, M2, M3, and M4 (any variant). It does not run on Intel Macs — if you're on Intel, use llama.cpp or Ollama instead.
Python 3.9 or newer. macOS ships with a usable Python 3, but I strongly prefer Homebrew's (brew install python) so you're not touching the system interpreter.
Recent macOS. Keep it reasonably current (Sonoma or newer is a safe bet). Older releases technically work but you'll hit edge cases.

Check what you've got:

# Confirm you're on Apple Silicon (should print "arm64")
uname -m

# Confirm Python is 3.9+
python3 --version

If uname -m prints x86_64, you're on Intel and MLX isn't for you. If it prints arm64, you're good.

How do I install MLX with a Python venv?

Always install into a virtual environment. A venv is an isolated Python sandbox that keeps mlx-lm and its dependencies from colliding with anything else on your system. Skipping this is the number-one cause of "it worked yesterday" breakage.

# Make a project folder and a venv inside it
mkdir ~/mlx-local && cd ~/mlx-local
python3 -m venv .venv

# Activate it (you'll do this every new terminal session)
source .venv/bin/activate

# Install mlx-lm — this pulls in mlx itself as a dependency
pip install --upgrade pip
pip install mlx-lm

That's it. You don't pip install mlx separately; mlx-lm brings the framework along. When the venv is active you'll see (.venv) at the front of your prompt. To leave it later, type deactivate.

Quick sanity check that the framework imported cleanly:

python3 -c "import mlx.core as mx; print(mx.array([1,2,3]).sum())"

If that prints array(6, dtype=int32), MLX is alive and talking to your GPU.

How do I run a tiny model smoke test?

This is the step everyone skips and then regrets. Before you pull a 14B model, prove the pipeline end-to-end with something small. MLX-format models live on Hugging Face under the mlx-community org, already converted and quantized, so you just point at one and go. mlx-lm downloads it automatically on first run.

mlx_lm.generate \
  --model mlx-community/Qwen2.5-0.5B-Instruct-4bit \
  --prompt "In one sentence, what is unified memory?" \
  --max-tokens 80

A sub-1B model at 4-bit is a few hundred megabytes, downloads fast, and runs instantly even on a base M1. If you get a coherent sentence back, your entire MLX stack works. Now you can scale up with confidence instead of debugging a 9GB download.

Want it interactive instead of one-shot? Use the chat REPL:

mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

And if you want an OpenAI-compatible endpoint so your existing scripts and apps (Open WebUI, scripts hitting /v1/chat/completions) can talk to it:

mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080

That mirrors the pattern in Ollama's OpenAI-compatible API, so anything you've already wired up keeps working.

Which model and quant should I pick?

The model size you can run is bounded by your Mac's total RAM, because weights and your other apps share the same pool. A rough rule: a 4-bit model needs a little over half a gigabyte of RAM per billion parameters, plus headroom for context. So a 7B–8B at 4-bit wants roughly 5–6 GB resident, and a 14B wants somewhere in the low double digits — verify on your own machine, since context length and macOS overhead move the number around. The same VRAM math from the how much VRAM for Llama 3 8B guide applies; on a Mac, "VRAM" is just your RAM.

Quantization is the lever that makes big models fit. It shrinks the weights by storing them at lower precision — 4-bit drastically cuts memory at a small quality cost, while 8-bit keeps near-full quality for roughly double the footprint. MLX names quants directly in the repo (-4bit, -8bit), unlike GGUF's Q4_K_M / Q8_0 labels, but the tradeoff is the same one I cover in Q4 vs Q8 quality tradeoffs.

Use this to choose:

If you have a base 8GB Mac then stick to 4-bit models at 3B–8B (Qwen2.5-3B, Llama-3.2-3B, Gemma-2-2B). Close other heavy apps.
If you have 16–24GB then 8B at 4-bit runs comfortably, and you can reach for a 14B at 4-bit (Qwen2.5-14B) with room for a decent context window.
If you have 32–64GB then 4-bit 32B models and even some 70B-class quants come into play; this is where Apple Silicon really shines versus a single consumer GPU.
If quality matters more than memory then bump from -4bit to -8bit on a smaller model rather than running a bigger model at aggressive quantization.

How do I convert a Hugging Face model to MLX myself?

If a model isn't already in the mlx-community org, you can convert and quantize it yourself in one command:

mlx_lm.convert \
  --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
  --mlx-path ./mistral-7b-mlx-4bit \
  -q

The -q flag quantizes to 4-bit during conversion. Drop it for full precision, or pass --q-bits 8 for 8-bit. The output folder is a standard MLX model directory you can load with --model ./mistral-7b-mlx-4bit. This is how you get newer Mistral, DeepSeek, Phi, or GLM releases running before someone uploads a pre-converted version.

Troubleshooting: common MLX install problems

command not found: mlx_lm.generate — your venv isn't active. Run source .venv/bin/activate again.
Library not loaded or Metal errors — you're likely on an Intel Mac or an x86 Python build under Rosetta. Confirm uname -m says arm64 and that your Python is the arm64 build.
Out of memory / system crawls to a halt — the model plus context exceeded available RAM. Drop to a smaller model or a lower quant, and close memory-hungry apps (browsers, Docker, IDEs).
Slow first run — that's the model downloading, not inference. Subsequent runs load from the local Hugging Face cache and are fast.

If you'd rather avoid the terminal entirely, LM Studio now ships an MLX backend, so you get the same Apple-native speed through a GUI. But for scripting, batching, and fine-tuning, native mlx-lm is where you want to be.

Bottom line

MLX is the fastest, most Mac-native way to run open-weight LLMs on Apple Silicon, and the install is genuinely a five-minute job: make a venv, pip install mlx-lm, then smoke-test a 0.5B model before scaling up. Match your model size to your total RAM, lean on 4-bit quants to fit bigger models, and convert anything from Hugging Face yourself when a pre-made MLX version doesn't exist. For the full picture — fine-tuning, performance tuning, and where MLX fits in a local stack — head back to the cornerstone guide at MLX on Apple Silicon for local AI.

What is MLX and why use it instead of Ollama on a Mac?

Here's the honest comparison so you pick the right tool:

Tool	Best for	Format	Speed on Apple Silicon	Setup effort
MLX (`mlx-lm`)	Max throughput, Python scripting, fine-tuning	MLX (safetensors)	Fastest, Apple-native	Medium (venv + pip)
Ollama	One-command convenience, API server	GGUF	Good	Lowest
LM Studio	GUI users, model browsing	GGUF + MLX	Good (it can use MLX too)	Low (app install)
llama.cpp	Cross-platform, fine control	GGUF	Good	Higher (build)

What do I need before installing MLX?

Three things, and the requirements are refreshingly simple:

An Apple Silicon Mac. MLX runs on M1, M2, M3, and M4 (any variant). It does not run on Intel Macs — if you're on Intel, use llama.cpp or Ollama instead.
Python 3.9 or newer. macOS ships with a usable Python 3, but I strongly prefer Homebrew's (brew install python) so you're not touching the system interpreter.
Recent macOS. Keep it reasonably current (Sonoma or newer is a safe bet). Older releases technically work but you'll hit edge cases.

Check what you've got:

# Confirm you're on Apple Silicon (should print "arm64")
uname -m

# Confirm Python is 3.9+
python3 --version

If uname -m prints x86_64, you're on Intel and MLX isn't for you. If it prints arm64, you're good.

How do I install MLX with a Python venv?

# Make a project folder and a venv inside it
mkdir ~/mlx-local && cd ~/mlx-local
python3 -m venv .venv

# Activate it (you'll do this every new terminal session)
source .venv/bin/activate

# Install mlx-lm — this pulls in mlx itself as a dependency
pip install --upgrade pip
pip install mlx-lm

That's it. You don't pip install mlx separately; mlx-lm brings the framework along. When the venv is active you'll see (.venv) at the front of your prompt. To leave it later, type deactivate.

Quick sanity check that the framework imported cleanly:

python3 -c "import mlx.core as mx; print(mx.array([1,2,3]).sum())"

If that prints array(6, dtype=int32), MLX is alive and talking to your GPU.

How do I run a tiny model smoke test?

mlx_lm.generate \
  --model mlx-community/Qwen2.5-0.5B-Instruct-4bit \
  --prompt "In one sentence, what is unified memory?" \
  --max-tokens 80

Want it interactive instead of one-shot? Use the chat REPL:

mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

And if you want an OpenAI-compatible endpoint so your existing scripts and apps (Open WebUI, scripts hitting /v1/chat/completions) can talk to it:

mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080

That mirrors the pattern in Ollama's OpenAI-compatible API, so anything you've already wired up keeps working.

Which model and quant should I pick?

Use this to choose:

If you have a base 8GB Mac then stick to 4-bit models at 3B–8B (Qwen2.5-3B, Llama-3.2-3B, Gemma-2-2B). Close other heavy apps.
If you have 16–24GB then 8B at 4-bit runs comfortably, and you can reach for a 14B at 4-bit (Qwen2.5-14B) with room for a decent context window.
If you have 32–64GB then 4-bit 32B models and even some 70B-class quants come into play; this is where Apple Silicon really shines versus a single consumer GPU.
If quality matters more than memory then bump from -4bit to -8bit on a smaller model rather than running a bigger model at aggressive quantization.

How do I convert a Hugging Face model to MLX myself?

If a model isn't already in the mlx-community org, you can convert and quantize it yourself in one command:

mlx_lm.convert \
  --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
  --mlx-path ./mistral-7b-mlx-4bit \
  -q

Troubleshooting: common MLX install problems

command not found: mlx_lm.generate — your venv isn't active. Run source .venv/bin/activate again.
Library not loaded or Metal errors — you're likely on an Intel Mac or an x86 Python build under Rosetta. Confirm uname -m says arm64 and that your Python is the arm64 build.
Out of memory / system crawls to a halt — the model plus context exceeded available RAM. Drop to a smaller model or a lower quant, and close memory-hungry apps (browsers, Docker, IDEs).
Slow first run — that's the model downloading, not inference. Subsequent runs load from the local Hugging Face cache and are fast.

Install MLX on Apple Silicon for Local Llama

Key takeaways

What is MLX and why use it instead of Ollama on a Mac?

What do I need before installing MLX?

How do I install MLX with a Python venv?

How do I run a tiny model smoke test?

Which model and quant should I pick?

How do I convert a Hugging Face model to MLX myself?

Troubleshooting: common MLX install problems

Bottom line

Frequently asked questions

Related Articles

MLX on Apple Silicon for Local AI

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL

Install MLX on Apple Silicon for Local Llama

Key takeaways

What is MLX and why use it instead of Ollama on a Mac?

What do I need before installing MLX?

How do I install MLX with a Python venv?

How do I run a tiny model smoke test?

Which model and quant should I pick?

How do I convert a Hugging Face model to MLX myself?

Troubleshooting: common MLX install problems

Bottom line

Frequently asked questions

Related Articles

MLX on Apple Silicon for Local AI

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL