Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
Install Ollama on Windows, Mac, and Linux (2026)
Step-by-step Ollama install paths for the three major desktop OS families.
Key takeaways
- Step-by-step Ollama install paths for the three major desktop OS families.
- Parent pillar: /blog/run-open-weight-models-locally-2026
10+ years in Digital Marketing & SEO
Installing Ollama takes about two minutes on any of the three desktop OS families: on macOS and Windows you download the official app installer and run it, and on Linux you pipe the official script with curl -fsSL https://ollama.com/install.sh | sh. Ollama is a local LLM runner that wraps llama.cpp behind a one-command interface, so once it's installed you pull an open-weight model and start chatting offline. Below are the exact steps for each OS, plus how to verify the install, point it at your GPU, and fix the handful of things that actually go wrong.
What is Ollama, in one sentence?
Ollama is a free, open-source desktop runner that downloads, manages, and serves open-weight GGUF models (Qwen, Llama, Gemma, DeepSeek, Mistral, Phi) through a single CLI and a built-in OpenAI-compatible API on localhost:11434. Think of it as the "just works" layer on top of llama.cpp — you trade some manual control for the ability to go from zero to a running model in two commands.
Which install path do I use for my OS?
Here's the quick decision before you touch a terminal:
| OS | Install method | GPU acceleration | Notes |
|---|---|---|---|
| macOS (Apple Silicon) | .dmg app from ollama.com |
Metal (automatic) | M1/M2/M3/M4 all use unified memory; no setup |
| macOS (Intel) | .dmg app |
CPU only | Works, but slow — older Intel Macs lack a usable GPU path |
| Windows 10/11 | .exe installer |
NVIDIA CUDA / AMD ROCm (automatic) | Native app, no WSL required |
| Linux (Ubuntu/Debian/Fedora/Arch) | install.sh script |
NVIDIA CUDA / AMD ROCm | Script auto-detects your GPU drivers |
If you're on Apple Silicon, you have the easiest ride of anyone — unified memory means the "VRAM" is just your system RAM. If you're on a desktop with a dedicated card, your VRAM is the hard ceiling, and it's worth reading how much VRAM you need for Llama 3 8B before you pull anything large.
How do I install Ollama on macOS?
- Go to ollama.com and download the macOS
.dmg. - Open the
.dmg, drag Ollama into Applications, and launch it once so it installs theollamacommand-line tool. - Open Terminal and confirm it's alive:
ollama --version
ollama run qwen3:8b
That second command pulls a small Qwen model (a few GB) and drops you into a chat prompt. On an M-series Mac with 16GB or more of unified memory, an 8B model at Q4 runs comfortably. With 8GB, stick to 3B–4B models like gemma3:4b or qwen3:4b.
One macOS-specific tip: the menu-bar app needs to be running for the background server to stay up. If ollama run says it can't connect, launch the app from Applications first.
How do I install Ollama on Windows?
- Download the Windows
.exefrom ollama.com. - Run the installer. It sets up the Ollama service and adds
ollamato your PATH automatically — no WSL, no Docker, no admin gymnastics. - Open PowerShell or Command Prompt and test:
ollama --version
ollama run llama3.1:8b
Ollama's Windows build auto-detects NVIDIA (CUDA) and recent AMD (ROCm) GPUs. If you have an RTX or RX card, it'll offload layers to the GPU without you configuring anything. To confirm the GPU is actually being used, run a model and check Task Manager's GPU "Dedicated memory" graph — it should jump when the model loads. If it doesn't, your driver is likely stale; update to the latest NVIDIA Studio/Game Ready or AMD Adrenalin driver and restart.
If you're still GPU-shopping, my NVIDIA vs AMD for local LLMs in 2026 breakdown covers which cards give the least driver pain.
How do I install Ollama on Linux?
The one-liner handles Ubuntu, Debian, Fedora, and Arch:
curl -fsSL https://ollama.com/install.sh | sh
The script detects your distro, installs the binary to /usr/local/bin, sets up a systemd service, and — importantly — detects NVIDIA or AMD GPU drivers and pulls the matching acceleration libraries. After it finishes:
# Confirm the service is running
systemctl status ollama
# Pull and run a model
ollama run deepseek-r1:8b
If you'd rather not pipe a script straight into your shell (a reasonable instinct), you can download the standalone tarball from the GitHub releases page and extract it manually — same binary, you just wire up the service yourself.
GPU note for Linux: Ollama needs the proprietary NVIDIA driver plus the CUDA runtime, or the ROCm stack for AMD. If ollama run falls back to CPU, run nvidia-smi (or rocminfo) to confirm the driver sees your card before blaming Ollama. CPU-only inference works fine for small models — see my notes on the CPU-only privacy tradeoff if you're running headless.
How do I verify Ollama is working?
After any install, three checks confirm everything's wired up:
# 1. Binary exists
ollama --version
# 2. Server is reachable
curl http://localhost:11434/api/tags
# 3. A model actually runs
ollama run gemma3:4b "Say hi in five words"
If the curl returns JSON (even an empty model list), the server is live. That endpoint matters because it's the same OpenAI-compatible API your other tools will hit — more on that in my Ollama OpenAI-compatible API guide.
How do I pick the right quant so the model fits?
Ollama models are GGUF files, and most default tags ship at Q4_K_M — a 4-bit quantization that cuts the model's memory footprint to roughly a quarter of the full-precision size while keeping quality high enough that you won't notice on everyday tasks. Q8 doubles the memory cost for a small quality bump that mostly matters for code and math.
Rough sizing for picking a tag:
- If you have 8GB VRAM/RAM → run 3B–7B models at Q4_K_M (e.g.
qwen3:4b,llama3.1:8bis borderline). - If you have 12–16GB → 8B–14B at Q4_K_M is the sweet spot.
- If you have 24GB+ → 14B–32B at Q4, or smaller models at Q8 for cleaner output.
- If you only have CPU → stay at or below 7B and expect single-digit tokens/sec.
These are ballpark ranges — your real throughput depends on your exact card, memory bandwidth, and context length, so benchmark on your own stack. For the full tradeoff, see Q4 vs Q8 quant quality and the broader quantization explainer.
Can I import my own GGUF or change where models are stored?
Yes to both. If you already downloaded a GGUF from Hugging Face, point Ollama at it with a tiny Modelfile:
# Modelfile
FROM ./my-model-Q4_K_M.gguf
ollama create my-model -f Modelfile
ollama run my-model
To move the model cache off your system drive (models add up fast), set the storage path before starting the server:
# macOS / Linux
export OLLAMA_MODELS=/mnt/bigdisk/ollama
# Windows (PowerShell, then restart Ollama)
setx OLLAMA_MODELS "D:\ollama-models"
What usually breaks, and how do I fix it?
A few recurring gotchas:
- "connection refused" on port 11434 — the background server isn't running. On macOS/Windows, launch the app; on Linux,
systemctl start ollama. - Model loads on CPU despite having a GPU — stale GPU driver, or the model is bigger than your VRAM and Ollama silently spilled to RAM. Update drivers; pick a smaller quant.
- Want to reach Ollama from another device — bind it to your network with
OLLAMA_HOST=0.0.0.0:11434before starting, then point a UI at it. This is exactly how the homelab Docker stack with Ollama and Open WebUI works. - Out-of-memory crash mid-generation — your context window is too long for the remaining VRAM. Lower it, or drop to a smaller model.
If you want a graphical chat interface instead of the terminal, Open WebUI connects to Ollama in a couple of minutes and gives you a clean ChatGPT-style front end.
Bottom line
Ollama is the fastest on-ramp to running open-weight models locally: a .dmg on Mac, an .exe on Windows, and a one-line script on Linux, with GPU acceleration auto-detected on all three. Start with a small Q4_K_M model that fits your memory, verify the localhost:11434 API responds, and scale up from there. When you outgrow Ollama's defaults and want manual control over layers, context, and build flags, that's your cue to look at llama.cpp — but for 90% of local-AI work, this install is all you need. Head back to the run open-weight models locally in 2026 pillar for the full picture.
Frequently asked questions
See /blog/run-open-weight-models-locally-2026 for the full cornerstone guide.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
