Which pillar does this cluster support?

See /blog/run-open-weight-models-locally-2026 for the full cornerstone guide.

Install Ollama on Windows, Mac, and Linux (2026) | WikiWayne

Installing Ollama takes about two minutes on any of the three desktop OS families: on macOS and Windows you download the official app installer and run it, and on Linux you pipe the official script with curl -fsSL https://ollama.com/install.sh | sh. Ollama is a local LLM runner that wraps llama.cpp behind a one-command interface, so once it's installed you pull an open-weight model and start chatting offline. Below are the exact steps for each OS, plus how to verify the install, point it at your GPU, and fix the handful of things that actually go wrong.

What is Ollama, in one sentence?

Ollama is a free, open-source desktop runner that downloads, manages, and serves open-weight GGUF models (Qwen, Llama, Gemma, DeepSeek, Mistral, Phi) through a single CLI and a built-in OpenAI-compatible API on localhost:11434. Think of it as the "just works" layer on top of llama.cpp — you trade some manual control for the ability to go from zero to a running model in two commands.

Which install path do I use for my OS?

Here's the quick decision before you touch a terminal:

OS	Install method	GPU acceleration	Notes
macOS (Apple Silicon)	`.dmg` app from ollama.com	Metal (automatic)	M1/M2/M3/M4 all use unified memory; no setup
macOS (Intel)	`.dmg` app	CPU only	Works, but slow — older Intel Macs lack a usable GPU path
Windows 10/11	`.exe` installer	NVIDIA CUDA / AMD ROCm (automatic)	Native app, no WSL required
Linux (Ubuntu/Debian/Fedora/Arch)	`install.sh` script	NVIDIA CUDA / AMD ROCm	Script auto-detects your GPU drivers

If you're on Apple Silicon, you have the easiest ride of anyone — unified memory means the "VRAM" is just your system RAM. If you're on a desktop with a dedicated card, your VRAM is the hard ceiling, and it's worth reading how much VRAM you need for Llama 3 8B before you pull anything large.

How do I install Ollama on macOS?

Go to ollama.com and download the macOS .dmg.
Open the .dmg, drag Ollama into Applications, and launch it once so it installs the ollama command-line tool.
Open Terminal and confirm it's alive:

ollama --version
ollama run qwen3:8b

That second command pulls a small Qwen model (a few GB) and drops you into a chat prompt. On an M-series Mac with 16GB or more of unified memory, an 8B model at Q4 runs comfortably. With 8GB, stick to 3B–4B models like gemma3:4b or qwen3:4b.

One macOS-specific tip: the menu-bar app needs to be running for the background server to stay up. If ollama run says it can't connect, launch the app from Applications first.

How do I install Ollama on Windows?

Download the Windows .exe from ollama.com.
Run the installer. It sets up the Ollama service and adds ollama to your PATH automatically — no WSL, no Docker, no admin gymnastics.
Open PowerShell or Command Prompt and test:

ollama --version
ollama run llama3.1:8b

Ollama's Windows build auto-detects NVIDIA (CUDA) and recent AMD (ROCm) GPUs. If you have an RTX or RX card, it'll offload layers to the GPU without you configuring anything. To confirm the GPU is actually being used, run a model and check Task Manager's GPU "Dedicated memory" graph — it should jump when the model loads. If it doesn't, your driver is likely stale; update to the latest NVIDIA Studio/Game Ready or AMD Adrenalin driver and restart.

If you're still GPU-shopping, my NVIDIA vs AMD for local LLMs in 2026 breakdown covers which cards give the least driver pain.

How do I install Ollama on Linux?

The one-liner handles Ubuntu, Debian, Fedora, and Arch:

curl -fsSL https://ollama.com/install.sh | sh

The script detects your distro, installs the binary to /usr/local/bin, sets up a systemd service, and — importantly — detects NVIDIA or AMD GPU drivers and pulls the matching acceleration libraries. After it finishes:

# Confirm the service is running
systemctl status ollama

# Pull and run a model
ollama run deepseek-r1:8b

If you'd rather not pipe a script straight into your shell (a reasonable instinct), you can download the standalone tarball from the GitHub releases page and extract it manually — same binary, you just wire up the service yourself.

GPU note for Linux: Ollama needs the proprietary NVIDIA driver plus the CUDA runtime, or the ROCm stack for AMD. If ollama run falls back to CPU, run nvidia-smi (or rocminfo) to confirm the driver sees your card before blaming Ollama. CPU-only inference works fine for small models — see my notes on the CPU-only privacy tradeoff if you're running headless.

How do I verify Ollama is working?

After any install, three checks confirm everything's wired up:

# 1. Binary exists
ollama --version

# 2. Server is reachable
curl http://localhost:11434/api/tags

# 3. A model actually runs
ollama run gemma3:4b "Say hi in five words"

If the curl returns JSON (even an empty model list), the server is live. That endpoint matters because it's the same OpenAI-compatible API your other tools will hit — more on that in my Ollama OpenAI-compatible API guide.

How do I pick the right quant so the model fits?

Ollama models are GGUF files, and most default tags ship at Q4_K_M — a 4-bit quantization that cuts the model's memory footprint to roughly a quarter of the full-precision size while keeping quality high enough that you won't notice on everyday tasks. Q8 doubles the memory cost for a small quality bump that mostly matters for code and math.

Rough sizing for picking a tag:

If you have 8GB VRAM/RAM → run 3B–7B models at Q4_K_M (e.g. qwen3:4b, llama3.1:8b is borderline).
If you have 12–16GB → 8B–14B at Q4_K_M is the sweet spot.
If you have 24GB+ → 14B–32B at Q4, or smaller models at Q8 for cleaner output.
If you only have CPU → stay at or below 7B and expect single-digit tokens/sec.

These are ballpark ranges — your real throughput depends on your exact card, memory bandwidth, and context length, so benchmark on your own stack. For the full tradeoff, see Q4 vs Q8 quant quality and the broader quantization explainer.

Can I import my own GGUF or change where models are stored?

Yes to both. If you already downloaded a GGUF from Hugging Face, point Ollama at it with a tiny Modelfile:

# Modelfile
FROM ./my-model-Q4_K_M.gguf

ollama create my-model -f Modelfile
ollama run my-model

To move the model cache off your system drive (models add up fast), set the storage path before starting the server:

# macOS / Linux
export OLLAMA_MODELS=/mnt/bigdisk/ollama

# Windows (PowerShell, then restart Ollama)
setx OLLAMA_MODELS "D:\ollama-models"

What usually breaks, and how do I fix it?

A few recurring gotchas:

"connection refused" on port 11434 — the background server isn't running. On macOS/Windows, launch the app; on Linux, systemctl start ollama.
Model loads on CPU despite having a GPU — stale GPU driver, or the model is bigger than your VRAM and Ollama silently spilled to RAM. Update drivers; pick a smaller quant.
Want to reach Ollama from another device — bind it to your network with OLLAMA_HOST=0.0.0.0:11434 before starting, then point a UI at it. This is exactly how the homelab Docker stack with Ollama and Open WebUI works.
Out-of-memory crash mid-generation — your context window is too long for the remaining VRAM. Lower it, or drop to a smaller model.

If you want a graphical chat interface instead of the terminal, Open WebUI connects to Ollama in a couple of minutes and gives you a clean ChatGPT-style front end.

Bottom line

Ollama is the fastest on-ramp to running open-weight models locally: a .dmg on Mac, an .exe on Windows, and a one-line script on Linux, with GPU acceleration auto-detected on all three. Start with a small Q4_K_M model that fits your memory, verify the localhost:11434 API responds, and scale up from there. When you outgrow Ollama's defaults and want manual control over layers, context, and build flags, that's your cue to look at llama.cpp — but for 90% of local-AI work, this install is all you need. Head back to the run open-weight models locally in 2026 pillar for the full picture.

Related: what is gguf local llm format

What is Ollama, in one sentence?

Which install path do I use for my OS?

Here's the quick decision before you touch a terminal:

OS	Install method	GPU acceleration	Notes
macOS (Apple Silicon)	`.dmg` app from ollama.com	Metal (automatic)	M1/M2/M3/M4 all use unified memory; no setup
macOS (Intel)	`.dmg` app	CPU only	Works, but slow — older Intel Macs lack a usable GPU path
Windows 10/11	`.exe` installer	NVIDIA CUDA / AMD ROCm (automatic)	Native app, no WSL required
Linux (Ubuntu/Debian/Fedora/Arch)	`install.sh` script	NVIDIA CUDA / AMD ROCm	Script auto-detects your GPU drivers

How do I install Ollama on macOS?

Go to ollama.com and download the macOS .dmg.
Open the .dmg, drag Ollama into Applications, and launch it once so it installs the ollama command-line tool.
Open Terminal and confirm it's alive:

ollama --version
ollama run qwen3:8b

One macOS-specific tip: the menu-bar app needs to be running for the background server to stay up. If ollama run says it can't connect, launch the app from Applications first.

How do I install Ollama on Windows?

Download the Windows .exe from ollama.com.
Run the installer. It sets up the Ollama service and adds ollama to your PATH automatically — no WSL, no Docker, no admin gymnastics.
Open PowerShell or Command Prompt and test:

ollama --version
ollama run llama3.1:8b

If you're still GPU-shopping, my NVIDIA vs AMD for local LLMs in 2026 breakdown covers which cards give the least driver pain.

How do I install Ollama on Linux?

The one-liner handles Ubuntu, Debian, Fedora, and Arch:

curl -fsSL https://ollama.com/install.sh | sh

# Confirm the service is running
systemctl status ollama

# Pull and run a model
ollama run deepseek-r1:8b

How do I verify Ollama is working?

After any install, three checks confirm everything's wired up:

# 1. Binary exists
ollama --version

# 2. Server is reachable
curl http://localhost:11434/api/tags

# 3. A model actually runs
ollama run gemma3:4b "Say hi in five words"

How do I pick the right quant so the model fits?

Rough sizing for picking a tag:

If you have 8GB VRAM/RAM → run 3B–7B models at Q4_K_M (e.g. qwen3:4b, llama3.1:8b is borderline).
If you have 12–16GB → 8B–14B at Q4_K_M is the sweet spot.
If you have 24GB+ → 14B–32B at Q4, or smaller models at Q8 for cleaner output.
If you only have CPU → stay at or below 7B and expect single-digit tokens/sec.

Can I import my own GGUF or change where models are stored?

Yes to both. If you already downloaded a GGUF from Hugging Face, point Ollama at it with a tiny Modelfile:

# Modelfile
FROM ./my-model-Q4_K_M.gguf

ollama create my-model -f Modelfile
ollama run my-model

To move the model cache off your system drive (models add up fast), set the storage path before starting the server:

# macOS / Linux
export OLLAMA_MODELS=/mnt/bigdisk/ollama

# Windows (PowerShell, then restart Ollama)
setx OLLAMA_MODELS "D:\ollama-models"

What usually breaks, and how do I fix it?

A few recurring gotchas:

"connection refused" on port 11434 — the background server isn't running. On macOS/Windows, launch the app; on Linux, systemctl start ollama.
Model loads on CPU despite having a GPU — stale GPU driver, or the model is bigger than your VRAM and Ollama silently spilled to RAM. Update drivers; pick a smaller quant.
Want to reach Ollama from another device — bind it to your network with OLLAMA_HOST=0.0.0.0:11434 before starting, then point a UI at it. This is exactly how the homelab Docker stack with Ollama and Open WebUI works.
Out-of-memory crash mid-generation — your context window is too long for the remaining VRAM. Lower it, or drop to a smaller model.

If you want a graphical chat interface instead of the terminal, Open WebUI connects to Ollama in a couple of minutes and gives you a clean ChatGPT-style front end.

Bottom line

Related: what is gguf local llm format

Install Ollama on Windows, Mac, and Linux (2026)

Key takeaways

What is Ollama, in one sentence?

Which install path do I use for my OS?

How do I install Ollama on macOS?

How do I install Ollama on Windows?

How do I install Ollama on Linux?

How do I verify Ollama is working?

How do I pick the right quant so the model fits?

Can I import my own GGUF or change where models are stored?

What usually breaks, and how do I fix it?

Bottom line

Frequently asked questions

Related Articles

Run Open-Weight Models Locally (2026)

CPU-Only Local LLM Privacy Tradeoffs

Local LLM Checklist: Keep Data Off the Cloud

Install Ollama on Windows, Mac, and Linux (2026)

Key takeaways

What is Ollama, in one sentence?

Which install path do I use for my OS?

How do I install Ollama on macOS?

How do I install Ollama on Windows?

How do I install Ollama on Linux?

How do I verify Ollama is working?

How do I pick the right quant so the model fits?

Can I import my own GGUF or change where models are stored?

What usually breaks, and how do I fix it?

Bottom line

Frequently asked questions

Related Articles

Run Open-Weight Models Locally (2026)

CPU-Only Local LLM Privacy Tradeoffs

Local LLM Checklist: Keep Data Off the Cloud