Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
Homelab Docker Stack: Ollama + Open WebUI
Compose services for local chat without cloud relay.
Key takeaways
- Compose services for local chat without cloud relay.
- Parent pillar: /blog/self-hosting-guide-beginners-2026
10+ years in Digital Marketing & SEO
A homelab Docker stack for local AI runs two services that talk to each other on your own machine: Ollama as the model runner (the OpenAI-style API that actually loads and serves the model) and Open WebUI as the browser chat front-end. Wire them together with one docker-compose.yml, point Open WebUI at Ollama over the internal Docker network, and you get a private ChatGPT-style interface where nothing leaves your box. No cloud relay, no API keys, no telemetry — just open-weight models like Qwen, Llama, Gemma, and Mistral running on hardware you control.
This is the cluster guide for the broader self-hosting guide for beginners pillar. Read that for the full sizing tables; here we get the two-container stack running.
What is the Ollama + Open WebUI stack?
It's a two-part split that mirrors how cloud AI works, except both halves live on your network:
- Ollama — an open-source model runner built on
llama.cpp. It downloads GGUF model weights, manages VRAM, and exposes a local HTTP API on port11434. Think of it as the engine. - Open WebUI — a self-hosted web interface (formerly "Ollama WebUI"). It gives you chat threads, model switching, RAG document upload, and multi-user accounts. Think of it as the dashboard.
They're separate containers on purpose. You can swap the front-end, run multiple front-ends against one Ollama, or expose the Ollama API to other apps on your LAN. If you're still deciding between runners, my LM Studio vs Ollama vs llama.cpp comparison breaks down when each one wins.
Why use Docker instead of installing Ollama directly?
You don't need Docker — a native install works fine for one person on one machine. Docker earns its keep when you want a reproducible, restartable stack. Define the whole thing once, version it in git, and docker compose up rebuilds it identically on any box.
| Approach | Best for | Tradeoff |
|---|---|---|
| Native install | Single user, simplest setup | Manual updates, no isolation, harder to reproduce |
| Docker Compose | Homelab, multi-user, repeatable | GPU passthrough config, slight overhead |
| Docker + reverse proxy | LAN/remote access with TLS | More moving parts to maintain |
For a homelab where you already run other containers, Docker is the obvious choice. The one wrinkle is GPU access — covered below.
What hardware do I need to run this?
The stack itself is light; the model is what eats resources. Ballpark guidance (verify on your own hardware — these vary by quant and context length):
- 8B-class models (Llama 3.1 8B, Qwen3 8B) at Q4_K_M: roughly 5-7 GB of VRAM or unified memory. Comfortable on a 12 GB GPU or a 16 GB Mac.
- 14B models at Q4_K_M: roughly 9-12 GB. Wants a 16 GB+ card.
- 30B+ MoE models (Qwen3-30B-A3B, GPT-OSS-20B): more, but mixture-of-experts means they punch above their active-parameter weight.
If you're sizing VRAM, my VRAM requirements guide and the how much VRAM for Llama 3 8B breakdown have the real math. Quick rule: GGUF model file size + 1-2 GB overhead ≈ VRAM needed to fit it fully on the GPU.
How do I write the docker-compose.yml?
Here's a working two-service stack. Drop this in a folder as docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
# GPU section — see notes below
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open-webui:/app/backend/data
volumes:
ollama:
open-webui:
The key line is OLLAMA_BASE_URL=http://ollama:11434. Because both containers share the default Compose network, Open WebUI reaches Ollama by its service name (ollama), not localhost. That internal hostname is the whole trick — get it wrong and the front-end can't see any models. If you hit that, my Open WebUI + Ollama connection guide walks through the fix.
Bring it up:
docker compose up -d
docker compose logs -f
Open WebUI lands at http://localhost:3000. The first account you create becomes the admin.
How do I enable GPU acceleration in the container?
Without GPU passthrough, Ollama runs CPU-only — it works, just slowly. To use an NVIDIA card, install the NVIDIA Container Toolkit on the host, then add a deploy block to the ollama service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Verify the toolkit first:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
If that prints your GPU, you're set. Decision points:
- If you have an NVIDIA GPU → use the
ollama/ollama:latestimage with the deploy block above. This is the smoothest path. - If you have an AMD GPU → use the
ollama/ollama:rocmimage and pass/dev/kfdand/dev/dridevices. ROCm support is real but pickier; see my NVIDIA vs AMD for local LLM rundown. - If you're on Apple Silicon → Docker can't pass through the Mac's GPU. Run Ollama natively on macOS (it uses Metal) and only containerize Open WebUI, pointing it at
http://host.docker.internal:11434.
That Apple caveat trips people up constantly. Docker Desktop on a Mac runs in a Linux VM that has no Metal access, so a containerized Ollama falls back to CPU. Native Ollama plus a containerized WebUI is the right split on Apple Silicon.
How do I pull and run a model?
With the stack up, pull a model into the Ollama container:
docker exec -it ollama ollama pull qwen3:8b
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama list
Then refresh Open WebUI — the new models appear in the model dropdown. Start small. A Q4_K_M 8B model is the sweet spot for a first run: good quality, modest memory, fast enough to feel responsive. Scale up only after it works. My pull your first open-weight model in 5 minutes guide covers picking a starter model.
A note on quant tags: q4_K_M is the 4-bit quant most people should default to — it cuts the model to roughly a quarter of the FP16 size with minimal quality loss. q8_0 is near-lossless but doubles the memory. The full tradeoff is in Q4 vs Q8 quant quality.
How do I confirm nothing is going to the cloud?
This is the entire point of running local, so verify it. The "no cloud relay" claim holds because both containers talk over the internal Docker bridge network and the model inference happens inside Ollama on your hardware. To prove it:
- Pull a model, then disconnect the machine from the internet (unplug ethernet / turn off Wi-Fi).
- Open
http://localhost:3000and start a chat. - It still works. If responses come back offline, no remote API is involved.
A few settings to lock down for a genuinely private stack:
- In Open WebUI admin settings, disable OpenAI API connections if you only want local models (otherwise it can route to OpenAI when keys are set).
- Set
OLLAMA_KEEP_ALIVEif you want models to stay loaded; unrelated to privacy but saves reload time. - Don't expose port
3000or11434to the public internet without auth and TLS. For a hardening pass, see my local LLM keep-data-off-cloud checklist.
How do I update and maintain the stack?
Updates are a three-command ritual. Named volumes (ollama, open-webui) preserve your models and chat history across image upgrades:
docker compose pull
docker compose up -d
docker image prune -f
The ollama volume holds downloaded GGUF weights — those can be tens of gigabytes, so don't delete it casually. The open-webui volume holds your accounts, chats, and uploaded RAG documents. Back both up before any risky change.
If you outgrow Ollama's defaults — you want raw llama.cpp flags, custom GPU layer offload, or a different server — that's a sign to graduate. I cover the migration in llama.cpp vs Ollama: when to switch, and tuning offload in GPU offload layers explained.
Bottom line
Two containers, one Compose file, and you've got a private AI chat stack: Ollama serves open-weight models from your GPU, Open WebUI gives you the browser interface, and the internal Docker network keeps everything off the cloud. Start with a Q4_K_M 8B model, get GPU passthrough working (or run Ollama native on Apple Silicon), and verify privacy by pulling the network cable mid-chat. From there, scale models up or swap runners as your hardware allows. For the full self-hosting picture, head back to the self-hosting guide for beginners pillar.
Frequently asked questions
See /blog/self-hosting-guide-beginners-2026 for the full cornerstone guide.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
