Which pillar does this cluster support?

See /blog/self-hosting-guide-beginners-2026 for the full cornerstone guide.

Homelab Docker Stack: Ollama + Open WebUI | WikiWayne

A homelab Docker stack for local AI runs two services that talk to each other on your own machine: Ollama as the model runner (the OpenAI-style API that actually loads and serves the model) and Open WebUI as the browser chat front-end. Wire them together with one docker-compose.yml, point Open WebUI at Ollama over the internal Docker network, and you get a private ChatGPT-style interface where nothing leaves your box. No cloud relay, no API keys, no telemetry — just open-weight models like Qwen, Llama, Gemma, and Mistral running on hardware you control.

This is the cluster guide for the broader self-hosting guide for beginners pillar. Read that for the full sizing tables; here we get the two-container stack running.

What is the Ollama + Open WebUI stack?

It's a two-part split that mirrors how cloud AI works, except both halves live on your network:

Ollama — an open-source model runner built on llama.cpp. It downloads GGUF model weights, manages VRAM, and exposes a local HTTP API on port 11434. Think of it as the engine.
Open WebUI — a self-hosted web interface (formerly "Ollama WebUI"). It gives you chat threads, model switching, RAG document upload, and multi-user accounts. Think of it as the dashboard.

They're separate containers on purpose. You can swap the front-end, run multiple front-ends against one Ollama, or expose the Ollama API to other apps on your LAN. If you're still deciding between runners, my LM Studio vs Ollama vs llama.cpp comparison breaks down when each one wins.

Why use Docker instead of installing Ollama directly?

You don't need Docker — a native install works fine for one person on one machine. Docker earns its keep when you want a reproducible, restartable stack. Define the whole thing once, version it in git, and docker compose up rebuilds it identically on any box.

Approach	Best for	Tradeoff
Native install	Single user, simplest setup	Manual updates, no isolation, harder to reproduce
Docker Compose	Homelab, multi-user, repeatable	GPU passthrough config, slight overhead
Docker + reverse proxy	LAN/remote access with TLS	More moving parts to maintain

For a homelab where you already run other containers, Docker is the obvious choice. The one wrinkle is GPU access — covered below.

What hardware do I need to run this?

The stack itself is light; the model is what eats resources. Ballpark guidance (verify on your own hardware — these vary by quant and context length):

8B-class models (Llama 3.1 8B, Qwen3 8B) at Q4_K_M: roughly 5-7 GB of VRAM or unified memory. Comfortable on a 12 GB GPU or a 16 GB Mac.
14B models at Q4_K_M: roughly 9-12 GB. Wants a 16 GB+ card.
30B+ MoE models (Qwen3-30B-A3B, GPT-OSS-20B): more, but mixture-of-experts means they punch above their active-parameter weight.

If you're sizing VRAM, my VRAM requirements guide and the how much VRAM for Llama 3 8B breakdown have the real math. Quick rule: GGUF model file size + 1-2 GB overhead ≈ VRAM needed to fit it fully on the GPU.

How do I write the docker-compose.yml?

Here's a working two-service stack. Drop this in a folder as docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    # GPU section — see notes below

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    depends_on:
      - ollama
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui:/app/backend/data

volumes:
  ollama:
  open-webui:

The key line is OLLAMA_BASE_URL=http://ollama:11434. Because both containers share the default Compose network, Open WebUI reaches Ollama by its service name (ollama), not localhost. That internal hostname is the whole trick — get it wrong and the front-end can't see any models. If you hit that, my Open WebUI + Ollama connection guide walks through the fix.

Bring it up:

docker compose up -d
docker compose logs -f

Open WebUI lands at http://localhost:3000. The first account you create becomes the admin.

How do I enable GPU acceleration in the container?

Without GPU passthrough, Ollama runs CPU-only — it works, just slowly. To use an NVIDIA card, install the NVIDIA Container Toolkit on the host, then add a deploy block to the ollama service:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Verify the toolkit first:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If that prints your GPU, you're set. Decision points:

If you have an NVIDIA GPU → use the ollama/ollama:latest image with the deploy block above. This is the smoothest path.
If you have an AMD GPU → use the ollama/ollama:rocm image and pass /dev/kfd and /dev/dri devices. ROCm support is real but pickier; see my NVIDIA vs AMD for local LLM rundown.
If you're on Apple Silicon → Docker can't pass through the Mac's GPU. Run Ollama natively on macOS (it uses Metal) and only containerize Open WebUI, pointing it at http://host.docker.internal:11434.

That Apple caveat trips people up constantly. Docker Desktop on a Mac runs in a Linux VM that has no Metal access, so a containerized Ollama falls back to CPU. Native Ollama plus a containerized WebUI is the right split on Apple Silicon.

How do I pull and run a model?

With the stack up, pull a model into the Ollama container:

docker exec -it ollama ollama pull qwen3:8b
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama list

Then refresh Open WebUI — the new models appear in the model dropdown. Start small. A Q4_K_M 8B model is the sweet spot for a first run: good quality, modest memory, fast enough to feel responsive. Scale up only after it works. My pull your first open-weight model in 5 minutes guide covers picking a starter model.

A note on quant tags: q4_K_M is the 4-bit quant most people should default to — it cuts the model to roughly a quarter of the FP16 size with minimal quality loss. q8_0 is near-lossless but doubles the memory. The full tradeoff is in Q4 vs Q8 quant quality.

How do I confirm nothing is going to the cloud?

This is the entire point of running local, so verify it. The "no cloud relay" claim holds because both containers talk over the internal Docker bridge network and the model inference happens inside Ollama on your hardware. To prove it:

Pull a model, then disconnect the machine from the internet (unplug ethernet / turn off Wi-Fi).
Open http://localhost:3000 and start a chat.
It still works. If responses come back offline, no remote API is involved.

A few settings to lock down for a genuinely private stack:

In Open WebUI admin settings, disable OpenAI API connections if you only want local models (otherwise it can route to OpenAI when keys are set).
Set OLLAMA_KEEP_ALIVE if you want models to stay loaded; unrelated to privacy but saves reload time.
Don't expose port 3000 or 11434 to the public internet without auth and TLS. For a hardening pass, see my local LLM keep-data-off-cloud checklist.

How do I update and maintain the stack?

Updates are a three-command ritual. Named volumes (ollama, open-webui) preserve your models and chat history across image upgrades:

docker compose pull
docker compose up -d
docker image prune -f

The ollama volume holds downloaded GGUF weights — those can be tens of gigabytes, so don't delete it casually. The open-webui volume holds your accounts, chats, and uploaded RAG documents. Back both up before any risky change.

If you outgrow Ollama's defaults — you want raw llama.cpp flags, custom GPU layer offload, or a different server — that's a sign to graduate. I cover the migration in llama.cpp vs Ollama: when to switch, and tuning offload in GPU offload layers explained.

Bottom line

Two containers, one Compose file, and you've got a private AI chat stack: Ollama serves open-weight models from your GPU, Open WebUI gives you the browser interface, and the internal Docker network keeps everything off the cloud. Start with a Q4_K_M 8B model, get GPU passthrough working (or run Ollama native on Apple Silicon), and verify privacy by pulling the network cable mid-chat. From there, scale models up or swap runners as your hardware allows. For the full self-hosting picture, head back to the self-hosting guide for beginners pillar.

This is the cluster guide for the broader self-hosting guide for beginners pillar. Read that for the full sizing tables; here we get the two-container stack running.

What is the Ollama + Open WebUI stack?

It's a two-part split that mirrors how cloud AI works, except both halves live on your network:

Ollama — an open-source model runner built on llama.cpp. It downloads GGUF model weights, manages VRAM, and exposes a local HTTP API on port 11434. Think of it as the engine.
Open WebUI — a self-hosted web interface (formerly "Ollama WebUI"). It gives you chat threads, model switching, RAG document upload, and multi-user accounts. Think of it as the dashboard.

Why use Docker instead of installing Ollama directly?

Approach	Best for	Tradeoff
Native install	Single user, simplest setup	Manual updates, no isolation, harder to reproduce
Docker Compose	Homelab, multi-user, repeatable	GPU passthrough config, slight overhead
Docker + reverse proxy	LAN/remote access with TLS	More moving parts to maintain

For a homelab where you already run other containers, Docker is the obvious choice. The one wrinkle is GPU access — covered below.

What hardware do I need to run this?

The stack itself is light; the model is what eats resources. Ballpark guidance (verify on your own hardware — these vary by quant and context length):

8B-class models (Llama 3.1 8B, Qwen3 8B) at Q4_K_M: roughly 5-7 GB of VRAM or unified memory. Comfortable on a 12 GB GPU or a 16 GB Mac.
14B models at Q4_K_M: roughly 9-12 GB. Wants a 16 GB+ card.
30B+ MoE models (Qwen3-30B-A3B, GPT-OSS-20B): more, but mixture-of-experts means they punch above their active-parameter weight.

How do I write the docker-compose.yml?

Here's a working two-service stack. Drop this in a folder as docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    # GPU section — see notes below

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    depends_on:
      - ollama
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui:/app/backend/data

volumes:
  ollama:
  open-webui:

Bring it up:

docker compose up -d
docker compose logs -f

Open WebUI lands at http://localhost:3000. The first account you create becomes the admin.

How do I enable GPU acceleration in the container?

Without GPU passthrough, Ollama runs CPU-only — it works, just slowly. To use an NVIDIA card, install the NVIDIA Container Toolkit on the host, then add a deploy block to the ollama service:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Verify the toolkit first:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If that prints your GPU, you're set. Decision points:

If you have an NVIDIA GPU → use the ollama/ollama:latest image with the deploy block above. This is the smoothest path.
If you have an AMD GPU → use the ollama/ollama:rocm image and pass /dev/kfd and /dev/dri devices. ROCm support is real but pickier; see my NVIDIA vs AMD for local LLM rundown.
If you're on Apple Silicon → Docker can't pass through the Mac's GPU. Run Ollama natively on macOS (it uses Metal) and only containerize Open WebUI, pointing it at http://host.docker.internal:11434.

How do I pull and run a model?

With the stack up, pull a model into the Ollama container:

docker exec -it ollama ollama pull qwen3:8b
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama list

How do I confirm nothing is going to the cloud?

Pull a model, then disconnect the machine from the internet (unplug ethernet / turn off Wi-Fi).
Open http://localhost:3000 and start a chat.
It still works. If responses come back offline, no remote API is involved.

A few settings to lock down for a genuinely private stack:

In Open WebUI admin settings, disable OpenAI API connections if you only want local models (otherwise it can route to OpenAI when keys are set).
Set OLLAMA_KEEP_ALIVE if you want models to stay loaded; unrelated to privacy but saves reload time.
Don't expose port 3000 or 11434 to the public internet without auth and TLS. For a hardening pass, see my local LLM keep-data-off-cloud checklist.

How do I update and maintain the stack?

Updates are a three-command ritual. Named volumes (ollama, open-webui) preserve your models and chat history across image upgrades:

docker compose pull
docker compose up -d
docker image prune -f

Homelab Docker Stack: Ollama + Open WebUI

Key takeaways

What is the Ollama + Open WebUI stack?

Why use Docker instead of installing Ollama directly?

What hardware do I need to run this?

How do I write the docker-compose.yml?

How do I enable GPU acceleration in the container?

How do I pull and run a model?

How do I confirm nothing is going to the cloud?

How do I update and maintain the stack?

Bottom line

Frequently asked questions

Related Articles

Self-Hosting for Beginners: Run Your Own Services in 2026

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL

Homelab Docker Stack: Ollama + Open WebUI

Key takeaways

What is the Ollama + Open WebUI stack?

Why use Docker instead of installing Ollama directly?

What hardware do I need to run this?

How do I write the docker-compose.yml?

How do I enable GPU acceleration in the container?

How do I pull and run a model?

How do I confirm nothing is going to the cloud?

How do I update and maintain the stack?

Bottom line

Frequently asked questions

Related Articles

Self-Hosting for Beginners: Run Your Own Services in 2026

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL