WikiWayne
Local AIAI ToolsDigital MarketingTech NewsAboutBlogContact

As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

WikiWayne

Independent guides on open-weight AI, local inference, and the hardware that runs it.

Categories

  • Local AI Hub
  • Local AI
  • AI Tools
  • Digital Marketing
  • Tech News

Quick Links

  • About Wayne
  • Contact
  • Methodology
  • Editorial Standards
  • Disclosures
  • Privacy Policy
  • Sitemap

Follow on X

Daily AI insights, tech takes, and more.

Follow @wikiwayne
WikiWayne© 2026
PrivacyMethodologyEditorialDisclosuresTermsSitemap

Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Home/Local AI/Homelab Docker Stack: Ollama + Open WebUI
Back to Blog
Homelab Docker Stack: Ollama + Open WebUI — WikiWayne local-AI hero
Local AI

Homelab Docker Stack: Ollama + Open WebUI

Published: June 13, 2026

Compose services for local chat without cloud relay.

Key takeaways

  • Compose services for local chat without cloud relay.
  • Parent pillar: /blog/self-hosting-guide-beginners-2026

Part of

Self-Hosting for Beginners (2026)

Cornerstone guide in the WikiWayne local-AI cluster.

8 min read
local-ai, cluster
Wayne Lowry, WikiWayne author
Wayne Lowry

10+ years in Digital Marketing & SEO

A homelab Docker stack for local AI runs two services that talk to each other on your own machine: Ollama as the model runner (the OpenAI-style API that actually loads and serves the model) and Open WebUI as the browser chat front-end. Wire them together with one docker-compose.yml, point Open WebUI at Ollama over the internal Docker network, and you get a private ChatGPT-style interface where nothing leaves your box. No cloud relay, no API keys, no telemetry — just open-weight models like Qwen, Llama, Gemma, and Mistral running on hardware you control.

This is the cluster guide for the broader self-hosting guide for beginners pillar. Read that for the full sizing tables; here we get the two-container stack running.

What is the Ollama + Open WebUI stack?

It's a two-part split that mirrors how cloud AI works, except both halves live on your network:

  • Ollama — an open-source model runner built on llama.cpp. It downloads GGUF model weights, manages VRAM, and exposes a local HTTP API on port 11434. Think of it as the engine.
  • Open WebUI — a self-hosted web interface (formerly "Ollama WebUI"). It gives you chat threads, model switching, RAG document upload, and multi-user accounts. Think of it as the dashboard.

They're separate containers on purpose. You can swap the front-end, run multiple front-ends against one Ollama, or expose the Ollama API to other apps on your LAN. If you're still deciding between runners, my LM Studio vs Ollama vs llama.cpp comparison breaks down when each one wins.

Why use Docker instead of installing Ollama directly?

You don't need Docker — a native install works fine for one person on one machine. Docker earns its keep when you want a reproducible, restartable stack. Define the whole thing once, version it in git, and docker compose up rebuilds it identically on any box.

Approach Best for Tradeoff
Native install Single user, simplest setup Manual updates, no isolation, harder to reproduce
Docker Compose Homelab, multi-user, repeatable GPU passthrough config, slight overhead
Docker + reverse proxy LAN/remote access with TLS More moving parts to maintain

For a homelab where you already run other containers, Docker is the obvious choice. The one wrinkle is GPU access — covered below.

What hardware do I need to run this?

The stack itself is light; the model is what eats resources. Ballpark guidance (verify on your own hardware — these vary by quant and context length):

  • 8B-class models (Llama 3.1 8B, Qwen3 8B) at Q4_K_M: roughly 5-7 GB of VRAM or unified memory. Comfortable on a 12 GB GPU or a 16 GB Mac.
  • 14B models at Q4_K_M: roughly 9-12 GB. Wants a 16 GB+ card.
  • 30B+ MoE models (Qwen3-30B-A3B, GPT-OSS-20B): more, but mixture-of-experts means they punch above their active-parameter weight.

If you're sizing VRAM, my VRAM requirements guide and the how much VRAM for Llama 3 8B breakdown have the real math. Quick rule: GGUF model file size + 1-2 GB overhead ≈ VRAM needed to fit it fully on the GPU.

How do I write the docker-compose.yml?

Here's a working two-service stack. Drop this in a folder as docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    # GPU section — see notes below

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    depends_on:
      - ollama
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui:/app/backend/data

volumes:
  ollama:
  open-webui:

The key line is OLLAMA_BASE_URL=http://ollama:11434. Because both containers share the default Compose network, Open WebUI reaches Ollama by its service name (ollama), not localhost. That internal hostname is the whole trick — get it wrong and the front-end can't see any models. If you hit that, my Open WebUI + Ollama connection guide walks through the fix.

Bring it up:

docker compose up -d
docker compose logs -f

Open WebUI lands at http://localhost:3000. The first account you create becomes the admin.

How do I enable GPU acceleration in the container?

Without GPU passthrough, Ollama runs CPU-only — it works, just slowly. To use an NVIDIA card, install the NVIDIA Container Toolkit on the host, then add a deploy block to the ollama service:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Verify the toolkit first:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If that prints your GPU, you're set. Decision points:

  • If you have an NVIDIA GPU → use the ollama/ollama:latest image with the deploy block above. This is the smoothest path.
  • If you have an AMD GPU → use the ollama/ollama:rocm image and pass /dev/kfd and /dev/dri devices. ROCm support is real but pickier; see my NVIDIA vs AMD for local LLM rundown.
  • If you're on Apple Silicon → Docker can't pass through the Mac's GPU. Run Ollama natively on macOS (it uses Metal) and only containerize Open WebUI, pointing it at http://host.docker.internal:11434.

That Apple caveat trips people up constantly. Docker Desktop on a Mac runs in a Linux VM that has no Metal access, so a containerized Ollama falls back to CPU. Native Ollama plus a containerized WebUI is the right split on Apple Silicon.

How do I pull and run a model?

With the stack up, pull a model into the Ollama container:

docker exec -it ollama ollama pull qwen3:8b
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama list

Then refresh Open WebUI — the new models appear in the model dropdown. Start small. A Q4_K_M 8B model is the sweet spot for a first run: good quality, modest memory, fast enough to feel responsive. Scale up only after it works. My pull your first open-weight model in 5 minutes guide covers picking a starter model.

A note on quant tags: q4_K_M is the 4-bit quant most people should default to — it cuts the model to roughly a quarter of the FP16 size with minimal quality loss. q8_0 is near-lossless but doubles the memory. The full tradeoff is in Q4 vs Q8 quant quality.

How do I confirm nothing is going to the cloud?

This is the entire point of running local, so verify it. The "no cloud relay" claim holds because both containers talk over the internal Docker bridge network and the model inference happens inside Ollama on your hardware. To prove it:

  1. Pull a model, then disconnect the machine from the internet (unplug ethernet / turn off Wi-Fi).
  2. Open http://localhost:3000 and start a chat.
  3. It still works. If responses come back offline, no remote API is involved.

A few settings to lock down for a genuinely private stack:

  • In Open WebUI admin settings, disable OpenAI API connections if you only want local models (otherwise it can route to OpenAI when keys are set).
  • Set OLLAMA_KEEP_ALIVE if you want models to stay loaded; unrelated to privacy but saves reload time.
  • Don't expose port 3000 or 11434 to the public internet without auth and TLS. For a hardening pass, see my local LLM keep-data-off-cloud checklist.

How do I update and maintain the stack?

Updates are a three-command ritual. Named volumes (ollama, open-webui) preserve your models and chat history across image upgrades:

docker compose pull
docker compose up -d
docker image prune -f

The ollama volume holds downloaded GGUF weights — those can be tens of gigabytes, so don't delete it casually. The open-webui volume holds your accounts, chats, and uploaded RAG documents. Back both up before any risky change.

If you outgrow Ollama's defaults — you want raw llama.cpp flags, custom GPU layer offload, or a different server — that's a sign to graduate. I cover the migration in llama.cpp vs Ollama: when to switch, and tuning offload in GPU offload layers explained.

Bottom line

Two containers, one Compose file, and you've got a private AI chat stack: Ollama serves open-weight models from your GPU, Open WebUI gives you the browser interface, and the internal Docker network keeps everything off the cloud. Start with a Q4_K_M 8B model, get GPU passthrough working (or run Ollama native on Apple Silicon), and verify privacy by pulling the network cable mid-chat. From there, scale models up or swap runners as your hardware allows. For the full self-hosting picture, head back to the self-hosting guide for beginners pillar.

Frequently asked questions

See /blog/self-hosting-guide-beginners-2026 for the full cornerstone guide.

Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Related Articles

local ai

Self-Hosting for Beginners: Run Your Own Services in 2026

13 min read

local ai

Best Used GPUs for Local AI on a Budget (2026)

9 min read

local ai

Your First ComfyUI Workflow for Local SDXL

8 min read