Is this page updated when runners change?

Yes. Cornerstone posts bump updatedAt when Ollama, LM Studio, or llama.cpp ship breaking changes; see the refresh log in Content Ideas.

A GPU helps for 7B+ models at interactive speed. CPU-only inference is supported for privacy experiments with smaller quants.

Open WebUI for Local AI | WikiWayne

Open WebUI for Local AI

Open WebUI is a self-hosted, browser-based chat interface that puts a polished, ChatGPT-style front end on top of your local models, talking to backends like Ollama or any OpenAI-compatible API. If you've been living in a terminal running ollama run, this is the upgrade that makes local AI usable by your whole household: persistent chat history, multiple users, document RAG, and model switching, all running on your own hardware with nothing leaving the box. I run it as the default door to my local stack, and it's the piece that turned "Wayne's nerd project" into something my family actually opens.

What is Open WebUI, in one sentence?

Open WebUI is an open-source (MIT-ish, self-hostable) web app that gives local LLM backends a multi-user chat UI with history, role-based admin, document/RAG ingestion, and model management, typically paired with Ollama but compatible with any OpenAI-style endpoint.

It does not run models itself. It's the cockpit; Ollama, llama.cpp, or LM Studio is the engine. That separation is the whole point: you can swap the backend without touching the interface your users see.

Why use Open WebUI instead of the terminal or LM Studio's built-in chat?

Because a single-user desktop chat window doesn't scale past one person at one machine. Open WebUI gives you things a terminal never will:

Multi-user accounts with an admin panel, so your partner and kids get their own logins and history.
Persistent, searchable chat history stored in a local SQLite (or Postgres) database, not lost when you close a window.
Document RAG built in: drop in PDFs or text, and the model can answer over them.
Model switching in a dropdown across every backend you've connected.
Access from any device on your LAN, phone included, since it's just a web page.

If you only ever chat solo on one laptop, LM Studio's built-in UI is fine and simpler. The moment you want other people, other devices, or RAG, Open WebUI earns its keep. See LM Studio vs Ollama vs llama.cpp for picking the engine underneath.

How do I install Open WebUI? (Docker quickstart)

The cleanest path is Docker. This assumes Ollama is already running on the host (if not, start with Install Ollama).

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000, create the first account (it becomes the admin), and you're in. The -v open-webui:/app/backend/data volume is what keeps your users, chats, and settings across container updates, don't skip it.

If you'd rather bundle Ollama and Open WebUI together in one container:

docker run -d \
  -p 3000:8080 \
  --gpus=all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

No Docker? You can run it with pip install open-webui && open-webui serve, but on Apple Silicon and mixed homelabs I find Docker far less fussy for updates. For a full reproducible homelab, the Docker stack guide wires Ollama and Open WebUI together with compose.

How do I connect Open WebUI to my models?

By default the container looks for Ollama at http://host.docker.internal:11434. If Ollama is on the same host, that usually just works. To point at any OpenAI-compatible endpoint (llama.cpp's server, vLLM, LM Studio's server, a remote box), go to Admin Panel → Settings → Connections and add the base URL plus an API key (use a dummy key like sk-local for local servers that ignore it).

A llama.cpp server, for example, exposes an OpenAI-style API you can drop straight in:

llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080 -c 8192

Then add http://host.docker.internal:8080/v1 as an OpenAI connection. For the Ollama-specific wiring and common gotchas (firewall, OLLAMA_HOST, the host.docker.internal trick on Linux), I wrote a dedicated Open WebUI + Ollama connection guide and the broader OpenAI-compatible API reference.

Which backend should I run behind Open WebUI?

Quick decision list:

If you want the easiest setup and model pulls → Ollama. One command pulls a GGUF and registers it; Open WebUI sees it instantly.
If you want maximum control over flags, context, and quant → llama.cpp's llama-server. More knobs, more speed tuning, slightly more setup. (build guide)
If you're on Apple Silicon and want native Metal speed → MLX as a server, or Ollama, which already uses Metal. (MLX setup)
If you already curate models in a desktop app → LM Studio's local server, then point Open WebUI at it.

Backend	Setup effort	Best for	Connect via
Ollama	Lowest	Households, fast start	Native Ollama connection
llama.cpp server	Medium	Tuning, raw speed	OpenAI `/v1` endpoint
LM Studio server	Low	Desktop-curated libraries	OpenAI `/v1` endpoint
MLX server	Medium	Apple Silicon natives	OpenAI `/v1` endpoint

You can connect several at once and switch per chat. That's how I run a fast small model for quick questions and a bigger one for code, side by side.

What hardware and model size do I actually need?

Open WebUI itself is featherweight, it's a web app, not a model, so it'll run on a Raspberry Pi or a tiny VM. The hardware question is really about the backend model you're serving.

Rough sizing for a 7B-8B model, which is the sweet spot for a household assistant:

A Q4_K_M quant of a 7B-8B model lands somewhere in the low-single-digit GB range on disk, and you generally want a few GB of VRAM headroom beyond the file size for the KV cache and context.
An 8GB GPU comfortably runs 7B-8B at Q4 with room for a usable context window. A 12-16GB card opens up 13B-14B and longer contexts.
CPU-only works for these sizes, just slower, fine for occasional privacy-first chats.

Don't trust my ballparks as gospel, VRAM use shifts with context length, quant, and runner. Verify on your own stack, and read how much VRAM for Llama 3 8B and the VRAM requirements guide before you commit to a model. On quant tradeoffs, Q4 vs Q8 covers when the bigger file is worth it.

Rule of thumb I use: start with a small Q4_K_M quant, confirm it loads and responds at interactive speed on your GPU, then scale up model size or context. Going big first is the fastest way to hit out-of-memory crashes.

How does RAG (document chat) work in Open WebUI?

RAG, retrieval-augmented generation, means the model answers using chunks pulled from your own documents instead of just its training data. In Open WebUI you upload files into the Documents/Knowledge section, it embeds and indexes them locally, and you reference a document in chat with # to ground answers in it.

A few practitioner notes:

The default embedding model is small and runs locally; for better retrieval on technical PDFs you can swap in a stronger embedding model in Admin → Settings → Documents.
Keep documents focused. A 400-page manual dumped in whole retrieves worse than the three relevant chapters.
Everything, embeddings, vector store, chats, stays on your machine. That's the privacy win. Pair it with the keep-data-off-cloud checklist if privacy is the whole reason you're here.

Is Open WebUI private and safe to expose?

On your LAN, with no port forwarding, it's as private as the host machine: chats and documents sit in the local data volume, and inference happens on your hardware. Safe defaults to keep it that way:

Don't expose port 3000 to the open internet without a reverse proxy + HTTPS + auth in front (Caddy or Traefik with basic auth or an OAuth provider).
Turn off open signups in Admin settings so strangers can't self-register if it's ever reachable.
Keep the data volume backed up, it holds everyone's history.

For a fully offline posture, run it on a box with no outbound internet and confirm the backend model is local-only.

Bottom line

Open WebUI is the front end that makes a local-AI stack feel like a real product instead of a science experiment: multi-user chat, persistent history, document RAG, and model switching, all self-hosted with nothing phoning home. Run it in Docker, point it at Ollama (or any OpenAI-compatible backend), start with a small Q4_K_M 7B-8B quant, verify VRAM on your own GPU, and scale from there. Get the backend right with the connection guide and the homelab Docker stack, and you've got a private ChatGPT for the whole house that costs nothing per token.

Open WebUI for Local AI

What is Open WebUI, in one sentence?

Why use Open WebUI instead of the terminal or LM Studio's built-in chat?

Because a single-user desktop chat window doesn't scale past one person at one machine. Open WebUI gives you things a terminal never will:

Multi-user accounts with an admin panel, so your partner and kids get their own logins and history.
Persistent, searchable chat history stored in a local SQLite (or Postgres) database, not lost when you close a window.
Document RAG built in: drop in PDFs or text, and the model can answer over them.
Model switching in a dropdown across every backend you've connected.
Access from any device on your LAN, phone included, since it's just a web page.

How do I install Open WebUI? (Docker quickstart)

The cleanest path is Docker. This assumes Ollama is already running on the host (if not, start with Install Ollama).

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

If you'd rather bundle Ollama and Open WebUI together in one container:

docker run -d \
  -p 3000:8080 \
  --gpus=all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

How do I connect Open WebUI to my models?

A llama.cpp server, for example, exposes an OpenAI-style API you can drop straight in:

llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080 -c 8192

Which backend should I run behind Open WebUI?

Quick decision list:

If you want the easiest setup and model pulls → Ollama. One command pulls a GGUF and registers it; Open WebUI sees it instantly.
If you want maximum control over flags, context, and quant → llama.cpp's llama-server. More knobs, more speed tuning, slightly more setup. (build guide)
If you're on Apple Silicon and want native Metal speed → MLX as a server, or Ollama, which already uses Metal. (MLX setup)
If you already curate models in a desktop app → LM Studio's local server, then point Open WebUI at it.

Backend	Setup effort	Best for	Connect via
Ollama	Lowest	Households, fast start	Native Ollama connection
llama.cpp server	Medium	Tuning, raw speed	OpenAI `/v1` endpoint
LM Studio server	Low	Desktop-curated libraries	OpenAI `/v1` endpoint
MLX server	Medium	Apple Silicon natives	OpenAI `/v1` endpoint

You can connect several at once and switch per chat. That's how I run a fast small model for quick questions and a bigger one for code, side by side.

What hardware and model size do I actually need?

Open WebUI itself is featherweight, it's a web app, not a model, so it'll run on a Raspberry Pi or a tiny VM. The hardware question is really about the backend model you're serving.

Rough sizing for a 7B-8B model, which is the sweet spot for a household assistant:

A Q4_K_M quant of a 7B-8B model lands somewhere in the low-single-digit GB range on disk, and you generally want a few GB of VRAM headroom beyond the file size for the KV cache and context.
An 8GB GPU comfortably runs 7B-8B at Q4 with room for a usable context window. A 12-16GB card opens up 13B-14B and longer contexts.
CPU-only works for these sizes, just slower, fine for occasional privacy-first chats.

How does RAG (document chat) work in Open WebUI?

A few practitioner notes:

The default embedding model is small and runs locally; for better retrieval on technical PDFs you can swap in a stronger embedding model in Admin → Settings → Documents.
Keep documents focused. A 400-page manual dumped in whole retrieves worse than the three relevant chapters.
Everything, embeddings, vector store, chats, stays on your machine. That's the privacy win. Pair it with the keep-data-off-cloud checklist if privacy is the whole reason you're here.

Is Open WebUI private and safe to expose?

On your LAN, with no port forwarding, it's as private as the host machine: chats and documents sit in the local data volume, and inference happens on your hardware. Safe defaults to keep it that way:

Don't expose port 3000 to the open internet without a reverse proxy + HTTPS + auth in front (Caddy or Traefik with basic auth or an OAuth provider).
Turn off open signups in Admin settings so strangers can't self-register if it's ever reachable.
Keep the data volume backed up, it holds everyone's history.

For a fully offline posture, run it on a box with no outbound internet and confirm the backend model is local-only.

Open WebUI for Local AI

Key takeaways

What is Open WebUI, in one sentence?

Why use Open WebUI instead of the terminal or LM Studio's built-in chat?

How do I install Open WebUI? (Docker quickstart)

How do I connect Open WebUI to my models?

Which backend should I run behind Open WebUI?

What hardware and model size do I actually need?

How does RAG (document chat) work in Open WebUI?

Is Open WebUI private and safe to expose?

Bottom line

Frequently asked questions

Related Articles

Open WebUI + Ollama Connection Guide

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide

Open WebUI for Local AI

Key takeaways

What is Open WebUI, in one sentence?

Why use Open WebUI instead of the terminal or LM Studio's built-in chat?

How do I install Open WebUI? (Docker quickstart)

How do I connect Open WebUI to my models?

Which backend should I run behind Open WebUI?

What hardware and model size do I actually need?

How does RAG (document chat) work in Open WebUI?

Is Open WebUI private and safe to expose?

Bottom line

Frequently asked questions

Related Articles

Open WebUI + Ollama Connection Guide

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide

Open WebUI for Local AI

Key takeaways

What is Open WebUI, in one sentence?

Why use Open WebUI instead of the terminal or LM Studio's built-in chat?

How do I install Open WebUI? (Docker quickstart)

How do I connect Open WebUI to my models?

Which backend should I run behind Open WebUI?

What hardware and model size do I actually need?

How does RAG (document chat) work in Open WebUI?

Is Open WebUI private and safe to expose?

Bottom line

Frequently asked questions

Is this page updated when runners change?

Do I need a GPU?

Related Articles

Open WebUI + Ollama Connection Guide

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide

Open WebUI for Local AI

Key takeaways

What is Open WebUI, in one sentence?

Why use Open WebUI instead of the terminal or LM Studio's built-in chat?

How do I install Open WebUI? (Docker quickstart)

How do I connect Open WebUI to my models?

Which backend should I run behind Open WebUI?

What hardware and model size do I actually need?

How does RAG (document chat) work in Open WebUI?

Is Open WebUI private and safe to expose?

Bottom line

Frequently asked questions

Is this page updated when runners change?

Do I need a GPU?

Related Articles

Open WebUI + Ollama Connection Guide

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide