Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
Ollama OpenAI-Compatible API for Local Apps
Point agents and UIs at `http://localhost:11434/v1`.
Key takeaways
- Point agents and UIs at `http://localhost:11434/v1`.
- Parent pillar: /blog/ollama-vs-lm-studio-2026
10+ years in Digital Marketing & SEO
Point any app that speaks the OpenAI API at http://localhost:11434/v1 and Ollama answers like a drop-in replacement for OpenAI's cloud — same request shape, same response shape, but running an open-weight model on your own hardware. You swap the base_url, hand it any string as the API key, and set the model to whatever you've pulled (say llama3.1 or qwen2.5). That's the whole trick: most "ChatGPT-only" tools work locally with three lines changed.
What is Ollama's OpenAI-compatible API?
It's a compatibility layer Ollama exposes at /v1 that mimics OpenAI's REST endpoints — /chat/completions, /completions, /embeddings, and /models. Your code thinks it's talking to OpenAI; it's actually hitting a local model. This matters because the OpenAI client format became the de facto standard, so agents, IDEs, and chat UIs almost universally support a custom base URL. Point that knob at Ollama and the cloud dependency vanishes.
For the full Ollama-vs-the-field picture, the parent pillar is Ollama vs LM Studio 2026.
How do I point an app at http://localhost:11434/v1?
First, make sure Ollama is running and you've pulled a model:
ollama pull llama3.1:8b
ollama serve # usually already running as a background service
Then in any OpenAI-compatible client, set three things:
- Base URL:
http://localhost:11434/v1 - API key: any non-empty string (
ollama,sk-local, whatever — it's ignored but often required by the client) - Model: the exact tag you pulled, e.g.
llama3.1:8b
Quick smoke test with plain curl:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Say hi in five words."}]
}'
If you get JSON back with a choices array, you're done — every OpenAI-shaped tool on your machine can now talk to a local model.
How do I use the official OpenAI Python/JS SDK with Ollama?
You use the real OpenAI SDK and just override base_url. No special library.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but ignored
)
resp = client.chat.completions.create(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "Explain GGUF in one sentence."}],
)
print(resp.choices[0].message.content)
Node/TypeScript is the same idea:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama",
});
const resp = await client.chat.completions.create({
model: "gemma2:9b",
messages: [{ role: "user", content: "Hello from local!" }],
});
console.log(resp.choices[0].message.content);
Streaming (stream: true) works too, returning the same server-sent-event chunks the OpenAI SDK already knows how to parse.
Which apps can I point at the local endpoint?
Basically anything with a "custom OpenAI base URL" or "OpenAI-compatible" field. Here's where the setting usually lives:
| App / tool | Where to set base URL | Notes |
|---|---|---|
| Open WebUI | Settings → Connections → OpenAI API | The cleanest local ChatGPT-style UI; see the Open WebUI connection guide |
| Continue (VS Code) | config.json provider block |
Set provider: "openai", apiBase to the /v1 URL |
| Aider | OPENAI_API_BASE env var + --model openai/<tag> |
Prefix the model with openai/ |
| LangChain / LlamaIndex | ChatOpenAI(base_url=...) |
Drop-in; tools and structured output mostly work |
| Cursor / IDE chat | "Override OpenAI base URL" in settings | Some clients require an https reverse proxy |
| Any curl/HTTP script | The URL string itself | Zero dependencies |
If your app only accepts an https:// URL, run a small reverse proxy (Caddy or nginx) in front of Ollama — some clients hard-reject plain http.
Ollama /v1 vs the native Ollama API — which should I use?
Ollama actually ships two HTTP surfaces, and picking the right one saves headaches.
OpenAI-compatible /v1 |
Native /api |
|
|---|---|---|
| Endpoint | localhost:11434/v1/chat/completions |
localhost:11434/api/chat |
| Best for | Reusing existing OpenAI-built apps/SDKs | New code written for Ollama directly |
| Request shape | OpenAI messages format |
Ollama JSON (prompt or messages) |
| Model management | Read-only-ish | pull, create, show, ps, etc. |
| Extra knobs | Limited to OpenAI params | Full Ollama options (num_ctx, keep_alive) |
If you're wiring up a tool that already speaks OpenAI, use /v1. If you're writing fresh code and want full control over context length and model lifecycle, use the native /api. You can mix both against the same running server.
What are the common gotchas?
A few things bite people repeatedly:
model not found— themodelfield must match a pulled tag exactly. Runollama listand copy the name verbatim, including the:8bpart.- Context window surprises — the OpenAI layer doesn't expose
num_ctx. If you need a big context, set it via a Modelfile or the native API, or your long prompts get silently truncated. - Embeddings model mismatch — use a real embedding model (e.g.
nomic-embed-text) for/v1/embeddings, not a chat model. - Tool/function calling — supported, but reliability depends on the model. Tool-calling-tuned models (recent Qwen, Llama 3.1+) behave; tiny models hallucinate arguments.
- Connection refused from Docker — inside a container,
localhostis the container, not your host. Usehttp://host.docker.internal:11434/v1(or the host's LAN IP). The homelab Docker stack guide covers this end to end.
What hardware do I need to run this well?
The endpoint is free; the model isn't. Throughput tracks the model size, quantization, and your VRAM. A quick rule of thumb for picking a quant: a Q4_K_M GGUF of an 8B model needs roughly 5–6 GB of weights plus a couple GB of context overhead, so it fits comfortably on an 8–12 GB card or a 16 GB Apple Silicon machine. Bump to Q8 and you roughly double the weight footprint for a small quality gain — verify the exact numbers on your own stack, since context length and batch size move the figure.
If you're sizing a box, here's the quick decision list:
- If you have 8–12 GB VRAM → run 7B–8B models at
Q4_K_M; great for agents and chat. See how much VRAM for Llama 3 8B. - If you have 16–24 GB → 12B–14B comfortably, or 8B at
Q8for cleaner output. - If you're on Apple Silicon with 32 GB+ unified memory → you can reach 27B–32B class models; consider MLX for max speed via MLX on Apple Silicon.
- If you're CPU-only → it works through the same endpoint, just slowly; weigh the CPU-only privacy tradeoff.
For the quant quality question specifically, Q4 vs Q8 tradeoffs breaks down when the smaller quant actually costs you.
How do I check it's actually using my GPU?
Run ollama ps while a request is in flight — it shows the loaded model and whether it's on GPU or CPU. If it says 100% CPU on a machine with a capable GPU, your offload layers aren't set or the model didn't fit. The GPU offload layers explainer walks through forcing more layers onto the card. On Apple Silicon, Activity Monitor's GPU history is the quick tell.
Bottom line
The OpenAI-compatible endpoint is the single best reason to keep Ollama in your stack: point any agent, IDE plugin, or chat UI at http://localhost:11434/v1, hand it a throwaway API key, and an open-weight model answers in OpenAI's own format — no code rewrite, no cloud, no per-token bill. Pull a small Q4_K_M model first, confirm it loads on your GPU with ollama ps, then scale up. For the full runner comparison and when to reach for something else, head back to the pillar: Ollama vs LM Studio 2026.
Frequently asked questions
See /blog/ollama-vs-lm-studio-2026 for the full cornerstone guide.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
