Which pillar does this cluster support?

See /blog/ollama-vs-lm-studio-2026 for the full cornerstone guide.

Ollama OpenAI-Compatible API for Local Apps | WikiWayne

Point any app that speaks the OpenAI API at http://localhost:11434/v1 and Ollama answers like a drop-in replacement for OpenAI's cloud — same request shape, same response shape, but running an open-weight model on your own hardware. You swap the base_url, hand it any string as the API key, and set the model to whatever you've pulled (say llama3.1 or qwen2.5). That's the whole trick: most "ChatGPT-only" tools work locally with three lines changed.

What is Ollama's OpenAI-compatible API?

It's a compatibility layer Ollama exposes at /v1 that mimics OpenAI's REST endpoints — /chat/completions, /completions, /embeddings, and /models. Your code thinks it's talking to OpenAI; it's actually hitting a local model. This matters because the OpenAI client format became the de facto standard, so agents, IDEs, and chat UIs almost universally support a custom base URL. Point that knob at Ollama and the cloud dependency vanishes.

For the full Ollama-vs-the-field picture, the parent pillar is Ollama vs LM Studio 2026.

How do I point an app at `http://localhost:11434/v1`?

First, make sure Ollama is running and you've pulled a model:

ollama pull llama3.1:8b
ollama serve   # usually already running as a background service

Then in any OpenAI-compatible client, set three things:

Base URL: http://localhost:11434/v1
API key: any non-empty string (ollama, sk-local, whatever — it's ignored but often required by the client)
Model: the exact tag you pulled, e.g. llama3.1:8b

Quick smoke test with plain curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Say hi in five words."}]
  }'

If you get JSON back with a choices array, you're done — every OpenAI-shaped tool on your machine can now talk to a local model.

How do I use the official OpenAI Python/JS SDK with Ollama?

You use the real OpenAI SDK and just override base_url. No special library.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored
)

resp = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Explain GGUF in one sentence."}],
)
print(resp.choices[0].message.content)

Node/TypeScript is the same idea:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

const resp = await client.chat.completions.create({
  model: "gemma2:9b",
  messages: [{ role: "user", content: "Hello from local!" }],
});
console.log(resp.choices[0].message.content);

Streaming (stream: true) works too, returning the same server-sent-event chunks the OpenAI SDK already knows how to parse.

Which apps can I point at the local endpoint?

Basically anything with a "custom OpenAI base URL" or "OpenAI-compatible" field. Here's where the setting usually lives:

App / tool	Where to set base URL	Notes
Open WebUI	Settings → Connections → OpenAI API	The cleanest local ChatGPT-style UI; see the Open WebUI connection guide
Continue (VS Code)	`config.json` provider block	Set `provider: "openai"`, `apiBase` to the `/v1` URL
Aider	`OPENAI_API_BASE` env var + `--model openai/<tag>`	Prefix the model with `openai/`
LangChain / LlamaIndex	`ChatOpenAI(base_url=...)`	Drop-in; tools and structured output mostly work
Cursor / IDE chat	"Override OpenAI base URL" in settings	Some clients require an `https` reverse proxy
Any curl/HTTP script	The URL string itself	Zero dependencies

If your app only accepts an https:// URL, run a small reverse proxy (Caddy or nginx) in front of Ollama — some clients hard-reject plain http.

Ollama `/v1` vs the native Ollama API — which should I use?

Ollama actually ships two HTTP surfaces, and picking the right one saves headaches.

	OpenAI-compatible `/v1`	Native `/api`
Endpoint	`localhost:11434/v1/chat/completions`	`localhost:11434/api/chat`
Best for	Reusing existing OpenAI-built apps/SDKs	New code written for Ollama directly
Request shape	OpenAI `messages` format	Ollama JSON (`prompt` or `messages`)
Model management	Read-only-ish	`pull`, `create`, `show`, `ps`, etc.
Extra knobs	Limited to OpenAI params	Full Ollama options (`num_ctx`, `keep_alive`)

If you're wiring up a tool that already speaks OpenAI, use /v1. If you're writing fresh code and want full control over context length and model lifecycle, use the native /api. You can mix both against the same running server.

What are the common gotchas?

A few things bite people repeatedly:

model not found — the model field must match a pulled tag exactly. Run ollama list and copy the name verbatim, including the :8b part.
Context window surprises — the OpenAI layer doesn't expose num_ctx. If you need a big context, set it via a Modelfile or the native API, or your long prompts get silently truncated.
Embeddings model mismatch — use a real embedding model (e.g. nomic-embed-text) for /v1/embeddings, not a chat model.
Tool/function calling — supported, but reliability depends on the model. Tool-calling-tuned models (recent Qwen, Llama 3.1+) behave; tiny models hallucinate arguments.
Connection refused from Docker — inside a container, localhost is the container, not your host. Use http://host.docker.internal:11434/v1 (or the host's LAN IP). The homelab Docker stack guide covers this end to end.

What hardware do I need to run this well?

The endpoint is free; the model isn't. Throughput tracks the model size, quantization, and your VRAM. A quick rule of thumb for picking a quant: a Q4_K_M GGUF of an 8B model needs roughly 5–6 GB of weights plus a couple GB of context overhead, so it fits comfortably on an 8–12 GB card or a 16 GB Apple Silicon machine. Bump to Q8 and you roughly double the weight footprint for a small quality gain — verify the exact numbers on your own stack, since context length and batch size move the figure.

If you're sizing a box, here's the quick decision list:

If you have 8–12 GB VRAM → run 7B–8B models at Q4_K_M; great for agents and chat. See how much VRAM for Llama 3 8B.
If you have 16–24 GB → 12B–14B comfortably, or 8B at Q8 for cleaner output.
If you're on Apple Silicon with 32 GB+ unified memory → you can reach 27B–32B class models; consider MLX for max speed via MLX on Apple Silicon.
If you're CPU-only → it works through the same endpoint, just slowly; weigh the CPU-only privacy tradeoff.

For the quant quality question specifically, Q4 vs Q8 tradeoffs breaks down when the smaller quant actually costs you.

How do I check it's actually using my GPU?

Run ollama ps while a request is in flight — it shows the loaded model and whether it's on GPU or CPU. If it says 100% CPU on a machine with a capable GPU, your offload layers aren't set or the model didn't fit. The GPU offload layers explainer walks through forcing more layers onto the card. On Apple Silicon, Activity Monitor's GPU history is the quick tell.

Bottom line

The OpenAI-compatible endpoint is the single best reason to keep Ollama in your stack: point any agent, IDE plugin, or chat UI at http://localhost:11434/v1, hand it a throwaway API key, and an open-weight model answers in OpenAI's own format — no code rewrite, no cloud, no per-token bill. Pull a small Q4_K_M model first, confirm it loads on your GPU with ollama ps, then scale up. For the full runner comparison and when to reach for something else, head back to the pillar: Ollama vs LM Studio 2026.

What is Ollama's OpenAI-compatible API?

For the full Ollama-vs-the-field picture, the parent pillar is Ollama vs LM Studio 2026.

How do I point an app at `http://localhost:11434/v1`?

First, make sure Ollama is running and you've pulled a model:

ollama pull llama3.1:8b
ollama serve   # usually already running as a background service

Then in any OpenAI-compatible client, set three things:

Base URL: http://localhost:11434/v1
API key: any non-empty string (ollama, sk-local, whatever — it's ignored but often required by the client)
Model: the exact tag you pulled, e.g. llama3.1:8b

Quick smoke test with plain curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Say hi in five words."}]
  }'

If you get JSON back with a choices array, you're done — every OpenAI-shaped tool on your machine can now talk to a local model.

How do I use the official OpenAI Python/JS SDK with Ollama?

You use the real OpenAI SDK and just override base_url. No special library.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored
)

resp = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Explain GGUF in one sentence."}],
)
print(resp.choices[0].message.content)

Node/TypeScript is the same idea:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

const resp = await client.chat.completions.create({
  model: "gemma2:9b",
  messages: [{ role: "user", content: "Hello from local!" }],
});
console.log(resp.choices[0].message.content);

Streaming (stream: true) works too, returning the same server-sent-event chunks the OpenAI SDK already knows how to parse.

Which apps can I point at the local endpoint?

Basically anything with a "custom OpenAI base URL" or "OpenAI-compatible" field. Here's where the setting usually lives:

App / tool	Where to set base URL	Notes
Open WebUI	Settings → Connections → OpenAI API	The cleanest local ChatGPT-style UI; see the Open WebUI connection guide
Continue (VS Code)	`config.json` provider block	Set `provider: "openai"`, `apiBase` to the `/v1` URL
Aider	`OPENAI_API_BASE` env var + `--model openai/<tag>`	Prefix the model with `openai/`
LangChain / LlamaIndex	`ChatOpenAI(base_url=...)`	Drop-in; tools and structured output mostly work
Cursor / IDE chat	"Override OpenAI base URL" in settings	Some clients require an `https` reverse proxy
Any curl/HTTP script	The URL string itself	Zero dependencies

If your app only accepts an https:// URL, run a small reverse proxy (Caddy or nginx) in front of Ollama — some clients hard-reject plain http.

Ollama `/v1` vs the native Ollama API — which should I use?

Ollama actually ships two HTTP surfaces, and picking the right one saves headaches.

	OpenAI-compatible `/v1`	Native `/api`
Endpoint	`localhost:11434/v1/chat/completions`	`localhost:11434/api/chat`
Best for	Reusing existing OpenAI-built apps/SDKs	New code written for Ollama directly
Request shape	OpenAI `messages` format	Ollama JSON (`prompt` or `messages`)
Model management	Read-only-ish	`pull`, `create`, `show`, `ps`, etc.
Extra knobs	Limited to OpenAI params	Full Ollama options (`num_ctx`, `keep_alive`)

What are the common gotchas?

A few things bite people repeatedly:

model not found — the model field must match a pulled tag exactly. Run ollama list and copy the name verbatim, including the :8b part.
Context window surprises — the OpenAI layer doesn't expose num_ctx. If you need a big context, set it via a Modelfile or the native API, or your long prompts get silently truncated.
Embeddings model mismatch — use a real embedding model (e.g. nomic-embed-text) for /v1/embeddings, not a chat model.
Tool/function calling — supported, but reliability depends on the model. Tool-calling-tuned models (recent Qwen, Llama 3.1+) behave; tiny models hallucinate arguments.
Connection refused from Docker — inside a container, localhost is the container, not your host. Use http://host.docker.internal:11434/v1 (or the host's LAN IP). The homelab Docker stack guide covers this end to end.

What hardware do I need to run this well?

If you're sizing a box, here's the quick decision list:

If you have 8–12 GB VRAM → run 7B–8B models at Q4_K_M; great for agents and chat. See how much VRAM for Llama 3 8B.
If you have 16–24 GB → 12B–14B comfortably, or 8B at Q8 for cleaner output.
If you're on Apple Silicon with 32 GB+ unified memory → you can reach 27B–32B class models; consider MLX for max speed via MLX on Apple Silicon.
If you're CPU-only → it works through the same endpoint, just slowly; weigh the CPU-only privacy tradeoff.

For the quant quality question specifically, Q4 vs Q8 tradeoffs breaks down when the smaller quant actually costs you.

Ollama OpenAI-Compatible API for Local Apps

Key takeaways

What is Ollama's OpenAI-compatible API?

How do I point an app at `http://localhost:11434/v1`?

How do I use the official OpenAI Python/JS SDK with Ollama?

Which apps can I point at the local endpoint?

Ollama `/v1` vs the native Ollama API — which should I use?

What are the common gotchas?

What hardware do I need to run this well?

How do I check it's actually using my GPU?

Bottom line

Frequently asked questions

Related Articles

Ollama vs LM Studio (2026): Which Local AI Runner Fits Your Workflow?

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL

Ollama OpenAI-Compatible API for Local Apps

Key takeaways

What is Ollama's OpenAI-compatible API?

How do I point an app at `http://localhost:11434/v1`?

How do I use the official OpenAI Python/JS SDK with Ollama?

Which apps can I point at the local endpoint?

Ollama `/v1` vs the native Ollama API — which should I use?

What are the common gotchas?

What hardware do I need to run this well?

How do I check it's actually using my GPU?

Bottom line

Frequently asked questions

Related Articles

Ollama vs LM Studio (2026): Which Local AI Runner Fits Your Workflow?

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL

Ollama OpenAI-Compatible API for Local Apps

Key takeaways

What is Ollama's OpenAI-compatible API?

How do I point an app at http://localhost:11434/v1?

How do I use the official OpenAI Python/JS SDK with Ollama?

Which apps can I point at the local endpoint?

Ollama /v1 vs the native Ollama API — which should I use?

What are the common gotchas?

What hardware do I need to run this well?

How do I check it's actually using my GPU?

Bottom line

Frequently asked questions