Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
llama.cpp Complete Guide
llama.cpp Complete Guide is a cornerstone page for the WikiWayne local-AI cluster.
Key takeaways
- llama.cpp Complete Guide is a cornerstone page for the WikiWayne local-AI cluster.
- Start with a small GGUF quant and verify VRAM on your own GPU before scaling model size.
- Use linked cluster posts for install steps and runner-specific commands.
10+ years in Digital Marketing & SEO
llama.cpp is the open-source C/C++ inference engine that actually runs most local LLMs on your machine. It loads GGUF model files, runs them on CPU or GPU (CUDA, Metal, Vulkan, ROCm), and powers the guts of Ollama, LM Studio, and KoboldCpp under the hood. If you want maximum control, minimum overhead, and a runner that works on everything from a Raspberry Pi to a 4090, this is the one to learn.
I've been running open-weight models, Qwen, Llama, Gemma, DeepSeek, Mistral, GLM, Phi, on llama.cpp across an M-series Mac and a couple of consumer NVIDIA cards for a while now. This is the guide I wish I'd had on day one: what it is, when to use it over the friendlier wrappers, how to build it, and the exact commands to get a model talking.
What is llama.cpp, in one sentence?
llama.cpp is a lightweight inference runtime that executes quantized GGUF models efficiently on consumer hardware, with optional GPU acceleration and a built-in OpenAI-compatible server.
It's the layer most people use without knowing it. When you ollama run, you're hitting a llama.cpp fork. When LM Studio shows a slider for GPU layers, that's llama.cpp's --n-gpu-layers. Learning the engine directly means the abstractions stop being magic.
When should I use llama.cpp instead of Ollama or LM Studio?
Use the raw engine when you need control the wrappers hide. Here's my decision list:
- If you want a one-line install and a model in five minutes → use Ollama or LM Studio, not raw llama.cpp. See install Ollama on Windows, Mac, Linux.
- If you need custom sampling, exotic flags, or a specific quant Ollama doesn't ship → use llama.cpp directly.
- If you're deploying a reproducible server image (Docker, no desktop GUI) → llama.cpp's
llama-serveris the cleanest option. - If you're on an embedded board or weird hardware → llama.cpp compiles where nothing else does.
- If you want the absolute latest model architecture support on day one → the engine gets it before the wrappers do.
For the full side-by-side, I wrote LM Studio vs Ollama vs llama.cpp and when to switch from Ollama to llama.cpp.
llama.cpp vs the wrappers at a glance
| Factor | llama.cpp (raw) | Ollama | LM Studio |
|---|---|---|---|
| Setup effort | Build or grab a binary | One installer | One installer + GUI |
| Control over flags | Total | Limited | Moderate (sliders) |
| GGUF from any source | Yes | Mostly via Modelfile | Yes |
| Built-in chat UI | No (CLI/server) | No (CLI) | Yes |
| OpenAI-compatible API | Yes (llama-server) |
Yes | Yes |
| Best for | Tinkerers, servers | Quick daily driver | Beginners, GUI fans |
What is GGUF, and why does llama.cpp need it?
GGUF (GPT-Generated Unified Format) is the single-file model format llama.cpp loads, bundling the weights, tokenizer, and metadata together so the runtime knows exactly how to run the model.
You can't feed raw Hugging Face safetensors to llama.cpp; they have to be converted to GGUF first (the repo ships a convert_hf_to_gguf.py script). In practice you'll just download pre-converted GGUFs, most popular open-weight models have them on Hugging Face within hours of release. If you want the deeper dive, I cover it in what is GGUF.
What quantization should I pick?
Quantization shrinks model weights from 16-bit floats down to 4, 5, 6, or 8 bits so the model fits in less memory, trading a little quality for a lot of savings. GGUF files carry a suffix like Q4_K_M or Q8_0 that tells you the scheme.
My rule of thumb after a lot of A/B testing:
- Q4_K_M is the default sweet spot. Roughly half the size of Q8 with quality that's hard to distinguish in normal use. Start here.
- Q5_K_M / Q6_K if you have VRAM to spare and want a touch more fidelity.
- Q8_0 when you want near-original quality and the model still fits. Diminishing returns above this.
- Q3 / Q2 only when you're desperate to squeeze a bigger model onto small hardware, expect noticeable degradation.
I go deeper on the tradeoffs in Q4 vs Q8 quant quality and quantization explained. The short version: pick the largest quant that fits comfortably in your VRAM with room for context.
How do I install llama.cpp?
You have three paths depending on how much you want to fight your compiler.
Path 1: Homebrew (easiest, Mac/Linux). This pulls a prebuilt binary with Metal on Apple Silicon.
brew install llama.cpp
llama-cli --version
Path 2: Prebuilt release binaries. Grab the latest from the GitHub releases page, there are builds for macOS (Metal), Windows (CPU/CUDA/Vulkan), and Linux. Unzip and run. No compiler needed.
Path 3: Build from source for the newest features or a specific GPU backend. On Apple Silicon, Metal is on by default:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
For an NVIDIA card you want the CUDA backend, which is fiddly enough that I gave it its own walkthrough: llama.cpp CUDA build quickstart for Linux. The one-liner version:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
AMD users build with -DGGML_HIPBLAS=ON (ROCm); the cross-platform -DGGML_VULKAN=ON works on both vendors and is the easiest GPU path on Windows. If you're still choosing hardware, NVIDIA vs AMD for local LLMs and the best GPU for local AI will save you money.
How do I run a model and chat with it?
Download a GGUF, then point llama-cli at it. As of recent builds you can even pull straight from Hugging Face with -hf:
# Pull + run a model directly from Hugging Face
llama-cli -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M -p "Explain GGUF in two sentences."
Or run a file you already downloaded, interactively:
llama-cli -m ./qwen2.5-7b-instruct-q4_k_m.gguf \
-ngl 99 \
-c 8192 \
--color -cnv
What those flags mean:
-ngl 99/--n-gpu-layers: how many model layers to offload to the GPU. Set it high to push everything onto the GPU; lower it if you run out of VRAM. This is the single most important knob, I broke it down in GPU offload layers explained.-c 8192: context window in tokens. Bigger context eats more memory.-cnv: conversation/chat mode.
How do I run llama.cpp as an API server?
llama-server spins up a local HTTP endpoint that speaks the OpenAI chat-completions format, so any tool expecting OpenAI works against your local model:
llama-server -m ./qwen2.5-7b-instruct-q4_k_m.gguf -ngl 99 -c 8192 --port 8080
Then hit it like any OpenAI endpoint:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hi"}]}'
It also serves a basic web UI at http://localhost:8080. For a nicer front end, wire it to Open WebUI. The pattern is identical to Ollama's OpenAI-compatible API, which is handy if you're swapping runners behind the same app.
There's also a Docker image if you want a reproducible deploy:
docker run -p 8080:8080 -v $(pwd)/models:/models ghcr.io/ggml-org/llama.cpp:server \
-m /models/qwen2.5-7b-instruct-q4_k_m.gguf -ngl 99 --host 0.0.0.0
How much VRAM or RAM do I actually need?
Rough math: a model's memory footprint is approximately the GGUF file size plus overhead for the KV cache (which grows with context length). A 7B model at Q4_K_M lands somewhere around 4-5 GB of weights, so an 8 GB GPU handles it comfortably with normal context. Bump context to 16K-32K or jump to a 13B-14B model and you'll want 12 GB or more.
These are ballparks, not promises, KV-cache size, the specific quant, and your context length all move the number. Always check actual usage on your own stack with nvidia-smi, Activity Monitor, or radeontop while the model is loaded. For the methodology, see how much VRAM for Llama 3 8B and the broader VRAM requirements guide.
If a model doesn't fully fit, llama.cpp will split it: GPU layers run fast, the rest fall back to CPU/RAM and slow down. That's the whole point of -ngl, tune it so as many layers as possible live on the GPU without an out-of-memory crash.
Can I run llama.cpp without a GPU?
Yes. CPU-only inference is fully supported and is one of llama.cpp's superpowers, it's why the project runs on phones and single-board computers. Small quants of 1B-4B models (Gemma, Phi, Qwen small) are genuinely usable on a modern CPU; 7B is tolerable, anything bigger gets slow.
For privacy-first setups where keeping data off the cloud matters more than speed, CPU-only is a legitimate choice, I weigh it in the CPU-only privacy tradeoff. Just temper expectations on the low end, as the Raspberry Pi 5 limits make clear.
Bottom line
llama.cpp is the engine, not the dashboard, learn it and the rest of the local-AI stack stops being a black box. Start with a Q4_K_M GGUF of a 7B model, set -ngl as high as your VRAM allows, and verify actual memory use on your own hardware before you scale up. If you just want a model running today, reach for Ollama; if you want the control, the speed, and a server you can ship anywhere, this is the tool worth mastering.
Frequently asked questions
Yes. Cornerstone posts bump updatedAt when Ollama, LM Studio, or llama.cpp ship breaking changes; see the refresh log in Content Ideas.
A GPU helps for 7B+ models at interactive speed. CPU-only inference is supported for privacy experiments with smaller quants.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
