Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
CPU-Only Local LLM Privacy Tradeoffs
Slower tokens, stronger air-gap story.
Key takeaways
- Slower tokens, stronger air-gap story.
- Parent pillar: /blog/run-open-weight-models-locally-2026
10+ years in Digital Marketing & SEO
Yes, you can run open-weight LLMs on CPU only, with no GPU at all, and the privacy story is actually better than a GPU rig: fewer moving parts, no proprietary drivers phoning home, and a box that's trivial to fully air-gap. The tradeoff is speed. Expect single-digit to low-double-digit tokens per second on a small quantized model, and a real wait on anything 13B and up. If your threat model values "nothing leaves this machine, ever" over raw throughput, CPU-only is a legitimate, even ideal, choice.
What does "CPU-only local LLM" actually mean?
A CPU-only local LLM is a language model that runs its inference entirely on your processor and system RAM, with no GPU or VRAM involved. The model weights live in regular DDR4/DDR5 memory, and the math happens on your CPU cores using vectorized instructions (AVX2, AVX-512, or ARM NEON on Apple Silicon).
The whole pipeline here is GGUF plus a CPU-capable runner. GGUF is the single-file quantized model format used by llama.cpp and its descendants (Ollama, LM Studio, KoboldCpp), and it was designed from day one to run well on CPU. No CUDA, no ROCm, no driver stack — just a binary and a .gguf file.
Why would I run an LLM on CPU instead of a GPU?
Three honest reasons, and one of them is the whole point of this article:
- Privacy and air-gapping. A CPU box has no proprietary GPU driver, no vendor telemetry layer, no CUDA runtime that quietly checks in. It's a smaller attack surface and a far easier thing to certify as "this never touches the network." More on this below.
- Hardware you already own. Any modern laptop or mini-PC with 16-32GB RAM can run a 7-8B model right now. No $400-$1,500 GPU purchase. If you're still GPU-shopping, see best used GPU for local AI on a budget.
- Power, heat, and silence. A CPU inference box sips watts compared to a 250W+ GPU. Great for an always-on homelab node.
The cost is throughput. CPUs are memory-bandwidth-bound for LLM inference, and that's the bottleneck no amount of clever code fully erases.
How much slower is CPU inference, really?
Ballpark, and you should verify on your own stack: a Q4 7-8B model on a modern 8-core desktop or an M-series Mac lands somewhere in the ~5-15 tokens/sec range for generation. A Ryzen with fast dual-channel DDR5 sits at the higher end; an older laptop on DDR4 sits lower. Move up to a 13-14B model and you're often in the ~2-6 tokens/sec zone. 30B+ on pure CPU is technically possible but genuinely painful for interactive use.
For comparison, even a modest consumer GPU will do many tens of tokens/sec on the same model. So this is a real difference, not a rounding error. The mental model: CPU is fine for chat, drafting, summarization, and batch jobs you don't watch in real time. It's frustrating for long agentic loops or anything where you're staring at the cursor.
CPU inference is bound by RAM bandwidth, not core count past a point. Eight fast cores on DDR5 often beats sixteen slower cores on DDR4. Apple Silicon punches above its weight here precisely because of its unified-memory bandwidth.
The privacy tradeoff: what do you actually gain?
This is the core question. Here's the honest ledger.
| Dimension | CPU-only box | GPU local rig | Cloud API |
|---|---|---|---|
| Data leaves machine | Never (easy to verify) | Never (with care) | Always |
| Proprietary driver/runtime | None | CUDA / ROCm | N/A |
| Air-gap difficulty | Trivial | Moderate | Impossible |
| Attack surface | Smallest | Larger (driver stack) | Provider-controlled |
| Tokens/sec | Slowest | Fast | Fastest |
| Hardware cost | Lowest (reuse existing) | $$$ GPU | $0 upfront, metered |
The privacy win isn't that GPU-local somehow leaks data — a properly configured Ollama or llama.cpp box doesn't phone home either. The win is verifiability and simplicity. With CPU-only you can pull the network cable, run lsof/netstat, and convince yourself nothing is talking to anything. There's no closed-source GPU driver blob in the trust chain. For regulated work, sensitive notes, or anything where "I can prove this is offline" matters more than speed, that's a meaningful upgrade. If air-gapping is your goal, walk through my keep-data-off-cloud checklist.
How do I run a model on CPU only? (copy-paste)
The fastest path is Ollama, which auto-detects no GPU and falls back to CPU cleanly. Install it (Windows/Mac/Linux guide here), then:
# Pull a small, CPU-friendly open-weight model and chat
ollama run qwen2.5:7b-instruct-q4_K_M
To force CPU even on a machine that has a GPU (useful for testing the air-gap path), set the layer count to zero:
# Force everything onto CPU: zero GPU offload layers
OLLAMA_NUM_GPU=0 ollama run gemma2:9b
Prefer raw llama.cpp for maximum control? Build it CPU-only (no CUDA flags) and run:
# Pure CPU run with llama.cpp; -ngl 0 keeps all layers on CPU
# -t sets thread count — match it to your physical cores, not threads
./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf \
-ngl 0 -t 8 -c 4096 \
-p "Summarize this in three bullets:"
LM Studio works too — flip the GPU offload slider to 0 in the model load settings and it runs entirely on CPU. If you're weighing the three, I broke them down in LM Studio vs Ollama vs llama.cpp. The -ngl 0 flag is doing the heavy lifting in all of these — it's the same offload concept covered in GPU offload layers explained, just dialed to zero.
How much RAM do I need for CPU inference?
System RAM replaces VRAM here, and the math is the same. A rough rule for a Q4_K_M quant: model footprint ≈ billions of params × ~0.6-0.7GB, plus 1-2GB for context and overhead.
| Model size (Q4_K_M) | Approx. RAM needed | Comfortable system RAM |
|---|---|---|
| 3-4B | ~3-4GB | 8GB |
| 7-8B | ~5-7GB | 16GB |
| 13-14B | ~9-11GB | 16-32GB |
| 30-34B | ~20-24GB | 32-64GB |
Leave headroom for your OS and apps — don't load a model that fills RAM to the brim or you'll hit swap, and swapping during inference is brutal. For the full breakdown see how much VRAM for Llama 3 8B (the RAM logic carries over directly).
Which quantization should I use on CPU?
Quantization is the compression that shrinks model weights from 16-bit floats down to ~4-8 bits so they fit in less memory and move faster through your bandwidth-limited CPU. On CPU specifically, smaller quants help twice: less memory used, and less data to push across the RAM bus.
- If you want the best speed/quality balance, then use Q4_K_M. It's the default sweet spot for a reason.
- If you have RAM to spare and want maximum fidelity, then use Q5_K_M or Q8_0 — but expect a noticeable speed hit on CPU since there's more data to move.
- If you're memory-starved or on a tiny box, then use Q4_K_S or even a 3-bit quant and accept some quality loss.
I'd avoid going below Q4 unless you have to; the quality cliff gets steep. The full tradeoff is in Q4 vs Q8 quant quality and the broader quantization explainer.
What can I realistically run? A decision list
- If you're on a Raspberry Pi 5 or similar SBC, then stick to 1-3B models and keep expectations low — see Raspberry Pi small-LLM limits.
- If you have a typical 16GB laptop, then run a 7-8B Q4 model (Qwen2.5, Llama 3.1 8B, Mistral 7B, Gemma 2 9B) and you'll have a usable assistant.
- If you have a 32GB+ desktop with DDR5, then 13-14B Q4 is comfortable for chat and drafting.
- If you need agentic speed or long context, then CPU-only is the wrong tool — add a GPU, even a used one.
Pro tips for getting the most out of CPU inference
- Match threads to physical cores. Setting
-thigher than your core count usually hurts. On an 8-core/16-thread CPU, try-t 8first. - Keep context modest. A 4096-token context is plenty for most chat; huge contexts eat RAM and slow prompt processing badly on CPU.
- Use a small model for agents, a bigger one for quality. On CPU you'll feel every token, so right-size aggressively.
- Apple Silicon users: Ollama and llama.cpp already use the CPU/unified memory well, but for the fastest Mac path look at MLX on Apple Silicon — it leverages the GPU cores in the SoC, which blurs the "CPU-only" line but is worth knowing about.
Bottom line
CPU-only local LLMs trade tokens per second for the cleanest privacy and air-gap story you can get: no proprietary driver stack, a tiny attack surface, and a box you can provably take offline. Run a 7-8B model at Q4_K_M on 16GB of RAM with Ollama or llama.cpp (-ngl 0), keep context tight, match threads to physical cores, and accept ~5-15 tokens/sec. If that speed works for your use case, you've got a private assistant that never has to touch the cloud. When you need it faster, that's your signal to add a GPU — start with the pillar guide to running open-weight models locally in 2026.
Frequently asked questions
See /blog/run-open-weight-models-locally-2026 for the full cornerstone guide.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
