Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
Quantization Explained for Local AI
Quantization Explained for Local AI is a cornerstone page for the WikiWayne local-AI cluster.
Key takeaways
- Quantization Explained for Local AI is a cornerstone page for the WikiWayne local-AI cluster.
- Start with a small GGUF quant and verify VRAM on your own GPU before scaling model size.
- Use linked cluster posts for install steps and runner-specific commands.
10+ years in Digital Marketing & SEO
Quantization Explained for Local AI
Quantization shrinks the numerical precision of a model's weights so a model that would never fit on your GPU suddenly does. A 7B model in full 16-bit precision wants roughly 14 GB just for weights; quantized to 4-bit it drops to around 4-5 GB and runs comfortably on a mid-range card or an M-series Mac. For most local setups, a Q4_K_M GGUF is the sweet spot, and you step up to Q8 only when a task is clearly suffering.
What is quantization in plain terms?
Quantization is the process of storing each model weight using fewer bits than the original training precision. Models are usually trained in 16-bit (FP16 or BF16), where every weight is a fairly precise decimal number. Quantization rounds those numbers onto a coarser grid — 8 bits, 5 bits, 4 bits, sometimes lower — so each weight takes up less space in memory and on disk.
The payoff is brutal and simple: fewer bits per weight means fewer gigabytes of VRAM, which means bigger models on smaller hardware, and usually faster inference because you're shuffling less data around. The cost is a small loss of fidelity. Done right, that loss is barely noticeable. Done too aggressively, the model gets dumber, more repetitive, and starts making mistakes it wouldn't make at full precision.
I run open-weight models like Qwen, Llama, Gemma, DeepSeek, Mistral, and Phi on Apple Silicon and consumer NVIDIA cards every day, and quantization is the single biggest lever between "this model won't load" and "this model is now my daily driver."
Why does quantization matter for running models locally?
Because VRAM is the wall everyone hits first. Cloud providers throw 80 GB datacenter GPUs at the problem; you have an 8 GB, 12 GB, 16 GB, or 24 GB card, or a Mac with shared memory. Quantization is how you fit a useful model inside that budget.
Here's the rough VRAM math for the weights alone. Multiply the parameter count by bytes-per-weight:
- FP16 (16-bit): ~2 bytes/param → a 7B model ≈ 14 GB
- Q8 (8-bit): ~1 byte/param → 7B ≈ 7-8 GB
- Q4 (4-bit): ~0.5 bytes/param → 7B ≈ 4-5 GB
Then add headroom for the KV cache (grows with context length) and runner overhead — budget another 1-3 GB depending on context window. So a 7B at Q4_K_M realistically wants ~6 GB total, while the same model at FP16 wants 16 GB+. That's the difference between running on a laptop and not running at all. For a deeper walkthrough, see my VRAM requirements guide and the focused how much VRAM for Llama 3 8B breakdown.
What is GGUF and how does it relate to quantization?
GGUF is the file format used by llama.cpp (and every tool built on it — Ollama, LM Studio, KoboldCpp) that packages a quantized model plus its metadata into a single file you can download and run. When you grab a model from Hugging Face for local use, you're almost always grabbing a GGUF at a specific quant level.
The quant level is baked into the filename. qwen2.5-7b-instruct-q4_k_m.gguf tells you the model, size, and that it's a 4-bit K-quant, medium variant. You don't quantize anything yourself in the normal workflow — the community already published every common quant, and you just pick the one that fits. If you want the full format rundown, I wrote a dedicated piece on what GGUF is.
What do Q4_K_M, Q5, Q8 and the other labels mean?
The naming looks cryptic but decodes cleanly:
- The number (Q4, Q5, Q6, Q8) is the approximate bits per weight. Lower = smaller and faster, but lower quality.
- The K means K-quant, a smarter block-wise scheme that allocates precision unevenly across the model. K-quants beat the old "legacy" quants at the same size.
- The suffix (
_S,_M,_L) is small/medium/large within that level —_Mkeeps a few sensitive layers at higher precision than_S.
So Q4_K_M = 4-bit K-quant, medium. It's the community default because it's the best quality-per-gigabyte trade most people will find. Q8_0 is near-lossless and the one I reach for when quality matters more than space.
Q4 vs Q8 vs FP16: which quantization should I pick?
Here's how I think about the common quant levels for a typical 7-8B model. Treat the VRAM and quality columns as ballpark guidance — always verify on your own stack, because runner overhead and context length move the numbers.
| Quant | Bits/weight | ~VRAM (7-8B) | Quality | Best for |
|---|---|---|---|---|
| Q2_K | ~2.6 | ~3-4 GB | Noticeably degraded | Desperation / tiny VRAM only |
| Q3_K_M | ~3.4 | ~4-5 GB | Acceptable, some drift | Squeezing onto 6 GB cards |
| Q4_K_M | ~4.8 | ~5-6 GB | Very good — recommended | Daily driver, general use |
| Q5_K_M | ~5.7 | ~6-7 GB | Excellent | Headroom to spare, chat + light code |
| Q6_K | ~6.6 | ~7-8 GB | Near-lossless | Code, reasoning, you have the VRAM |
| Q8_0 | ~8.5 | ~8-9 GB | Essentially lossless | Max quality, benchmarking, distill source |
| FP16 | 16 | ~14-16 GB | Reference | Fine-tuning, you have a big GPU |
The honest truth: between Q4_K_M and Q8 the quality gap on everyday chat and summarization is small enough that most people won't notice in a blind test. The gap widens on code generation, multi-step reasoning, and long-context tasks where small errors compound. I dig into exactly where that line falls in Q4 vs Q8 quality tradeoffs.
How do I choose a quant for my hardware? (decision list)
Use this as a quick "if X then Y":
- If you have 6-8 GB VRAM → run a 7-8B at Q4_K_M, or a 3-4B model at Q5/Q6 if you want more quality headroom.
- If you have 12 GB VRAM → 7-8B at Q5_K_M or Q6_K is comfortable; a 14B fits at Q4_K_M.
- If you have 16 GB VRAM → 14B at Q4_K_M/Q5_K_M, or a 7B at Q8 for near-lossless quality.
- If you have 24 GB VRAM → 32B-class models at Q4_K_M, or 14B at Q8.
- If you're on Apple Silicon → unified memory is your VRAM; an M-series with 16 GB handles 7-8B Q4/Q5 well, 32 GB+ opens up larger models. Look at MLX-format quants too via my MLX on Apple Silicon guide.
- If the model output feels dumb or repetitive → step up one quant level before you blame the model itself.
- If you're doing code or math → bias toward Q6_K or Q8; those tasks punish low quants hardest.
When weights don't fully fit in VRAM, the runner can offload some layers to system RAM — slower but workable. That's a separate lever worth understanding; see GPU offload layers explained.
How do I actually download and run a quantized model?
The easiest path is Ollama, which picks a sensible default quant for you and pulls a GGUF behind the scenes:
# Ollama grabs a Q4_K_M by default for most models
ollama run qwen2.5:7b
# Want a specific quant? Tag it explicitly
ollama run qwen2.5:7b-instruct-q8_0
With llama.cpp you point directly at a GGUF file and control everything, including how many layers go on the GPU:
# -ngl 99 offloads as many layers as fit on the GPU
./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf \
-ngl 99 -c 4096 -p "Explain quantization in one paragraph."
In LM Studio, the model search shows every available quant per model with an estimated memory footprint and a green/yellow/red fit indicator for your machine — the friendliest way to see the trade-off visually. My LM Studio download walkthrough covers it step by step, and if you're still picking a runner, LM Studio vs Ollama vs llama.cpp lays out which one fits which workflow.
Does quantization actually make models worse?
A little, and it depends entirely on how far you push it. Quantization increases perplexity — a measure of how surprised the model is by correct text, where lower is better. Down to Q5/Q6 the perplexity bump is tiny. At Q4 it's small but real. Below Q3 it gets ugly fast, and Q2 should be a last resort when nothing else fits.
What this looks like in practice:
- Summarization, chat, simple Q&A tolerate aggressive quantization well — Q4 is plenty.
- Code generation wants more precision; subtle token errors break syntax. Lean Q6/Q8.
- Long-context and multi-step reasoning degrade faster because small errors compound across the chain.
My rule: start at Q4_K_M, and only spend more VRAM on a higher quant once a specific task shows you it needs it. Don't pre-optimize for quality you can't perceive.
Do I need a GPU to run quantized models?
A GPU helps a lot for 7B+ models at interactive speed, but it isn't strictly required. CPU-only inference works fine for smaller quants and smaller models, especially for privacy experiments where speed matters less than keeping data off the cloud. Expect noticeably slower token generation on CPU. I cover the trade-offs in CPU-only local LLM, and if you're shopping for hardware, best GPU for local AI 2026 is the place to start.
Bottom line
Quantization is the trick that makes local AI practical on hardware you already own — it trades a sliver of quality for a mountain of VRAM savings. Grab a Q4_K_M GGUF, confirm it fits with the rough math (params × bytes-per-weight, plus a couple GB for cache and overhead), and verify real usage on your own GPU before you reach for anything bigger. Step up to Q5, Q6, or Q8 only when a task tells you it needs the extra precision. Start small, measure, then scale.
Frequently asked questions
Yes. Cornerstone posts bump updatedAt when Ollama, LM Studio, or llama.cpp ship breaking changes; see the refresh log in Content Ideas.
A GPU helps for 7B+ models at interactive speed. CPU-only inference is supported for privacy experiments with smaller quants.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
