Is this page updated when runners change?

Yes. Cornerstone posts bump updatedAt when Ollama, LM Studio, or llama.cpp ship breaking changes; see the refresh log in Content Ideas.

A GPU helps for 7B+ models at interactive speed. CPU-only inference is supported for privacy experiments with smaller quants.

Quantization Explained for Local AI | WikiWayne

Quantization Explained for Local AI

Quantization shrinks the numerical precision of a model's weights so a model that would never fit on your GPU suddenly does. A 7B model in full 16-bit precision wants roughly 14 GB just for weights; quantized to 4-bit it drops to around 4-5 GB and runs comfortably on a mid-range card or an M-series Mac. For most local setups, a Q4_K_M GGUF is the sweet spot, and you step up to Q8 only when a task is clearly suffering.

What is quantization in plain terms?

Quantization is the process of storing each model weight using fewer bits than the original training precision. Models are usually trained in 16-bit (FP16 or BF16), where every weight is a fairly precise decimal number. Quantization rounds those numbers onto a coarser grid — 8 bits, 5 bits, 4 bits, sometimes lower — so each weight takes up less space in memory and on disk.

The payoff is brutal and simple: fewer bits per weight means fewer gigabytes of VRAM, which means bigger models on smaller hardware, and usually faster inference because you're shuffling less data around. The cost is a small loss of fidelity. Done right, that loss is barely noticeable. Done too aggressively, the model gets dumber, more repetitive, and starts making mistakes it wouldn't make at full precision.

I run open-weight models like Qwen, Llama, Gemma, DeepSeek, Mistral, and Phi on Apple Silicon and consumer NVIDIA cards every day, and quantization is the single biggest lever between "this model won't load" and "this model is now my daily driver."

Why does quantization matter for running models locally?

Because VRAM is the wall everyone hits first. Cloud providers throw 80 GB datacenter GPUs at the problem; you have an 8 GB, 12 GB, 16 GB, or 24 GB card, or a Mac with shared memory. Quantization is how you fit a useful model inside that budget.

Here's the rough VRAM math for the weights alone. Multiply the parameter count by bytes-per-weight:

FP16 (16-bit): ~2 bytes/param → a 7B model ≈ 14 GB
Q8 (8-bit): ~1 byte/param → 7B ≈ 7-8 GB
Q4 (4-bit): ~0.5 bytes/param → 7B ≈ 4-5 GB

Then add headroom for the KV cache (grows with context length) and runner overhead — budget another 1-3 GB depending on context window. So a 7B at Q4_K_M realistically wants ~6 GB total, while the same model at FP16 wants 16 GB+. That's the difference between running on a laptop and not running at all. For a deeper walkthrough, see my VRAM requirements guide and the focused how much VRAM for Llama 3 8B breakdown.

What is GGUF and how does it relate to quantization?

GGUF is the file format used by llama.cpp (and every tool built on it — Ollama, LM Studio, KoboldCpp) that packages a quantized model plus its metadata into a single file you can download and run. When you grab a model from Hugging Face for local use, you're almost always grabbing a GGUF at a specific quant level.

The quant level is baked into the filename. qwen2.5-7b-instruct-q4_k_m.gguf tells you the model, size, and that it's a 4-bit K-quant, medium variant. You don't quantize anything yourself in the normal workflow — the community already published every common quant, and you just pick the one that fits. If you want the full format rundown, I wrote a dedicated piece on what GGUF is.

What do Q4_K_M, Q5, Q8 and the other labels mean?

The naming looks cryptic but decodes cleanly:

The number (Q4, Q5, Q6, Q8) is the approximate bits per weight. Lower = smaller and faster, but lower quality.
The K means K-quant, a smarter block-wise scheme that allocates precision unevenly across the model. K-quants beat the old "legacy" quants at the same size.
The suffix (_S, _M, _L) is small/medium/large within that level — _M keeps a few sensitive layers at higher precision than _S.

So Q4_K_M = 4-bit K-quant, medium. It's the community default because it's the best quality-per-gigabyte trade most people will find. Q8_0 is near-lossless and the one I reach for when quality matters more than space.

Q4 vs Q8 vs FP16: which quantization should I pick?

Here's how I think about the common quant levels for a typical 7-8B model. Treat the VRAM and quality columns as ballpark guidance — always verify on your own stack, because runner overhead and context length move the numbers.

Quant	Bits/weight	~VRAM (7-8B)	Quality	Best for
Q2_K	~2.6	~3-4 GB	Noticeably degraded	Desperation / tiny VRAM only
Q3_K_M	~3.4	~4-5 GB	Acceptable, some drift	Squeezing onto 6 GB cards
Q4_K_M	~4.8	~5-6 GB	Very good — recommended	Daily driver, general use
Q5_K_M	~5.7	~6-7 GB	Excellent	Headroom to spare, chat + light code
Q6_K	~6.6	~7-8 GB	Near-lossless	Code, reasoning, you have the VRAM
Q8_0	~8.5	~8-9 GB	Essentially lossless	Max quality, benchmarking, distill source
FP16	16	~14-16 GB	Reference	Fine-tuning, you have a big GPU

The honest truth: between Q4_K_M and Q8 the quality gap on everyday chat and summarization is small enough that most people won't notice in a blind test. The gap widens on code generation, multi-step reasoning, and long-context tasks where small errors compound. I dig into exactly where that line falls in Q4 vs Q8 quality tradeoffs.

How do I choose a quant for my hardware? (decision list)

Use this as a quick "if X then Y":

If you have 6-8 GB VRAM → run a 7-8B at Q4_K_M, or a 3-4B model at Q5/Q6 if you want more quality headroom.
If you have 12 GB VRAM → 7-8B at Q5_K_M or Q6_K is comfortable; a 14B fits at Q4_K_M.
If you have 16 GB VRAM → 14B at Q4_K_M/Q5_K_M, or a 7B at Q8 for near-lossless quality.
If you have 24 GB VRAM → 32B-class models at Q4_K_M, or 14B at Q8.
If you're on Apple Silicon → unified memory is your VRAM; an M-series with 16 GB handles 7-8B Q4/Q5 well, 32 GB+ opens up larger models. Look at MLX-format quants too via my MLX on Apple Silicon guide.
If the model output feels dumb or repetitive → step up one quant level before you blame the model itself.
If you're doing code or math → bias toward Q6_K or Q8; those tasks punish low quants hardest.

When weights don't fully fit in VRAM, the runner can offload some layers to system RAM — slower but workable. That's a separate lever worth understanding; see GPU offload layers explained.

How do I actually download and run a quantized model?

The easiest path is Ollama, which picks a sensible default quant for you and pulls a GGUF behind the scenes:

# Ollama grabs a Q4_K_M by default for most models
ollama run qwen2.5:7b

# Want a specific quant? Tag it explicitly
ollama run qwen2.5:7b-instruct-q8_0

With llama.cpp you point directly at a GGUF file and control everything, including how many layers go on the GPU:

# -ngl 99 offloads as many layers as fit on the GPU
./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 99 -c 4096 -p "Explain quantization in one paragraph."

In LM Studio, the model search shows every available quant per model with an estimated memory footprint and a green/yellow/red fit indicator for your machine — the friendliest way to see the trade-off visually. My LM Studio download walkthrough covers it step by step, and if you're still picking a runner, LM Studio vs Ollama vs llama.cpp lays out which one fits which workflow.

Does quantization actually make models worse?

A little, and it depends entirely on how far you push it. Quantization increases perplexity — a measure of how surprised the model is by correct text, where lower is better. Down to Q5/Q6 the perplexity bump is tiny. At Q4 it's small but real. Below Q3 it gets ugly fast, and Q2 should be a last resort when nothing else fits.

What this looks like in practice:

Summarization, chat, simple Q&A tolerate aggressive quantization well — Q4 is plenty.
Code generation wants more precision; subtle token errors break syntax. Lean Q6/Q8.
Long-context and multi-step reasoning degrade faster because small errors compound across the chain.

My rule: start at Q4_K_M, and only spend more VRAM on a higher quant once a specific task shows you it needs it. Don't pre-optimize for quality you can't perceive.

Do I need a GPU to run quantized models?

A GPU helps a lot for 7B+ models at interactive speed, but it isn't strictly required. CPU-only inference works fine for smaller quants and smaller models, especially for privacy experiments where speed matters less than keeping data off the cloud. Expect noticeably slower token generation on CPU. I cover the trade-offs in CPU-only local LLM, and if you're shopping for hardware, best GPU for local AI 2026 is the place to start.

Bottom line

Quantization is the trick that makes local AI practical on hardware you already own — it trades a sliver of quality for a mountain of VRAM savings. Grab a Q4_K_M GGUF, confirm it fits with the rough math (params × bytes-per-weight, plus a couple GB for cache and overhead), and verify real usage on your own GPU before you reach for anything bigger. Step up to Q5, Q6, or Q8 only when a task tells you it needs the extra precision. Start small, measure, then scale.

Quantization Explained for Local AI

What is quantization in plain terms?

Why does quantization matter for running models locally?

Here's the rough VRAM math for the weights alone. Multiply the parameter count by bytes-per-weight:

FP16 (16-bit): ~2 bytes/param → a 7B model ≈ 14 GB
Q8 (8-bit): ~1 byte/param → 7B ≈ 7-8 GB
Q4 (4-bit): ~0.5 bytes/param → 7B ≈ 4-5 GB

What is GGUF and how does it relate to quantization?

What do Q4_K_M, Q5, Q8 and the other labels mean?

The naming looks cryptic but decodes cleanly:

The number (Q4, Q5, Q6, Q8) is the approximate bits per weight. Lower = smaller and faster, but lower quality.
The K means K-quant, a smarter block-wise scheme that allocates precision unevenly across the model. K-quants beat the old "legacy" quants at the same size.
The suffix (_S, _M, _L) is small/medium/large within that level — _M keeps a few sensitive layers at higher precision than _S.

Q4 vs Q8 vs FP16: which quantization should I pick?

Quant	Bits/weight	~VRAM (7-8B)	Quality	Best for
Q2_K	~2.6	~3-4 GB	Noticeably degraded	Desperation / tiny VRAM only
Q3_K_M	~3.4	~4-5 GB	Acceptable, some drift	Squeezing onto 6 GB cards
Q4_K_M	~4.8	~5-6 GB	Very good — recommended	Daily driver, general use
Q5_K_M	~5.7	~6-7 GB	Excellent	Headroom to spare, chat + light code
Q6_K	~6.6	~7-8 GB	Near-lossless	Code, reasoning, you have the VRAM
Q8_0	~8.5	~8-9 GB	Essentially lossless	Max quality, benchmarking, distill source
FP16	16	~14-16 GB	Reference	Fine-tuning, you have a big GPU

How do I choose a quant for my hardware? (decision list)

Use this as a quick "if X then Y":

If you have 6-8 GB VRAM → run a 7-8B at Q4_K_M, or a 3-4B model at Q5/Q6 if you want more quality headroom.
If you have 12 GB VRAM → 7-8B at Q5_K_M or Q6_K is comfortable; a 14B fits at Q4_K_M.
If you have 16 GB VRAM → 14B at Q4_K_M/Q5_K_M, or a 7B at Q8 for near-lossless quality.
If you have 24 GB VRAM → 32B-class models at Q4_K_M, or 14B at Q8.
If you're on Apple Silicon → unified memory is your VRAM; an M-series with 16 GB handles 7-8B Q4/Q5 well, 32 GB+ opens up larger models. Look at MLX-format quants too via my MLX on Apple Silicon guide.
If the model output feels dumb or repetitive → step up one quant level before you blame the model itself.
If you're doing code or math → bias toward Q6_K or Q8; those tasks punish low quants hardest.

When weights don't fully fit in VRAM, the runner can offload some layers to system RAM — slower but workable. That's a separate lever worth understanding; see GPU offload layers explained.

How do I actually download and run a quantized model?

The easiest path is Ollama, which picks a sensible default quant for you and pulls a GGUF behind the scenes:

# Ollama grabs a Q4_K_M by default for most models
ollama run qwen2.5:7b

# Want a specific quant? Tag it explicitly
ollama run qwen2.5:7b-instruct-q8_0

With llama.cpp you point directly at a GGUF file and control everything, including how many layers go on the GPU:

# -ngl 99 offloads as many layers as fit on the GPU
./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 99 -c 4096 -p "Explain quantization in one paragraph."

Does quantization actually make models worse?

What this looks like in practice:

Summarization, chat, simple Q&A tolerate aggressive quantization well — Q4 is plenty.
Code generation wants more precision; subtle token errors break syntax. Lean Q6/Q8.
Long-context and multi-step reasoning degrade faster because small errors compound across the chain.

My rule: start at Q4_K_M, and only spend more VRAM on a higher quant once a specific task shows you it needs it. Don't pre-optimize for quality you can't perceive.

Quantization Explained for Local AI

Key takeaways

What is quantization in plain terms?

Why does quantization matter for running models locally?

What is GGUF and how does it relate to quantization?

What do Q4_K_M, Q5, Q8 and the other labels mean?

Q4 vs Q8 vs FP16: which quantization should I pick?

How do I choose a quant for my hardware? (decision list)

How do I actually download and run a quantized model?

Does quantization actually make models worse?

Do I need a GPU to run quantized models?

Bottom line

Frequently asked questions

Related Articles

Q4 vs Q8 Quant Quality Tradeoffs

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide

Quantization Explained for Local AI

Key takeaways

What is quantization in plain terms?

Why does quantization matter for running models locally?

What is GGUF and how does it relate to quantization?

What do Q4_K_M, Q5, Q8 and the other labels mean?

Q4 vs Q8 vs FP16: which quantization should I pick?

How do I choose a quant for my hardware? (decision list)

How do I actually download and run a quantized model?

Does quantization actually make models worse?

Do I need a GPU to run quantized models?

Bottom line

Frequently asked questions

Related Articles

Q4 vs Q8 Quant Quality Tradeoffs

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide

Quantization Explained for Local AI

Key takeaways

What is quantization in plain terms?

Why does quantization matter for running models locally?

What is GGUF and how does it relate to quantization?

What do Q4_K_M, Q5, Q8 and the other labels mean?

Q4 vs Q8 vs FP16: which quantization should I pick?

How do I choose a quant for my hardware? (decision list)

How do I actually download and run a quantized model?

Does quantization actually make models worse?

Do I need a GPU to run quantized models?

Bottom line

Frequently asked questions

Is this page updated when runners change?

Do I need a GPU?

Related Articles

Q4 vs Q8 Quant Quality Tradeoffs

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide

Quantization Explained for Local AI

Key takeaways

What is quantization in plain terms?

Why does quantization matter for running models locally?

What is GGUF and how does it relate to quantization?

What do Q4_K_M, Q5, Q8 and the other labels mean?

Q4 vs Q8 vs FP16: which quantization should I pick?

How do I choose a quant for my hardware? (decision list)

How do I actually download and run a quantized model?

Does quantization actually make models worse?

Do I need a GPU to run quantized models?

Bottom line

Frequently asked questions

Is this page updated when runners change?

Do I need a GPU?

Related Articles

Q4 vs Q8 Quant Quality Tradeoffs

Best GPU for Local AI (2026)

ComfyUI Local Stable Diffusion Guide