Which pillar does this cluster support?

See /blog/quantization-explained-local-ai for the full cornerstone guide.

Q4 vs Q8 Quant Quality Tradeoffs | WikiWayne

For most people running open-weight models locally, Q4 (specifically Q4_K_M) is the right default — it cuts a model's size roughly in half versus Q8 while keeping the vast majority of its quality, so you can fit a bigger, smarter model in the same VRAM. Spend the extra gigabytes on Q8 only when you're running a small model (under ~3B), doing precision-sensitive work like code or structured JSON, or you already have memory to burn. Below I'll show exactly where the line falls and how to decide on your own hardware.

What do Q4 and Q8 actually mean?

Quantization is the process of storing a model's weights at lower numerical precision to shrink its memory footprint. A model trained in 16-bit (FP16/BF16) gets compressed down to a handful of bits per weight, and the number in the name tells you roughly how many.

Q8 stores weights at about 8 bits each — near-lossless, basically indistinguishable from the full-precision model in practice.
Q4 stores weights at about 4 bits each — half the size of Q8, with a small, usually-hard-to-notice quality hit.

In the GGUF world (the file format used by llama.cpp, Ollama, LM Studio, and KoboldCpp), you'll almost never see a plain "Q4." You'll see Q4_K_M, Q4_K_S, Q5_K_M, Q6_K, Q8_0, and friends. The _K means k-quants — a smarter scheme that protects the most important weights with higher precision and squeezes the rest harder. The _M / _S suffix is medium vs small. Q4_K_M is the community default for a reason: it's the sweet spot on the size-vs-quality curve. If you want the full mental model of how this all works, start with the pillar: quantization explained for local AI.

Q4 vs Q8: which should I use?

Here's the honest comparison. Numbers below are realistic ballpark ranges for a typical ~8B model in GGUF — verify the exact figures on your own stack, because they shift with model size, context length, and runner.

Factor	Q4_K_M	Q8_0
Bits per weight (approx)	~4.5	~8.5
File size, 8B model	~4.5–5 GB	~8.5 GB
File size, 70B model	~40–43 GB	~75 GB
Quality vs FP16	Very close; tiny degradation	Effectively identical
Tokens/sec (same GPU)	Faster (less memory bandwidth)	Slightly slower
Best for	Default everyday use, fitting bigger models	Small models, code, strict JSON, max fidelity
VRAM friendliness	Excellent	Demanding

The key insight that trips people up: a Q4 of a bigger model almost always beats a Q8 of a smaller one. If you can run Q8 of a 13B or Q4 of a 30B in the same memory budget, the Q4 30B usually wins on real-world quality. Parameter count buys you more than precision does, right up until the quant gets aggressive (Q3 and below), where things start to wobble.

When is Q8 actually worth the extra gigabytes?

Q8 earns its disk space in specific situations, not as a blanket "more is better" choice:

Small models (under ~3B). Phi, small Qwen, small Gemma, and the like have less redundancy to absorb quantization error. A 1.5B or 3B model at Q4 can get noticeably dumber; at Q8 it holds up. For tiny models, just run Q8 — the file is small anyway.
Code generation and tool calling. A single wrong token breaks compilation or malformed JSON breaks a tool call. The marginal precision of Q8 reduces those rare slips.
Strict structured output. Grammars, function-calling schemas, and anything where format correctness is non-negotiable benefit from the extra fidelity.
You already have the memory. If a 7B Q8 fits comfortably in your VRAM with room for context, there's no reason to downgrade to Q4.
Long, multi-turn reasoning chains. Small quantization errors can compound over a very long generation; Q8 is more stable here.

If none of those apply, Q4_K_M is the move.

When is Q4 the smarter choice?

You want the biggest model that fits. This is the most common reason. Q4 lets you jump a model tier — run a 30B instead of a 13B, or a 70B instead of a 34B.
You're tight on VRAM or running CPU-only. Less data to move means faster inference, especially when memory bandwidth is your bottleneck. See GPU offload layers explained for squeezing partial-GPU setups.
General chat, summarization, brainstorming, RAG. These are forgiving workloads where Q4's tiny quality dip is invisible.
Apple Silicon with unified memory. Q4 leaves more headroom for context and other apps sharing that pool.

How much memory do I actually need for each?

Rough rule of thumb for VRAM (or unified memory): GGUF file size + 1–3 GB for the KV cache and overhead, scaling up with context length. So an 8B Q4_K_M (~5 GB file) wants roughly 6–8 GB of headroom; the same model at Q8 (~8.5 GB) wants closer to 10–12 GB. Long contexts (32K+) push those numbers higher. Do the math against your card before you download — I walk through it in how much VRAM for Llama 3 8B and the broader VRAM requirements guide. Always confirm real usage with nvidia-smi (NVIDIA) or Activity Monitor (Mac) under load, since theoretical math undercounts.

How do I pull and test each quant?

The fastest way to settle the debate is to run both and feel the difference on your own prompts. Here's how across the common runners.

Ollama — pull a specific quant by tag:

# Q4_K_M (the default for most Ollama models)
ollama pull qwen2.5:7b

# Explicit Q8 variant when the model offers it
ollama pull qwen2.5:7b-instruct-q8_0

# Compare side by side
ollama run qwen2.5:7b "Write a Python function to merge two sorted lists."
ollama run qwen2.5:7b-instruct-q8_0 "Write a Python function to merge two sorted lists."

llama.cpp — download the GGUF you want from Hugging Face and point at it:

# Grab a specific quant file (use the exact filename from the repo)
huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF \
  Qwen2.5-7B-Instruct-Q4_K_M.gguf --local-dir ./models

# Run it
./llama-cli -m ./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -p "Explain quantization in two sentences." -ngl 99

Swap Q4_K_M for Q8_0 in the filename to fetch the higher-precision build. New to building llama.cpp? The CUDA quickstart and the complete llama.cpp guide cover setup.

LM Studio — search the model in the discover tab, and the download panel lists every quant with its file size and a green/yellow/red "fits your hardware" indicator. Pick Q4_K_M for the default, Q8_0 for max fidelity. Step-by-step in downloading models in LM Studio.

How do I tell if a quant is hurting quality?

Don't trust vibes alone — run a quick A/B with prompts that stress the model:

Reasoning: a multi-step word problem or logic puzzle.
Code: a function with edge cases, then actually run the output.
Format: ask for strict JSON and check it parses.
Recall: a question about something niche where hallucination shows up.

Run each prompt 3–5 times at a low temperature on both quants. If Q4 and Q8 give equivalent answers, you've got your proof that Q4 is fine for that workload — keep the gigabytes. If Q4 stumbles on code or format while Q8 holds, you've found a case where the upgrade pays off.

Quick decision list

If you have plenty of VRAM and want max fidelity → Q8_0.
If you want the best quality that fits your hardware → biggest model you can run at Q4_K_M.
If you're running a model under ~3B → Q8_0; the file's small and Q4 hurts small models.
If you're doing code or strict structured output → Q8_0 (or at least Q6_K).
If you're on CPU-only or a tight GPU → Q4_K_M, every time.
If you're unsure → Q4_K_M. It's the default for a reason.
If Q4 feels too aggressive but Q8 won't fit → try Q5_K_M or Q6_K as the middle ground.

Bottom line

Q4_K_M is the workhorse — half the size of Q8, almost all the quality, and it lets you run bigger, smarter models on the same hardware. Reach for Q8 only when the model is small, the task is precision-sensitive, or you've got memory to spare. Best move: download both for one model you care about, run your own prompts through them, and let the results decide. For the full picture on how quantization works under the hood, head back to the pillar: quantization explained for local AI.

What do Q4 and Q8 actually mean?

Q8 stores weights at about 8 bits each — near-lossless, basically indistinguishable from the full-precision model in practice.
Q4 stores weights at about 4 bits each — half the size of Q8, with a small, usually-hard-to-notice quality hit.

Q4 vs Q8: which should I use?

Factor	Q4_K_M	Q8_0
Bits per weight (approx)	~4.5	~8.5
File size, 8B model	~4.5–5 GB	~8.5 GB
File size, 70B model	~40–43 GB	~75 GB
Quality vs FP16	Very close; tiny degradation	Effectively identical
Tokens/sec (same GPU)	Faster (less memory bandwidth)	Slightly slower
Best for	Default everyday use, fitting bigger models	Small models, code, strict JSON, max fidelity
VRAM friendliness	Excellent	Demanding

When is Q8 actually worth the extra gigabytes?

Q8 earns its disk space in specific situations, not as a blanket "more is better" choice:

Small models (under ~3B). Phi, small Qwen, small Gemma, and the like have less redundancy to absorb quantization error. A 1.5B or 3B model at Q4 can get noticeably dumber; at Q8 it holds up. For tiny models, just run Q8 — the file is small anyway.
Code generation and tool calling. A single wrong token breaks compilation or malformed JSON breaks a tool call. The marginal precision of Q8 reduces those rare slips.
Strict structured output. Grammars, function-calling schemas, and anything where format correctness is non-negotiable benefit from the extra fidelity.
You already have the memory. If a 7B Q8 fits comfortably in your VRAM with room for context, there's no reason to downgrade to Q4.
Long, multi-turn reasoning chains. Small quantization errors can compound over a very long generation; Q8 is more stable here.

If none of those apply, Q4_K_M is the move.

When is Q4 the smarter choice?

You want the biggest model that fits. This is the most common reason. Q4 lets you jump a model tier — run a 30B instead of a 13B, or a 70B instead of a 34B.
You're tight on VRAM or running CPU-only. Less data to move means faster inference, especially when memory bandwidth is your bottleneck. See GPU offload layers explained for squeezing partial-GPU setups.
General chat, summarization, brainstorming, RAG. These are forgiving workloads where Q4's tiny quality dip is invisible.
Apple Silicon with unified memory. Q4 leaves more headroom for context and other apps sharing that pool.

How much memory do I actually need for each?

How do I pull and test each quant?

The fastest way to settle the debate is to run both and feel the difference on your own prompts. Here's how across the common runners.

Ollama — pull a specific quant by tag:

# Q4_K_M (the default for most Ollama models)
ollama pull qwen2.5:7b

# Explicit Q8 variant when the model offers it
ollama pull qwen2.5:7b-instruct-q8_0

# Compare side by side
ollama run qwen2.5:7b "Write a Python function to merge two sorted lists."
ollama run qwen2.5:7b-instruct-q8_0 "Write a Python function to merge two sorted lists."

llama.cpp — download the GGUF you want from Hugging Face and point at it:

# Grab a specific quant file (use the exact filename from the repo)
huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF \
  Qwen2.5-7B-Instruct-Q4_K_M.gguf --local-dir ./models

# Run it
./llama-cli -m ./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -p "Explain quantization in two sentences." -ngl 99

Swap Q4_K_M for Q8_0 in the filename to fetch the higher-precision build. New to building llama.cpp? The CUDA quickstart and the complete llama.cpp guide cover setup.

How do I tell if a quant is hurting quality?

Don't trust vibes alone — run a quick A/B with prompts that stress the model:

Reasoning: a multi-step word problem or logic puzzle.
Code: a function with edge cases, then actually run the output.
Format: ask for strict JSON and check it parses.
Recall: a question about something niche where hallucination shows up.

Quick decision list

If you have plenty of VRAM and want max fidelity → Q8_0.
If you want the best quality that fits your hardware → biggest model you can run at Q4_K_M.
If you're running a model under ~3B → Q8_0; the file's small and Q4 hurts small models.
If you're doing code or strict structured output → Q8_0 (or at least Q6_K).
If you're on CPU-only or a tight GPU → Q4_K_M, every time.
If you're unsure → Q4_K_M. It's the default for a reason.
If Q4 feels too aggressive but Q8 won't fit → try Q5_K_M or Q6_K as the middle ground.

Q4 vs Q8 Quant Quality Tradeoffs

Key takeaways

What do Q4 and Q8 actually mean?

Q4 vs Q8: which should I use?

When is Q8 actually worth the extra gigabytes?

When is Q4 the smarter choice?

How much memory do I actually need for each?

How do I pull and test each quant?

How do I tell if a quant is hurting quality?

Quick decision list

Bottom line

Frequently asked questions

Related Articles

Quantization Explained for Local AI

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL

Q4 vs Q8 Quant Quality Tradeoffs

Key takeaways

What do Q4 and Q8 actually mean?

Q4 vs Q8: which should I use?

When is Q8 actually worth the extra gigabytes?

When is Q4 the smarter choice?

How much memory do I actually need for each?

How do I pull and test each quant?

How do I tell if a quant is hurting quality?

Quick decision list

Bottom line

Frequently asked questions

Related Articles

Quantization Explained for Local AI

Best Used GPUs for Local AI on a Budget (2026)

Your First ComfyUI Workflow for Local SDXL