Which pillar does this cluster support?

See /blog/run-open-weight-models-locally-2026 for the full cornerstone guide.

What Is GGUF? The Local LLM File Format Explained | WikiWayne

GGUF (GPT-Generated Unified Format) is the single-file format that packs a model's tensors and all its metadata together so llama.cpp-based runners can load it fast and run it on regular hardware. If you've ever downloaded a .gguf file from Hugging Face and dropped it into Ollama, LM Studio, or KoboldCpp, you've already used it. It's the de facto standard for running open-weight models locally on CPU, Apple Silicon, and consumer GPUs.

I run Qwen, Llama, Gemma, DeepSeek, and Mistral on a Mac and a couple of NVIDIA boxes every day, and GGUF is the format that holds 90% of that together. Here's everything you actually need to know to pick the right file and run it.

What does GGUF actually stand for and what is it?

GGUF stands for GPT-Generated Unified Format. It's a binary container that stores quantized model weights plus a key-value metadata block (architecture, tokenizer, context length, chat template, RoPE settings) in one file the runner can memory-map directly.

The "unified" part is the whole point. Older formats made the runner guess half the configuration. GGUF bakes everything the loader needs into the file itself, so a single .gguf is self-describing. You don't ship a config.json, a tokenizer.json, and three other files alongside it — it's all in there.

Why did GGUF replace GGML?

GGUF replaced GGML in August 2023 because GGML was too rigid: adding a new model architecture or tokenizer often broke old files, and metadata lived outside the weights. GGUF fixed that with an extensible key-value header, so new architectures slot in without breaking existing files.

In practice this means a GGUF you downloaded last year still loads in today's llama.cpp, and a brand-new architecture (say a fresh GLM or Qwen release) works the moment the runner adds support — no format change required. If you still see .bin GGML files floating around, treat them as dead. Everything modern is GGUF.

What is inside a GGUF file?

Three things live in every GGUF file:

Tensors — the actual model weights, usually quantized to shrink them down.
Metadata — architecture name, context length, RoPE scaling, embedding dimensions, and the chat/prompt template.
Tokenizer — the full vocabulary and merge rules, so you don't need a separate tokenizer file.

That bundled chat template matters more than people expect. It's why ollama run or LM Studio can format your messages correctly without you hand-writing <|im_start|> tags. When a model gives weird, rambling output, a broken or missing template baked into the GGUF is a common culprit.

How does quantization work in GGUF?

Quantization stores each weight at lower precision (4-bit, 5-bit, 8-bit) instead of 16-bit, cutting file size and memory roughly in proportion to the bit count while trading away a little accuracy. GGUF uses K-quants — block-based schemes labeled like Q4_K_M — that quantize most layers aggressively but keep the sensitive ones (attention, key layers) at higher precision.

Reading a quant name like Q4_K_M:

Q4 — roughly 4 bits per weight
_K — K-quant method (smarter per-block scaling than the old "legacy" quants)
_M — size tier within that level: S (small), M (medium), L (large)

So Q4_K_M is "4-bit, K-quant, medium" — and it's the default I reach for nine times out of ten. I go deeper on the precision tradeoff in Q4 vs Q8: quant quality tradeoffs and the broader quantization explained guide.

Which GGUF quant should I download?

Here's the cheat sheet I actually use. Sizes assume a ~7-8B model — scale up proportionally for bigger ones, and always verify on your own stack.

Quant	Bits (approx)	Rough size (7-8B)	Quality	Use it when
`Q2_K`	~2.6	~3 GB	Noticeably degraded	Only if you're desperate for space
`Q3_K_M`	~3.4	~3.5-4 GB	Usable, some loss	Tight VRAM, simple tasks
`Q4_K_M`	~4.5	~4.5-5 GB	Great balance	Default pick for most people
`Q5_K_M`	~5.5	~5.5-6 GB	Very close to full	A bit more headroom, want quality
`Q6_K`	~6.6	~6.5-7 GB	Near-lossless	Quality matters, VRAM allows
`Q8_0`	~8.5	~8 GB	Essentially full	Max fidelity, you have the memory

Decision list:

If you have ≤8 GB VRAM (or unified memory) → start with Q4_K_M on a 7-8B model.
If you have 12-16 GB → run Q5_K_M/Q6_K, or jump to a bigger model at Q4_K_M.
If you have 24 GB+ → Q8_0 on mid-size models, or a 30B-class model at Q4_K_M.
If output quality feels off → bump the quant up one tier before blaming the model.
If you're CPU-only → smaller quants run faster; see the CPU-only privacy tradeoff.

For the full memory math, I keep how much VRAM for Llama 3 8B and the VRAM requirements guide bookmarked.

How much VRAM does a GGUF model need?

The quick rule: VRAM needed ≈ file size on disk + context overhead. A 4.7 GB Q4_K_M file wants roughly 5-6 GB of memory to run comfortably at a normal context length, with more on top for big context windows and the KV cache.

A back-of-envelope formula I use:

VRAM ≈ (model file size) + (KV cache) + ~1 GB runtime overhead

The KV cache grows with context length and model size — at a long 32K context it can add several GB on its own. If a model doesn't fully fit, GGUF runners let you split layers between GPU and CPU (GPU offload), which I cover in GPU offload layers explained. Partial offload is slower but it's the difference between running a model and not running it at all.

How do I run a GGUF file?

Any llama.cpp-based runner takes GGUF directly. Pick your tool:

Ollama — easiest path. Pull a prebuilt model:

ollama run qwen2.5:7b

Or import a .gguf you downloaded yourself with a tiny Modelfile:

printf 'FROM ./qwen2.5-7b-instruct-q4_k_m.gguf\n' > Modelfile
ollama create my-qwen -f Modelfile
ollama run my-qwen

llama.cpp — the engine everything else is built on:

./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf -p "Explain GGUF in one sentence." -ngl 99

That -ngl 99 offloads as many layers to the GPU as fit. Drop it lower for partial offload on tight VRAM. The full llama.cpp complete guide walks through building it.

LM Studio — GUI with a built-in model browser; search a model, click the GGUF quant you want, and it downloads and loads it. Walkthrough in LM Studio download models step by step.

Hugging Face → llama.cpp in one command:

./llama-cli -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M -p "hi"

Not sure which runner fits you? I broke it down in LM Studio vs Ollama vs llama.cpp.

GGUF vs other model formats: how does it compare?

Format	Best for	Quantized?	Runs on
GGUF	Local CPU/GPU inference	Yes (K-quants)	llama.cpp, Ollama, LM Studio, KoboldCpp
Safetensors	Full-precision weights, fine-tuning	No (FP16/BF16 typically)	Transformers, vLLM, training pipelines
GGML	Legacy llama.cpp (dead)	Yes	Old llama.cpp only
MLX	Apple Silicon native	Yes	MLX framework on Mac

The short version: safetensors is what you grab to fine-tune or run at full precision on a big GPU with vLLM. GGUF is what you grab to actually run a model on normal hardware with minimal fuss. On a Mac, MLX is a strong alternative that often squeezes out a bit more speed from Apple's GPU — I compare them in MLX on Apple Silicon. For most people on mixed hardware, GGUF is the universal answer.

Where do I download GGUF files safely?

Hugging Face is the main hub. Look for quantizers with a track record — bartowski, TheBloke (older), unsloth, and official model orgs all publish reliable GGUFs. A repo usually offers every quant tier; download just the one .gguf you need, not the whole folder.

Two things to check before you pull:

Match the quant to your hardware using the table above — don't grab Q8_0 for an 8 GB card.
Confirm the architecture is supported by your runner's llama.cpp version. Brand-new models sometimes need a runner update before they'll load.

If you want models managed for you instead of hand-picking files, Ollama's library and the pull-first open-weight model in 5 minutes walkthrough get you running without touching Hugging Face at all.

Bottom line

GGUF is the format that makes local LLMs practical: one self-describing file with weights, metadata, and tokenizer, quantized so it fits on hardware you already own. Grab a Q4_K_M quant of a 7-8B model, point Ollama or LM Studio at it, and you're running in minutes — bump to Q5_K_M or Q8_0 when you've got the memory and want more quality. For the bigger picture on getting open models running end to end, head back to the pillar: run open-weight models locally in 2026.

What does GGUF actually stand for and what is it?

Why did GGUF replace GGML?

What is inside a GGUF file?

Three things live in every GGUF file:

Tensors — the actual model weights, usually quantized to shrink them down.
Metadata — architecture name, context length, RoPE scaling, embedding dimensions, and the chat/prompt template.
Tokenizer — the full vocabulary and merge rules, so you don't need a separate tokenizer file.

How does quantization work in GGUF?

Reading a quant name like Q4_K_M:

Q4 — roughly 4 bits per weight
_K — K-quant method (smarter per-block scaling than the old "legacy" quants)
_M — size tier within that level: S (small), M (medium), L (large)

Which GGUF quant should I download?

Here's the cheat sheet I actually use. Sizes assume a ~7-8B model — scale up proportionally for bigger ones, and always verify on your own stack.

Quant	Bits (approx)	Rough size (7-8B)	Quality	Use it when
`Q2_K`	~2.6	~3 GB	Noticeably degraded	Only if you're desperate for space
`Q3_K_M`	~3.4	~3.5-4 GB	Usable, some loss	Tight VRAM, simple tasks
`Q4_K_M`	~4.5	~4.5-5 GB	Great balance	Default pick for most people
`Q5_K_M`	~5.5	~5.5-6 GB	Very close to full	A bit more headroom, want quality
`Q6_K`	~6.6	~6.5-7 GB	Near-lossless	Quality matters, VRAM allows
`Q8_0`	~8.5	~8 GB	Essentially full	Max fidelity, you have the memory

Decision list:

If you have ≤8 GB VRAM (or unified memory) → start with Q4_K_M on a 7-8B model.
If you have 12-16 GB → run Q5_K_M/Q6_K, or jump to a bigger model at Q4_K_M.
If you have 24 GB+ → Q8_0 on mid-size models, or a 30B-class model at Q4_K_M.
If output quality feels off → bump the quant up one tier before blaming the model.
If you're CPU-only → smaller quants run faster; see the CPU-only privacy tradeoff.

For the full memory math, I keep how much VRAM for Llama 3 8B and the VRAM requirements guide bookmarked.

How much VRAM does a GGUF model need?

A back-of-envelope formula I use:

VRAM ≈ (model file size) + (KV cache) + ~1 GB runtime overhead

How do I run a GGUF file?

Any llama.cpp-based runner takes GGUF directly. Pick your tool:

Ollama — easiest path. Pull a prebuilt model:

ollama run qwen2.5:7b

Or import a .gguf you downloaded yourself with a tiny Modelfile:

printf 'FROM ./qwen2.5-7b-instruct-q4_k_m.gguf\n' > Modelfile
ollama create my-qwen -f Modelfile
ollama run my-qwen

llama.cpp — the engine everything else is built on:

./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf -p "Explain GGUF in one sentence." -ngl 99

That -ngl 99 offloads as many layers to the GPU as fit. Drop it lower for partial offload on tight VRAM. The full llama.cpp complete guide walks through building it.

LM Studio — GUI with a built-in model browser; search a model, click the GGUF quant you want, and it downloads and loads it. Walkthrough in LM Studio download models step by step.

Hugging Face → llama.cpp in one command:

./llama-cli -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M -p "hi"

Not sure which runner fits you? I broke it down in LM Studio vs Ollama vs llama.cpp.

GGUF vs other model formats: how does it compare?

Format	Best for	Quantized?	Runs on
GGUF	Local CPU/GPU inference	Yes (K-quants)	llama.cpp, Ollama, LM Studio, KoboldCpp
Safetensors	Full-precision weights, fine-tuning	No (FP16/BF16 typically)	Transformers, vLLM, training pipelines
GGML	Legacy llama.cpp (dead)	Yes	Old llama.cpp only
MLX	Apple Silicon native	Yes	MLX framework on Mac

Where do I download GGUF files safely?

Two things to check before you pull:

Match the quant to your hardware using the table above — don't grab Q8_0 for an 8 GB card.
Confirm the architecture is supported by your runner's llama.cpp version. Brand-new models sometimes need a runner update before they'll load.

If you want models managed for you instead of hand-picking files, Ollama's library and the pull-first open-weight model in 5 minutes walkthrough get you running without touching Hugging Face at all.

What Is GGUF? The Local LLM File Format Explained

Key takeaways

What does GGUF actually stand for and what is it?

Why did GGUF replace GGML?

What is inside a GGUF file?

How does quantization work in GGUF?

Which GGUF quant should I download?

How much VRAM does a GGUF model need?

How do I run a GGUF file?

GGUF vs other model formats: how does it compare?

Where do I download GGUF files safely?

Bottom line

Frequently asked questions

Related Articles

Run Open-Weight Models Locally (2026)

CPU-Only Local LLM Privacy Tradeoffs

Install Ollama on Windows, Mac, and Linux (2026)

What Is GGUF? The Local LLM File Format Explained

Key takeaways

What does GGUF actually stand for and what is it?

Why did GGUF replace GGML?

What is inside a GGUF file?

How does quantization work in GGUF?

Which GGUF quant should I download?

How much VRAM does a GGUF model need?

How do I run a GGUF file?

GGUF vs other model formats: how does it compare?

Where do I download GGUF files safely?

Bottom line

Frequently asked questions

Related Articles

Run Open-Weight Models Locally (2026)

CPU-Only Local LLM Privacy Tradeoffs

Install Ollama on Windows, Mac, and Linux (2026)