Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
VRAM Requirements for Local LLMs
VRAM Requirements for Local LLMs is a cornerstone page for the WikiWayne local-AI cluster.
Key takeaways
- VRAM Requirements for Local LLMs is a cornerstone page for the WikiWayne local-AI cluster.
- Start with a small GGUF quant and verify VRAM on your own GPU before scaling model size.
- Use linked cluster posts for install steps and runner-specific commands.
10+ years in Digital Marketing & SEO
VRAM Requirements for Local LLMs
VRAM is the single hardware spec that decides which local LLMs you can run fast, because the model's weights have to physically fit in your GPU's memory to run at full speed. As a rough rule, you need roughly the model's parameter count in billions, times the bytes-per-weight of your quant, plus a couple of GB for context and overhead. An 8B model at Q4 wants about 5-6 GB of VRAM; a 70B at Q4 wants 40 GB or more. Below I'll show you the math so you can size any model on your own card instead of guessing.
What is VRAM and why does it matter for local LLMs?
VRAM (video RAM) is the dedicated memory on your GPU where model weights, the KV cache, and activations live during inference. When the whole model fits in VRAM, every layer runs on the GPU and you get interactive speed. When it doesn't fit, your runner spills some layers to system RAM and the CPU, and tokens-per-second falls off a cliff because the CPU is dramatically slower at this math and the PCIe bus becomes a bottleneck.
On Apple Silicon there's no separate "VRAM" number — the GPU shares unified memory with the system, so your usable budget is roughly total RAM minus what macOS and your apps are using. That's why a 64 GB M-series Mac can punch above a 16 GB discrete NVIDIA card for big models: the memory is one big shared pool.
How do I calculate VRAM requirements for a model?
The quick formula most practitioners use:
VRAM (GB) ≈ (parameters_in_billions × bytes_per_weight) + context_overhead
Bytes-per-weight depends on quantization. Quantization is compressing model weights from 16-bit floats down to fewer bits per number, trading a little accuracy for a lot less memory. Here's the cheat sheet:
| Quant | Approx bits/weight | Bytes/weight | Quality |
|---|---|---|---|
| FP16 / BF16 | 16 | ~2.0 | Reference (full) |
| Q8_0 | 8 | ~1.0 | Near-lossless |
| Q6_K | 6 | ~0.8 | Excellent |
| Q5_K_M | 5 | ~0.65 | Very good |
| Q4_K_M | ~4.5 | ~0.55 | The sweet spot |
| Q3_K_M | ~3.5 | ~0.45 | Noticeably softer |
So a 7B model at Q4_K_M is roughly 7 × 0.55 ≈ 3.9 GB of weights. Add 1-2 GB for the KV cache and runtime overhead at a normal context length, and you land around 5-6 GB in practice. That overhead grows with context: an 8K window is cheap, but push to 32K or 128K and the KV cache alone can eat several extra GB. If you want the full reasoning on quant levels, I broke it down in Quantization explained and the Q4 vs Q8 quality tradeoff.
One more reality check: the GGUF file size on disk is a great proxy for the weight portion of VRAM. GGUF is the single-file model format used by llama.cpp, Ollama, LM Studio, and KoboldCpp. If the file is 4.2 GB, budget roughly that plus context overhead. Treat all of these as ballpark numbers and verify on your own stack — your context length, batch size, and runner version all move the figure.
How much VRAM do I need per model size?
These are realistic bands for full-GPU loading at a modest context (4K-8K). Verify against the actual GGUF you download.
| Model size | Q4_K_M | Q8_0 | Comfortable GPU |
|---|---|---|---|
| 3B (Phi, small Qwen, Gemma) | ~2-3 GB | ~3-4 GB | 6 GB+ |
| 7-8B (Llama, Qwen, Mistral) | ~5-6 GB | ~9-10 GB | 8-12 GB |
| 12-14B (Phi, Qwen, Gemma) | ~8-10 GB | ~14-16 GB | 12-16 GB |
| 24-32B (Qwen, Gemma, Mistral Small) | ~18-22 GB | ~34-36 GB | 24 GB+ |
| 70B (Llama) | ~40-45 GB | ~75 GB | 48 GB+ or heavy offload |
| MoE 100B+ (DeepSeek, GLM, Qwen) | varies wildly | — | 64 GB unified / multi-GPU |
A few notes from running these myself. Mixture-of-experts models like DeepSeek and some GLM/Qwen variants have a huge total parameter count but only activate a fraction per token — they still need to fit (or mostly fit) in memory, so budget by file size, not by "active" params. And for a deeper single-model walkthrough, see How much VRAM for Llama 3 8B.
What happens if a model doesn't fit in VRAM?
It still runs — just slower. Runners split the model: some transformer layers stay on the GPU, the rest run on CPU using system RAM. This is GPU offload, and you control it with the number of layers sent to the GPU (the -ngl / n_gpu_layers setting). More layers on GPU means more speed, right up until you run out of VRAM and the driver either errors or silently pages, which tanks performance.
The performance hit isn't linear. Keeping 90% of layers on GPU might cost you 20-30% speed; dropping to 50/50 can cut throughput by half or worse. I dug into tuning this in GPU offload layers explained. If you have no GPU at all, smaller quants on CPU are usable for privacy work but slow — I covered the CPU-only privacy tradeoff separately.
How do I check VRAM usage on my own machine?
Don't trust my ballparks — measure. Pick the line for your hardware:
# NVIDIA: live VRAM usage, refresh every second
nvidia-smi -l 1
# AMD on Linux
rocm-smi --showmeminfo vram
# Apple Silicon: unified memory pressure
sudo powermetrics --samplers gpu_power -i 1000
# or just watch memory in Activity Monitor
To see exactly how a model loads, run it and read the log. With llama.cpp's server or CLI, the startup output tells you how many layers landed on GPU vs CPU:
# Try to put all layers on GPU; watch the offload report
./llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf -ngl 99 -c 8192
If you see something like offloaded 28/29 layers to GPU, you're nearly fully resident and golden. If it says offloaded 12/29, you're VRAM-bound and should drop to a smaller quant or shorter context. With Ollama, just pull and run, then check usage:
ollama pull qwen2.5:7b
ollama run qwen2.5:7b "hello"
ollama ps # shows the model and how it's split CPU/GPU
That ollama ps line showing 100% GPU vs 48%/52% CPU/GPU is the fastest sanity check there is. New to the toolchain? Start with Install Ollama on Windows, Mac, Linux or compare the runners in LM Studio vs Ollama vs llama.cpp.
Which model should I run for my GPU? (decision list)
Match your VRAM to a model and quant, then size up only after confirming it fully loads:
- If you have 6-8 GB (e.g. RTX 3060, RTX 4060, 8 GB cards): run 7-8B at Q4_K_M, or 3B at Q8. This is the comfortable sweet spot for chat and coding assist.
- If you have 10-12 GB (RTX 3080, 4070): run 8B at Q6/Q8, or step up to 12-14B at Q4_K_M.
- If you have 16 GB (RTX 4060 Ti 16GB, 4070 Ti Super): 14B at Q6 sits nicely, or a tight 24B at Q4 with short context.
- If you have 24 GB (RTX 3090, 4090): 24-32B at Q4_K_M runs great; this is the best price/capability tier on used cards.
- If you have 32 GB+ unified or dual GPUs: 70B at Q4 with offload, or large MoE models.
- If you're on Apple Silicon: take your total RAM, subtract ~6-8 GB for the OS, and that's your model budget — and look at MLX, which is tuned for the hardware. See MLX on Apple Silicon.
Shopping for hardware around these numbers? I keep Best GPU for local AI 2026 and the best used GPU on a budget current. The 24 GB used 3090 remains the value king for a reason: it's the smallest card that runs the 24-32B class fully resident.
Does context length change my VRAM needs?
Yes, and people forget this. The KV cache stores attention keys and values for every token in your context window, and it scales with context length. A model that loads in 6 GB at 4K context might need 8-9 GB at 32K, and considerably more at 128K. If you're hitting out-of-memory errors only on long prompts, that's the KV cache, not the weights.
Two levers help: lower your context window (-c in llama.cpp, the context slider in LM Studio) to what you actually need, or enable KV cache quantization if your runner supports it, which shrinks the cache at a small quality cost. Don't allocate a 128K window you'll never use — it's free VRAM you're throwing away.
Bottom line
Size with the formula — params in billions times bytes-per-weight, plus a couple GB for context — then verify on your own card with nvidia-smi, ollama ps, or the llama.cpp offload log. Q4_K_M is the practical sweet spot: an 8B fits in 6 GB, a 32B in 24 GB, a 70B wants 40 GB+. When a model won't fit, drop the quant or trim context before you reach for CPU offload, because partial offload trades speed for size every time. Start small, confirm it loads fully, then scale up — that's the whole game.
Frequently asked questions
Yes. Cornerstone posts bump updatedAt when Ollama, LM Studio, or llama.cpp ship breaking changes; see the refresh log in Content Ideas.
A GPU helps for 7B+ models at interactive speed. CPU-only inference is supported for privacy experiments with smaller quants.
Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.
