WikiWayne
Local AIAI ToolsDigital MarketingTech NewsAboutBlogContact

As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

WikiWayne

Independent guides on open-weight AI, local inference, and the hardware that runs it.

Categories

  • Local AI Hub
  • Local AI
  • AI Tools
  • Digital Marketing
  • Tech News

Quick Links

  • About Wayne
  • Contact
  • Methodology
  • Editorial Standards
  • Disclosures
  • Privacy Policy
  • Sitemap

Follow on X

Daily AI insights, tech takes, and more.

Follow @wikiwayne
WikiWayne© 2026
PrivacyMethodologyEditorialDisclosuresTermsSitemap

Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Home/Local AI/GPU Offload Layers Explained for Local LLMs
Back to Blog
GPU Offload Layers Explained for Local LLMs — WikiWayne local-AI hero
Local AI

GPU Offload Layers Explained for Local LLMs

Published: June 13, 2026

What `-ngl` / GPU layer sliders actually do.

Key takeaways

  • What `-ngl` / GPU layer sliders actually do.
  • Parent pillar: /blog/vram-requirements-local-llms-guide

Part of

VRAM Requirements for Local LLMs

Cornerstone guide in the WikiWayne local-AI cluster.

8 min read
local-ai, cluster
Wayne Lowry, WikiWayne author
Wayne Lowry

10+ years in Digital Marketing & SEO

GPU offload layers control how much of a model lives on your GPU's VRAM versus your system RAM. The -ngl flag in llama.cpp (and the "GPU layers" slider in LM Studio) tells the runner how many of the model's transformer layers to load onto the GPU; everything else runs on the CPU. Set it high enough that the whole model fits in VRAM and you get full-speed inference. Set it too high for your card and you get an out-of-memory crash. Get it right and a model you thought was too big runs fine.

This is the single most impactful knob for local LLM speed, and almost nobody explains what it actually does. Let me fix that. For the broader picture on sizing models to your hardware, this article sits under the VRAM requirements for local LLMs guide.

What does -ngl actually mean?

-ngl is short for --n-gpu-layers: the number of model layers offloaded to the GPU. A transformer model like Llama 3.1 8B or Qwen2.5 14B is built from a stack of repeated layers (32, 40, 48, whatever the architecture says). Each layer is a chunk of weights. When you load a GGUF file, the runner decides where each chunk goes:

  • On the GPU (fast, but limited by VRAM)
  • On the CPU/system RAM (slow, but you usually have a lot more of it)

-ngl 0 means "everything on CPU." -ngl 999 (or any number bigger than the layer count) means "put every layer on the GPU." Anything in between is a split — some layers on GPU, the rest on CPU. That split is the whole game.

One detail people miss: the layer count in -ngl includes the model's repeating blocks plus, in modern llama.cpp, the output/embedding layers. So when you offload "all 33 layers" of a 32-block model, that extra one is the output head. Don't overthink it — just pass a number equal to or higher than the total and the runner loads everything it can.

Why do GPU offload layers matter so much for speed?

Because GPU memory bandwidth is roughly an order of magnitude faster than system RAM bandwidth, and LLM inference is bandwidth-bound. Every token you generate requires reading the entire active model weights. If those weights live in VRAM, the GPU rips through them. If even a few layers live in system RAM, every token has to wait on the slow path for those layers.

The brutal part: the slowest component dominates. If 90% of your model is on the GPU and 10% is on the CPU, you don't get 90% of GPU speed — you get something much worse, because each token still stalls on the CPU layers and the PCIe shuffle between them. Partial offload is always slower than it "feels like it should be."

That's why the goal is almost always: fit the entire model in VRAM. Full offload is the cliff edge between "fast and usable" and "technically works but painful."

How do I know how many layers to offload?

Start by checking how many layers your model has, then do the VRAM math. Here's the practical loop I use.

Load with full offload first and watch the logs:

# llama.cpp: try to offload everything
./llama-cli -m qwen2.5-14b-instruct-q4_k_m.gguf \
  -ngl 999 -c 4096 -p "Hello"

llama.cpp prints lines like offloaded 49/49 layers to GPU and a VRAM used figure. If it loads without an out-of-memory error, you're done — full offload, full speed.

If it crashes with CUDA out of memory (or Metal/ROCm equivalents), back off. Find the layer count in the load logs (e.g. n_layer = 48) and step down:

# Offload most layers, keep a few on CPU
./llama-cli -m qwen2.5-14b-instruct-q4_k_m.gguf \
  -ngl 40 -c 4096 -p "Hello"

Lower -ngl until it loads cleanly. Each layer you pull off the GPU frees a bit of VRAM but costs speed, so use the highest number that fits.

In Ollama, you usually don't touch this — it auto-detects and offloads as many layers as fit. If you want to force it, set num_gpu in the modelfile or API:

ollama run qwen2.5:14b --verbose
# Or pin layers via the API:
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:14b",
  "prompt": "Hello",
  "options": { "num_gpu": 40 }
}'

In LM Studio, it's the "GPU Offload" slider on the model load screen — drag it to max and watch the estimated VRAM. Same concept, friendlier UI.

How do I estimate VRAM before I even load the model?

Rough rule: VRAM needed ≈ (model file size on disk) + (KV cache for your context) + a little overhead. The quant determines the file size, so it dominates.

Model size Quant Approx. file size Fits in...
8B Q4_K_M ~4.5–5 GB 8 GB GPU (full offload, short context)
8B Q8_0 ~8–8.5 GB 12 GB GPU comfortably
14B Q4_K_M ~8.5–9 GB 12 GB GPU (full offload)
14B Q8_0 ~15–16 GB 16–24 GB GPU
32B Q4_K_M ~19–20 GB 24 GB GPU (tight, short context)
70B Q4_K_M ~40–42 GB 48 GB+, or split across cards

Treat these as ballpark — exact sizes vary by tokenizer, vocab, and quant variant, so verify against the actual GGUF size on your disk. The KV cache grows with context length; a long 16K–32K context can add several GB on top of the weights, which is often what tips a "just barely fits" setup into a crash. If you bump -c (context) and suddenly hit OOM, that's your KV cache eating the headroom. For the full breakdown, see how much VRAM for Llama 3 8B and the deeper VRAM requirements guide.

What if the model doesn't fully fit in VRAM?

You've got four moves, roughly in order of preference:

  • If you're one quant level too big, drop the quant. Going from Q8 to Q4_K_M nearly halves the file size with modest quality loss. This is usually the best trade — see Q4 vs Q8 quant quality tradeoffs.
  • If a smaller model would do the job, use it. A fully-offloaded 8B beats a half-offloaded 14B on speed almost every time.
  • If you can shorten context, do it. Dropping -c from 16K to 4K frees KV-cache VRAM and may let all layers fit.
  • If none of that works, accept a partial split. Offload as many layers as fit and let the rest run on CPU. It'll be slow but functional — fine for batch jobs, painful for chat.

A decision shortcut I tell people:

  • If model fits in VRAM → -ngl 999, full offload, done.
  • If it's 10–20% too big → drop one quant level, then full offload.
  • If it's way too big → smaller model or partial offload and lower expectations.

Is GPU offload different on Apple Silicon?

Yes, in a good way. Macs use unified memory — the CPU and GPU share the same RAM pool — so there's no separate VRAM to fill. On a Mac, -ngl 999 offloads to the GPU but the "VRAM" is just your system RAM (capped by a fraction the OS reserves). A 32GB M-series machine can run models a 16GB discrete GPU can't, because there's no copy across PCIe.

If you're on Apple Silicon, you generally still pass full offload in llama.cpp, but the native fast path is MLX, which is built for unified memory from the ground up. See MLX on Apple Silicon for local Llama. The offload-layer mental model still applies; you just rarely have to ration it the way NVIDIA owners do.

Quick reference: offload settings by tool

Tool Where you set it Full-offload value
llama.cpp -ngl / --n-gpu-layers flag -ngl 999
Ollama num_gpu option (auto by default) leave default, or "num_gpu": 999
LM Studio "GPU Offload" slider drag to max
koboldcpp --gpulayers flag --gpulayers 999

If you're deciding which of these runners to standardize on, I compared them in LM Studio vs Ollama vs llama.cpp.

Bottom line

GPU offload layers are just a dial that splits the model between fast VRAM and slow system RAM. Aim to fit the whole model in VRAM — full offload — and you get the speed your card is capable of. When it won't fit, drop a quant level or step down to a smaller model before you settle for a partial split, because a half-offloaded model gives up most of its speed for the convenience. Load with -ngl 999, read the logs, and back off only if it crashes. Verify the real file sizes and VRAM use on your own stack, since every model and context length shifts the numbers a bit.

Frequently asked questions

See /blog/vram-requirements-local-llms-guide for the full cornerstone guide.

Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Related Articles

local ai

VRAM Requirements for Local LLMs

8 min read

local ai

How Much VRAM for Llama 3 8B?

8 min read

local ai

Best Used GPUs for Local AI on a Budget (2026)

9 min read