Which pillar does this cluster support?

See /blog/vram-requirements-local-llms-guide for the full cornerstone guide.

GPU Offload Layers Explained for Local LLMs | WikiWayne

GPU offload layers control how much of a model lives on your GPU's VRAM versus your system RAM. The -ngl flag in llama.cpp (and the "GPU layers" slider in LM Studio) tells the runner how many of the model's transformer layers to load onto the GPU; everything else runs on the CPU. Set it high enough that the whole model fits in VRAM and you get full-speed inference. Set it too high for your card and you get an out-of-memory crash. Get it right and a model you thought was too big runs fine.

This is the single most impactful knob for local LLM speed, and almost nobody explains what it actually does. Let me fix that. For the broader picture on sizing models to your hardware, this article sits under the VRAM requirements for local LLMs guide.

What does `-ngl` actually mean?

-ngl is short for --n-gpu-layers: the number of model layers offloaded to the GPU. A transformer model like Llama 3.1 8B or Qwen2.5 14B is built from a stack of repeated layers (32, 40, 48, whatever the architecture says). Each layer is a chunk of weights. When you load a GGUF file, the runner decides where each chunk goes:

On the GPU (fast, but limited by VRAM)
On the CPU/system RAM (slow, but you usually have a lot more of it)

-ngl 0 means "everything on CPU." -ngl 999 (or any number bigger than the layer count) means "put every layer on the GPU." Anything in between is a split — some layers on GPU, the rest on CPU. That split is the whole game.

One detail people miss: the layer count in -ngl includes the model's repeating blocks plus, in modern llama.cpp, the output/embedding layers. So when you offload "all 33 layers" of a 32-block model, that extra one is the output head. Don't overthink it — just pass a number equal to or higher than the total and the runner loads everything it can.

Why do GPU offload layers matter so much for speed?

Because GPU memory bandwidth is roughly an order of magnitude faster than system RAM bandwidth, and LLM inference is bandwidth-bound. Every token you generate requires reading the entire active model weights. If those weights live in VRAM, the GPU rips through them. If even a few layers live in system RAM, every token has to wait on the slow path for those layers.

The brutal part: the slowest component dominates. If 90% of your model is on the GPU and 10% is on the CPU, you don't get 90% of GPU speed — you get something much worse, because each token still stalls on the CPU layers and the PCIe shuffle between them. Partial offload is always slower than it "feels like it should be."

That's why the goal is almost always: fit the entire model in VRAM. Full offload is the cliff edge between "fast and usable" and "technically works but painful."

How do I know how many layers to offload?

Start by checking how many layers your model has, then do the VRAM math. Here's the practical loop I use.

Load with full offload first and watch the logs:

# llama.cpp: try to offload everything
./llama-cli -m qwen2.5-14b-instruct-q4_k_m.gguf \
  -ngl 999 -c 4096 -p "Hello"

llama.cpp prints lines like offloaded 49/49 layers to GPU and a VRAM used figure. If it loads without an out-of-memory error, you're done — full offload, full speed.

If it crashes with CUDA out of memory (or Metal/ROCm equivalents), back off. Find the layer count in the load logs (e.g. n_layer = 48) and step down:

# Offload most layers, keep a few on CPU
./llama-cli -m qwen2.5-14b-instruct-q4_k_m.gguf \
  -ngl 40 -c 4096 -p "Hello"

Lower -ngl until it loads cleanly. Each layer you pull off the GPU frees a bit of VRAM but costs speed, so use the highest number that fits.

In Ollama, you usually don't touch this — it auto-detects and offloads as many layers as fit. If you want to force it, set num_gpu in the modelfile or API:

ollama run qwen2.5:14b --verbose
# Or pin layers via the API:
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:14b",
  "prompt": "Hello",
  "options": { "num_gpu": 40 }
}'

In LM Studio, it's the "GPU Offload" slider on the model load screen — drag it to max and watch the estimated VRAM. Same concept, friendlier UI.

How do I estimate VRAM before I even load the model?

Rough rule: VRAM needed ≈ (model file size on disk) + (KV cache for your context) + a little overhead. The quant determines the file size, so it dominates.

Model size	Quant	Approx. file size	Fits in...
8B	Q4_K_M	~4.5–5 GB	8 GB GPU (full offload, short context)
8B	Q8_0	~8–8.5 GB	12 GB GPU comfortably
14B	Q4_K_M	~8.5–9 GB	12 GB GPU (full offload)
14B	Q8_0	~15–16 GB	16–24 GB GPU
32B	Q4_K_M	~19–20 GB	24 GB GPU (tight, short context)
70B	Q4_K_M	~40–42 GB	48 GB+, or split across cards

Treat these as ballpark — exact sizes vary by tokenizer, vocab, and quant variant, so verify against the actual GGUF size on your disk. The KV cache grows with context length; a long 16K–32K context can add several GB on top of the weights, which is often what tips a "just barely fits" setup into a crash. If you bump -c (context) and suddenly hit OOM, that's your KV cache eating the headroom. For the full breakdown, see how much VRAM for Llama 3 8B and the deeper VRAM requirements guide.

What if the model doesn't fully fit in VRAM?

You've got four moves, roughly in order of preference:

If you're one quant level too big, drop the quant. Going from Q8 to Q4_K_M nearly halves the file size with modest quality loss. This is usually the best trade — see Q4 vs Q8 quant quality tradeoffs.
If a smaller model would do the job, use it. A fully-offloaded 8B beats a half-offloaded 14B on speed almost every time.
If you can shorten context, do it. Dropping -c from 16K to 4K frees KV-cache VRAM and may let all layers fit.
If none of that works, accept a partial split. Offload as many layers as fit and let the rest run on CPU. It'll be slow but functional — fine for batch jobs, painful for chat.

A decision shortcut I tell people:

If model fits in VRAM → -ngl 999, full offload, done.
If it's 10–20% too big → drop one quant level, then full offload.
If it's way too big → smaller model or partial offload and lower expectations.

Is GPU offload different on Apple Silicon?

Yes, in a good way. Macs use unified memory — the CPU and GPU share the same RAM pool — so there's no separate VRAM to fill. On a Mac, -ngl 999 offloads to the GPU but the "VRAM" is just your system RAM (capped by a fraction the OS reserves). A 32GB M-series machine can run models a 16GB discrete GPU can't, because there's no copy across PCIe.

If you're on Apple Silicon, you generally still pass full offload in llama.cpp, but the native fast path is MLX, which is built for unified memory from the ground up. See MLX on Apple Silicon for local Llama. The offload-layer mental model still applies; you just rarely have to ration it the way NVIDIA owners do.

Quick reference: offload settings by tool

Tool	Where you set it	Full-offload value
llama.cpp	`-ngl` / `--n-gpu-layers` flag	`-ngl 999`
Ollama	`num_gpu` option (auto by default)	leave default, or `"num_gpu": 999`
LM Studio	"GPU Offload" slider	drag to max
koboldcpp	`--gpulayers` flag	`--gpulayers 999`

If you're deciding which of these runners to standardize on, I compared them in LM Studio vs Ollama vs llama.cpp.

Bottom line

GPU offload layers are just a dial that splits the model between fast VRAM and slow system RAM. Aim to fit the whole model in VRAM — full offload — and you get the speed your card is capable of. When it won't fit, drop a quant level or step down to a smaller model before you settle for a partial split, because a half-offloaded model gives up most of its speed for the convenience. Load with -ngl 999, read the logs, and back off only if it crashes. Verify the real file sizes and VRAM use on your own stack, since every model and context length shifts the numbers a bit.

What does `-ngl` actually mean?

On the GPU (fast, but limited by VRAM)
On the CPU/system RAM (slow, but you usually have a lot more of it)

Why do GPU offload layers matter so much for speed?

That's why the goal is almost always: fit the entire model in VRAM. Full offload is the cliff edge between "fast and usable" and "technically works but painful."

How do I know how many layers to offload?

Start by checking how many layers your model has, then do the VRAM math. Here's the practical loop I use.

Load with full offload first and watch the logs:

# llama.cpp: try to offload everything
./llama-cli -m qwen2.5-14b-instruct-q4_k_m.gguf \
  -ngl 999 -c 4096 -p "Hello"

llama.cpp prints lines like offloaded 49/49 layers to GPU and a VRAM used figure. If it loads without an out-of-memory error, you're done — full offload, full speed.

If it crashes with CUDA out of memory (or Metal/ROCm equivalents), back off. Find the layer count in the load logs (e.g. n_layer = 48) and step down:

# Offload most layers, keep a few on CPU
./llama-cli -m qwen2.5-14b-instruct-q4_k_m.gguf \
  -ngl 40 -c 4096 -p "Hello"

Lower -ngl until it loads cleanly. Each layer you pull off the GPU frees a bit of VRAM but costs speed, so use the highest number that fits.

In Ollama, you usually don't touch this — it auto-detects and offloads as many layers as fit. If you want to force it, set num_gpu in the modelfile or API:

ollama run qwen2.5:14b --verbose
# Or pin layers via the API:
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:14b",
  "prompt": "Hello",
  "options": { "num_gpu": 40 }
}'

In LM Studio, it's the "GPU Offload" slider on the model load screen — drag it to max and watch the estimated VRAM. Same concept, friendlier UI.

How do I estimate VRAM before I even load the model?

Rough rule: VRAM needed ≈ (model file size on disk) + (KV cache for your context) + a little overhead. The quant determines the file size, so it dominates.

Model size	Quant	Approx. file size	Fits in...
8B	Q4_K_M	~4.5–5 GB	8 GB GPU (full offload, short context)
8B	Q8_0	~8–8.5 GB	12 GB GPU comfortably
14B	Q4_K_M	~8.5–9 GB	12 GB GPU (full offload)
14B	Q8_0	~15–16 GB	16–24 GB GPU
32B	Q4_K_M	~19–20 GB	24 GB GPU (tight, short context)
70B	Q4_K_M	~40–42 GB	48 GB+, or split across cards

What if the model doesn't fully fit in VRAM?

You've got four moves, roughly in order of preference:

If you're one quant level too big, drop the quant. Going from Q8 to Q4_K_M nearly halves the file size with modest quality loss. This is usually the best trade — see Q4 vs Q8 quant quality tradeoffs.
If a smaller model would do the job, use it. A fully-offloaded 8B beats a half-offloaded 14B on speed almost every time.
If you can shorten context, do it. Dropping -c from 16K to 4K frees KV-cache VRAM and may let all layers fit.
If none of that works, accept a partial split. Offload as many layers as fit and let the rest run on CPU. It'll be slow but functional — fine for batch jobs, painful for chat.

A decision shortcut I tell people:

If model fits in VRAM → -ngl 999, full offload, done.
If it's 10–20% too big → drop one quant level, then full offload.
If it's way too big → smaller model or partial offload and lower expectations.

Is GPU offload different on Apple Silicon?

Quick reference: offload settings by tool

Tool	Where you set it	Full-offload value
llama.cpp	`-ngl` / `--n-gpu-layers` flag	`-ngl 999`
Ollama	`num_gpu` option (auto by default)	leave default, or `"num_gpu": 999`
LM Studio	"GPU Offload" slider	drag to max
koboldcpp	`--gpulayers` flag	`--gpulayers 999`

If you're deciding which of these runners to standardize on, I compared them in LM Studio vs Ollama vs llama.cpp.

GPU Offload Layers Explained for Local LLMs

Key takeaways

What does `-ngl` actually mean?

Why do GPU offload layers matter so much for speed?

How do I know how many layers to offload?

How do I estimate VRAM before I even load the model?

What if the model doesn't fully fit in VRAM?

Is GPU offload different on Apple Silicon?

Quick reference: offload settings by tool

Bottom line

Frequently asked questions

Related Articles

VRAM Requirements for Local LLMs

How Much VRAM for Llama 3 8B?

Best Used GPUs for Local AI on a Budget (2026)

GPU Offload Layers Explained for Local LLMs

Key takeaways

What does `-ngl` actually mean?

Why do GPU offload layers matter so much for speed?

How do I know how many layers to offload?

How do I estimate VRAM before I even load the model?

What if the model doesn't fully fit in VRAM?

Is GPU offload different on Apple Silicon?

Quick reference: offload settings by tool

Bottom line

Frequently asked questions

Related Articles

VRAM Requirements for Local LLMs

How Much VRAM for Llama 3 8B?

Best Used GPUs for Local AI on a Budget (2026)

GPU Offload Layers Explained for Local LLMs

Key takeaways

What does -ngl actually mean?

Why do GPU offload layers matter so much for speed?

How do I know how many layers to offload?

How do I estimate VRAM before I even load the model?

What if the model doesn't fully fit in VRAM?

Is GPU offload different on Apple Silicon?

Quick reference: offload settings by tool

Bottom line

Frequently asked questions