WikiWayne
Local AIAI ToolsDigital MarketingTech NewsAboutBlogContact

As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

WikiWayne

Independent guides on open-weight AI, local inference, and the hardware that runs it.

Categories

  • Local AI Hub
  • Local AI
  • AI Tools
  • Digital Marketing
  • Tech News

Quick Links

  • About Wayne
  • Contact
  • Methodology
  • Editorial Standards
  • Disclosures
  • Privacy Policy
  • Sitemap

Follow on X

Daily AI insights, tech takes, and more.

Follow @wikiwayne
WikiWayne© 2026
PrivacyMethodologyEditorialDisclosuresTermsSitemap

Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Home/Local AI/llama.cpp CUDA Build Quickstart on Linux
Back to Blog
llama.cpp CUDA Build Quickstart on Linux — WikiWayne local-AI hero
Local AI

llama.cpp CUDA Build Quickstart on Linux

Published: June 13, 2026

Compile with GPU backends for NVIDIA cards.

Key takeaways

  • Compile with GPU backends for NVIDIA cards.
  • Parent pillar: /blog/llama-cpp-complete-guide

Part of

llama.cpp Complete Guide

Cornerstone guide in the WikiWayne local-AI cluster.

8 min read
local-ai, cluster
Wayne Lowry, WikiWayne author
Wayne Lowry

10+ years in Digital Marketing & SEO

To build llama.cpp with CUDA on Linux, install the NVIDIA driver and CUDA Toolkit, then configure the project with cmake -B build -DGGML_CUDA=ON and compile with cmake --build build --config Release -j. That single flag is the whole trick — GGML_CUDA=ON swaps in the CUDA backend so your NVIDIA GPU does the matrix math instead of your CPU. Below is the copy-paste path I actually use on a fresh Ubuntu box, plus the gotchas that eat an afternoon if you skip them.

What does "building llama.cpp with CUDA" actually mean?

llama.cpp is the open-source C/C++ inference engine that runs GGUF models on almost any hardware. By default it compiles CPU-only. Building it "with CUDA" means compiling the GGML CUDA backend so the engine offloads transformer layers onto an NVIDIA GPU — which is the difference between single-digit tokens/sec on CPU and a genuinely usable chat experience.

If you just want to run models and never touch a compiler, honestly use Ollama or LM Studio — they ship prebuilt CUDA binaries. See my LM Studio vs Ollama vs llama.cpp comparison for when each makes sense. You build from source when you want the bleeding-edge commit, a custom GPU arch, or the raw llama-server / llama-cli binaries with no wrapper.

What do I need before I start?

You need three things lined up, in this order:

  • A working NVIDIA driver — nvidia-smi must print your GPU and a driver version.
  • The CUDA Toolkit (nvcc, the compiler) — the driver alone is not enough to build.
  • Build tools — git, cmake (3.18+), and a C/C++ compiler (build-essential).

Quick term check: the driver lets your OS talk to the GPU; the CUDA Toolkit gives you nvcc, the compiler that turns GGML's CUDA kernels into GPU code. You can run CUDA apps with just the driver, but you cannot compile them.

# Confirm the driver sees your card
nvidia-smi

# Confirm the compiler exists (this is the one people forget)
nvcc --version

If nvidia-smi works but nvcc says "command not found," you have the driver but not the toolkit. That's the single most common reason a CUDA build fails.

How do I install the CUDA Toolkit on Ubuntu/Debian?

Grab build tools and the toolkit. The distro package is the least painful route for most people:

sudo apt update
sudo apt install -y build-essential git cmake
sudo apt install -y nvidia-cuda-toolkit

If apt ships an older CUDA than your driver supports, install NVIDIA's official repo package instead (search "CUDA Toolkit downloads" for your exact distro). After install, make sure nvcc is on your PATH:

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
nvcc --version

How do I build llama.cpp with CUDA? (the actual commands)

Clone, configure with the CUDA flag, build. This is the whole quickstart:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Configure the CUDA build
cmake -B build -DGGML_CUDA=ON

# Compile — -j uses all cores; this takes a few minutes
cmake --build build --config Release -j

When it finishes, your binaries land in build/bin/. The two you care about are llama-cli (one-shot / interactive prompts) and llama-server (an OpenAI-compatible HTTP server).

Note: the flag changed names a while back. Old guides say LLAMA_CUBLAS=1 and old Makefile builds used make LLAMA_CUDA=1. The current CMake flag is -DGGML_CUDA=ON. If a tutorial uses the old name, it predates the rename.

Should I set the GPU architecture flag?

Optional but worth it. By default the build targets a broad set of architectures, which is slower to compile. If you build only for your card, compilation is faster and the binary is leaner:

# Example: RTX 30-series = 86, RTX 40-series = 89, RTX 50-series = 120
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89

Look up your card's compute capability if you're unsure — wrong arch means the GPU kernels silently won't load.

How do I confirm the GPU is actually being used?

Pull a small GGUF first, then watch nvidia-smi while it runs. Start small — a 7B/8B Q4_K_M quant fits comfortably on 8GB+ cards:

# Run a model with all layers on the GPU
./build/bin/llama-cli \
  -hf bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q4_K_M \
  -ngl 99 -p "Explain GGUF in one sentence."

The flag that matters is -ngl (n-gpu-layers): how many transformer layers to offload to the GPU. -ngl 99 means "all of them." If you set -ngl 0, everything runs on CPU even though you built with CUDA — so a CUDA build that "isn't using the GPU" is almost always a missing -ngl. I go deep on this in GPU offload layers explained.

In a second terminal, watch -n 1 nvidia-smi should show the llama process and rising VRAM usage. If VRAM moves, you're golden.

How much VRAM do I need, and which quant?

VRAM math is simple: model size on disk ≈ VRAM for the weights, plus headroom for the KV cache (context). A rough rule for the file itself:

Model size Q4_K_M (≈4.5 bit) Q8_0 (≈8 bit) Comfortable GPU
7B–8B ~4.5–5 GB ~8 GB 8 GB
13B–14B ~8–9 GB ~14 GB 12 GB
32B–34B ~19–22 GB ~35 GB 24 GB
70B ~40–45 GB ~70+ GB 2× 24 GB

These are ballpark file sizes — verify against the actual GGUF on Hugging Face and leave 1–2 GB of headroom for context. Q4_K_M is the sweet-spot quant (small, barely-perceptible quality loss); Q8_0 is near-lossless but double the size. I break the tradeoff down in Q4 vs Q8 quant quality and the VRAM requirements guide.

If the model doesn't fully fit, you don't have to give up — lower -ngl so only some layers go to the GPU and the rest stay on CPU. It's slower, but it runs.

Which path should I pick? (decision list)

  • If you just want to chat with models and never compile → use Ollama or LM Studio with their prebuilt CUDA binaries, not source.
  • If you want the newest llama.cpp features or a specific commit → build from source with -DGGML_CUDA=ON, as above.
  • If nvcc is missing → you skipped the CUDA Toolkit; install it before cmake.
  • If the build succeeds but the GPU sits idle → you forgot -ngl 99 at runtime.
  • If you're on AMD instead of NVIDIA → build with -DGGML_HIPBLAS=ON (ROCm) instead of CUDA; see NVIDIA vs AMD for local LLMs.
  • If you want zero compiler headaches and Docker is fine → run the official CUDA image instead (next section).

Is there a Docker shortcut that skips compiling?

Yes. If you have the NVIDIA Container Toolkit installed, the official CUDA image gives you a GPU-ready llama-server with no local build:

docker run --gpus all -p 8080:8080 \
  -v ~/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/your-model.gguf -ngl 99 --host 0.0.0.0

This is my preferred route on a headless homelab box — it isolates the CUDA mess inside the container. If you're building a stack like that, my Docker homelab guide pairs llama.cpp's OpenAI-compatible server with a web UI nicely.

Why does my CUDA build fail? (common errors)

  • nvcc: command not found → CUDA Toolkit not installed or not on PATH. Fix the export PATH step above.
  • Unsupported gpu architecture → wrong CMAKE_CUDA_ARCHITECTURES, or your CUDA version is too old for a new card. Update the toolkit.
  • Driver/CUDA version mismatch → nvidia-smi shows a CUDA version ceiling; your toolkit must be at or below it. Update the driver if needed.
  • Out-of-memory at load → quant too big for your VRAM. Drop to a smaller quant or lower -ngl.
  • Build is glacially slow → set CMAKE_CUDA_ARCHITECTURES to just your card so nvcc isn't compiling for every GPU ever made.

For the bigger picture — flags, sampling, server config, and the full feature set — read the parent pillar, the llama.cpp complete guide. And if you find yourself reaching for Ollama's convenience after this, llama.cpp vs Ollama: when to switch covers exactly that decision.

Bottom line

Building llama.cpp with CUDA on Linux is two real steps: install the CUDA Toolkit (so nvcc exists), then cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j. The two things that trip everyone up are a missing toolkit at build time and a missing -ngl 99 at run time — fix those and your NVIDIA card does the heavy lifting. Start with a small Q4_K_M quant, confirm VRAM moves in nvidia-smi, then scale up. If compiling isn't your idea of fun, the Docker server-cuda image gets you the same GPU acceleration with none of the build dance.

Frequently asked questions

See /blog/llama-cpp-complete-guide for the full cornerstone guide.

Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Related Articles

local ai

llama.cpp Complete Guide

8 min read

local ai

Best Used GPUs for Local AI on a Budget (2026)

9 min read

local ai

Your First ComfyUI Workflow for Local SDXL

8 min read