WikiWayne
Local AIAI ToolsDigital MarketingTech NewsAboutBlogContact

As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

WikiWayne

Independent guides on open-weight AI, local inference, and the hardware that runs it.

Categories

  • Local AI Hub
  • Local AI
  • AI Tools
  • Digital Marketing
  • Tech News

Quick Links

  • About Wayne
  • Contact
  • Methodology
  • Editorial Standards
  • Disclosures
  • Privacy Policy
  • Sitemap

Follow on X

Daily AI insights, tech takes, and more.

Follow @wikiwayne
WikiWayne© 2026
PrivacyMethodologyEditorialDisclosuresTermsSitemap

Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Home/Local AI/LM Studio: Download Models Step by Step
Back to Blog
LM Studio: Download Models Step by Step — WikiWayne local-AI hero
Local AI

LM Studio: Download Models Step by Step

Published: June 13, 2026

Use the LM Studio catalog without guessing quant labels.

Key takeaways

  • Use the LM Studio catalog without guessing quant labels.
  • Parent pillar: /blog/lm-studio-vs-ollama-vs-llama-cpp-which-local-ai-tool

Part of

LM Studio vs Ollama vs llama.cpp: Which Local AI Tool?

Cornerstone guide in the WikiWayne local-AI cluster.

8 min read
local-ai, cluster
Wayne Lowry, WikiWayne author
Wayne Lowry

10+ years in Digital Marketing & SEO

Downloading models in LM Studio is mostly point-and-click: open the Discover tab, search a model name, and pick a quant from the dropdown. The only real trick is reading the quant labels so you don't grab a file that won't fit in your memory. This guide walks the whole flow start to finish, decodes the GGUF and quant naming, and gives you a sizing rule so you stop guessing.

If you're still deciding between tools before you commit, my LM Studio vs Ollama vs llama.cpp comparison is the parent guide this one feeds into. Come back here once you've settled on LM Studio.

What is LM Studio and what format does it download?

LM Studio is a free desktop app (Windows, macOS, Linux) that gives you a GUI catalog for finding, downloading, and chatting with open-weight LLMs locally. Under the hood it runs GGUF files through a bundled llama.cpp engine, plus an MLX engine on Apple Silicon for native Metal acceleration.

GGUF (GPT-Generated Unified Format) is the single-file container format that packs a model's weights, tokenizer, and metadata together so a runner can load it without any extra config. Almost every model you'll download in LM Studio is a GGUF, and the quant label baked into the filename is the one thing you actually have to understand. More on the format in my GGUF explainer.

How do I download a model in LM Studio, step by step?

Here's the full flow, assuming a fresh install.

  1. Open the Discover tab. It's the magnifying-glass icon in the left sidebar (older builds call it the Search tab).
  2. Search a model name. Type something concrete like Qwen3 8B, Llama 3.1 8B Instruct, Gemma 3 12B, or Mistral Nemo. LM Studio queries Hugging Face and lists matching repos.
  3. Pick the repo, then expand the quant list. Each result has a dropdown of available quantizations. This is where people freeze up, so I'll decode the labels in the next section.
  4. Watch the fit indicator. LM Studio tags each quant with a rough "will this fit your machine" estimate based on your detected RAM/VRAM. Green-ish means it should load; a warning means you're cutting it close or over.
  5. Click Download. Files land in ~/.lmstudio/models/ (or your configured models folder) in a publisher/model/ folder structure.
  6. Load it from the Chat tab. Hit the model selector at the top, pick what you just downloaded, and adjust the GPU offload slider before loading.
  7. Test with one prompt. Confirm it responds and check tokens/sec in the bottom bar.

That's it. The download itself is trivial; everything that goes wrong is a quant-and-memory mismatch, which the rest of this guide is about.

How do I read LM Studio's quant labels without guessing?

Quantization is the process of storing model weights at lower numeric precision (4-bit, 5-bit, 8-bit instead of 16-bit) to shrink the file and cut memory use, at a small cost to quality. The label tells you exactly which scheme a file uses.

Here's how to parse a typical filename like Qwen3-8B-Q4_K_M.gguf:

  • The number (Q4, Q5, Q8) is the bits per weight. Lower = smaller file, less memory, slightly lower quality.
  • _K means a "K-quant," the modern mixed-precision method that's better than the old legacy quants at the same size. You want K-quants.
  • The suffix (_S, _M, _L) is Small / Medium / Large within that bit level — a finer trade between size and quality.

Quant cheat sheet for the labels you'll actually see:

Quant label Bits/weight Quality When I reach for it
Q8_0 8 Near-lossless Plenty of memory, want max quality
Q6_K ~6.5 Excellent Sweet spot when you have headroom
Q5_K_M ~5.5 Very good Balanced default if Q4 feels lossy
Q4_K_M ~4.5 Good My default — best size/quality balance
Q4_K_S ~4 Decent Squeezing into tight memory
Q3_K_M ~3.5 Noticeably degraded Last resort to make it fit at all
IQ (e.g. IQ4_XS) varies Good per-byte Tightest fits; uses imatrix, slightly slower

If you only remember one thing: Q4_K_M is the right default for almost everyone. Start there, and only move up to Q5/Q6/Q8 if you have memory to spare or down to Q3/IQ if it won't load. I go deeper on the trade-off in Q4 vs Q8: the quant quality tradeoff and the broader quantization explainer.

How much memory does a given quant need?

Quick mental math: file size on disk roughly equals the RAM/VRAM the weights need, plus a bit of overhead for context (the KV cache). So a 5 GB GGUF wants about 5 GB of memory for weights, then add maybe 0.5-2 GB for context depending on how long you set it.

Ballpark download sizes by model size at Q4_K_M (verify the exact bytes in LM Studio's dropdown, these drift by model):

Model size ~Q4_K_M file ~Memory to run comfortably
3-4B ~2-3 GB 6 GB+
7-8B ~4-5 GB 8 GB+
12-14B ~7-9 GB 12-16 GB
27-32B ~16-20 GB 24-32 GB
70B ~40 GB 48 GB+ (or heavy CPU offload)

These are rough. Treat LM Studio's own fit estimate as a starting hint and confirm on your own stack — actual usage shifts with context length and engine. For the full math, see how much VRAM for Llama 3 8B and the general VRAM requirements guide.

Which model and quant should I pick? (decision list)

Match the choice to your hardware:

  • If you have 8 GB RAM/VRAM → grab a 7-8B model (Qwen3 8B, Llama 3.1 8B) at Q4_K_M. Keep context modest.
  • If you have 16 GB → run an 8B at Q5_K_M/Q6_K, or step up to a 12-14B (Gemma 3 12B, Phi-4) at Q4_K_M.
  • If you have 24 GB (e.g. a 3090/4090 or M-series with unified memory) → a 27-32B model (Qwen3 32B, Gemma 3 27B) at Q4_K_M is a great daily driver.
  • If you have 32 GB+ of unified memory on Apple Silicon → prefer the MLX version of the model when LM Studio offers it; it's tuned for Metal and often runs a touch faster than the GGUF.
  • If you're on a small or CPU-only box → stay at 3-4B and accept slower tokens/sec. See my CPU-only privacy tradeoff piece.
  • If it won't fit at Q4_K_M → drop to Q4_K_S, then an IQ4_XS, then a smaller model. Don't go below Q3 unless you have no other option.

How do I import a GGUF I already downloaded?

If you already pulled a GGUF elsewhere (Hugging Face directly, or shared from another machine), you don't have to re-download. Drop the file into your models directory using the publisher/model folder convention:

# macOS / Linux default models path
~/.lmstudio/models/<publisher>/<model-name>/<file>.gguf

# Example
mkdir -p ~/.lmstudio/models/lmstudio-community/Qwen3-8B-GGUF
cp ~/Downloads/Qwen3-8B-Q4_K_M.gguf \
   ~/.lmstudio/models/lmstudio-community/Qwen3-8B-GGUF/

Restart LM Studio (or refresh My Models) and it'll appear in the Chat selector. The same GGUF works in Ollama and llama.cpp too, so you're not locked in — that portability is the whole point of the format.

Why is my model slow or failing to load?

A few quick fixes for the common ones:

  • Loads then crashes / out of memory → the quant is too big. Pick a smaller quant or lower the GPU offload layers.
  • Painfully slow → too many layers are running on CPU. Bump the GPU offload slider up in the load settings; see GPU offload layers explained.
  • Won't fit even at Q4 → you're memory-bound. Smaller model, or look at the best budget GPU options.
  • Context errors on long prompts → lower the context length slider; long context eats memory fast on top of the weights.

Bottom line

Downloading models in LM Studio is easy once the quant labels stop being a mystery. Search the Discover tab, default to Q4_K_M, check that the file size fits your memory with a little headroom, and only deviate up or down when you have a reason. On Apple Silicon, grab the MLX build when it's offered. Start small, confirm it loads and responds, then scale up — and when you're ready to compare runners head-to-head, head back to the pillar guide.

Related: llama cpp vs ollama when to switch

Frequently asked questions

See /blog/lm-studio-vs-ollama-vs-llama-cpp-which-local-ai-tool for the full cornerstone guide.

Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Related Articles

local ai

LM Studio vs Ollama vs llama.cpp: Which Local AI Tool?

8 min read

local ai

llama.cpp vs Ollama: When to Switch

7 min read

local ai

Best Used GPUs for Local AI on a Budget (2026)

9 min read