Skip to content

Choosing a model

Picking the right GGUF for the job is mostly about four axes: size, quant, architecture and modality. This page is a short guide to each.

The four axes

flowchart LR
    A[Pick a model] --> B[Size]
    A --> C[Quant]
    A --> D[Architecture]
    A --> E[Modality]
Axis What you choose Trade-off
Size 0.5B / 1B / 3B / 7B / 13B / 70B parameters. Larger = smarter but slower and more memory.
Quant F16 / Q8_0 / Q6_K / Q5_K_M / Q4_K_M / Q3_K_M / Q2_K. Smaller quant = less memory, slightly less accurate.
Architecture Llama 3, Qwen 2.5, Gemma 3, Mistral, Phi-3, … Each has a different chat template, tool format, and license.
Modality Text, vision, audio, multimodal. Vision needs mtmd; audio needs a matching projector.

A size cheat-sheet

Size Best for Memory (Q4_K_M) Speed on a 4090
0.5B Demos, REPLs, smoke tests. ~400 MB ~120 tok/s.
1B Simple assistants, classification. ~800 MB ~90 tok/s.
3B Single-user chatbots. ~2 GB ~50 tok/s.
7B General-purpose assistants. ~4 GB ~30 tok/s.
13B Higher-quality assistants. ~8 GB ~20 tok/s.
70B Frontier-quality. ~40 GB ~6 tok/s.

These numbers are for generation, not retrieval. Embedding models are usually 0.1–0.5 GB.

Picking a quant

Quant Bits per weight Quality loss When to use it
F16 16 None. Reference. Almost never shipped.
Q8_0 8 Negligible. When you have the VRAM.
Q6_K 6.5 Tiny. Mid-budget.
Q5_K_M 5.7 Small. Good default.
Q4_K_M 4.8 Noticeable on long contexts. The most common default.
Q3_K_M 3.9 Visible on reasoning tasks. When you need to save 1–2 GB.
Q2_K 3.4 Significant. Only for very tight memory budgets.

The K quantisations are a newer format that splits the weights into "super-blocks" and applies a higher precision to the sensitive ones. They generally produce better quality than the non-K quants at the same bitrate.

Picking an architecture

Architecture License When to use it
Llama 3 / 3.1 / 3.2 / 3.3 Llama 3 community license. General-purpose. Wide tooling support.
Qwen 2 / 2.5 Apache 2.0. Strong multilingual and tool-calling.
Gemma 2 / 3 Gemma license. Quality-per-parameter leader on the small end.
Mistral / Mixtral Apache 2.0. Strong instruct and tool calling.
Phi-3 / Phi-3.5 MIT. Small but capable; great for phones.
DeepSeek-V2 / V2.5 DeepSeek license. Strong coding and reasoning.
Command R / R+ CC-BY-NC. RAG-tuned; long context.

For most users, the choice comes down to:

  • License compatibility with your distribution channel.
  • Tool calling — Qwen 2.5, Llama 3, Mistral and DeepSeek are the strongest.
  • Multilingual — Qwen 2.5 and DeepSeek are the strongest.
  • Quality at small sizes — Gemma 2 and Phi-3 are the strongest.

Picking a modality

Modality Cargo feature Projector needed? When to use it
Text only No. Most chatbots, RAG, agents.
Vision mtmd Yes (mmproj-*.gguf). Image Q&A, document extraction.
Audio mtmd Yes. Speech-to-text, audio Q&A.
Multimodal mtmd Yes. Combined inputs.

The vision projector must match the text model. Gemma 4 and LFM2.5-VL ship separate vision projectors; Qwen 2.5-VL has a single multimodal GGUF.

Use case Model
Demos and CI Qwen2.5-0.5B-Instruct-GGUF (Q4_K_M, ~400 MB).
Single-user chatbot Qwen2.5-7B-Instruct-GGUF (Q4_K_M, ~4 GB).
Frontier assistant Llama-3.3-70B-Instruct-GGUF (Q4_K_M, ~40 GB).
Embeddings bge-small-en-v1.5-gguf (~30 MB).
Reranker bge-reranker-base-Q4_K_M-GGUF (~600 MB).
Vision gemma-4-E4B-it-GGUF + mmproj-gemma-4-E4B-it-BF16.gguf.
Mobile phone Qwen2.5-0.5B-Instruct-GGUF + MobilePreset::Balanced.

Where to next?