Performance tuning¶
This recipe walks through the four levers you have over inference performance: the backend, the offload strategy, the batch size and the sampler chain. We measure first, then optimise.
The four levers¶
flowchart TD
A[Tokens per second] --> B[Backend]
A --> C[Offload]
A --> D[Batch size]
A --> E[Sampler chain]
| Lever | Effect on tokens/sec | When to tune it |
|---|---|---|
| Backend | 5–10× (CPU → GPU). | When the model is too slow on CPU. |
| Offload | 2–4× per crossed layer. | When the model doesn't fit in VRAM. |
| Batch size | 1.2–2× for batched servers. | When you serve many requests in parallel. |
| Sampler chain | 1.05–1.5× (greedy is fastest). | When you have a tight latency budget. |
Step 1: measure first¶
Don't optimise before measuring. The high-level Completion struct
carries the timings you need:
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(2048))?;
let resp = llama.create_completion("The capital of France is", 200)?;
println!("tokens/sec = {}", 200.0 / resp.timings.total_sec);
The CompletionTimings struct has:
| Field | Meaning |
|---|---|
prompt_n |
Number of prompt tokens. |
prompt_sec |
Wall time for the prefill. |
predicted_n |
Number of generated tokens. |
predicted_sec |
Wall time for the decode loop. |
total_sec |
Total wall time. |
For a fair comparison, run the same prompt at the same
max_tokens across the configurations you want to test.
Step 2: pick a backend¶
| Target | Cargo features | Typical tokens/sec (7B Q4_K_M) |
|---|---|---|
| CPU only (M-series Mac) | ["openmp"] |
~10 tok/s. |
| Apple Metal | ["metal", "openmp"] |
~25 tok/s. |
| NVIDIA RTX 4090 | ["cuda", "openmp"] |
~80 tok/s. |
| AMD MI300X | ["rocm", "openmp"] |
~100 tok/s. |
| Vulkan (NVIDIA) | ["vulkan", "openmp"] |
~70 tok/s. |
Numbers depend on quant, context length, prompt length and the exact model. Use them as a sanity check, not a guarantee.
Step 3: tune n_gpu_layers¶
n_gpu_layers is the single biggest knob for hybrid
CPU/GPU workloads. The rule of thumb:
| Model size | 8 GB GPU | 16 GB GPU | 24 GB GPU |
|---|---|---|---|
| 7B Q4_K_M | 99 layers | 99 layers | 99 layers |
| 13B Q4_K_M | 99 layers | 99 layers | 99 layers |
| 70B Q4_K_M | 10–15 layers | 20–25 layers | 35–40 layers |
To find the exact value for your model, increase n_gpu_layers
until you see "out of memory" errors, then back off by 5.
Step 4: tune n_threads¶
CPU threads are most useful when (a) the model is too big to fit in VRAM and the tail runs on the CPU, or (b) the prompt ingestion phase is dominated by tokenisation (rare on small prompts).
A reasonable starting point is the number of physical cores (not the total core count). On Apple Silicon, the number of performance cores is a better target.
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(
LlamaParams::new("model.gguf")
.with_n_threads(num_cpus::get_physical()),
)?;
A few rules of thumb:
- Don't oversubscribe. More threads than physical cores causes context-switch overhead.
- On laptops, half the physical cores to avoid thermal throttling.
- The
n_threads_batchsetting (separate thread count for batches) is rarely needed; leave it at the default.
Step 5: pick the right sampler chain¶
The sampler chain has a small per-token overhead, but the quality of the generated text also matters. A few chains to start with:
| Use case | Chain |
|---|---|
| Benchmarks | temp(0.0) (greedy). |
| Code | temp(0.0) (greedy) or temp(0.2) + top_p(0.95). |
| General chat | temp(0.7) + top_p(0.9) + min_p(0.05) + penalties(64, 1.1, 0, 0). |
| Creative writing | temp(0.9) + top_p(0.95) + min_p(0.05) + penalties(128, 1.1, 0.1, 0.1). |
| Structured output | temp(0.7) + top_p(0.9) + grammar. |
greedy is the fastest sampler. mirostat_v2 is roughly 1.5×
slower than top_p + temp for the same quality.
Step 6: speculative decoding¶
On agreement-heavy workloads, speculative decoding can give a 2–3× speedup. The recipe:
- Start with
PromptLookupDecoding::new(2, 8)and a small main model. Measure acceptance rate. - If acceptance is > 60 %, you're getting a free speedup.
- If acceptance is < 30 %, switch to a small draft model or skip speculative decoding.
See the speculative decoding guide for the full menu.
A worked example¶
Suppose a 7B Q4_K_M model runs at 10 tok/s on a MacBook Pro M2. You have an RTX 4090 and want to push it to 80 tok/s.
- Add the CUDA backend:
- Offload all layers:
- Measure:
- Try increasing
n_threadsfrom 4 to 8: - Try
temp(0.0)(greedy) instead of the default chain:
The actual numbers depend on the exact hardware and the model; the process is the same: measure, change one variable, measure again.
Common pitfalls¶
| Pitfall | Symptom | Fix |
|---|---|---|
| Building without the GPU feature | supports_gpu_offload() is false. |
Add the right backend to the Cargo features. |
n_gpu_layers = 0 |
Everything runs on the CPU. | Set it to the number of layers in the model. |
n_threads > physical cores |
Context-switch overhead. | Cap at physical cores. |
| Greedy sampler on a creative task | Repetitive, bland output. | Add temperature and top-p. |
| Speculative decoding on creative text | Acceptance < 30 %, negative speedup. | Use a different draft or skip speculative decoding. |
Where to next?¶
- Backends & GPU offload — the runtime configuration.
- Sampling strategies — the full sampler menu.
- Speculative decoding — when agreement is high.