speculative — Prompt-lookup draft decoding¶
Demonstrates [PromptLookupDecoding]: scan the prompt for the last
n tokens and emit what followed them as a draft. No extra model
required.
Run¶
Downloads Qwen2.5-0.5B-Instruct-GGUF (~400 MB).
What it does¶
use llama_crab::speculative::{DraftModel, PromptLookupDecoding};
use llama_crab::{Llama, LlamaParams};
let llama = Llama::load(LlamaParams::new("models/qwen2.5-0.5b-instruct-q4_k_m.gguf")
.with_n_ctx(1024))?;
let prompt = "Rust is fast and memory safe. Rust is fast";
let prompt_tokens = llama.model().tokenize(prompt, true, true)?;
let draft = PromptLookupDecoding::new(3, 8);
let drafted = draft.draft(&prompt_tokens, 8);
let drafted_text = llama.model().detokenize(&drafted, false)?;
println!("drafted text> {}", drafted_text.trim());
The prompt contains a repetition ("Rust is fast" appears twice),
so the draft picks up "and memory safe" and emits it as the
candidate next tokens.
Expected output¶
prompt> Rust is fast and memory safe. Rust is fast
drafted token ids> [Token(...), Token(...), ...]
drafted text> and memory safe
Tuning knobs¶
| Knob | Description | Typical range |
|---|---|---|
max_ngram_size |
How many trailing tokens form the lookup key. | 2–4 |
num_pred_tokens |
How many tokens to emit when a match is found. | 4–16 |
Larger max_ngram_size finds more matches but is more sensitive
to small edits. Larger num_pred_tokens reduces the verification
overhead per accepted token, but a wrong draft is more expensive
to recover from.
When this helps¶
- Repetitive prompts — code, lists, RAG that quotes the context.
- FIM infill — the body of a function appears earlier in the file.
- Long templated prompts — the system prompt repeats.
For open-ended creative writing, acceptance drops and the overhead can exceed the savings. Measure before adopting.
Custom draft models¶
The DraftModel trait lets you plug in any token-proposing
strategy — a smaller model, a regex automaton, a finite-state
machine, a trie of common phrases:
use llama_crab::speculative::DraftModel;
use llama_crab::token::LlamaToken;
struct AlwaysHello;
impl DraftModel for AlwaysHello {
fn draft(&self, _input: &[LlamaToken], n: usize) -> Vec<LlamaToken> {
// Replace with: sample n tokens from your smaller model.
Vec::new()
}
}
Then drive the speculative step with the [speculative_decode]
free function.
When acceptance is too low¶
A few rules of thumb:
| Symptom | Likely cause | Fix |
|---|---|---|
| Acceptance < 30 % | The prompt is not repetitive enough. | Try a different draft model (a small instruct GGUF). |
| Acceptance > 80 % but speedup is small | The draft step is too slow. | Use a smaller draft, or PromptLookupDecoding. |
| Speedup is negative | The main model is already small. | Speculative decoding rarely helps sub-1B models. |
Full source¶
examples/speculative/src/main.rs.
Where to next?¶
- Speculative decoding guide — the full reference, including custom draft models.
- Performance tuning recipe — measure throughput with and without speculative decoding.