Caching & session state¶
llama-crab exposes two complementary persistence mechanisms:
- Prompt / KV cache — reuse the computed key-value tensors for a previously seen prompt prefix. Saves the prefill pass.
- Session state — serialize the entire KV state (and optionally the sampler / optimizer state) of a sequence to bytes.
Both are manual in v0.1.x. The high-level create_completion
helper clears KV sequence 0 before each call, so it always starts at
position 0 and does not consult the prompt cache automatically. If
you need to use these mechanisms, drive the lower-level API
directly — or use the worker-thread pattern that the
server ships with.
Prompt cache¶
Prompt-cache storage lives in the [cache] module. Both
implementations follow the [Cache] trait, which stores and looks
up opaque session bytes for token prefixes.
| Implementation | Cargo feature | Backing store |
|---|---|---|
[RamCache] |
(always) | In-process BTreeMap. |
[DiskCache] |
disk-cache |
sled database on disk. |
Using RamCache¶
use llama_crab::cache::{Cache, CacheEntry, RamCache};
let cache = RamCache::new();
let tokens: Vec<llama_crab::token::LlamaToken> = Vec::new();
// After computing the KV state for `tokens` once:
cache.store(&tokens, CacheEntry {
state: vec![],
n_past: tokens.len() as i32,
});
// On the next call with a longer sequence that starts with `tokens`:
if let Some(hit) = cache.lookup(&tokens) {
println!("cache hit at position {}", hit.n_past);
}
The cache is keyed on the exact token sequence and returns the
longest matching prefix. If you ask for [A, B, C, D] and the
cache has [A, B], the lookup returns the entry for [A, B] and
you can use it to skip the prefill of the first two tokens.
Using DiskCache¶
For persistence across process restarts, enable the disk-cache
feature and switch to DiskCache::open(path):
The disk cache is safe to share across Llama instances in the same
process; only the last writer wins on key collisions.
When the prompt cache helps¶
The cache pays off in three regimes:
- Manual chat loops — each turn prepends the full history; if you restore cached state yourself, a prefix hit can skip re-evaluating previous turns.
- RAG — embed the same document chunk multiple times across queries.
- Templated prompts — system prompt + few-shot examples are repeated across queries.
The cache does not help when:
- Every call uses a fresh prompt with no common prefix.
- You stay on the high-level
create_completionpath without manually restoring cached state. - The model is small enough that prefill is not the bottleneck anyway.
Session state¶
The llama_state_get_data and llama_state_set_data functions
serialise the full KV state (and the model's learned sampling
state, if any) of a sequence to a byte buffer. This is what
RamCache and DiskCache store under the hood.
The high-level [Llama] orchestrator does not yet wrap these calls
behind a typed API; you can drive them through llama.context() if
you need byte-exact session snapshots — for example, to suspend and
resume a long agent session.
// Pseudocode — see llama-crab-sys for the full surface.
let bytes = unsafe { llama.context().session_save(/*seq_id*/ 0)? };
std::fs::write("session.bin", bytes)?;
// Later, in a new process:
let bytes = std::fs::read("session.bin")?;
unsafe { llama.context().session_load(/*seq_id*/ 0, &bytes)? };
Manual KV cache reuse¶
If you don't want to persist state across processes, you can still benefit from the KV cache by managing the sequence yourself:
use llama_crab::batch::LlamaBatch;
use llama_crab::token::LlamaToken;
// 1. Load a long system prompt.
let system_tokens = llama.model().tokenize(SYSTEM_PROMPT, true, true)?;
let mut batch = LlamaBatch::new(system_tokens.len(), 1);
batch.add_sequence(&system_tokens, 0, false);
batch.prepare();
llama.context().decode(&batch)?;
let mut n_past = system_tokens.len() as i32;
// 2. Each turn, decode only the new user message.
loop {
let user_tokens = llama.model().tokenize(/* prompt */, false, true)?;
let mut batch = LlamaBatch::new(user_tokens.len(), 1);
batch.add_sequence(&user_tokens, 0, false);
batch.prepare();
llama.context().decode(&batch)?;
n_past += user_tokens.len() as i32;
// 3. Sample, append, loop.
// ...
}
The KV cache is not cleared between turns, so turn N+1 only
pays the prefill cost of the new user message. This is the pattern
used by the stateful_chat example.
Caching pitfalls¶
| Pitfall | What goes wrong | Fix |
|---|---|---|
Two processes share a DiskCache |
Concurrent writers corrupt the database. | Scope the cache to one process or add a file lock. |
| Cache key uses raw text | A typo causes a miss. | Tokenise the prompt first, key on the token ids. |
| Cache survives a model upgrade | Old KV state is incompatible with the new model. | Include the GGUF hash in the cache key, or wipe on upgrade. |
| Cache grows unboundedly | Out-of-memory crash. | Periodically evict old entries or use DiskCache. |
Where to next?¶
- Stateful chat — multi-turn chat with a growing history that doesn't replay the entire context.
- Server — the reference implementation of the worker-thread pattern.
- Caching example — a runnable program that uses manual KV cache reuse.