Core concepts¶
This section explains the why behind the llama-crab API. Read it
once before diving into a specific feature, and refer back to it
when a method signature doesn't do what you expected.
-
The big picture: the relationship between
LlamaBackend,LlamaModel,LlamaContext,LlamaBatch,LlamaSamplerandLlama. A diagram of the data flow inside a single forward pass. -
When does the backend come up and tear down? Who owns the model? What is the safe way to share a model across threads? How do you free a context when you no longer need it?
-
The
LlamaErrorenum, how to map it to user-facing errors, and the patterns the safe API uses to surface "model not found", "context too large", "backend not initialised" and friends.
Mental model¶
flowchart TB
subgraph Backend
BE[LlamaBackend]
end
subgraph Model
M[LlamaModel]
end
subgraph Context
C[LlamaContext]
KV[(KV cache)]
C --- KV
end
BE -->|loads weights| M
M -->|creates| C
C -->|decodes| T[LlamaBatch]
T -->|produces logits| S[LlamaSampler]
S -->|samples| TOK[Token id]
TOK -->|appends to context| C
A request to "complete a prompt" walks this loop once per generated token:
- The model weights are loaded from a GGUF file into a
LlamaModel. - A
LlamaContextis created on top of the model, allocating the KV cache that will hold attention keys and values. - The prompt is tokenised into a
Vec<LlamaToken>, wrapped in aLlamaBatchand submitted to the context withdecode. - The logits produced by the forward pass are handed to a
LlamaSampler, which selects the next token. - The selected token is appended to the context (re-using the KV cache) and the loop continues until the sampler emits EOS or a stop condition is met.
- The selected tokens are detokenised back into text.
The high-level [Llama] orchestrator hides steps 3–5 behind a single
method call. The lower-level types in [llama_crab::context],
[llama_crab::batch] and [llama_crab::sampling] give you full
control over each step when you need it.
The three layers of the API¶
llama-crab is intentionally a three-layer crate:
| Layer | Crate / module | When to use it |
|---|---|---|
| High-level orchestrator | [Llama] |
95 % of applications. The ergonomic, safe facade. |
| Mid-level building blocks | [LlamaModel], [LlamaContext], [LlamaBatch], [LlamaSampler], [MtmdContext] |
When you need manual control over batching, sessions, multimodal evaluation, sampling chains, etc. |
| Raw FFI | [llama_crab_sys] |
Only when you need a llama.cpp symbol that the safe layer does not expose. The high-level crate re-exports the safe wrappers you need. |
Most of this guide works at the mid-level — the building blocks — because they explain what's happening under the hood of the high-level helpers. Production code can stay at the high level unless you hit a wall.
A note on safety¶
llama-crab is built on top of C++ that allocates, frees and shares
memory through raw pointers. The unsafe boundary is intentionally
narrow:
llama_crab_syscontains all theunsafeFFI declarations and the fewunsafefunctions needed to wrap them.- The safe crate on top is annotated with
#![deny(unsafe_op_in_unsafe_fn)]and a strict clippy configuration, so the public API isunsafe-free in 100 % of the documented call sites. - A few low-level escape hatches (raw context handles, raw
*mut llama_context,chunks.eval, the FFI re-exports) are exposed behindunsafe fnso the compiler can verify that you accept responsibility for the invariants.
The rule of thumb: if you don't see an unsafe block in your code,
the safe layer has your back.
Where to next?¶
- Architecture — the data flow, the responsibilities of each type, and how to read the safe API as a thin layer on top of the FFI.
- Lifecycle — when the backend lives, when the model lives, and how to share a model between threads.
- Error handling — the
LlamaErrorenum and the patterns for converting library errors into application errors.