Architecture¶
This page walks through the data flow of a single completion request
in llama-crab, from the moment you call Llama::load until the
last generated token lands in a String. It's the mental model you
need before you can read the lower-level API or hit a wall that the
high-level helpers don't paper over.
Big picture¶
flowchart TB
A[Client code] -->|Llama::load| B[LlamaBackend]
A -->|create_completion| C[Llama orchestrator]
B -->|loads libggml| D[GGML backend]
B -->|keeps| E[LlamaModel]
E -->|weights + tokenizer + metadata| F[GGUF file]
C -->|requests| G[LlamaContext]
G -->|allocates| H[KV cache]
G -->|drives| I[forward pass]
I -->|logits| J[LlamaSampler]
J -->|next token| C
C -->|feed forward| G
C -->|detokenise| K[Output text]
The high-level [Llama] orchestrator owns the model, the context and
the default sampler. It exposes a handful of methods that hide the
loop illustrated above behind a single function call:
let mut llama = Llama::load(LlamaParams::new("model.gguf"))?;
let resp = llama.create_completion("Hello", 32)?;
println!("{}", resp.text);
When you need finer-grained control, every step of the loop is exposed through a typed API. The remainder of this page describes each step.
Step 1: Initialise the backend¶
Before any llama.cpp call, the native library has to be initialised. This sets up the global GGML state, registers the backends compiled into the binary, and configures the thread pools.
use llama_crab::LlamaBackend;
// Implicit, called by Llama::load:
let _backend = LlamaBackend::init()?;
// Explicit, if you drive the lower-level API directly:
let _backend = LlamaBackend::init_numa(NumaStrategy::Distribute)?;
LlamaBackend::init()— initialises the default backend.LlamaBackend::init_numa(strategy)— same, but with an explicit NUMA placement strategy (Distribute,Isolate,Numactl).- The returned guard owns the backend; dropping it tears the
underlying state down. As long as a
LlamaModelorLlamaContextis alive, the backend must be alive too.
The guard is Send + Sync so it can sit in a OnceLock or an
Arc<LlamaBackend> in a multi-threaded binary.
Step 2: Load the model¶
The model holds the weights, the tokenizer and the metadata stored
in the GGUF container. Loading is the single most expensive step
in any llama-crab program: expect anywhere from 0.5 s (for a 0.5 B
Q4 quant) to 30 s (for a 70 B Q4 quant on a cold disk).
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(
LlamaParams::new("model.gguf")
.with_n_ctx(2048)
.with_n_gpu_layers(99),
)?;
Behind the scenes the orchestrator:
- Calls
LlamaBackend::init()(or reuses an existing guard). - Memory-maps the GGUF file (when supported) and parses the metadata.
- Creates a
LlamaModel, which allocates the weight tensors and loads them onto the active backend (CPU, GPU, or a mix). - Creates a default
LlamaContextwith the requested context size.
The Llama struct is essentially LlamaModel + LlamaContext +
default state, so you rarely have to touch the model and context
directly when you stay on the high-level path.
Step 3: Tokenise the prompt¶
Tokenisation converts a UTF-8 string into the integer ids the model
operates on. The tokeniser is part of the GGUF file (or, with the
hf-tokenizer feature, can be loaded from a separate tokenizer.json).
let prompt = "The capital of France is";
let tokens = llama.model().tokenize(prompt, /*add_bos*/ true, /*special*/ true)?;
The high-level helpers tokenise for you:
let resp = llama.create_completion(prompt, 32)?;
// → internally: tokenize → decode → sample loop → detokenize
Step 4: Forward pass (decode)¶
The tokenised prompt is wrapped in a LlamaBatch and submitted to
the context with decode. This runs the model over the batch and
returns the logits of the last token.
use llama_crab::batch::LlamaBatch;
let mut batch = LlamaBatch::new(tokens.len(), 1);
batch.add_sequence(&tokens, 0, /*logits_all*/ false);
batch.prepare();
llama.context().decode(&batch)?;
The KV cache stored in the context is updated in place, so the next call only needs to feed the newly generated token.
Step 5: Sample the next token¶
The logits are passed to a LlamaSampler, which implements a
particular decoding strategy. The default sampler chain is
greedy, but you can compose a custom chain with SamplerChain:
use llama_crab::sampling::{LlamaSampler, SamplerChain};
let mut sampler = SamplerChain::new()
.temp(0.8)
.top_p(0.95, 1)
.min_p(0.05, 1)
.penalties(64, 1.1, 0.0, 0.0)
.build();
let next_token = unsafe { sampler.sample(llama.context().raw_handle(), -1) };
sampler.accept(next_token);
See the Sampling strategies guide for the full menu of samplers and recommended chains.
Step 6: Append and continue¶
The selected token is fed back into the context as a new batch of size 1. The KV cache is reused, so this forward pass is the cheapest one in the loop.
use llama_crab::batch::LlamaBatch;
let single = LlamaBatch::one(next_token, n_past, 0, true);
llama.context().decode(&single)?;
n_past += 1;
Steps 5 and 6 repeat until the sampler emits the EOS token, a stop sequence is matched, or the maximum token count is reached.
Step 7: Detokenise¶
The selected token ids are mapped back to text with the model's tokeniser:
The high-level Completion struct combines the generated tokens, the
text, the per-token log probabilities and the timings:
pub struct Completion {
pub text: String,
pub tokens: Vec<LlamaToken>,
pub logprobs: Option<CompletionLogprobs>,
pub timings: CompletionTimings,
pub stop_reason: StopReason,
}
Where the multimodal stack fits in¶
The mtmd feature adds a parallel pipeline for vision and audio
inputs. The text model stays the same; the MtmdContext and
MtmdBitmap types encode images (or audio) into the same token
stream used by the rest of the API.
flowchart LR
IMG[Image bytes] -->|MtmdBitmap::from_file| BM[MtmdBitmap]
TXT[Prompt text] -->|MtmdInputText| IT[Input]
BM -->|tokenise| MTMD[MtmdContext]
IT -->|tokenise| MTMD
MTMD -->|chunks| EVAL[chunks.eval]
EVAL -->|extends KV cache| CTX[LlamaContext]
CTX -->|normal sampling| OUT[Generated text]
See the Multimodal guide for the end-to-end flow.
Where to next?¶
- Lifecycle — when the model and context come up and tear down, and how to share them across threads.
- Sampling strategies — every available sampler and how to chain them.
- Backends & GPU offload — what the active backend does to the data flow above.