mtmd — Raw mtmd.h API¶
A lower-level counterpart to the vision example:
loads a text model and an mmproj projector, embeds an image and a
prompt into a chunk list, then drives the decode loop by hand. Use
this when you need direct access to MtmdBitmap / MtmdInputText /
chunks.eval.
The higher-level vision example does the same thing
in fewer lines through the MtmdContext helper.
Run¶
Downloads the Gemma 4 text GGUF + its mmproj projector (~5 GB total).
What it does¶
use llama_crab::batch::LlamaBatch;
use llama_crab::multimodal::{default_media_marker, MtmdBitmap, MtmdContext, MtmdInputText};
use llama_crab::sampling::LlamaSampler;
use llama_crab::token::LlamaToken;
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(4096))?;
let mtmd = MtmdContext::init_from_file("mmproj.gguf", llama.model())?;
let bitmap = MtmdBitmap::from_file("image.png")?;
let marker = default_media_marker();
let prompt = format!("{marker}\nDescribe this image in one short sentence.");
let chunks = mtmd.tokenize(MtmdInputText::new(&prompt), &[&bitmap])?;
let ctx_ptr = llama.context().raw_handle();
let mut n_past = unsafe {
chunks.eval(&mtmd, ctx_ptr, 0, 0, llama.context().n_batch() as i32, true)?
};
// Standard decode loop with a greedy sampler.
let mut sampler = LlamaSampler::greedy()?;
let eos = llama.model().token_eos();
let mut out = String::new();
for i in 0..96 {
let idx = if i == 0 { -1 } else { 0 };
let tok: LlamaToken = unsafe { sampler.sample(ctx_ptr, idx) };
sampler.accept(tok);
if tok == eos { break; }
out.push_str(&llama.model().detokenize(&[tok], false)?);
let single = LlamaBatch::one(tok, n_past + i as i32, 0, true);
llama.context().decode(&single)?;
}
println!("{out}");
Walk-through¶
default_media_marker()¶
The marker is the special string the projector expects at the position of each image. The marker is model-specific:
- Gemma 4 uses
<start_of_image>. - LFM2.5-VL uses
<|image|>.
default_media_marker() returns whatever the loaded MtmdContext
expects, so you don't need to hard-code it.
MtmdInputText¶
The text part of the prompt. It can include the marker inline (as above) or rely on the projector to insert the marker automatically.
chunks.eval¶
Evaluates the multimodal chunks into the KV cache:
let new_n_past = unsafe {
chunks.eval(&mtmd, ctx_ptr, 0, 0, llama.context().n_batch() as i32, true)?
};
&mtmd— the active context.ctx_ptr— the raw*mut llama_contextfromllama.context().raw_handle().seq_id— usually0.n_past— usually0(start a fresh position) or the current position if you want to extend an existing context.n_batch— the logical batch size; usellama.context().n_batch().logits_last—trueso the next sample reads the right position.
The function returns the new n_past, which is where the sampler
should start.
The decode loop¶
The rest of the example is a standard sampler loop:
flowchart LR
S[sample] --> A[accept]
A --> D{more tokens?}
D -- yes --> S
D -- no --> OUT[print out]
LlamaSampler::greedy() is the fastest and the most reproducible
sampler. For higher-quality output, build a SamplerChain.
Tested models¶
| Model | Status |
|---|---|
lmstudio-community/gemma-4-E4B-it-GGUF |
✅ |
unsloth/LFM2.5-VL-1.6B-GGUF |
✅ |
Qwen2.5-VL |
Compatible via mtmd. |
Llama-3.2-Vision |
Compatible via mtmd. |
The same flow is exercised by the integration tests in
llama-crab/tests/gemma4_vision.rs and
llama-crab/tests/lfm_vl_vision.rs.
Full source¶
Where to next?¶
- Multimodal guide — the data flow and the chunk-evaluation API.
- Vision (high level) — if you don't need the raw loop.
- Server with vision — the HTTP path.