Examples¶

The repository ships with 14 self-contained example crates in [examples/], one per public feature. Each is a standalone Cargo crate you can copy from.

Quickstart

The smallest end-to-end program: load, tokenize, complete, chat, FIM. ~80 lines, fully annotated.
Plain completion

One-shot text completion with a custom sampler chain.
Streaming

Token-by-token output through the high-level callback API.
Multi-turn chat

A two-turn chat using a built-in template.
Stateful REPL

Interactive multi-turn REPL with /clear, /save, EOF handling.
Vision (mtmd)

Multimodal image + text with the high-level MtmdContext API.
Raw mtmd.h API

Lower-level mtmd.h API: bitmap → chunks → eval.
Embeddings

Embedding extraction with L2 normalisation.
Semantic search

BGE-small + cosine ranking over a small corpus.
Reranker

Bi-encoder reranker demo.
Tool calling

ToolDefinition + five ToolParser formats.
Structured output

JSON-Schema → GBNF → constrained JSON output.
Speculative decoding

PromptLookupDecoding draft decoding.

One-command runner¶

Every example is wrapped by examples/run.sh, which downloads the right model on first run and is idempotent afterwards:

./examples/run.sh quickstart            # ~400 MB — text only, smallest demo
./examples/run.sh chat                  # same model — interactive REPL
./examples/run.sh stateful_chat         # multi-turn REPL with /clear, /save
./examples/run.sh embeddings            # ~30 MB — BGE-small embedding
./examples/run.sh embedding_search      # BGE-small + cosine ranking
./examples/run.sh reranker              # bi-encoder scoring
./examples/run.sh vision gemma4         # ~5 GB — vision + text chat
./examples/run.sh vision lfm-vl         # ~1 GB — smaller vision model
./examples/run.sh mtmd gemma4           # raw mtmd.h API
./examples/run.sh tools                 # function calling
./examples/run.sh structured            # JSON-schema grammar
./examples/run.sh speculative           # prompt-lookup draft decoding

Without arguments, the script lists every available example.

Full table¶

Example	Model	Size	What it shows
`quickstart`	`Qwen2.5-0.5B-Instruct-GGUF`	~400 MB	Load → tokenize → complete → chat → FIM
`simple`	any text GGUF	varies	Plain text completion
`streaming`	same as `quickstart`	~400 MB	High-level token-by-token output
`chat`	instruct GGUF	varies	One-shot chat with a builtin template
`stateful_chat`	same as `quickstart`	~400 MB	REPL with growing history, `/clear`, `/save`
`vision`	Gemma 4 or LFM2.5-VL + mmproj	~1–5 GB	High-level `MtmdContext` vision chat
`mtmd`	Gemma 4 + mmproj	~5 GB	Raw `mtmd.h` API: bitmap → chunks → eval
`embeddings`	`bge-small-en-v1.5-gguf`	~30 MB	Embedding extraction + L2 norm
`embedding_search`	`bge-small-en-v1.5-gguf`	~30 MB	Semantic search with cosine ranking
`reranker`	embedding GGUF	varies	Bi-encoder ranking by cosine similarity
`tools`	tool-aware instruct GGUF	varies	`ToolDefinition` + 5 `ToolParser` formats
`structured`	any text GGUF	varies	`json_schema_grammar()` + JSON parsing
`speculative`	any text GGUF	varies	`prompt-lookup` n-gram draft

Passing a different model¶

Every example accepts the GGUF path as the first positional argument:

cargo run --release --bin run_quickstart -- models/llama-3.2-1b-instruct-q4_k_m.gguf

Vision examples take <text.gguf> <mmproj.gguf> <image>.

Adding a new example¶

The boilerplate for a new example crate is ~15 lines:

examples/my_example/Cargo.toml

[package]
name = "my_example"
version.workspace = true
edition.workspace = true
rust-version.workspace = true
publish = false

[[bin]]
name = "run_my_example"
path = "src/main.rs"

[dependencies]
llama-crab = { path = "../../llama-crab", version = "0.1.0" }
anyhow = "1"

examples/my_example/src/main.rs

use anyhow::Result;
use llama_crab::{Llama, LlamaParams};

fn main() -> Result<()> {
    let mut llama = Llama::load(LlamaParams::new("models/your.gguf"))?;
    let resp = llama.create_completion("Hello!", 32)?;
    print!("{}", resp.text);
    Ok(())
}

Then add examples/my_example to the members = [...] list in the root Cargo.toml and a row to the table on this page.

Where to next?¶

Quickstart — the smallest end-to-end program.
Streaming — the most common request from app developers.
Vision (mtmd) — if you want to feed images to a model.
Chatbot recipe — when a single example isn't enough and you need to wire a full agent.