`streaming` — High-level token streaming¶

Use Llama::create_completion_stream when you want synchronous token-by-token output while still receiving the final Completion. The callback receives text chunks as they become available and returns StreamControl::Continue or StreamControl::Stop.

Run¶

./examples/run.sh streaming

Downloads the same ~400 MB Qwen2.5 0.5B model as the quickstart example.

What it does¶

use std::io::{self, Write};

use llama_crab::{CompletionOptions, Llama, LlamaParams, StreamControl};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(512))?;
    let prompt = "Write one short sentence about Rust.";
    let mut stdout = io::stdout().lock();

    let mut write_error: Option<io::Error> = None;
    let completion = llama.create_completion_stream(
        prompt,
        CompletionOptions::new(64).with_stop_sequence("\n\n"),
        |chunk| {
            if let Err(err) = write!(stdout, "{}", chunk.text).and_then(|_| stdout.flush()) {
                write_error = Some(err);
                return StreamControl::Stop;
            }
            StreamControl::Continue
        },
    )?;

    if let Some(err) = write_error {
        return Err(err.into());
    }
    writeln!(stdout)?;
    Ok(())
}

Capturing I/O errors¶

The callback cannot return a Result, so capture I/O errors and return StreamControl::Stop; after the stream returns, propagate the captured error:

let mut write_error: Option<io::Error> = None;

let completion = llama.create_completion_stream(
    "Write one short sentence about Rust.",
    CompletionOptions::new(64),
    |chunk| {
        if let Err(err) = write!(stdout, "{}", chunk.text).and_then(|_| stdout.flush()) {
            write_error = Some(err);
            return StreamControl::Stop;
        }
        StreamControl::Continue
    },
)?;

if let Some(err) = write_error {
    return Err(err.into());
}

For quick demos where stdout errors are not important, the callback can ignore them:

llama.create_completion_stream(
    "Write one short sentence about Rust.",
    CompletionOptions::new(64),
    |chunk| {
        let _ = write!(stdout, "{}", chunk.text);
        let _ = stdout.flush();
        StreamControl::Continue
    },
)?;

Stopping the stream early¶

Returning StreamControl::Stop from the callback halts the generation loop. The Completion returned by the call is still populated with whatever was generated so far:

let mut stopped = false;
let completion = llama.create_completion_stream(
    "List 10 colors, one per line:",
    CompletionOptions::new(256),
    |chunk| {
        print!("{}", chunk.text);
        if chunk.text.contains("done") {
            stopped = true;
            return StreamControl::Stop;
        }
        StreamControl::Continue
    },
)?;
println!("\nstopped: {stopped}");

How it works internally¶

sequenceDiagram
    participant App
    participant Llama
    participant Sampler

    App->>Llama: create_completion_stream(prompt, opts, callback)
    Llama->>Llama: tokenize(prompt)
    Llama->>Sampler: sample
    Sampler-->>Llama: token
    Llama->>Llama: detokenize(token)
    Llama->>App: callback(chunk)
    App-->>Llama: StreamControl::Continue
    Llama->>Sampler: accept(token)
    Note over Llama,Sampler: repeat until EOS, stop, or Stop
    Llama-->>App: Completion

The streaming helper uses the same high-level completion path as create_completion: it clears sequence 0 before each call and does not enable automatic prompt-cache reuse between calls. For custom sampling, batching, or manual KV/session reuse, use the lower-level context, batch and sampler APIs directly.

Streaming + log probabilities¶

If you set logprobs = true on the options, each chunk carries the per-token log probabilities:

CompletionOptions::new(64)
    .with_logprobs(true, 5)

The chunk.logprobs field is Some(...) on every chunk, including the partial one. Use the field to display alternatives in a UI or to compute a confidence score.

Streaming + tools¶

Streaming works with the chat pipeline and tool calling. The chunk schema matches the OpenAI SSE format. See the server streaming guide for the exact chunk order.

Full source¶

examples/streaming/src/main.rs.

Where to next?¶

Stateful chat — multi-turn REPL.
Sampling strategies guide — custom sampler chains.
Server streaming — the same flow over HTTP.

streaming — High-level token streaming¶