streaming — High-level token streaming¶
Use Llama::create_completion_stream when you want synchronous
token-by-token output while still receiving the final Completion.
The callback receives text chunks as they become available and
returns StreamControl::Continue or StreamControl::Stop.
Run¶
Downloads the same ~400 MB Qwen2.5 0.5B model as the quickstart
example.
What it does¶
use std::io::{self, Write};
use llama_crab::{CompletionOptions, Llama, LlamaParams, StreamControl};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(512))?;
let prompt = "Write one short sentence about Rust.";
let mut stdout = io::stdout().lock();
let mut write_error: Option<io::Error> = None;
let completion = llama.create_completion_stream(
prompt,
CompletionOptions::new(64).with_stop_sequence("\n\n"),
|chunk| {
if let Err(err) = write!(stdout, "{}", chunk.text).and_then(|_| stdout.flush()) {
write_error = Some(err);
return StreamControl::Stop;
}
StreamControl::Continue
},
)?;
if let Some(err) = write_error {
return Err(err.into());
}
writeln!(stdout)?;
Ok(())
}
Capturing I/O errors¶
The callback cannot return a Result, so capture I/O errors and
return StreamControl::Stop; after the stream returns, propagate
the captured error:
let mut write_error: Option<io::Error> = None;
let completion = llama.create_completion_stream(
"Write one short sentence about Rust.",
CompletionOptions::new(64),
|chunk| {
if let Err(err) = write!(stdout, "{}", chunk.text).and_then(|_| stdout.flush()) {
write_error = Some(err);
return StreamControl::Stop;
}
StreamControl::Continue
},
)?;
if let Some(err) = write_error {
return Err(err.into());
}
For quick demos where stdout errors are not important, the callback can ignore them:
llama.create_completion_stream(
"Write one short sentence about Rust.",
CompletionOptions::new(64),
|chunk| {
let _ = write!(stdout, "{}", chunk.text);
let _ = stdout.flush();
StreamControl::Continue
},
)?;
Stopping the stream early¶
Returning StreamControl::Stop from the callback halts the
generation loop. The Completion returned by the call is still
populated with whatever was generated so far:
let mut stopped = false;
let completion = llama.create_completion_stream(
"List 10 colors, one per line:",
CompletionOptions::new(256),
|chunk| {
print!("{}", chunk.text);
if chunk.text.contains("done") {
stopped = true;
return StreamControl::Stop;
}
StreamControl::Continue
},
)?;
println!("\nstopped: {stopped}");
How it works internally¶
sequenceDiagram
participant App
participant Llama
participant Sampler
App->>Llama: create_completion_stream(prompt, opts, callback)
Llama->>Llama: tokenize(prompt)
Llama->>Sampler: sample
Sampler-->>Llama: token
Llama->>Llama: detokenize(token)
Llama->>App: callback(chunk)
App-->>Llama: StreamControl::Continue
Llama->>Sampler: accept(token)
Note over Llama,Sampler: repeat until EOS, stop, or Stop
Llama-->>App: Completion
The streaming helper uses the same high-level completion path as
create_completion: it clears sequence 0 before each call and does
not enable automatic prompt-cache reuse between calls. For custom
sampling, batching, or manual KV/session reuse, use the lower-level
context, batch and sampler APIs directly.
Streaming + log probabilities¶
If you set logprobs = true on the options, each chunk carries the
per-token log probabilities:
The chunk.logprobs field is Some(...) on every chunk, including
the partial one. Use the field to display alternatives in a UI or
to compute a confidence score.
Streaming + tools¶
Streaming works with the chat pipeline and tool calling. The chunk schema matches the OpenAI SSE format. See the server streaming guide for the exact chunk order.
Full source¶
examples/streaming/src/main.rs.
Where to next?¶
- Stateful chat — multi-turn REPL.
- Sampling strategies guide — custom sampler chains.
- Server streaming — the same flow over HTTP.