Text completion¶
The simplest llama-crab workflow: feed the model a prompt string,
get a text continuation back. This page documents the
create_completion family of methods, the stop-sequence and
log-probability knobs, streaming, best-of-N, and FIM (fill-in-the-
middle) for code.
The basic call¶
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(2048))?;
let resp = llama.create_completion("Once upon a time", 64)?;
println!("{}", resp.text);
The [Completion] struct returned by the high-level call carries:
| Field | Type | Description |
|---|---|---|
text |
String |
The generated text, with the prompt stripped. |
tokens |
Vec<LlamaToken> |
The generated token ids. |
logprobs |
Option<CompletionLogprobs> |
Per-token log probabilities, when requested. |
timings |
CompletionTimings |
Prompt ingestion, generation, and total wall time. |
stop_reason |
StopReason |
Why generation stopped (Stop, Length, TokensLimit, Canceled). |
Customising the call¶
create_completion is a thin wrapper over
create_completion_with_options, which takes a [CompletionOptions]
builder. Use it to expose the rest of the sampler chain, the stop
sequences, the log-prob settings and the best-of-N knob.
use llama_crab::{CompletionOptions, Llama, LlamaParams};
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(2048))?;
let resp = llama.create_completion_with_options(
"The capital of France is",
CompletionOptions::new(32)
.with_temperature(0.7)
.with_top_p(0.95, 1)
.with_top_k(40)
.with_stop_sequence("\n\n")
.with_logprobs(true, 5)
.with_echo(false),
)?;
CompletionOptions reference¶
| Method | Default | Description |
|---|---|---|
new(max_tokens) |
– | Sets the maximum number of tokens to generate. |
with_temperature(t) |
0.8 |
Temperature; 0.0 selects greedy decoding. |
with_top_k(k) |
40 |
Restrict to the top K tokens. |
with_top_p(p, min_keep) |
0.95, 1 |
Nucleus sampling. |
with_min_p(p, min_keep) |
0.05, 1 |
Min-P sampling. |
with_typical_p(p, min_keep) |
1.0, 1 |
Locally-typical sampling. |
with_tfs_z(z) |
1.0 |
Tail-free sampling. |
with_repeat_penalty(p) |
1.0 |
Repetition penalty. |
with_frequency_penalty(p) |
0.0 |
Frequency penalty. |
with_presence_penalty(p) |
0.0 |
Presence penalty. |
with_penalty_last_n(n) |
64 |
Tokens to consider when applying penalties. |
with_mirostat(...) |
0 |
Mirostat mode (0 = off, 1, 2). |
with_seed(seed) |
random | RNG seed. |
with_stop_sequence(s) |
– | Adds a single stop sequence. |
with_stop_sequences([s]) |
– | Adds multiple stop sequences. |
with_logit_bias(biases) |
{} |
Manual logit bias. |
with_logprobs(enable, k) |
false, 0 |
Per-token log probabilities. |
with_echo(echo) |
false |
Echo the prompt back as part of the response. |
with_suffix(suffix) |
– | Suffix appended after the prompt (for FIM). |
with_best_of(n) |
n |
Number of internal candidates for n. |
with_grammar(text) |
– | GBNF grammar (requires the common feature). |
Streaming¶
For real-time UIs, use create_completion_stream:
use std::io::{self, Write};
use llama_crab::{CompletionOptions, Llama, LlamaParams, StreamControl};
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(512))?;
let prompt = "Write one short sentence about Rust.";
let mut stdout = io::stdout().lock();
let mut write_error: Option<io::Error> = None;
let completion = llama.create_completion_stream(
prompt,
CompletionOptions::new(64).with_stop_sequence("\n\n"),
|chunk| {
if let Err(err) = write!(stdout, "{}", chunk.text).and_then(|_| stdout.flush()) {
write_error = Some(err);
return StreamControl::Stop;
}
StreamControl::Continue
},
)?;
if let Some(err) = write_error {
return Err(err.into());
}
The callback cannot return a Result, so capture I/O errors and
return StreamControl::Stop; after the stream returns, propagate
the captured error.
See the streaming example for a self-contained program.
FIM (fill-in-the-middle) for code¶
Code-completion models expect a prefix<SUFFIX_FILL>middle<SUFFIX>-style
prompt. llama-crab exposes complete_infill for that:
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(1024))?;
let prefix = "fn main() {\n println!(\"";
let suffix = "\");\n}";
let resp = llama.complete_infill(prefix, suffix)?;
println!("{}", resp.text);
The complete_infill helper renders a BuiltinTemplate::CodeFim
template around the prefix and suffix, then runs the normal
generation loop. Make sure the model you're using has been trained
on a FIM task — the GGUF metadata usually declares it.
Best-of-N¶
with_best_of(n) generates n internal completions, scores them
by average log probability, and returns the top n (the public
choice count). Use it to trade compute for quality:
use llama_crab::{CompletionOptions, Llama, LlamaParams};
let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(1024))?;
let resp = llama.create_completion_with_options(
"def fibonacci(n):",
CompletionOptions::new(64).with_best_of(4),
)?;
The returned Completion has the same shape as a single completion;
the extra candidates are internal.
Log probabilities¶
with_logprobs(true, k) populates Completion.logprobs with the
top k token log probabilities at every position, plus the
selected token:
pub struct CompletionLogprobs {
pub tokens: Vec<LlamaToken>,
pub text_offset: Vec<usize>,
pub token_logprobs: Vec<Option<f32>>,
pub top_logprobs: Vec<Vec<TopLogprob>>,
pub top_logprobs_idx: Vec<usize>,
}
Use it for uncertainty estimates, perplexity scoring, or to implement a custom UI that shows alternatives.
Stop sequences¶
Stop sequences are matched after a token is decoded, on the detokenised chunk. Add as many as you want; the first one to match ends the generation. They are case-sensitive and whitespace-significant.
CompletionOptions::new(64)
.with_stop_sequence("\n\n")
.with_stop_sequence("</answer>")
.with_stop_sequence("User:")
The default behaviour matches against any of them.
Echoing the prompt¶
with_echo(true) returns the prompt as part of the generated text.
Useful for debugging, less useful in production.
Performance tips¶
- Reuse the model. Each
Llama::loadis O(seconds). Load once, call many times. - Tune
n_threadsto physical cores. A 16-core Mac with hyperthreading should use 8–12 threads, not 16. - Use
temp = 0.0for benchmarks. It picks the greedy sampler and gives the most reproducible timings. - Offload to the GPU when the model fits in VRAM. A
with_n_gpu_layers(99)call is usually 5–10× faster than CPU. - Set
with_seedto a fixed value for unit tests. The default RNG seed is random per call, which makes test assertions flaky.
Where to next?¶
- Sampling strategies — for full control over the sampler chain.
- Streaming example — a self-contained program that streams tokens to stdout.
- Speculative decoding — when you need more tokens per second than the model can natively produce.