Skip to content

Text completion

The simplest llama-crab workflow: feed the model a prompt string, get a text continuation back. This page documents the create_completion family of methods, the stop-sequence and log-probability knobs, streaming, best-of-N, and FIM (fill-in-the- middle) for code.

The basic call

use llama_crab::{Llama, LlamaParams};

let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(2048))?;
let resp = llama.create_completion("Once upon a time", 64)?;
println!("{}", resp.text);

The [Completion] struct returned by the high-level call carries:

Field Type Description
text String The generated text, with the prompt stripped.
tokens Vec<LlamaToken> The generated token ids.
logprobs Option<CompletionLogprobs> Per-token log probabilities, when requested.
timings CompletionTimings Prompt ingestion, generation, and total wall time.
stop_reason StopReason Why generation stopped (Stop, Length, TokensLimit, Canceled).

Customising the call

create_completion is a thin wrapper over create_completion_with_options, which takes a [CompletionOptions] builder. Use it to expose the rest of the sampler chain, the stop sequences, the log-prob settings and the best-of-N knob.

use llama_crab::{CompletionOptions, Llama, LlamaParams};

let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(2048))?;

let resp = llama.create_completion_with_options(
    "The capital of France is",
    CompletionOptions::new(32)
        .with_temperature(0.7)
        .with_top_p(0.95, 1)
        .with_top_k(40)
        .with_stop_sequence("\n\n")
        .with_logprobs(true, 5)
        .with_echo(false),
)?;

CompletionOptions reference

Method Default Description
new(max_tokens) Sets the maximum number of tokens to generate.
with_temperature(t) 0.8 Temperature; 0.0 selects greedy decoding.
with_top_k(k) 40 Restrict to the top K tokens.
with_top_p(p, min_keep) 0.95, 1 Nucleus sampling.
with_min_p(p, min_keep) 0.05, 1 Min-P sampling.
with_typical_p(p, min_keep) 1.0, 1 Locally-typical sampling.
with_tfs_z(z) 1.0 Tail-free sampling.
with_repeat_penalty(p) 1.0 Repetition penalty.
with_frequency_penalty(p) 0.0 Frequency penalty.
with_presence_penalty(p) 0.0 Presence penalty.
with_penalty_last_n(n) 64 Tokens to consider when applying penalties.
with_mirostat(...) 0 Mirostat mode (0 = off, 1, 2).
with_seed(seed) random RNG seed.
with_stop_sequence(s) Adds a single stop sequence.
with_stop_sequences([s]) Adds multiple stop sequences.
with_logit_bias(biases) {} Manual logit bias.
with_logprobs(enable, k) false, 0 Per-token log probabilities.
with_echo(echo) false Echo the prompt back as part of the response.
with_suffix(suffix) Suffix appended after the prompt (for FIM).
with_best_of(n) n Number of internal candidates for n.
with_grammar(text) GBNF grammar (requires the common feature).

Streaming

For real-time UIs, use create_completion_stream:

use std::io::{self, Write};
use llama_crab::{CompletionOptions, Llama, LlamaParams, StreamControl};

let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(512))?;
let prompt = "Write one short sentence about Rust.";
let mut stdout = io::stdout().lock();
let mut write_error: Option<io::Error> = None;

let completion = llama.create_completion_stream(
    prompt,
    CompletionOptions::new(64).with_stop_sequence("\n\n"),
    |chunk| {
        if let Err(err) = write!(stdout, "{}", chunk.text).and_then(|_| stdout.flush()) {
            write_error = Some(err);
            return StreamControl::Stop;
        }
        StreamControl::Continue
    },
)?;

if let Some(err) = write_error {
    return Err(err.into());
}

The callback cannot return a Result, so capture I/O errors and return StreamControl::Stop; after the stream returns, propagate the captured error.

See the streaming example for a self-contained program.

FIM (fill-in-the-middle) for code

Code-completion models expect a prefix<SUFFIX_FILL>middle<SUFFIX>-style prompt. llama-crab exposes complete_infill for that:

use llama_crab::{Llama, LlamaParams};

let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(1024))?;

let prefix = "fn main() {\n    println!(\"";
let suffix = "\");\n}";
let resp = llama.complete_infill(prefix, suffix)?;
println!("{}", resp.text);

The complete_infill helper renders a BuiltinTemplate::CodeFim template around the prefix and suffix, then runs the normal generation loop. Make sure the model you're using has been trained on a FIM task — the GGUF metadata usually declares it.

Best-of-N

with_best_of(n) generates n internal completions, scores them by average log probability, and returns the top n (the public choice count). Use it to trade compute for quality:

use llama_crab::{CompletionOptions, Llama, LlamaParams};

let mut llama = Llama::load(LlamaParams::new("model.gguf").with_n_ctx(1024))?;
let resp = llama.create_completion_with_options(
    "def fibonacci(n):",
    CompletionOptions::new(64).with_best_of(4),
)?;

The returned Completion has the same shape as a single completion; the extra candidates are internal.

Log probabilities

with_logprobs(true, k) populates Completion.logprobs with the top k token log probabilities at every position, plus the selected token:

pub struct CompletionLogprobs {
    pub tokens:           Vec<LlamaToken>,
    pub text_offset:      Vec<usize>,
    pub token_logprobs:   Vec<Option<f32>>,
    pub top_logprobs:     Vec<Vec<TopLogprob>>,
    pub top_logprobs_idx: Vec<usize>,
}

Use it for uncertainty estimates, perplexity scoring, or to implement a custom UI that shows alternatives.

Stop sequences

Stop sequences are matched after a token is decoded, on the detokenised chunk. Add as many as you want; the first one to match ends the generation. They are case-sensitive and whitespace-significant.

CompletionOptions::new(64)
    .with_stop_sequence("\n\n")
    .with_stop_sequence("</answer>")
    .with_stop_sequence("User:")

The default behaviour matches against any of them.

Echoing the prompt

with_echo(true) returns the prompt as part of the generated text. Useful for debugging, less useful in production.

Performance tips

  • Reuse the model. Each Llama::load is O(seconds). Load once, call many times.
  • Tune n_threads to physical cores. A 16-core Mac with hyperthreading should use 8–12 threads, not 16.
  • Use temp = 0.0 for benchmarks. It picks the greedy sampler and gives the most reproducible timings.
  • Offload to the GPU when the model fits in VRAM. A with_n_gpu_layers(99) call is usually 5–10× faster than CPU.
  • Set with_seed to a fixed value for unit tests. The default RNG seed is random per call, which makes test assertions flaky.

Where to next?