Features¶

This section documents every public feature of the llama-crab API. Each page goes end-to-end: the why, the how, the pitfalls, and a complete runnable example.

Text completion

Plain prompt → text. Stop sequences, log probabilities, streaming, best-of-N, and FIM (fill-in-the-middle) for code.
Chat & tool calling

Role-based messages, 14 built-in Jinja2 templates, a Jinja2 subset renderer, the tool-call parser for ChatML / Mistral / Llama 3 / Functionary, and incremental parsing.
Multimodal (vision + audio)

Pair a text GGUF with an mmproj projector, decode local images into MtmdBitmap, evaluate multimodal chunks, and continue generation with the normal sampler chain.
Embeddings & reranking

Mean / CLS / Last pooling, L2-normalised vectors, semantic search, and the cross-encoder Llama::rerank helper.
JSON-Schema & GBNF grammars

Convert a JSON Schema 2020-12 document into a GBNF grammar and use the grammar sampler to force the model to emit only valid output. The most reliable way to get structured data out of a model.
Speculative decoding

The PromptLookupDecoding draft model (no extra weights) and the DraftModel trait for plugging in your own draft. The speculative_decode free function drives the verify step.
Stateful chat

Multi-turn chat with a growing history. Template auto-detection, history trimming strategies, and session persistence.

Picking the right feature for the job¶

flowchart TD
    Q{What are you building?}
    Q -->|Code completion| A[Text completion]
    Q -->|Chatbot / agent| B[Chat & tool calling]
    Q -->|Image / audio Q&A| C[Multimodal]
    Q -->|Search / clustering| D[Embeddings & reranking]
    Q -->|Extracting structured data| E[Grammars]
    Q -->|Long-form generation| F[Speculative decoding]
    Q -->|Persistent assistant| G[Stateful chat]

If you're not sure where to start, the quickstart example walks through plain completion, chat, FIM and embeddings in a single ~80-line program.