Skip to content

Features

This section documents every public feature of the llama-crab API. Each page goes end-to-end: the why, the how, the pitfalls, and a complete runnable example.

  • Text completion

    Plain prompt → text. Stop sequences, log probabilities, streaming, best-of-N, and FIM (fill-in-the-middle) for code.

  • Chat & tool calling

    Role-based messages, 14 built-in Jinja2 templates, a Jinja2 subset renderer, the tool-call parser for ChatML / Mistral / Llama 3 / Functionary, and incremental parsing.

  • Multimodal (vision + audio)

    Pair a text GGUF with an mmproj projector, decode local images into MtmdBitmap, evaluate multimodal chunks, and continue generation with the normal sampler chain.

  • Embeddings & reranking

    Mean / CLS / Last pooling, L2-normalised vectors, semantic search, and the cross-encoder Llama::rerank helper.

  • JSON-Schema & GBNF grammars

    Convert a JSON Schema 2020-12 document into a GBNF grammar and use the grammar sampler to force the model to emit only valid output. The most reliable way to get structured data out of a model.

  • Speculative decoding

    The PromptLookupDecoding draft model (no extra weights) and the DraftModel trait for plugging in your own draft. The speculative_decode free function drives the verify step.

  • Stateful chat

    Multi-turn chat with a growing history. Template auto-detection, history trimming strategies, and session persistence.

Picking the right feature for the job

flowchart TD
    Q{What are you building?}
    Q -->|Code completion| A[Text completion]
    Q -->|Chatbot / agent| B[Chat & tool calling]
    Q -->|Image / audio Q&A| C[Multimodal]
    Q -->|Search / clustering| D[Embeddings & reranking]
    Q -->|Extracting structured data| E[Grammars]
    Q -->|Long-form generation| F[Speculative decoding]
    Q -->|Persistent assistant| G[Stateful chat]

If you're not sure where to start, the quickstart example walks through plain completion, chat, FIM and embeddings in a single ~80-line program.