Skip to content

llama-crab

Safe, ergonomic and complete Rust bindings for llama.cpp.

Crates.io docs.rs MSRV: 1.88 License: MIT llama.cpp pinned


What is llama-crab?

llama-crab is a Rust crate (actually a workspace of two crates) that gives you a 100 % safe Rust API over llama.cpp. You can load any GGUF model, run text and chat completions, compute embeddings, constrain generation with a GBNF grammar, drive vision- language models through mtmd, or expose everything over HTTP — all without touching a single unsafe block at the application level.

  • Get started in 5 minutes

    Load a model and generate a completion with a handful of lines.

    Installation Your first program

  • Run on any hardware

    CPU, Metal, CUDA, Vulkan, ROCm, OpenCL and KleidiAI — pick your backend at build time and offload as many layers as fit in VRAM.

    Backends & GPU offload

  • Ship to phones and tablets

    release-size and release-perf profiles, OpenCL + KleidiAI for Android, Metal for iOS, and MobilePreset for sensible defaults.

    Mobile distribution

  • Vision & audio

    Pair a text GGUF with an mmproj projector and feed images or audio into the same context.

    Multimodal

  • Embeddings & reranking

    Extract vectors with configurable pooling, run semantic search, or use a cross-encoder for higher-quality ranking.

    Embeddings

  • HTTP server out of the box

    llama-crab-server exposes the high-level API over an OpenAI-compatible HTTP interface with SSE streaming.

    Server

A taste of the API

use llama_crab::{Llama, LlamaParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut llama = Llama::load(
        LlamaParams::new("models/qwen2.5-0.5b-instruct-q4_k_m.gguf")
            .with_n_ctx(2048)
            .with_n_gpu_layers(99),
    )?;

    let response = llama.create_completion("The capital of France is", 32)?;
    println!("{}", response.text);
    Ok(())
}
use llama_crab::chat::BuiltinTemplate;
use llama_crab::high_level::chat_completion::{create_chat_completion_with, ChatMessage};
use llama_crab::{Llama, LlamaParams, Role};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut llama = Llama::load(
        LlamaParams::new("models/instruct.gguf").with_n_ctx(4096),
    )?;

    let messages = vec![
        ChatMessage::new(Role::System, "You are a concise assistant."),
        ChatMessage::new(Role::User, "Explain Rust ownership in one paragraph."),
    ];

    let response = create_chat_completion_with(
        &mut llama,
        &messages,
        BuiltinTemplate::ChatMl,
        &[],
        128,
    )?;

    println!("{}", response.content);
    Ok(())
}
use llama_crab::{Llama, LlamaParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut llama = Llama::load(
        LlamaParams::new("models/bge-small-en-v1.5-q4_k_m.gguf")
            .with_n_ctx(512)
            .with_embeddings(true),
    )?;

    let embedding = llama.embed("Rust is memory-safe.", true)?;
    println!("dim = {}", embedding.len());
    Ok(())
}

Why llama-crab?

llama-crab is designed for applications that need direct access to llama.cpp without giving up Rust's safety, packaging, or deployment discipline.

  • Safe by default

    The high-level API exposes no unsafe surface. FFI boundaries live behind typed wrappers, and raw access stays opt-in for the cases that truly need it.

  • Complete feature surface

    Sampling, chat formats, vision pipelines, JSON-Schema grammars, speculative decoding, embeddings, reranking, and KV cache flows are available from safe Rust APIs.

  • Reproducible builds

    llama.cpp is pinned to a known commit, the build is explicit about enabled backends, and CI keeps the supported CPU / CUDA / Vulkan / Metal / ROCm combinations visible.

  • Performance first

    Layer offload, flash attention, mobile presets, sampling chains, speculative decoding, and tool-call parsers are exposed without requiring application code to own custom kernels.

Crates in this workspace

Crate Purpose When to use it
llama-crab 100 % safe Rust API: model loading, sampling, chat, embeddings, server glue. Most applications. This is the crate you depend on.
llama-crab-sys Raw FFI generated via bindgen over wrapper.h + CMake. When you need direct access to llama.cpp symbols that the safe crate does not (yet) wrap.
llama-crab-server HTTP binary built on top of llama-crab. When you want an OpenAI-compatible endpoint without writing one.

License

llama-crab is distributed under the MIT License. See LICENSE-MIT for the full text.


Where to next?