What is llama-crab?¶
llama-crab is a Rust crate (actually a workspace of two crates) that
gives you a 100 % safe Rust API over llama.cpp.
You can load any GGUF model, run text and chat completions, compute
embeddings, constrain generation with a GBNF grammar, drive vision-
language models through mtmd, or expose everything over HTTP — all
without touching a single unsafe block at the application level.
-
Get started in 5 minutes
Load a model and generate a completion with a handful of lines.
-
Run on any hardware
CPU, Metal, CUDA, Vulkan, ROCm, OpenCL and KleidiAI — pick your backend at build time and offload as many layers as fit in VRAM.
-
Ship to phones and tablets
release-sizeandrelease-perfprofiles, OpenCL + KleidiAI for Android, Metal for iOS, andMobilePresetfor sensible defaults. -
Vision & audio
Pair a text GGUF with an
mmprojprojector and feed images or audio into the same context. -
Embeddings & reranking
Extract vectors with configurable pooling, run semantic search, or use a cross-encoder for higher-quality ranking.
-
HTTP server out of the box
llama-crab-serverexposes the high-level API over an OpenAI-compatible HTTP interface with SSE streaming.
A taste of the API¶
use llama_crab::{Llama, LlamaParams};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut llama = Llama::load(
LlamaParams::new("models/qwen2.5-0.5b-instruct-q4_k_m.gguf")
.with_n_ctx(2048)
.with_n_gpu_layers(99),
)?;
let response = llama.create_completion("The capital of France is", 32)?;
println!("{}", response.text);
Ok(())
}
use llama_crab::chat::BuiltinTemplate;
use llama_crab::high_level::chat_completion::{create_chat_completion_with, ChatMessage};
use llama_crab::{Llama, LlamaParams, Role};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut llama = Llama::load(
LlamaParams::new("models/instruct.gguf").with_n_ctx(4096),
)?;
let messages = vec![
ChatMessage::new(Role::System, "You are a concise assistant."),
ChatMessage::new(Role::User, "Explain Rust ownership in one paragraph."),
];
let response = create_chat_completion_with(
&mut llama,
&messages,
BuiltinTemplate::ChatMl,
&[],
128,
)?;
println!("{}", response.content);
Ok(())
}
use llama_crab::{Llama, LlamaParams};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut llama = Llama::load(
LlamaParams::new("models/bge-small-en-v1.5-q4_k_m.gguf")
.with_n_ctx(512)
.with_embeddings(true),
)?;
let embedding = llama.embed("Rust is memory-safe.", true)?;
println!("dim = {}", embedding.len());
Ok(())
}
Why llama-crab?¶
llama-crab is designed for applications that need direct access to
llama.cpp without giving up Rust's safety, packaging, or deployment
discipline.
-
Safe by default
The high-level API exposes no
unsafesurface. FFI boundaries live behind typed wrappers, and raw access stays opt-in for the cases that truly need it. -
Complete feature surface
Sampling, chat formats, vision pipelines, JSON-Schema grammars, speculative decoding, embeddings, reranking, and KV cache flows are available from safe Rust APIs.
-
Reproducible builds
llama.cppis pinned to a known commit, the build is explicit about enabled backends, and CI keeps the supported CPU / CUDA / Vulkan / Metal / ROCm combinations visible. -
Performance first
Layer offload, flash attention, mobile presets, sampling chains, speculative decoding, and tool-call parsers are exposed without requiring application code to own custom kernels.
Crates in this workspace¶
| Crate | Purpose | When to use it |
|---|---|---|
llama-crab |
100 % safe Rust API: model loading, sampling, chat, embeddings, server glue. | Most applications. This is the crate you depend on. |
llama-crab-sys |
Raw FFI generated via bindgen over wrapper.h + CMake. |
When you need direct access to llama.cpp symbols that the safe crate does not (yet) wrap. |
llama-crab-server |
HTTP binary built on top of llama-crab. |
When you want an OpenAI-compatible endpoint without writing one. |
License¶
llama-crab is distributed under the MIT License. See
LICENSE-MIT
for the full text.
Where to next?
- Install the crate and verify your toolchain.
- Walk through the architecture overview to understand the major building blocks.
- Skim the examples index and copy the one closest to what you want to build.
