Embeddings & reranking¶
llama-crab exposes the embedding pipeline of llama.cpp through a
single high-level helper, [Llama::embed], plus pooling and
normalisation knobs on LlamaContextParams. This page walks through
enabling embeddings, the four pooling modes, semantic search, and
the cross-encoder Llama::rerank helper.
Enabling embeddings¶
Load the model with with_embeddings(true). By default the context
uses mean pooling; pick a different strategy with
with_pooling_type:
use llama_crab::context::params::PoolingType;
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(
LlamaParams::new("bge-small-en-v1.5-q4_k_m.gguf")
.with_n_ctx(512)
.with_embeddings(true)
.with_pooling_type(PoolingType::Cls),
)?;
| Pooling | When to use it |
|---|---|
None |
Token-level embeddings. No pooling — the output is a matrix. |
Mean |
Default. Robust for general sentence similarity. |
Cls |
BGE / GTE / E5 — uses the first token (BOS) as the summary. |
Last |
Uses the last non-pad token. |
Rank |
Cross-encoder rerankers. Produces a single logit per pair. |
Computing one embedding¶
embed(..., true) returns an L2-normalised vector, so the dot
product of two vectors equals their cosine similarity. The function
returns Result<Vec<f32>, LlamaError>.
Embedding options¶
The embed helper takes optional configuration through
EmbedOptions:
use llama_crab::embed::EmbedOptions;
let v = llama.embed_with_options(
"Rust is memory-safe.",
EmbedOptions::new()
.with_normalize(true)
.with_start_token(false) // skip the BOS token
)?;
Batch embeddings¶
For multi-document workloads, prefer embed_texts (or
embed_texts_with_options):
use llama_crab::Llama;
let texts = vec![
"Rust is memory-safe.",
"Python is a dynamic language.",
"The Eiffel Tower is in Paris.",
];
let embeddings = llama.embed_texts(&texts, true)?; // Vec<Vec<f32>>
The batch call amortises model load cost, but each text is still evaluated independently. Use the lower-level batch and sequence APIs when you need higher throughput.
Semantic search¶
Embed a query and a corpus, then rank by cosine similarity:
let corpus = [
"Rust is memory-safe.",
"Paris is the capital of France.",
"Bananas are yellow fruit.",
];
let query = "safe systems programming language";
let q = llama.embed(query, true)?;
let mut scored: Vec<(usize, f32)> = corpus.iter().enumerate()
.map(|(i, doc)| {
let v = llama.embed(doc, true).unwrap();
let sim: f32 = q.iter().zip(v.iter()).map(|(a, b)| a * b).sum();
(i, sim)
})
.collect();
scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
for (i, sim) in &scored {
println!("{sim:.3} {}", corpus[*i]);
}
The full example lives in
examples/embedding_search/.
Reranking¶
Rerankers (a.k.a. cross-encoders) score (query, document) pairs
jointly rather than from independent embeddings. They give
better rankings at the cost of one model pass per pair.
llama-crab includes Llama::rerank(query, documents) for
cross-encoder rank models. Load the model with embeddings enabled
and PoolingType::Rank, then pass the query and documents:
use llama_crab::context::params::PoolingType;
use llama_crab::{Llama, LlamaParams};
let mut llama = Llama::load(
LlamaParams::new("bge-reranker-base-q4_k_m.gguf")
.with_n_ctx(512)
.with_embeddings(true)
.with_pooling_type(PoolingType::Rank),
)?;
let scores = llama.rerank("safe systems programming", &[
"Rust prevents many memory bugs.",
"Paris is the capital of France.",
])?;
The helper currently encodes each (query, document) pair
independently. Use the lower-level batch and sequence APIs when you
need higher throughput.
Bi-encoder vs cross-encoder¶
| Method | Latency | Quality | When to use it |
|---|---|---|---|
| Bi-encoder (cosine on independent embeddings) | Cheap — encode each text once, then dot product. | Good for "is this in the ballpark" retrieval. | First-stage retrieval over thousands of documents. |
Cross-encoder (rerank on (query, doc) pairs) |
Expensive — one model pass per pair. | Much better at fine-grained relevance. | Second-stage reranking over the top K candidates. |
A typical pipeline uses both: a fast bi-encoder retrieves 100 candidates, then a cross-encoder reranks them.
Building a vector index¶
llama-crab is unopinionated about the storage layer. The simplest
"index" is a Vec<(String, Vec<f32>)> of (text, embedding) pairs
held in memory. For larger corpora, pair the embeddings with one of:
hnsw— Rust-native HNSW.qdrant-client— Qdrant vector DB.pgvector— Postgres with vector support.
The important invariant is that the index stores L2-normalised vectors and the query is also normalised — then the dot product equals cosine similarity and you can use a single index type for both.
Common pitfalls¶
| Pitfall | Symptom | Fix |
|---|---|---|
| Wrong pooling type | Similarity is NaN or close to zero across all pairs. |
BGE / GTE / E5 expect Cls; sentence-transformers-style models prefer Mean. |
Forgot with_embeddings(true) |
embed panics with "embedding mode is not enabled". |
Add .with_embeddings(true) to the params. |
| Comparing unnormalised vectors | Similarity scores look wrong. | Pass normalize = true to embed. |
Cross-encoder loaded with Mean pooling |
rerank returns garbage scores. |
Use PoolingType::Rank for cross-encoders. |
| Embedding model is too small for the language | Similarity scores look like noise. | Pick a model trained on the language you want to embed. |
Where to next?¶
- Embeddings example — a 30-line program that prints one embedding.
- Semantic search example — cosine ranking over a small corpus.
- Reranker example — a bi-encoder reranker demo.
- RAG recipe — combining embeddings, a vector store and a chat model in a single pipeline.