Building a RAG pipeline¶

Retrieval-Augmented Generation (RAG) is the pattern of retrieving relevant documents for a query, then generating an answer that quotes them. This recipe walks through the full pipeline: embed → store → retrieve → re-rank → answer.

The pipeline¶

flowchart LR
    Q[Query] --> QE[Embed]
    QE --> QR[Top-K retrieval]
    DB[(Vector index)] --> QR
    QR --> RR[Re-rank top-K]
    RR --> PROMPT[Build prompt with quotes]
    P[Query + retrieved docs] --> M[Model]
    M --> A[Answer]
    PROMPT --> M

The five steps:

Embed the query with an embedding model.
Retrieve the top K most similar documents from a vector index.
Re-rank the top K with a cross-encoder for higher precision.
Build a prompt that includes the original query and the retrieved documents.
Generate an answer with the chat model.

Step 1: embed¶

use llama_crab::context::params::PoolingType;
use llama_crab::{Llama, LlamaParams};

let mut embedder = Llama::load(
    LlamaParams::new("bge-small-en-v1.5-q4_k_m.gguf")
        .with_n_ctx(512)
        .with_embeddings(true)
        .with_pooling_type(PoolingType::Cls),
)?;

let query_embedding: Vec<f32> = embedder.embed("What is Rust?", true)?;

The true argument normalises the vector; with normalised vectors, the dot product equals cosine similarity.

Step 2: index¶

The index can be in-memory (for small corpora), a Rust-native HNSW library, or a vector database. The minimum API the index needs to expose is:

trait VectorIndex {
    fn insert(&mut self, id: &str, vec: &[f32]);
    fn search(&self, query: &[f32], k: usize) -> Vec<(String, f32)>;
}

For a production deployment, use Qdrant, pgvector, or Weaviate.

Step 3: retrieve¶

let candidates: Vec<(String, f32)> = index.search(&query_embedding, 20);

A typical first-stage K is 20–100. The re-ranker then narrows this to the top 3–5.

Step 4: re-rank¶

The cross-encoder Llama::rerank is the right tool for this. Load it with PoolingType::Rank:

use llama_crab::context::params::PoolingType;
use llama_crab::{Llama, LlamaParams};

let mut reranker = Llama::load(
    LlamaParams::new("bge-reranker-base-q4_k_m.gguf")
        .with_n_ctx(512)
        .with_embeddings(true)
        .with_pooling_type(PoolingType::Rank),
)?;

let documents: Vec<&str> = candidates.iter().map(|(doc, _)| doc.as_str()).collect();
let scores = reranker.rerank("What is Rust?", &documents)?;

// Sort and take the top 3.
let mut reranked: Vec<_> = candidates.iter().zip(scores).collect();
reranked.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
let top: Vec<_> = reranked.into_iter().take(3).collect();

Step 5: build the prompt¶

The prompt format depends on the chat template. A common pattern:

use llama_crab::chat::{BuiltinTemplate, ChatMessage, render_builtin};
use llama_crab::Role;

let mut messages = vec![ChatMessage::new(
    Role::System,
    "You are a helpful assistant. Use the provided context to answer the user's question. \
     If the answer is not in the context, say you don't know.",
)];

// Add the retrieved documents as context.
let context = top.iter()
    .map(|(doc, _)| doc.as_str())
    .collect::<Vec<_>>()
    .join("\n\n");
messages.push(ChatMessage::new(Role::System, format!("Context:\n{context}")));

// Add the user question.
messages.push(ChatMessage::new(Role::User, "What is Rust?"));

// Render with a known template.
let prompt = render_builtin(BuiltinTemplate::ChatMl, &messages, &[], true);

Step 6: generate¶

use llama_crab::high_level::chat_completion::create_chat_completion_with;
use llama_crab::chat::BuiltinTemplate;
use llama_crab::chat::ChatMessage;

let mut chat = Llama::load(LlamaParams::new("qwen2.5-7b-instruct-q4_k_m.gguf").with_n_ctx(4096))?;
let response = create_chat_completion_with(
    &mut chat, &messages, BuiltinTemplate::ChatMl, &[], 256,
)?;
println!("{}", response.content);

Putting it all together¶

sequenceDiagram
    participant U as User
    participant App
    participant E as Embedder
    participant IDX as Vector index
    participant RR as Reranker
    participant C as Chat model

    U->>App: "What is Rust?"
    App->>E: embed(query)
    E-->>App: query_embedding
    App->>IDX: search(query_embedding, k=20)
    IDX-->>App: top 20 candidates
    App->>RR: rerank(query, candidates)
    RR-->>App: scored candidates
    App->>App: top 3
    App->>C: chat_completion(query + context)
    C-->>App: answer
    App-->>U: answer

Performance considerations¶

Index size — for a 1 M document corpus, a Qdrant instance on a 16 GB RAM box can hold ~1 M 384-dim vectors.
Embedding throughput — embed_texts batches the inference, amortising the model load cost. For 1 M documents, expect ~1 hour on a single A100.
Re-ranking throughput — rerank is one model pass per pair. For 20 candidates, expect ~50 ms on a 4090.
End-to-end latency — typically 200–500 ms for the retrieval pipeline plus the chat generation time.

Common pitfalls¶

Pitfall	What goes wrong	Fix
Wrong pooling type	Similarity is `NaN` or close to zero.	Use `Cls` for BGE / GTE / E5.
Index stores unnormalised vectors	Dot product ≠ cosine similarity.	Normalise at insert time.
Re-ranker loaded with `Mean` pooling	`rerank` returns garbage scores.	Use `Rank` pooling.
Context too long for the chat model	The model truncates or errors.	Pick the top 3–5 candidates, not 20.
Model quotes but doesn't synthesise	The answer is just a paste.	Add a system prompt that asks the model to synthesise.

Where to next?¶

Embeddings & reranking guide — the full reference.
Embeddings example — a runnable program.
Semantic search example — cosine ranking over a small corpus.
Reranker example — bi-encoder ranking demo.