Recipes¶
This section collects end-to-end recipes for the most common tasks. Each recipe is a small, complete program with a one-line summary of what it does, the full source, and a discussion of the trade-offs.
-
Measure tokens per second, find the bottleneck, and tune
n_threads,n_gpu_layers, batch size, and sampler chain to maximise throughput on your hardware. -
Quant size vs. accuracy vs. speed vs. memory. A short guide to picking the right GGUF for the job.
-
From the 80-line REPL to a deployable agent: state machines, tool calls, history trimming, and session persistence.
-
Embed → store → retrieve → re-rank → answer. The full end-to-end pattern.
Reading order¶
The recipes are independent — pick the one that matches your
current task. If you are new to llama-crab, the
Performance tuning page is a good starting point
because it teaches you how to measure before you optimise.
flowchart LR
A[Quickstart] --> B[Performance tuning]
A --> C[Choosing a model]
A --> D[Chatbot]
A --> E[RAG]