Guides¶
The pages in this section go deeper than the Getting Started guide on a single topic. They each explain the what and the why, walk through a representative code path, and link out to the relevant runnable example.
-
Pick a build-time backend (CPU, Metal, CUDA, Vulkan, ROCm, OpenCL, KleidiAI), offload as many layers as fit in VRAM, and use the
LlamaBackendcapability probes to detect what's available at runtime. -
The
release-perfandrelease-sizeprofiles, the iOS and Android build flags, theMobilePresetdefaults, and the caveats around OpenCL + ICD loaders + the NDK. -
Every sampler
llama.cppexposes (greedy, top-k, top-p, min-p, typical, mirostat, dry, penalties, XTC, grammar…), how to chain them withSamplerChain, and recommended starting points. -
The in-process
RamCache, thesled-backedDiskCache, and the manualllama_state_get_data/llama_state_set_dataAPIs. When the prompt cache helps (and when it doesn't).
Reading order¶
There's no strict order — every guide is self-contained. The most common paths through them are:
flowchart TD
A[Getting Started] --> B{What do you need?}
B -->|Performance on a specific GPU| C[Backends]
B -->|Ship to iOS / Android| D[Mobile]
B -->|Improve generation quality| E[Sampling]
B -->|Multi-turn chat with growing history| F[Caching]
If you're unsure which guide is relevant, the Features index is a great starting point — it links to the right guide for each feature, and most guides reference one or two of the runnable examples.