Skip to content

Guides

The pages in this section go deeper than the Getting Started guide on a single topic. They each explain the what and the why, walk through a representative code path, and link out to the relevant runnable example.

  • Backends & GPU offload

    Pick a build-time backend (CPU, Metal, CUDA, Vulkan, ROCm, OpenCL, KleidiAI), offload as many layers as fit in VRAM, and use the LlamaBackend capability probes to detect what's available at runtime.

  • Mobile distribution

    The release-perf and release-size profiles, the iOS and Android build flags, the MobilePreset defaults, and the caveats around OpenCL + ICD loaders + the NDK.

  • Sampling strategies

    Every sampler llama.cpp exposes (greedy, top-k, top-p, min-p, typical, mirostat, dry, penalties, XTC, grammar…), how to chain them with SamplerChain, and recommended starting points.

  • Caching & session state

    The in-process RamCache, the sled-backed DiskCache, and the manual llama_state_get_data / llama_state_set_data APIs. When the prompt cache helps (and when it doesn't).

Reading order

There's no strict order — every guide is self-contained. The most common paths through them are:

flowchart TD
    A[Getting Started] --> B{What do you need?}
    B -->|Performance on a specific GPU| C[Backends]
    B -->|Ship to iOS / Android| D[Mobile]
    B -->|Improve generation quality| E[Sampling]
    B -->|Multi-turn chat with growing history| F[Caching]

If you're unsure which guide is relevant, the Features index is a great starting point — it links to the right guide for each feature, and most guides reference one or two of the runnable examples.