Mobile distribution¶
llama-crab does not ship prebuilt mobile artifacts — you build the
Rust crate for your target and bundle the produced library or binary
with your app. This page covers the release profiles, the iOS and
Android recipes, the OpenCL + NDK integration, and the runtime
presets.
Release profiles¶
The workspace defines two release profiles for mobile packaging:
| Profile | Use case | Trade-offs |
|---|---|---|
release-perf |
Maximum runtime performance. | Larger binary, longer link time, thin LTO. |
release-size |
Smaller artifact with fat LTO, symbol stripping, panic = "abort". |
Best for store-distributed apps. |
iOS¶
Use the metal feature for Apple GPU offload:
cargo build --profile release-perf \
--target aarch64-apple-ios \
--no-default-features --features metal
For smaller CPU-only artifacts, disable default features and use
release-size:
cargo build --profile release-size \
--target aarch64-apple-ios-sim \
--no-default-features --features openmp
Keep model files outside the binary and load them from app storage —
shipping a 4 GB GGUF inside the .ipa is rarely a good idea.
Android CPU¶
For CPU-first Android builds, start with OpenMP and KleidiAI:
cargo build --profile release-size \
--target aarch64-linux-android \
--no-default-features --features openmp,kleidiai
static-stdcxx and shared-stdcxx select the Android C++ runtime.
They are mutually exclusive. If neither feature is set, the
build keeps the historical c++_static default.
Android OpenCL¶
For Adreno-oriented OpenCL builds, install OpenCL headers and an ICD loader for the target NDK. Then build with OpenCL enabled:
cargo build --profile release-perf \
--target aarch64-linux-android \
--no-default-features --features opencl,shared-stdcxx
The build forwards these environment variables to CMake when set:
| Variable | Purpose |
|---|---|
OpenCL_LIBRARY |
Path to the OpenCL library. |
OPENCL_HEADERS_DIR |
Path to OpenCL headers. |
OPENCL_ICD_LOADER_HEADERS_DIR |
Path used when building an ICD loader. |
OpenCL and KleidiAI require target SDKs and device validation. The
default CI only checks Cargo feature wiring with dynamic-link; it
does not prove that a particular Android SDK, driver, or device
can run the backend.
Runtime presets¶
The high-level API exposes [MobilePreset], a compact set of
defaults tuned for the most common mobile scenarios:
| Preset | When to use it |
|---|---|
LowRam |
Old phones, Android Go, watches. 1–2 GB of free RAM. |
Balanced |
Modern phones, 4 GB+ of free RAM. |
GpuMax |
Devices with a fast GPU (Adreno 7xx+, Apple A-series). |
use llama_crab::{Llama, LlamaParams, MobilePreset};
let mut llama = Llama::load(
LlamaParams::new("model.gguf")
.with_mobile_preset(MobilePreset::Balanced)
.with_n_ctx(2048),
)?;
Call explicit setters after with_mobile_preset when you need to
override an individual value:
use llama_crab::{Llama, LlamaParams, MobilePreset};
let mut llama = Llama::load(
LlamaParams::new("model.gguf")
.with_mobile_preset(MobilePreset::LowRam)
.with_n_ctx(1024) // override the preset's n_ctx
.with_n_threads(2), // override the preset's thread count
)?;
The server exposes the same presets through --mobile-preset low-ram,
--mobile-preset balanced, and --mobile-preset gpu-max.
Packaging checklist¶
A short checklist for shipping a llama-crab binary inside a mobile
app:
- Pick the right target triple (
aarch64-apple-ios,aarch64-linux-android, …). - Pick a release profile (
release-perffor power users,release-sizefor the App Store). - Bundle the GGUF as a downloadable asset, not a baked-in resource.
- Add a runtime "download model" UI — models are big.
- Pre-warm the model on a background thread so the first user prompt doesn't pay the load cost.
- Watch memory; on Android, trigger a GC after model load.
- On iOS, declare the Privacy manifest for any data the app sends to the model (the inference itself stays on-device).
Common pitfalls¶
| Pitfall | Fix |
|---|---|
Linker error: cannot find -lomp |
Enable the openmp feature and link against the OpenMP runtime available on the target. |
Linker error: cannot find -lOpenCL |
Install OpenCL headers and an ICD loader into the NDK sysroot, or use the OPENCL_* CMake variables. |
| App rejected by the App Store for "excessive binary size" | Use release-size and the openmp / opencl features only. |
| Crashes during model load on Android Go | Drop the model size or use MobilePreset::LowRam with a smaller n_ctx. |
| First token takes 5+ seconds | Pre-warm the model on a background thread at app start. |
Where to next?¶
- Backends & GPU offload — pick the right backend for the device.
- Cargo features — the full set of feature flags.
- Server — when you want a separate process to host the model.