Running the server¶
The server binary lives in the llama-crab-server crate. It boots in
a few seconds (depending on the model size) and exposes an
OpenAI-compatible HTTP API.
The basic command¶
cargo run -p llama-crab-server --release -- \
--model models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
--host 127.0.0.1 \
--port 8080
The binary prints a banner on stderr when it is ready:
llama-crab-server listening on http://127.0.0.1:8080
model : llama-crab
routes: /health, /v1/models, /v1/completions, /v1/chat/completions,
/v1/embeddings, /v1/rerank, /extras/tokenize,
/extras/tokenize/count, /extras/detokenize
ctrl+c to stop
RUST_LOG=debug (or any other level) overrides the default info
filter.
Command-line flags¶
| Flag | Default | Purpose |
|---|---|---|
--model <PATH> |
required | GGUF model path. |
--model-name <NAME> |
llama-crab |
id returned by /v1/models. |
--host <HOST> |
127.0.0.1 |
Bind host. Use 0.0.0.0 to listen on all interfaces. |
--port <PORT> |
8080 |
Bind port. |
--n-ctx <N> |
2048 |
Context size in tokens. |
--n-batch <N> |
512 |
Logical batch size. |
--n-threads <N> |
0 (auto) |
CPU thread count; 0 keeps the default. |
--n-gpu-layers <N> |
0 |
Number of transformer layers to offload. |
--mobile-preset <PRESET> |
– | One of low-ram, balanced, gpu-max. |
--embeddings |
false |
Enable embedding mode. |
--reranking |
false |
Enable rerank endpoints. |
--pooling <TYPE> |
unspecified |
none, mean, cls, last, rank, unspecified. |
--mmproj <PATH> |
– | Multimodal projector path. |
Environment variables¶
Every flag has an LLAMA_CRAB_* environment variable equivalent,
useful in containerised environments and systemd units:
| Flag | Environment variable | Purpose |
|---|---|---|
--model |
LLAMA_CRAB_MODEL |
GGUF model path. |
--model-name |
LLAMA_CRAB_MODEL_NAME |
id returned by /v1/models. |
--n-ctx |
LLAMA_CRAB_N_CTX |
Context size. |
--n-batch |
LLAMA_CRAB_N_BATCH |
Logical batch size. |
--n-threads |
LLAMA_CRAB_N_THREADS |
CPU thread count; 0 keeps the default. |
--n-gpu-layers |
LLAMA_CRAB_N_GPU_LAYERS |
GPU offload layers. |
--mobile-preset |
LLAMA_CRAB_MOBILE_PRESET |
Mobile defaults: low-ram, balanced, gpu-max. |
--embeddings |
LLAMA_CRAB_EMBEDDINGS |
Enable embedding mode. |
--reranking |
LLAMA_CRAB_RERANKING |
Enable rerank endpoints. |
--pooling |
LLAMA_CRAB_POOLING |
Pooling type. |
--mmproj |
LLAMA_CRAB_MMPROJ |
Multimodal projector path. |
When --mobile-preset is set, the other flags only override the
preset's defaults if they are explicitly provided. Without a preset,
the server keeps its previous defaults.
Common recipes¶
Embeddings server¶
cargo run -p llama-crab-server --release -- \
--model models/bge-small-en-v1.5-q4_k_m.gguf \
--embeddings
Reranking server¶
cargo run -p llama-crab-server --release -- \
--model models/bge-reranker-base-q4_k_m.gguf \
--reranking \
--pooling rank
Multimodal chat server¶
Build the server with the mtmd feature and provide the projector:
cargo run -p llama-crab-server --release --features mtmd -- \
--model models/LFM2.5-VL-1.6B-Q4_K_M.gguf \
--mmproj models/LFM2.5-VL-1.6B-mmproj-BF16.gguf
Production-style unit¶
A systemd unit file for a small chat server:
/etc/systemd/system/llama-crab.service
[Unit]
Description=llama-crab-server
After=network.target
[Service]
Type=simple
User=llama
Environment="LLAMA_CRAB_MODEL=/var/lib/llama-crab/qwen2.5-7b-instruct-q4_k_m.gguf"
Environment="LLAMA_CRAB_N_GPU_LAYERS=99"
Environment="LLAMA_CRAB_N_CTX=4096"
Environment="RUST_LOG=info"
ExecStart=/usr/local/bin/llama-crab-server --host 0.0.0.0 --port 8080
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Docker¶
A minimal Dockerfile:
Dockerfile
FROM rust:1.88-bookworm AS builder
WORKDIR /app
RUN apt-get update && apt-get install -y cmake build-essential
COPY . .
RUN cargo build --release -p llama-crab-server
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y libgomp1 ca-certificates
COPY --from=builder /app/target/release/llama-crab-server /usr/local/bin/
COPY models/ /var/lib/llama-crab/
ENV LLAMA_CRAB_MODEL=/var/lib/llama-crab/qwen2.5-7b-instruct-q4_k_m.gguf
EXPOSE 8080
CMD ["llama-crab-server", "--host", "0.0.0.0", "--port", "8080"]
Quick health check¶
Where to next?¶
- API reference — every route, with
curlexamples. - Streaming — Server-Sent Events contract.
- Structured output — the
response_format,grammarandjson_schemafields.