Server¶
llama-crab-server is a thin HTTP binary built on top of the safe
llama-crab API. It keeps inference inside the Rust binding and uses
a worker thread that owns the model and context.
The server exposes an OpenAI-compatible surface (/v1/chat/completions,
/v1/completions, /v1/embeddings, /v1/rerank, /v1/models) plus
a few llama-crab-specific extensions (/extras/tokenize,
/extras/detokenize). It is the easiest way to drop a model behind
an HTTP endpoint without writing one yourself.
-
The
cargo runcommand, the command-line flags, theLLAMA_CRAB_*environment variables, and the available presets. -
Every route, the request shape, the response shape, and the status codes. Includes a worked
curlexample for each. -
Server-Sent Events for
stream: truerequests. The exact chunk order for chat and text completions. -
The
grammar,json_schemaandresponse_formatrequest fields, and the GBNF pipeline they go through.
Runtime shape¶
The server keeps model inference on a dedicated worker thread and
sends requests to it through channels. Llama owns a native model
and context and is intentionally not shared freely across threads.
flowchart LR
HTTP[HTTP router] -->|channel| W[Worker thread]
W -->|owns| L[Llama]
L -->|owns| M[Model]
L -->|owns| C[Context]
W -->|channel| HTTP
HTTP -->|SSE| Client[Client]
The included binary uses this layout:
- One worker owns one
Llamainstance and processes requests sequentially. - The HTTP router validates requests and forwards inference jobs to the worker.
- Streaming routes forward decoded chunks back to the HTTP task over a channel.
You can run several server processes or extend the crate with several workers when you need parallel throughput.
Why a separate binary?¶
The server is intentionally a thin wrapper:
- Configuration — CLI flags and env vars.
- HTTP routing — request parsing, response formatting.
- Worker lifecycle — startup, shutdown, error handling.
- Streaming transport — Server-Sent Events.
- Errors — converts
LlamaErrorto OpenAI-style HTTP status codes.
Inference behavior remains in llama-crab so the CLI, library, and
server users exercise the same implementation. If you fork the
server for a custom integration, the only logic that needs to be
kept in sync is the HTTP layer — the model code stays the same.
Where to next?¶
- Running the server — the boot command and the command-line flags.
- API reference — every route and its parameters.
- Streaming — the SSE contract.