API reference¶
The server exposes an OpenAI-compatible HTTP API. This page
documents every route, the request shape, the response shape, and
the status codes. Worked curl examples are included for each
route.
Routes¶
| HTTP route | Method | Rust entry point |
|---|---|---|
/health |
GET |
Readiness probe. |
/v1/models |
GET |
Configured model name. |
/v1/completions |
POST |
Llama::create_completion_with_options. |
/v1/chat/completions |
POST |
Llama::create_chat_completion_stream_with. |
/v1/embeddings |
POST |
Llama::embed_texts. |
/v1/rerank |
POST |
Llama::rerank. |
/v1/reranking |
POST |
Alias for /v1/rerank. |
/rerank |
POST |
Alias for /v1/rerank. |
/reranking |
POST |
Alias for /v1/rerank. |
/extras/tokenize |
POST |
LlamaModel::tokenize. |
/extras/tokenize/count |
POST |
LlamaModel::tokenize. |
/extras/detokenize |
POST |
LlamaModel::detokenize. |
Set "stream": true on completion or chat requests to receive
Server-Sent Events. Text completion chunks carry choices[].text;
chat chunks carry choices[].delta.role and choices[].delta.content.
Normal streams finish with data: [DONE].
GET /health¶
Readiness probe. Returns 200 OK once the model is loaded.
GET /v1/models¶
Returns the model configured for this process:
{
"object": "list",
"data": [
{
"id": "local-model",
"object": "model",
"owned_by": "me",
"permissions": []
}
]
}
POST /v1/completions¶
Plain text completion.
Request fields¶
| Field | Default | Description |
|---|---|---|
prompt |
required | A string or an array of strings. |
max_tokens |
16 |
Maximum number of tokens to generate. |
min_tokens |
0 |
Minimum number of tokens to generate. |
temperature |
0.8 |
0.0 selects greedy decoding. |
top_k |
40 |
Top-K sampling. |
top_p |
0.95 |
Top-P sampling. |
tfs_z |
1.0 |
Tail-free sampling. |
min_p |
0.05 |
Min-P sampling. |
typical_p |
1.0 |
Locally-typical sampling. |
min_keep |
1 |
Minimum tokens to keep after filtering. |
repeat_penalty |
1.0 |
Repetition penalty. |
frequency_penalty |
0.0 |
Frequency penalty. |
presence_penalty |
0.0 |
Presence penalty. |
penalty_last_n |
64 |
Tokens to consider for penalties. |
mirostat_mode |
0 |
Mirostat mode (0, 1, 2). |
mirostat_tau |
5.0 |
Mirostat target perplexity. |
mirostat_eta |
0.1 |
Mirostat learning rate. |
seed |
random | RNG seed. |
logit_bias |
{} |
Token id → additive logit bias. |
logit_bias_type |
input_ids |
input_ids or tokens. |
grammar |
– | Raw GBNF grammar. |
json_schema |
– | JSON Schema (converted to GBNF). |
response_format |
– | text, json_object, or json_schema. |
grammar_root |
root |
Root rule of the GBNF grammar. |
stop |
[] |
String or list of strings. |
stream |
false |
Server-Sent Events. |
echo |
false |
Echo the prompt in the response. |
suffix |
– | Suffix appended after the prompt. |
best_of |
n |
Number of internal candidates for n. |
logprobs |
false |
Per-token log probabilities. |
top_logprobs |
0 |
Top-K logprobs per token. |
n |
1 |
Number of choices to return. |
model |
ignored | Included for OpenAI client compatibility. |
user |
ignored | Included for OpenAI client compatibility. |
POST /v1/chat/completions¶
Multi-turn chat completion.
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"messages": [
{ "role": "user", "content": "Explain Rust ownership briefly." }
],
"max_tokens": 64,
"template": "chatml"
}'
Chat with tools¶
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"messages": [
{ "role": "user", "content": "Weather in Tokyo?" }
],
"template": "chatml",
"max_tokens": 96,
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
}
}],
"tool_choice": {
"type": "function",
"function": { "name": "get_weather" }
}
}'
Multimodal chat¶
image_url.url must be a local path or file:// URL. The server
must be built with --features mtmd and started with --mmproj:
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"messages": [{
"role": "user",
"content": [
{ "type": "text", "text": "Describe this image in one sentence." },
{ "type": "image_url", "image_url": { "url": "tests/fixtures/test_image.png" } }
]
}],
"max_tokens": 64,
"template": "chatml"
}'
Chat with prior tool calls¶
{
"messages": [
{ "role": "user", "content": "Weather in Tokyo?" },
{
"role": "assistant",
"tool_calls": [
{
"id": "call_weather",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\":\"Tokyo\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_weather",
"content": "{\"temperature\": 22}"
}
],
"template": "chatml"
}
Chat request fields¶
Chat requests accept the same generation fields as text completion, plus:
| Field | Default | Description |
|---|---|---|
messages |
required | List of {role, content} messages. |
template |
plain |
plain, chatml, llama3, mistral, gemma, … |
tools |
[] |
List of function definitions. |
tool_choice |
auto |
none, auto, or a specific function. |
function_call |
– | Legacy OpenAI parameter. |
top_logprobs |
0 |
Top-K logprobs per token. |
logprobs |
false |
Per-token log probabilities. |
Chat content may be a string, null, or an array of content
parts. Text parts are concatenated in order. image_url parts are
evaluated with mtmd when the server is built with the mtmd
feature and started with --mmproj. audio_url and video_url
parse for request compatibility but are not yet evaluated by the
server generation path.
POST /v1/embeddings¶
curl http://127.0.0.1:8080/v1/embeddings \
-H 'content-type: application/json' \
-d '{
"input": ["Rust is memory-safe.", "Paris is in France."],
"normalize": true
}'
Set encoding_format to base64 to return each embedding as a
single base64 string containing little-endian f32 bytes:
curl http://127.0.0.1:8080/v1/embeddings \
-H 'content-type: application/json' \
-d '{
"input": "Rust",
"encoding_format": "base64"
}'
POST /v1/rerank¶
curl http://127.0.0.1:8080/v1/rerank \
-H 'content-type: application/json' \
-d '{
"query": "safe systems programming language",
"documents": [
"Rust is a memory-safe systems programming language.",
"Paris is the capital city of France.",
"Bananas are yellow fruit."
],
"top_n": 2
}'
POST /extras/tokenize and /extras/tokenize/count¶
curl http://127.0.0.1:8080/extras/tokenize \
-H 'content-type: application/json' \
-d '{"input": "How many tokens in this query?"}'
curl http://127.0.0.1:8080/extras/tokenize/count \
-H 'content-type: application/json' \
-d '{"input": "How many tokens in this query?"}'
POST /extras/detokenize¶
curl http://127.0.0.1:8080/extras/detokenize \
-H 'content-type: application/json' \
-d '{"tokens": [1, 2, 3]}'
Status codes¶
| Code | When |
|---|---|
200 OK |
Success. |
400 Bad Request |
Malformed JSON, unknown field, invalid template, schema that fails to compile. |
404 Not Found |
Unknown route. |
422 Unprocessable Entity |
The model rejected the request (e.g. tool_choice names an unknown function). |
500 Internal Server Error |
An internal error (rare; usually means the model is not loaded). |
503 Service Unavailable |
The model is still loading. |
Where to next?¶
- Streaming — the Server-Sent Events contract.
- Structured output —
response_format,grammarandjson_schema. - Running the server — boot flags and presets.