Skip to content

API reference

The server exposes an OpenAI-compatible HTTP API. This page documents every route, the request shape, the response shape, and the status codes. Worked curl examples are included for each route.

Routes

HTTP route Method Rust entry point
/health GET Readiness probe.
/v1/models GET Configured model name.
/v1/completions POST Llama::create_completion_with_options.
/v1/chat/completions POST Llama::create_chat_completion_stream_with.
/v1/embeddings POST Llama::embed_texts.
/v1/rerank POST Llama::rerank.
/v1/reranking POST Alias for /v1/rerank.
/rerank POST Alias for /v1/rerank.
/reranking POST Alias for /v1/rerank.
/extras/tokenize POST LlamaModel::tokenize.
/extras/tokenize/count POST LlamaModel::tokenize.
/extras/detokenize POST LlamaModel::detokenize.

Set "stream": true on completion or chat requests to receive Server-Sent Events. Text completion chunks carry choices[].text; chat chunks carry choices[].delta.role and choices[].delta.content. Normal streams finish with data: [DONE].

GET /health

Readiness probe. Returns 200 OK once the model is loaded.

curl http://127.0.0.1:8080/health

GET /v1/models

Returns the model configured for this process:

{
  "object": "list",
  "data": [
    {
      "id": "local-model",
      "object": "model",
      "owned_by": "me",
      "permissions": []
    }
  ]
}

POST /v1/completions

Plain text completion.

curl http://127.0.0.1:8080/v1/completions \
  -H 'content-type: application/json' \
  -d '{
    "prompt": "The capital of France is",
    "max_tokens": 16,
    "temperature": 0.7,
    "top_p": 0.9,
    "echo": false,
    "logit_bias": {"42": -100.0}
  }'
curl http://127.0.0.1:8080/v1/completions \
  -H 'content-type: application/json' \
  -d '{
    "prompt": [
      "The capital of France is",
      "The capital of Japan is"
    ],
    "max_tokens": 8
  }'

Request fields

Field Default Description
prompt required A string or an array of strings.
max_tokens 16 Maximum number of tokens to generate.
min_tokens 0 Minimum number of tokens to generate.
temperature 0.8 0.0 selects greedy decoding.
top_k 40 Top-K sampling.
top_p 0.95 Top-P sampling.
tfs_z 1.0 Tail-free sampling.
min_p 0.05 Min-P sampling.
typical_p 1.0 Locally-typical sampling.
min_keep 1 Minimum tokens to keep after filtering.
repeat_penalty 1.0 Repetition penalty.
frequency_penalty 0.0 Frequency penalty.
presence_penalty 0.0 Presence penalty.
penalty_last_n 64 Tokens to consider for penalties.
mirostat_mode 0 Mirostat mode (0, 1, 2).
mirostat_tau 5.0 Mirostat target perplexity.
mirostat_eta 0.1 Mirostat learning rate.
seed random RNG seed.
logit_bias {} Token id → additive logit bias.
logit_bias_type input_ids input_ids or tokens.
grammar Raw GBNF grammar.
json_schema JSON Schema (converted to GBNF).
response_format text, json_object, or json_schema.
grammar_root root Root rule of the GBNF grammar.
stop [] String or list of strings.
stream false Server-Sent Events.
echo false Echo the prompt in the response.
suffix Suffix appended after the prompt.
best_of n Number of internal candidates for n.
logprobs false Per-token log probabilities.
top_logprobs 0 Top-K logprobs per token.
n 1 Number of choices to return.
model ignored Included for OpenAI client compatibility.
user ignored Included for OpenAI client compatibility.

POST /v1/chat/completions

Multi-turn chat completion.

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{
    "messages": [
      { "role": "user", "content": "Explain Rust ownership briefly." }
    ],
    "max_tokens": 64,
    "template": "chatml"
  }'

Chat with tools

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{
    "messages": [
      { "role": "user", "content": "Weather in Tokyo?" }
    ],
    "template": "chatml",
    "max_tokens": 96,
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {
          "type": "object",
          "properties": { "city": { "type": "string" } },
          "required": ["city"]
        }
      }
    }],
    "tool_choice": {
      "type": "function",
      "function": { "name": "get_weather" }
    }
  }'

Multimodal chat

image_url.url must be a local path or file:// URL. The server must be built with --features mtmd and started with --mmproj:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{
    "messages": [{
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image in one sentence." },
        { "type": "image_url", "image_url": { "url": "tests/fixtures/test_image.png" } }
      ]
    }],
    "max_tokens": 64,
    "template": "chatml"
  }'

Chat with prior tool calls

{
  "messages": [
    { "role": "user", "content": "Weather in Tokyo?" },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "call_weather",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"city\":\"Tokyo\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_weather",
      "content": "{\"temperature\": 22}"
    }
  ],
  "template": "chatml"
}

Chat request fields

Chat requests accept the same generation fields as text completion, plus:

Field Default Description
messages required List of {role, content} messages.
template plain plain, chatml, llama3, mistral, gemma, …
tools [] List of function definitions.
tool_choice auto none, auto, or a specific function.
function_call Legacy OpenAI parameter.
top_logprobs 0 Top-K logprobs per token.
logprobs false Per-token log probabilities.

Chat content may be a string, null, or an array of content parts. Text parts are concatenated in order. image_url parts are evaluated with mtmd when the server is built with the mtmd feature and started with --mmproj. audio_url and video_url parse for request compatibility but are not yet evaluated by the server generation path.

POST /v1/embeddings

curl http://127.0.0.1:8080/v1/embeddings \
  -H 'content-type: application/json' \
  -d '{
    "input": ["Rust is memory-safe.", "Paris is in France."],
    "normalize": true
  }'

Set encoding_format to base64 to return each embedding as a single base64 string containing little-endian f32 bytes:

curl http://127.0.0.1:8080/v1/embeddings \
  -H 'content-type: application/json' \
  -d '{
    "input": "Rust",
    "encoding_format": "base64"
  }'

POST /v1/rerank

curl http://127.0.0.1:8080/v1/rerank \
  -H 'content-type: application/json' \
  -d '{
    "query": "safe systems programming language",
    "documents": [
      "Rust is a memory-safe systems programming language.",
      "Paris is the capital city of France.",
      "Bananas are yellow fruit."
    ],
    "top_n": 2
  }'

POST /extras/tokenize and /extras/tokenize/count

curl http://127.0.0.1:8080/extras/tokenize \
  -H 'content-type: application/json' \
  -d '{"input": "How many tokens in this query?"}'

curl http://127.0.0.1:8080/extras/tokenize/count \
  -H 'content-type: application/json' \
  -d '{"input": "How many tokens in this query?"}'

POST /extras/detokenize

curl http://127.0.0.1:8080/extras/detokenize \
  -H 'content-type: application/json' \
  -d '{"tokens": [1, 2, 3]}'

Status codes

Code When
200 OK Success.
400 Bad Request Malformed JSON, unknown field, invalid template, schema that fails to compile.
404 Not Found Unknown route.
422 Unprocessable Entity The model rejected the request (e.g. tool_choice names an unknown function).
500 Internal Server Error An internal error (rare; usually means the model is not loaded).
503 Service Unavailable The model is still loading.

Where to next?