Skip to content

JSON-Schema & GBNF grammars

Constrained decoding is the most reliable way to get a model to emit structured output. llama-crab ships with a pure-Rust JSON-Schema → GBNF converter that supports a useful subset of JSON Schema 2020-12, and a GBNF grammar sampler that constrains the model's logits to match the grammar at every step.

How it works

flowchart LR
    A[JSON Schema] -->|json_schema_grammar| B[GBNF string]
    B --> C[LlamaSampler::grammar]
    D[Logits] --> C
    C -->|mask out invalid tokens| E[Sampled token]
    E --> F[Always valid output]

The grammar sampler runs after every other sampler in the chain. It looks at the current context (the tokens generated so far) and the GBNF grammar, computes the set of tokens that would keep the output valid, and masks all other tokens' logits to -inf. The next sampler in the chain then picks from the masked distribution.

The result: the model literally cannot emit a token that would break the grammar. The output is guaranteed to be valid against the schema, regardless of the model size or the prompt.

Quickstart

use llama_crab::high_level::completion::json_schema_grammar;
use serde_json::json;

let schema = json!({
    "type": "object",
    "properties": {
        "name": { "type": "string" },
        "age":  { "type": "integer" }
    },
    "required": ["name", "age"]
});
let grammar = json_schema_grammar(&schema).unwrap();

The function returns a String containing a valid GBNF grammar. Pass it to the grammar sampler:

use llama_crab::sampling::LlamaSampler;
use llama_crab::high_level::completion::CompletionOptions;
use llama_crab::{Llama, LlamaParams};

let mut llama = Llama::load(LlamaParams::new("model.gguf"))?;
let grammar = unsafe { LlamaSampler::grammar(llama.model(), &grammar_text, "root")? };
let greedy = LlamaSampler::greedy();
let mut sampler = LlamaSampler::chain(vec![grammar, greedy], false)?;

let completion = llama.create_completion_with_sampler(
    "Return one object: ",
    CompletionOptions::new(64),
    &mut sampler,
)?;

LlamaSampler::grammar is gated by the common Cargo feature. The full example is in examples/structured/.

Supported JSON-Schema features

The converter understands a useful subset of JSON Schema 2020-12:

Feature Status
type: object with properties, required, additionalProperties
type: array with items, prefixItems, minItems, maxItems
type: string with minLength, maxLength, pattern
type: integer / number with minimum, maximum, exclusiveMinimum, exclusiveMaximum
type: boolean, null
enum (string, integer, boolean, null)
const
format: date-time, email, uri, uuid
oneOf, anyOf, allOf
$ref (local, #/definitions/...)
definitions, $defs
Conditional keywords (if, then, else) Partial
Recursive schemas Partial (single-level $ref only)

If a feature you need is missing, open an issue with the schema snippet. The converter is designed to grow with the use cases the community hits.

A worked example

Suppose you want the model to emit a list of "people", each with a name, age and email. The schema is:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name":  { "type": "string" },
      "age":   { "type": "integer", "minimum": 0 },
      "email": { "type": "string", "format": "email" }
    },
    "required": ["name", "age"]
  },
  "minItems": 1,
  "maxItems": 5
}

The GBNF grammar the converter produces is roughly:

root   ::= arr
arr    ::= "[" item (", " item)* "]"
item   ::= "{" pair (", " pair)* "}"
pair   ::= string ":" (number|string)
string ::= "\"" char+ "\""
number ::= [0-9]+
char   ::= [^"\\] | "\\" ["\\nrt]

When the model generates, the grammar sampler only allows tokens that keep the partial output on a path to a valid root rule. The output is always parseable JSON that matches the schema.

Performance

Grammars have a small per-token overhead — the grammar sampler evaluates the grammar against the partial output every step. In practice the cost is dominated by the model forward pass, not the sampler, so total time-to-completion is usually comparable to unconstrained generation. The grammar is also tighter than what a hand-written GBNF would be, because the converter optimises for the schema structure.

Custom grammars

For full control, build a GBNF string by hand and pass it directly to the grammar sampler (gated by the common feature):

let grammar_text = r#"
root   ::= "answer=" answer
answer ::= "yes" | "no"
"#;
let grammar = unsafe { LlamaSampler::grammar(llama.model(), grammar_text, "root")? };

GBNF is a small, BNF-like grammar language. The llama.cpp GBNF spec covers the full syntax.

When to use grammars vs few-shot

Approach Reliability Flexibility Cost
Grammar-constrained decoding 100 % valid output. Output is locked to the grammar. Small per-token overhead.
Few-shot prompting 80–95 % valid output (model-dependent). Anything the model can express. None.
JSON-mode + parser High (most models emit valid JSON when asked). The schema has to be hinted in the prompt. None, plus a post-hoc parser.

The grammar sampler is the right choice when:

  • The schema is fixed and known in advance.
  • Downstream code expects well-typed output (no fallback parser).
  • The cost of an invalid output is high (e.g. a database insert).

Common pitfalls

Pitfall What goes wrong Fix
Schema with no type keyword Converter falls back to "any value", which is unconstrained. Add type: object (or whatever the root is).
Recursive schema with deep nesting Converter truncates recursion at one level. Flatten the schema or use anyOf with a fixed depth.
Grammar sampler runs before another sampler The second sampler picks an invalid token. Always put the grammar sampler last in the chain.
LlamaSampler::grammar returns None The common feature is not enabled. Add features = ["common"] to the dependency.
Model ignores the grammar The model is too small or the prompt is bad. Increase model size; verify the prompt mentions the expected output.

Where to next?