# Structured output

> Source: https://aiwiki.ai/wiki/structured_output
> Updated: 2026-06-23
> Categories: Artificial Intelligence, Developer Tools, Large Language Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Structured output** is a set of techniques and API features that constrain a [large language model](/wiki/large_language_model) (LLM) to emit responses that exactly conform to a predefined format or schema, such as JSON, XML, or a custom grammar, instead of free-form text. The most reliable implementations use constrained decoding, which masks invalid tokens at each generation step so the model is mathematically prevented from producing output that violates the schema. When [OpenAI](/wiki/openai) introduced its Structured Outputs feature on August 6, 2024, the gpt-4o-2024-08-06 model scored a perfect 100% on evaluations of complex JSON schema following, compared with under 40% for the previous model without the feature.[3]

Structured output has become one of the most important features in the LLM ecosystem because it bridges the gap between natural language generation and programmatic consumption. Without it, developers must write brittle parsing logic, implement retry mechanisms, and handle malformed responses. With structured output, the model itself is constrained to produce valid, schema-compliant data every time. OpenAI describes the goal directly: the feature ensures that "model-generated outputs will exactly match JSON Schemas provided by developers."[3]

## How does structured output work?

Structured output implementations rely on several underlying techniques that operate at different levels of the generation process. The dominant approach, constrained decoding, intervenes during token generation; lighter-weight approaches such as JSON mode and prompt instructions only shape or check the output after the fact. The sections below trace how the field moved from prompt-based requests to engine-level enforcement.

## When did structured output features ship?

The need for structured output emerged almost immediately after LLMs became accessible through APIs. Early approaches relied on careful [prompt engineering](/wiki/prompt_engineering) with instructions like "respond only in JSON format," but these were unreliable.[3] Models would frequently include explanatory text alongside JSON, produce invalid syntax, or omit required fields. The evolution from prompt-based approaches to engine-level enforcement represents one of the most significant practical advances in LLM tooling.

### Function calling (June 2023)

OpenAI introduced function calling on June 13, 2023, alongside the release of [GPT-3.5](/wiki/gpt-3.5) Turbo and [GPT-4](/wiki/gpt-4) model snapshots (gpt-3.5-turbo-0613 and gpt-4-0613).[2] [Function calling](/wiki/function_calling) allowed developers to describe functions with JSON Schema definitions, and the model would generate a JSON object containing arguments to call those functions. These models had been fine-tuned to both detect when a function needs to be called and to respond with JSON that adheres to the function signature.[2] While originally designed to let models invoke external tools, developers quickly recognized function calling as a general-purpose mechanism for extracting structured data. By defining a "function" that represented their desired output schema, they could get reliably structured responses without the function ever being executed.

### JSON mode (November 2023)

At its first DevDay conference on November 6, 2023, OpenAI announced GPT-4 Turbo and introduced JSON mode.[1] By setting `response_format: { type: "json_object" }` in the API request, developers could ensure the model would respond with syntactically valid JSON.[1] However, JSON mode had a significant limitation: it guaranteed valid JSON syntax but did not enforce conformance to any particular schema.[4] The model might return valid JSON that was missing required fields, had unexpected property names, or used incorrect data types. Developers still had to include prompt instructions describing the desired JSON structure and implement validation logic for the response.

### Structured Outputs (August 2024)

On August 6, 2024, OpenAI released Structured Outputs, a major upgrade that guarantees 100% schema compliance.[3] By setting `response_format: { type: "json_schema", json_schema: { ... } }`, developers could supply a full [JSON Schema](/wiki/json_schema) definition, and the model would produce output that exactly matched the specified structure. On evaluations of complex JSON schema following, the new gpt-4o-2024-08-06 model with Structured Outputs scored a perfect 100%, compared to fewer than 40% for the previous model without Structured Outputs.[3]

OpenAI achieved this through a two-pronged approach. First, they trained the model (gpt-4o-2024-08-06) to better understand complex schemas; even after this training the model reached only about 93% on their internal benchmark, which OpenAI judged insufficient for production reliability.[3] Second, they implemented constrained decoding at the inference level, using a context-free grammar (CFG) derived from the JSON Schema to dynamically restrict which tokens the model could produce at each generation step.[3] The schema is preprocessed into a cached data structure so that token masking can be performed efficiently, with minimal latency overhead during generation.[3]

Structured Outputs is available in two forms: as a `response_format` parameter for direct JSON responses, and as `strict: true` on function/tool definitions for structured function calling arguments.[3] Both forms are supported on gpt-4o, gpt-4o-mini, and all models from gpt-4-0613 onward.[3][4]

### Expansion across OpenAI models and APIs (2024-2025)

OpenAI subsequently extended Structured Outputs across its model lineup and APIs. The [o1](/wiki/o1) reasoning model gained the feature when it reached the API on December 17, 2024, alongside function calling and developer messages, and OpenAI reported that the o1-2024-12-17 snapshot significantly outperformed gpt-4o on its internal function calling and Structured Outputs evaluations.[17] [o3-mini](/wiki/o3_mini), released on January 31, 2025, became the company's first small reasoning model to support the feature.[18] The Responses API, launched on March 11, 2025 as the recommended interface for agent building,[19] retained Structured Outputs but renamed the request parameter from `response_format` to `text.format`.[4]

With the August 7, 2025 launch of [GPT-5](/wiki/gpt-5), OpenAI introduced custom tools, which let the model send tool calls as plaintext instead of JSON while optionally constraining that text with a developer-supplied context-free grammar written in Lark or regular expression syntax.[20] The grammar enforcement for custom tools is implemented with llguidance, the open-source Rust engine that also powers the Guidance library.[21]

### Anthropic Claude

[Anthropic](/wiki/anthropic) introduced tool use (function calling) for the [Claude](/wiki/claude) 3 model family in 2024, providing structured JSON output through tool definitions. On November 14, 2025, Anthropic announced Structured Outputs in public beta for [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5) and Opus 4.1.[22] This feature compiles JSON schemas into a grammar that constrains token generation during inference, rather than relying on prompting alone.[5] As Anthropic's documentation puts it, "the model literally cannot produce tokens that would violate your schema."[5] The feature is now generally available on the Claude API and [Amazon Bedrock](/wiki/amazon_bedrock) for Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.5, and Claude Haiku 4.5.[5] Anthropic offers two complementary capabilities: JSON outputs via `output_config.format` for controlling the response format, and strict tool use via `strict: true` for guaranteeing tool parameter schema validation.[5] When combined, Claude can call tools with guaranteed-valid parameters and return structured JSON responses in a single interaction.

Anthropic's release notes date the rollout in stages: the public beta opened on November 14, 2025 behind the beta header `structured-outputs-2025-11-13`, support for Claude Haiku 4.5 followed on December 4, 2025, and the feature reached general availability on the Claude API on January 29, 2026 for Claude Sonnet 4.5, Claude Opus 4.5, and Claude Haiku 4.5, bringing expanded schema support, improved grammar compilation latency, and removal of the beta header requirement.[22] The general availability release also replaced the earlier `output_format` parameter with `output_config.format`.[22] As of mid-2026, Anthropic documents structured outputs as generally available on the Claude API and Google [Vertex AI](/wiki/vertex_ai) for newer models up through Claude Opus 4.8, and in beta on Microsoft Foundry.[5] Supported JSON Schema features include composition through `anyOf`, `allOf`, and `$ref`, plus common string formats such as `date-time`, `email`, and `uuid`, while recursive schemas and numeric bounds such as `minimum` and `maximum` are not supported.[5]

### Google Gemini

Google introduced controlled generation for [Gemini](/wiki/gemini) 1.5 Pro at Google I/O 2024.[7] The feature, sometimes called "JSON mode" or "controlled generation," uses a `response_schema` parameter built on the OpenAPI 3.0 schema definition.[6] In the Gemini API, the concept of "response schema" acts as a template, dictating the elements, data types, and overall structure of the model's output.[7] Google later expanded the capability to Gemini 1.5 Flash and added enum support in September 2024.[7] The feature adds minimal latency to API calls, even on the first request.[7] Starting with Gemini 2.5 models, the API preserves key ordering from the schema in generated output.[23] Gemini supports two output formats: `application/json` for structured JSON and `text/x.enum` for constrained enum responses.[6]

On November 5, 2025, Google announced native JSON Schema support in the Gemini API through a `responseJsonSchema` parameter, available across all actively supported Gemini models.[23] The change added keywords that the OpenAPI-based `responseSchema` could not express, including `anyOf` for unions, `$ref` for recursive schemas, `minimum` and `maximum` numeric bounds, `additionalProperties`, and `prefixItems` for tuple-like arrays, and it allowed Pydantic and Zod schema definitions to be passed to the API without conversion.[23] Google also extended implicit property ordering, which preserves the schema's key order in generated output, to all Gemini 2.5 models and the OpenAI compatibility layer.[23]

### Other providers

Structured output has become a standard feature across the LLM API market. [Mistral AI](/wiki/mistral_ai)'s API provides custom structured outputs that force generation to follow a user-supplied JSON schema through a `response_format` parameter of type `json_schema` with a `strict` flag, alongside a schemaless JSON mode.[24] [xAI](/wiki/xai)'s Grok API uses the same `response_format` pattern, guarantees that responses match the supplied schema when supported schema features are used, and accepts schemas defined with Pydantic or Zod.[25]

## What are the core techniques behind structured output?

Structured output implementations rely on several underlying techniques that operate at different levels of the generation process.

### Constrained decoding

Constrained decoding (also called guided decoding or constrained sampling) is the foundational technique behind most structured output implementations. It operates by modifying the token probability distribution at each generation step to prevent the model from producing tokens that would violate the target structure.

The process works as follows:

1. Before generation begins, the target schema or grammar is compiled into an internal representation (such as a finite-state machine or pushdown automaton).
2. At each decoding step, the inference engine determines which tokens are valid continuations based on the current state of the automaton and the tokens generated so far.
3. Invalid tokens have their logit values set to negative infinity, effectively reducing their sampling probability to zero. This is known as logit masking or token masking.[15]
4. The model then samples from the remaining valid tokens using its standard sampling strategy ([temperature](/wiki/temperature_sampling), top-p, etc.).

Because the schema can be preprocessed and cached, constrained decoding adds minimal latency overhead during generation.[3] The technique is deterministic in its structural enforcement: if the grammar says a comma must come next, only tokens containing commas will have nonzero probability.

### Finite-state machines and regular expressions

Many structured output systems compile constraints into [finite-state machines](/wiki/finite_state_machine) (FSMs). Regular expressions, which are mathematically equivalent to FSMs, provide a convenient way for developers to specify output patterns.[26] At each generation step, the FSM tracks its current state and determines which tokens would lead to valid transitions. This approach is fast and memory-efficient, but FSMs cannot express all possible structural constraints. For example, they cannot enforce matched bracket nesting in JSON, because tracking nesting depth requires a stack that FSMs do not have.

The Outlines library, for example, uses the interegular library to translate regular expressions that represent JSON Schema patterns into FSMs. Each state in the FSM corresponds to a position in the regular expression, and the set of valid next tokens is determined by which tokens would cause a valid state transition.[26]

### Context-free grammars

Context-free grammars (CFGs) are more expressive than regular expressions and FSMs. A CFG can represent nested and recursive structures, making it suitable for enforcing JSON schema compliance, programming language syntax, and other hierarchical formats. OpenAI's Structured Outputs implementation uses a CFG-based approach for this reason.[3] CFG-based constrained decoding uses a pushdown automaton (essentially an FSM augmented with a stack) to track the current parsing state.[10] However, CFG-based constrained decoding is computationally more expensive than FSM-based methods because the parser must maintain and update the stack at each token.[15]

### Compressed finite-state machines

[SGLang](/wiki/sglang) introduced compressed finite-state machines as an optimization for constrained decoding. The system analyzes constraints and builds a compressed FSM that merges multi-token paths into single-step transitions whenever possible.[27] For instance, if the grammar requires the literal string `"name":` at a certain point, the compressed FSM can emit all those characters in a single step rather than generating them token by token. This allows the engine to skip 30 to 50% of generation steps for highly structured outputs, significantly improving throughput. In benchmarks against JSON decoding pipelines built on Guidance and Outlines, the SGLang team reported that this jump-forward decoding technique reduced latency by up to 2x and improved throughput by up to 2.5x, with a re-tokenization step during jumps adding roughly 4% computational overhead.[27]

## Open-source tools and libraries

Several open-source libraries enable structured output generation for locally hosted models, often providing capabilities comparable to or exceeding those of proprietary API features.

### llama.cpp GBNF grammars

[llama.cpp](/wiki/llama_cpp), the popular C/C++ inference engine for running LLMs locally, supports grammar-constrained generation through GBNF ([GGML](/wiki/ggml) BNF). GBNF is a format based on Backus-Naur Form with modern regex-like extensions for character ranges and repetition.[8] Developers write grammar rules that define valid output sequences, and the inference engine restricts token selection to only those tokens that satisfy the grammar at each step.

For example, a simple GBNF grammar for a JSON object with name and age fields might look like:

```
root   ::= "{" ws "\"name\"" ws ":" ws string "," ws "\"age\"" ws ":" ws number "}"
string ::= "\"" [a-zA-Z ]+ "\""
number ::= [0-9]+
ws     ::= [ \t\n]*
```

llama.cpp also includes a `json-schema-to-grammar.py` utility that converts JSON Schema definitions into GBNF grammars automatically.[8] The Python bindings (`llama-cpp-python`) expose this through a `LlamaGrammar` class, available since version 0.1.78. Grammar-constrained generation lets developers use smaller, cheaper local models while still getting reliable structured output.

### Outlines

Outlines is a Python library developed by dottxt for structured generation from LLMs. Created by Brandon Willard and Remi Louf, it uses the mathematical equivalence between regular expressions and finite-state machines to constrain generation at the token level.[26] Rather than validating output after generation and retrying on failure, Outlines prevents invalid tokens from being generated in the first place. The authors demonstrated that this generation-time validation approach comes at virtually no additional cost during inference.[26]

Outlines supports multiple constraint types: JSON schemas (via [Pydantic](/wiki/pydantic) models), regular expressions, context-free grammars, and choice constraints.[9] It translates JSON Schemas into regular expressions using the interegular library, then converts those into FSMs for token-level enforcement. The library constructs an Index object by combining a Vocabulary and regular expression to efficiently map tokens from a given vocabulary to state transitions in the finite-state automaton. The core algorithms have been ported to Rust in the `outlines-core` package, developed in collaboration with Hugging Face, for better performance.[28] Outlines is compatible with models running on [Hugging Face](/wiki/hugging_face) Transformers, [vLLM](/wiki/vllm), [Ollama](/wiki/ollama), and other backends.[9] Outlines v1.0.0, released on June 18, 2025, rewrote the public API around a unified `Generator` interface, narrowed the library's focus to constrained generation, and delegated model loading and sampling control to the underlying inference libraries.[29]

### XGrammar

XGrammar is a high-performance structured generation engine developed by the MLC team that supports JSON schemas, regular expressions, and arbitrary context-free grammars.[10] It achieves near-zero overhead in JSON generation by computing token masks in under 40 microseconds per token. On benchmarks, XGrammar outperforms previous solutions by up to 3.5x on JSON schema workloads and more than 10x on CFG workloads.[16] In end-to-end evaluation, it outperforms existing LLM engines up to 14x in JSON-schema generation and up to 80x in CFG-guided generation.[16] As of late 2024, XGrammar has been adopted as the default structured generation backend in major inference frameworks including vLLM, SGLang, [TensorRT](/wiki/tensorrt)-LLM, and MLC-LLM.[15][16] It supports multiple hardware platforms including NVIDIA GPUs, AMD GPUs, Apple Silicon, and TPUs.

### Guidance

Guidance is an open-source Python library from Microsoft that provides a domain-specific language for controlling LLM output. It allows developers to interleave generation, prompting, and logical control into a single continuous flow that matches how the language model actually processes text.[11] Unlike conventional prompting techniques that rely on post-hoc parsing, Guidance enforces constraints by steering the model token by token in the inference layer. It supports select (choosing from a set of options), regular expressions, and context-free grammars.[11] Its companion library, llguidance, implements constrained decoding with performance on the order of 50 microseconds of CPU time per token.[30] Microsoft reports that Guidance provides guaranteed output formats with up to 50% faster inference speeds compared to prompt chaining approaches, because the entire program executes in a single API call. Guidance works with both locally hosted open-source models ([LLaMA](/wiki/llama), [GPT-2](/wiki/gpt-2), Alpaca) and closed API models ([ChatGPT](/wiki/chatgpt), GPT-4).[11]

### LM Format Enforcer

LM Format Enforcer, created by Noam Gat, takes a character-level approach to structured output. It combines a character-level parser with a tokenizer prefix tree to allow only tokens that contain character sequences leading to potentially valid output.[12] At each step, the character-level parser limits the choice of characters that can be added based on the current constraints. For example, when generating JSON, an opening bracket `{` must be followed by a whitespace, a closing bracket, or a quotation mark to start a property name. The library supports JSON Schema, JSON mode (schemaless), and regular expression formats.[12] It works with [Transformers](/wiki/transformers_library), [LlamaIndex](/wiki/llamaindex), [LangChain](/wiki/langchain), llama.cpp, vLLM, [Haystack](/wiki/haystack), and NVIDIA TensorRT-LLM.[12] Because it operates solely on output logits without modifying the generation loop, it supports advanced generation methods like beam search and batching.[12]

## Inference engine support

Modern LLM inference engines provide built-in structured output capabilities, making constrained decoding accessible without requiring additional libraries.

### vLLM

vLLM supports structured output through multiple backends, including XGrammar (default), Outlines, LM Format Enforcer, and Guidance.[13] By default, vLLM uses auto mode to choose the best backend based on the request.[13] It provides an OpenAI-compatible API that accepts `guided_json`, `guided_regex`, `guided_choice`, and `guided_grammar` parameters.[15] For the Python API, constraints are specified under `SamplingParams.guided_decoding`.[15] Structured output performance in vLLM V1 is dramatically faster than in V0; in V0, even a single constrained request could degrade system-wide performance, while V1 introduces minimal overhead through backend optimizations and smarter architecture.[15]

### SGLang

SGLang uses compressed finite-state machines for efficient constrained decoding.[27] It supports regular expression constraints through a `regex` argument and JSON schema constraints.[27] The system can leverage XGrammar and llguidance as grammar backends. For highly structured outputs, the compression technique can skip 30 to 50% of generation steps. SGLang also provides features like RadixAttention for prefix caching, which further optimizes repeated structured generation requests with similar schemas.[27]

## Schema definition tools

Developers typically define their desired output structures using schema definition libraries rather than writing raw JSON Schema by hand. These libraries provide type safety, validation, and automatic schema generation.

### Pydantic

Pydantic is the dominant schema definition library in the Python ecosystem for structured output. Developers define Python classes with typed fields, and Pydantic automatically generates JSON Schema definitions from the class structure. Pydantic also provides runtime validation: after the model generates output, Pydantic can parse and validate the JSON to confirm it matches the schema, providing detailed error messages when validation fails. Most structured output libraries in Python (including Instructor, Outlines, and the OpenAI Python SDK) accept Pydantic models as schema definitions.[9][14]

Example:

```python
from pydantic import BaseModel
from typing import Optional

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class UserInfo(BaseModel):
    name: str
    age: int
    email: str
    phone: Optional[str] = None
    address: Address
    interests: list[str]
```

This model definition automatically generates a JSON Schema with nested object support, optional fields, array types, and string/integer constraints. The same class is used for both defining the schema sent to the LLM and validating the response.

### Zod

[Zod](/wiki/zod) serves the same purpose in the TypeScript and JavaScript ecosystem. It provides TypeScript-first schema validation with static type inference, meaning a single Zod schema definition produces both the runtime validator and the compile-time TypeScript type. This eliminates the common problem of type definitions and validation logic drifting apart. The OpenAI Node.js SDK, [Vercel](/wiki/vercel) AI SDK, and other TypeScript LLM libraries accept Zod schemas for defining structured output formats. When a Zod schema is provided, the response includes a `parsed` field containing the validated and typed result.[4]

## High-level libraries

### Instructor

Instructor is the most widely used Python library for extracting structured data from LLMs. Created by Jason Liu and open-sourced in June 2023 (originally as "OpenAI Function Call and Pydantic Integration Module"), it has grown to over 3 million monthly downloads and 11,000+ GitHub stars.[14] The OpenAI team cited Instructor as inspiration for their Structured Outputs feature.[3] Instructor wraps LLM API calls with Pydantic model validation, automatic retries via the Tenacity library, and streaming support for partial responses.[14]

Instructor works by accepting a `response_model` parameter (a Pydantic class) that defines the desired output structure.[14] It translates the Pydantic model into the appropriate API format (function calling, tool use, or JSON Schema) depending on the provider. If the model's output fails validation, Instructor automatically retries with the validation error included in the prompt, giving the model context about what went wrong.[14] The library supports over 15 providers, including OpenAI, Anthropic, Google, Ollama, and [DeepSeek](/wiki/deepseek).[14] A TypeScript version, developed in collaboration with Dimitri Kennedy (creator of Island AI), is also available.

### LangChain

[LangChain](/wiki/langchain) provides structured output through its `with_structured_output()` method, which accepts Pydantic models, TypedDict classes, or raw JSON Schema definitions.[31] Under the hood, LangChain selects the best approach for each provider: tool calling for providers that support it, JSON mode where available, or prompt-based extraction as a fallback.[31] This abstraction allows developers to switch between providers without changing their schema definitions.

## How do structured output features compare across providers?

The following table compares structured output capabilities across major LLM providers as of early 2026.

| Feature | OpenAI | Anthropic (Claude) | Google (Gemini) |
|---|---|---|---|
| JSON mode | Yes (November 2023) | Yes | Yes |
| JSON Schema enforcement | Yes (August 2024) | Yes (November 2025 beta, now GA) | Yes (Google I/O 2024) |
| Function/tool calling | Yes (June 2023) | Yes (2024) | Yes |
| Strict mode for tools | `strict: true` | `strict: true` | Automatic with function calling |
| Schema definition format | JSON Schema | JSON Schema | OpenAPI 3.0 Schema |
| Constrained decoding | CFG-based | Grammar-based | Yes |
| Enum support | Yes | Yes | Yes |
| Nested objects | Yes | Yes | Yes |
| Recursive schemas | Yes (with `$ref`) | Limited | Limited |
| Streaming support | Yes | Yes | Yes |
| Key ordering preserved | No guarantee | No guarantee | Yes (Gemini 2.5+) |

## How does function calling relate to structured output?

Function calling (or tool calling) was the first widely adopted method for obtaining structured output from LLMs.[2] Although designed to let models invoke external functions, developers discovered that defining a function with the desired output schema and setting the model to always call that function effectively forced structured responses. The function itself never needed to be executed.

This pattern works because function calling requires the model to produce a JSON object matching the function's parameter schema. OpenAI formalized this pattern with Structured Outputs in August 2024, allowing developers to set `strict: true` on function definitions to guarantee that the generated arguments exactly match the JSON Schema.[3] This feature is available on all OpenAI models that support function calling, including gpt-4o, gpt-4o-mini, and all models from gpt-4-0613 onward.[3]

Anthropic and Google offer similar capabilities through their respective tool use APIs, where tool input schemas serve the same purpose as function parameter schemas.[5]

## What is structured output used for?

Structured output enables a wide range of production applications that would be difficult or impossible with free-form text generation.

### API response generation

Applications that use LLMs as backend services need responses in consistent, machine-readable formats. Structured output ensures every API response from the LLM matches the expected schema, eliminating the need for parsing heuristics and error recovery. This is particularly important for microservice architectures where the LLM component must interoperate with other services that expect specific data contracts.

### Data extraction

Extracting structured information from unstructured text is one of the most common use cases. Given a block of text (an email, a legal document, a medical record), an LLM with structured output can reliably extract specific fields such as names, dates, amounts, and categories into a predefined schema. This replaces manual data entry and rule-based extraction systems that are brittle and expensive to maintain.

### Tool use and agentic workflows

[AI agents](/wiki/ai_agents) rely on structured output to generate valid tool calls. When an agent decides to search the web, query a database, or call an API, it must produce a precisely formatted request. Structured output guarantees that tool call arguments are valid JSON matching the tool's parameter schema, preventing runtime errors in multi-step workflows. Without structured output, agent systems would need extensive error handling for malformed tool calls.

### Classification and labeling

Structured output simplifies classification tasks by constraining the model's response to a predefined set of categories. Using enum constraints or choice parameters, the model must select from valid options rather than generating free-form labels that might not match expected values. This is useful for sentiment analysis, content moderation, intent detection, and similar categorization tasks.

### Code generation

Grammar-constrained generation can enforce syntactic validity in generated code. By defining a context-free grammar for a target programming language, the model is prevented from producing syntactically invalid code, though semantic correctness still depends on the model's capabilities.

### Report and document generation

LLMs can generate structured reports with consistent sections, headings, and data tables when constrained by a schema that defines the report format. This is particularly useful in domains like finance, healthcare, and compliance where report formats must follow strict templates.

## How reliable is structured output?

While constrained decoding provides strong structural guarantees, production systems should still implement validation layers.

### Structural vs. semantic correctness

Structured output guarantees that the generated data is syntactically valid and matches the specified schema. It does not guarantee that the content is factually correct, logically consistent, or semantically meaningful. A model might produce perfectly formatted JSON with a `temperature` field set to an unrealistic value, or an `email` field containing a syntactically valid but nonexistent address. This distinction between structural and semantic correctness is fundamental to understanding the limits of constrained decoding.

### Validation strategies

Production systems typically combine constrained decoding with additional validation:

| Validation layer | What it checks | Example tools |
|---|---|---|
| Schema validation | JSON structure matches schema | Pydantic, Zod, JSON Schema validators |
| Type checking | Field values are correct types | Pydantic type coercion, Zod parsing |
| Range/constraint validation | Values fall within acceptable bounds | Pydantic field validators, custom checks |
| Semantic validation | Content makes logical sense | Application-specific business logic |
| Cross-field validation | Related fields are consistent | Pydantic model validators |
| External validation | Data matches real-world sources | Database lookups, API verification |

### Retry mechanisms

Even with constrained decoding, edge cases can occur (for example, the model may hit a maximum token limit before completing the JSON structure). Libraries like Instructor implement automatic retry logic: if the output fails validation, the error message is fed back to the model along with the original prompt, giving it a chance to correct the response.[14] Configuring a maximum number of retries (typically 2 to 3) prevents infinite loops while giving the model sufficient opportunity to self-correct.

### Performance considerations

Constrained decoding can have subtle effects on output quality. By restricting the token space at each step, the model may occasionally produce less fluent or less accurate content than it would in unconstrained mode. This tradeoff is generally acceptable for data extraction and structured tasks but may matter for tasks that require creative text within a structured format. [Benchmarks](/wiki/benchmarks) have shown that for well-designed schemas, the quality impact is negligible, and the reliability gains far outweigh any minor quality differences.

### Does constrained decoding hurt model quality? Research (2024-2026)

Whether constrained decoding helps or hurts model capability has become an active research question. A 2024 study by researchers at Appier and National Taiwan University, "Let Me Speak Freely?", presented at the EMNLP 2024 industry track, reported a significant decline in LLM reasoning performance under format restrictions and found that stricter format constraints generally led to greater degradation on reasoning tasks.[32] Researchers at dottxt, the company behind Outlines, published a rebuttal titled "Say What You Mean" arguing that the reported gaps stemmed from prompt design and unequal comparisons rather than from structure itself, and presented re-run experiments in which structured generation matched or outperformed unstructured generation.[33]

Theoretical work has also examined how token masking interacts with the model's probability distribution. Grammar-aligned decoding, presented at NeurIPS 2024, showed that standard grammar-constrained decoding can distort the model's distribution, yielding outputs that are grammatical but appear with likelihoods that are not proportional to those the model itself assigns, and introduced the ASAp sampling algorithm, which provably converges toward the model's distribution conditioned on the grammar constraint.[34] In January 2025, researchers from EPFL and Microsoft released JSONSchemaBench, a benchmark of roughly 10,000 real-world JSON schemas paired with the official JSON Schema Test Suite, used to compare six structured output frameworks (Guidance, Outlines, llama.cpp, XGrammar, and the OpenAI and Gemini APIs) on efficiency, schema coverage, and output quality.[35]

## Limitations and challenges

Despite significant advances, structured output still faces several challenges.

**Schema complexity:** Very large or deeply nested schemas can increase latency during constrained decoding because the grammar or FSM becomes more complex to evaluate at each token. OpenAI's implementation mitigates this through preprocessing and caching, but the first request with a new schema incurs additional latency for compilation.[3]

**Unsupported schema features:** Most providers support a subset of JSON Schema rather than the full specification. For example, OpenAI's Structured Outputs require all fields to be marked as required (optional fields must use nullable types with union syntax), and some providers do not support recursive references or certain conditional keywords like `if`/`then`/`else`.[4]

**[Token](/wiki/token) limit truncation:** If the model reaches its maximum output token limit before completing the structured output, the result will be truncated and invalid. Monitoring the `finish_reason` field (which will indicate `length` rather than `stop`) helps detect this condition.[4] Developers should set generous `max_tokens` values when requesting structured output for complex schemas.

**Quality degradation with aggressive constraints:** Overly restrictive grammars that leave the model with very few valid tokens at each step can reduce the quality of generated content. Designing schemas that allow the model sufficient expressive freedom within the structure produces better results.

**Latency on first request:** For providers and engines that compile schemas into grammars at request time, the first request with a new schema may incur higher latency. Subsequent requests benefit from caching. XGrammar and similar engines partially mitigate this by pre-computing reusable components of the grammar.[10]

## References

1. OpenAI. "New models and developer products announced at DevDay." November 6, 2023. https://openai.com/index/new-models-and-developer-products-announced-at-devday/
2. OpenAI. "Function calling and other API updates." June 13, 2023. https://openai.com/index/function-calling-and-other-api-updates/
3. OpenAI. "Introducing Structured Outputs in the API." August 6, 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/
4. OpenAI. "Structured model outputs." [OpenAI API](/wiki/openai_api) Documentation. https://platform.openai.com/docs/guides/structured-outputs
5. Anthropic. "Structured outputs." Claude API Documentation. https://platform.claude.com/docs/en/build-with-claude/structured-outputs
6. Google. "Structured outputs." Gemini API Documentation. https://ai.google.dev/gemini-api/docs/structured-output
7. Google Developers Blog. "Mastering Controlled Generation with Gemini 1.5: Schema Adherence for Developers." 2024. https://developers.googleblog.com/en/mastering-controlled-generation-with-gemini-15-schema-adherence/
8. ggml-org. "llama.cpp grammars." GitHub. https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
9. dottxt-ai. "Outlines: Structured Outputs." GitHub. https://github.com/dottxt-ai/outlines
10. mlc-ai. "XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models." November 2024. https://arxiv.org/abs/2411.15100
11. guidance-ai. "Guidance: A guidance language for controlling large language models." GitHub. https://github.com/guidance-ai/guidance
12. Noam Gat. "LM Format Enforcer." GitHub. https://github.com/noamgat/lm-format-enforcer
13. vLLM. "Structured Outputs." vLLM Documentation. https://docs.vllm.ai/en/latest/features/structured_outputs/
14. 567-labs. "Instructor: structured outputs for LLMs." GitHub. https://github.com/567-labs/instructor
15. vLLM Blog. "Structured Decoding in vLLM: a gentle introduction." January 2025. https://blog.vllm.ai/2025/01/14/struct-decode-intro.html
16. MLC Blog. "Achieving Efficient, Flexible, and Portable Structured Generation with XGrammar." November 2024. https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar
17. OpenAI. "OpenAI o1 and new tools for developers." December 17, 2024. https://openai.com/index/o1-and-new-tools-for-developers/
18. OpenAI. "OpenAI o3-mini." January 31, 2025. https://openai.com/index/openai-o3-mini/
19. OpenAI. "New tools for building agents." March 11, 2025. https://openai.com/index/new-tools-for-building-agents/
20. OpenAI. "Introducing GPT-5 for developers." August 7, 2025. https://openai.com/index/introducing-gpt-5-for-developers/
21. guidance-ai. "llguidance: Super-fast Structured Outputs." GitHub. https://github.com/guidance-ai/llguidance
22. Anthropic. "Claude API release notes." Claude Platform Documentation. https://platform.claude.com/docs/en/release-notes/api
23. Google Developers Blog. "Native JSON Schema support in the Gemini API." November 5, 2025. https://developers.googleblog.com/en/native-json-schema-support-in-the-gemini-api/
24. Mistral AI. "Custom Structured Outputs." Mistral AI Documentation. https://docs.mistral.ai/capabilities/structured-output/custom_structured_output/
25. xAI. "Structured Outputs." xAI Documentation. https://docs.x.ai/docs/guides/structured-outputs
26. Brandon T. Willard and Remi Louf. "Efficient Guided Generation for Large Language Models." July 2023. https://arxiv.org/abs/2307.09702
27. Lianmin Zheng et al. "SGLang: Efficient Execution of Structured Language Model Programs." 2023. https://arxiv.org/abs/2312.07104
28. dottxt-ai. "outlines-core." GitHub. https://github.com/dottxt-ai/outlines-core
29. dottxt-ai. "Outlines v1.0.0 release notes." GitHub. https://github.com/dottxt-ai/outlines/releases
30. guidance-ai. "llguidance performance." GitHub. https://github.com/guidance-ai/llguidance
31. LangChain. "How to return structured data from a model." LangChain Documentation. https://python.langchain.com/docs/how_to/structured_output/
32. Zhi Rui Tam et al. "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models." EMNLP 2024 Industry Track. https://arxiv.org/abs/2408.02442
33. dottxt-ai. "Say What You Mean: A Response to 'Let Me Speak Freely'." 2024. https://blog.dottxt.co/say-what-you-mean.html
34. Kanghee Park et al. "Grammar-Aligned Decoding." NeurIPS 2024. https://arxiv.org/abs/2405.21047
35. Saibo Geng et al. "JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models." January 2025. https://arxiv.org/abs/2501.10868
