Structured output refers to the set of techniques and features that constrain the output of a large language model (LLM) to follow a specific format or schema. Instead of producing free-form text, an LLM with structured output guarantees that its response conforms to a predefined structure such as JSON, XML, or a custom grammar. This capability is critical for production applications where downstream systems need to parse, validate, and act on model outputs reliably.
Structured output has become one of the most important features in the LLM ecosystem because it bridges the gap between natural language generation and programmatic consumption. Without it, developers must write brittle parsing logic, implement retry mechanisms, and handle malformed responses. With structured output, the model itself is constrained to produce valid, schema-compliant data every time.
The need for structured output emerged almost immediately after LLMs became accessible through APIs. Early approaches relied on careful prompt engineering with instructions like "respond only in JSON format," but these were unreliable. Models would frequently include explanatory text alongside JSON, produce invalid syntax, or omit required fields. The evolution from prompt-based approaches to engine-level enforcement represents one of the most significant practical advances in LLM tooling.
OpenAI introduced function calling on June 13, 2023, alongside the release of GPT-3.5 Turbo and GPT-4 model snapshots (gpt-3.5-turbo-0613 and gpt-4-0613). Function calling allowed developers to describe functions with JSON Schema definitions, and the model would generate a JSON object containing arguments to call those functions. These models had been fine-tuned to both detect when a function needs to be called and to respond with JSON that adheres to the function signature. While originally designed to let models invoke external tools, developers quickly recognized function calling as a general-purpose mechanism for extracting structured data. By defining a "function" that represented their desired output schema, they could get reliably structured responses without the function ever being executed.
At its first DevDay conference on November 6, 2023, OpenAI announced GPT-4 Turbo and introduced JSON mode. By setting response_format: { type: "json_object" } in the API request, developers could ensure the model would respond with syntactically valid JSON. However, JSON mode had a significant limitation: it guaranteed valid JSON syntax but did not enforce conformance to any particular schema. The model might return valid JSON that was missing required fields, had unexpected property names, or used incorrect data types. Developers still had to include prompt instructions describing the desired JSON structure and implement validation logic for the response.
On August 6, 2024, OpenAI released Structured Outputs, a major upgrade that guarantees 100% schema compliance. By setting response_format: { type: "json_schema", json_schema: { ... } }, developers could supply a full JSON Schema definition, and the model would produce output that exactly matched the specified structure. On evaluations of complex JSON schema following, the new gpt-4o-2024-08-06 model with Structured Outputs scored a perfect 100%, compared to fewer than 40% for the previous model without Structured Outputs.
OpenAI achieved this through a two-pronged approach. First, they trained the model (gpt-4o-2024-08-06) to better understand complex schemas. Second, they implemented constrained decoding at the inference level, using a context-free grammar (CFG) derived from the JSON Schema to dynamically restrict which tokens the model could produce at each generation step. The schema is preprocessed into a cached data structure so that token masking can be performed efficiently, with minimal latency overhead during generation.
Structured Outputs is available in two forms: as a response_format parameter for direct JSON responses, and as strict: true on function/tool definitions for structured function calling arguments. Both forms are supported on gpt-4o, gpt-4o-mini, and all models from gpt-4-0613 onward.
Anthropic introduced tool use (function calling) for the Claude 3 model family in 2024, providing structured JSON output through tool definitions. On November 14, 2025, Anthropic announced Structured Outputs in public beta for Claude Sonnet 4.5 and Opus 4.1. This feature compiles JSON schemas into a grammar that constrains token generation during inference, rather than relying on prompting alone. The feature is now generally available on the Claude API and Amazon Bedrock for Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.5, and Claude Haiku 4.5. Anthropic offers two complementary capabilities: JSON outputs via output_config.format for controlling the response format, and strict tool use via strict: true for guaranteeing tool parameter schema validation. When combined, Claude can call tools with guaranteed-valid parameters and return structured JSON responses in a single interaction.
Google introduced controlled generation for Gemini 1.5 Pro at Google I/O 2024. The feature, sometimes called "JSON mode" or "controlled generation," uses a response_schema parameter built on the OpenAPI 3.0 schema definition. In the Gemini API, the concept of "response schema" acts as a template, dictating the elements, data types, and overall structure of the model's output. Google later expanded the capability to Gemini 1.5 Flash and added enum support in September 2024. The feature adds minimal latency to API calls, even on the first request. Starting with Gemini 2.5 models, the API preserves key ordering from the schema in generated output. Gemini supports two output formats: application/json for structured JSON and text/x.enum for constrained enum responses.
Structured output implementations rely on several underlying techniques that operate at different levels of the generation process.
Constrained decoding (also called guided decoding or constrained sampling) is the foundational technique behind most structured output implementations. It operates by modifying the token probability distribution at each generation step to prevent the model from producing tokens that would violate the target structure.
The process works as follows:
Because the schema can be preprocessed and cached, constrained decoding adds minimal latency overhead during generation. The technique is deterministic in its structural enforcement: if the grammar says a comma must come next, only tokens containing commas will have nonzero probability.
Many structured output systems compile constraints into finite-state machines (FSMs). Regular expressions, which are mathematically equivalent to FSMs, provide a convenient way for developers to specify output patterns. At each generation step, the FSM tracks its current state and determines which tokens would lead to valid transitions. This approach is fast and memory-efficient, but FSMs cannot express all possible structural constraints. For example, they cannot enforce matched bracket nesting in JSON, because tracking nesting depth requires a stack that FSMs do not have.
The Outlines library, for example, uses the interegular library to translate regular expressions that represent JSON Schema patterns into FSMs. Each state in the FSM corresponds to a position in the regular expression, and the set of valid next tokens is determined by which tokens would cause a valid state transition.
Context-free grammars (CFGs) are more expressive than regular expressions and FSMs. A CFG can represent nested and recursive structures, making it suitable for enforcing JSON schema compliance, programming language syntax, and other hierarchical formats. OpenAI's Structured Outputs implementation uses a CFG-based approach for this reason. CFG-based constrained decoding uses a pushdown automaton (essentially an FSM augmented with a stack) to track the current parsing state. However, CFG-based constrained decoding is computationally more expensive than FSM-based methods because the parser must maintain and update the stack at each token.
SGLang introduced compressed finite-state machines as an optimization for constrained decoding. The system analyzes constraints and builds a compressed FSM that merges multi-token paths into single-step transitions whenever possible. For instance, if the grammar requires the literal string "name": at a certain point, the compressed FSM can emit all those characters in a single step rather than generating them token by token. This allows the engine to skip 30 to 50% of generation steps for highly structured outputs, significantly improving throughput.
Several open-source libraries enable structured output generation for locally hosted models, often providing capabilities comparable to or exceeding those of proprietary API features.
llama.cpp, the popular C/C++ inference engine for running LLMs locally, supports grammar-constrained generation through GBNF (GGML BNF). GBNF is a format based on Backus-Naur Form with modern regex-like extensions for character ranges and repetition. Developers write grammar rules that define valid output sequences, and the inference engine restricts token selection to only those tokens that satisfy the grammar at each step.
For example, a simple GBNF grammar for a JSON object with name and age fields might look like:
root ::= "{" ws "\"name\"" ws ":" ws string "," ws "\"age\"" ws ":" ws number "}"
string ::= "\"" [a-zA-Z ]+ "\""
number ::= [0-9]+
ws ::= [ \t\n]*
llama.cpp also includes a json-schema-to-grammar.py utility that converts JSON Schema definitions into GBNF grammars automatically. The Python bindings (llama-cpp-python) expose this through a LlamaGrammar class, available since version 0.1.78. Grammar-constrained generation lets developers use smaller, cheaper local models while still getting reliable structured output.
Outlines is a Python library developed by dottxt for structured generation from LLMs. Created by Brandon Willard and Remi Louf, it uses the mathematical equivalence between regular expressions and finite-state machines to constrain generation at the token level. Rather than validating output after generation and retrying on failure, Outlines prevents invalid tokens from being generated in the first place. The authors demonstrated that this generation-time validation approach comes at virtually no additional cost during inference.
Outlines supports multiple constraint types: JSON schemas (via Pydantic models), regular expressions, context-free grammars, and choice constraints. It translates JSON Schemas into regular expressions using the interegular library, then converts those into FSMs for token-level enforcement. The library constructs an Index object by combining a Vocabulary and regular expression to efficiently map tokens from a given vocabulary to state transitions in the finite-state automaton. The core algorithms have been ported to Rust in the outlines-core package, developed in collaboration with Hugging Face, for better performance. Outlines is compatible with models running on Hugging Face Transformers, vLLM, Ollama, and other backends.
XGrammar is a high-performance structured generation engine developed by the MLC team that supports JSON schemas, regular expressions, and arbitrary context-free grammars. It achieves near-zero overhead in JSON generation by computing token masks in under 40 microseconds per token. On benchmarks, XGrammar outperforms previous solutions by up to 3.5x on JSON schema workloads and more than 10x on CFG workloads. In end-to-end evaluation, it outperforms existing LLM engines up to 14x in JSON-schema generation and up to 80x in CFG-guided generation. As of late 2024, XGrammar has been adopted as the default structured generation backend in major inference frameworks including vLLM, SGLang, TensorRT-LLM, and MLC-LLM. It supports multiple hardware platforms including NVIDIA GPUs, AMD GPUs, Apple Silicon, and TPUs.
Guidance is an open-source Python library from Microsoft that provides a domain-specific language for controlling LLM output. It allows developers to interleave generation, prompting, and logical control into a single continuous flow that matches how the language model actually processes text. Unlike conventional prompting techniques that rely on post-hoc parsing, Guidance enforces constraints by steering the model token by token in the inference layer. It supports select (choosing from a set of options), regular expressions, and context-free grammars. Its companion library, llguidance, implements constrained decoding with performance on the order of 50 microseconds of CPU time per token. Microsoft reports that Guidance provides guaranteed output formats with up to 50% faster inference speeds compared to prompt chaining approaches, because the entire program executes in a single API call. Guidance works with both locally hosted open-source models (LLaMA, GPT-2, Alpaca) and closed API models (ChatGPT, GPT-4).
LM Format Enforcer, created by Noam Gat, takes a character-level approach to structured output. It combines a character-level parser with a tokenizer prefix tree to allow only tokens that contain character sequences leading to potentially valid output. At each step, the character-level parser limits the choice of characters that can be added based on the current constraints. For example, when generating JSON, an opening bracket { must be followed by a whitespace, a closing bracket, or a quotation mark to start a property name. The library supports JSON Schema, JSON mode (schemaless), and regular expression formats. It works with Transformers, LlamaIndex, LangChain, llama.cpp, vLLM, Haystack, and NVIDIA TensorRT-LLM. Because it operates solely on output logits without modifying the generation loop, it supports advanced generation methods like beam search and batching.
Modern LLM inference engines provide built-in structured output capabilities, making constrained decoding accessible without requiring additional libraries.
vLLM supports structured output through multiple backends, including XGrammar (default), Outlines, LM Format Enforcer, and Guidance. By default, vLLM uses auto mode to choose the best backend based on the request. It provides an OpenAI-compatible API that accepts guided_json, guided_regex, guided_choice, and guided_grammar parameters. For the Python API, constraints are specified under SamplingParams.guided_decoding. Structured output performance in vLLM V1 is dramatically faster than in V0; in V0, even a single constrained request could degrade system-wide performance, while V1 introduces minimal overhead through backend optimizations and smarter architecture.
SGLang uses compressed finite-state machines for efficient constrained decoding. It supports regular expression constraints through a regex argument and JSON schema constraints. The system can leverage XGrammar and llguidance as grammar backends. For highly structured outputs, the compression technique can skip 30 to 50% of generation steps. SGLang also provides features like RadixAttention for prefix caching, which further optimizes repeated structured generation requests with similar schemas.
Developers typically define their desired output structures using schema definition libraries rather than writing raw JSON Schema by hand. These libraries provide type safety, validation, and automatic schema generation.
Pydantic is the dominant schema definition library in the Python ecosystem for structured output. Developers define Python classes with typed fields, and Pydantic automatically generates JSON Schema definitions from the class structure. Pydantic also provides runtime validation: after the model generates output, Pydantic can parse and validate the JSON to confirm it matches the schema, providing detailed error messages when validation fails. Most structured output libraries in Python (including Instructor, Outlines, and the OpenAI Python SDK) accept Pydantic models as schema definitions.
Example:
from pydantic import BaseModel
from typing import Optional
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class UserInfo(BaseModel):
name: str
age: int
email: str
phone: Optional[str] = None
address: Address
interests: list[str]
This model definition automatically generates a JSON Schema with nested object support, optional fields, array types, and string/integer constraints. The same class is used for both defining the schema sent to the LLM and validating the response.
Zod serves the same purpose in the TypeScript and JavaScript ecosystem. It provides TypeScript-first schema validation with static type inference, meaning a single Zod schema definition produces both the runtime validator and the compile-time TypeScript type. This eliminates the common problem of type definitions and validation logic drifting apart. The OpenAI Node.js SDK, Vercel AI SDK, and other TypeScript LLM libraries accept Zod schemas for defining structured output formats. When a Zod schema is provided, the response includes a parsed field containing the validated and typed result.
Instructor is the most widely used Python library for extracting structured data from LLMs. Created by Jason Liu and open-sourced in June 2023 (originally as "OpenAI Function Call and Pydantic Integration Module"), it has grown to over 3 million monthly downloads and 11,000+ GitHub stars. The OpenAI team cited Instructor as inspiration for their Structured Outputs feature. Instructor wraps LLM API calls with Pydantic model validation, automatic retries via the Tenacity library, and streaming support for partial responses.
Instructor works by accepting a response_model parameter (a Pydantic class) that defines the desired output structure. It translates the Pydantic model into the appropriate API format (function calling, tool use, or JSON Schema) depending on the provider. If the model's output fails validation, Instructor automatically retries with the validation error included in the prompt, giving the model context about what went wrong. The library supports over 15 providers, including OpenAI, Anthropic, Google, Ollama, and DeepSeek. A TypeScript version, developed in collaboration with Dimitri Kennedy (creator of Island AI), is also available.
LangChain provides structured output through its with_structured_output() method, which accepts Pydantic models, TypedDict classes, or raw JSON Schema definitions. Under the hood, LangChain selects the best approach for each provider: tool calling for providers that support it, JSON mode where available, or prompt-based extraction as a fallback. This abstraction allows developers to switch between providers without changing their schema definitions.
The following table compares structured output capabilities across major LLM providers as of early 2026.
| Feature | OpenAI | Anthropic (Claude) | Google (Gemini) |
|---|---|---|---|
| JSON mode | Yes (November 2023) | Yes | Yes |
| JSON Schema enforcement | Yes (August 2024) | Yes (November 2025 beta, now GA) | Yes (Google I/O 2024) |
| Function/tool calling | Yes (June 2023) | Yes (2024) | Yes |
| Strict mode for tools | strict: true | strict: true | Automatic with function calling |
| Schema definition format | JSON Schema | JSON Schema | OpenAPI 3.0 Schema |
| Constrained decoding | CFG-based | Grammar-based | Yes |
| Enum support | Yes | Yes | Yes |
| Nested objects | Yes | Yes | Yes |
| Recursive schemas | Yes (with $ref) | Limited | Limited |
| Streaming support | Yes | Yes | Yes |
| Key ordering preserved | No guarantee | No guarantee | Yes (Gemini 2.5+) |
Function calling (or tool calling) was the first widely adopted method for obtaining structured output from LLMs. Although designed to let models invoke external functions, developers discovered that defining a function with the desired output schema and setting the model to always call that function effectively forced structured responses. The function itself never needed to be executed.
This pattern works because function calling requires the model to produce a JSON object matching the function's parameter schema. OpenAI formalized this pattern with Structured Outputs in August 2024, allowing developers to set strict: true on function definitions to guarantee that the generated arguments exactly match the JSON Schema. This feature is available on all OpenAI models that support function calling, including gpt-4o, gpt-4o-mini, and all models from gpt-4-0613 onward.
Anthropic and Google offer similar capabilities through their respective tool use APIs, where tool input schemas serve the same purpose as function parameter schemas.
Structured output enables a wide range of production applications that would be difficult or impossible with free-form text generation.
Applications that use LLMs as backend services need responses in consistent, machine-readable formats. Structured output ensures every API response from the LLM matches the expected schema, eliminating the need for parsing heuristics and error recovery. This is particularly important for microservice architectures where the LLM component must interoperate with other services that expect specific data contracts.
Extracting structured information from unstructured text is one of the most common use cases. Given a block of text (an email, a legal document, a medical record), an LLM with structured output can reliably extract specific fields such as names, dates, amounts, and categories into a predefined schema. This replaces manual data entry and rule-based extraction systems that are brittle and expensive to maintain.
AI agents rely on structured output to generate valid tool calls. When an agent decides to search the web, query a database, or call an API, it must produce a precisely formatted request. Structured output guarantees that tool call arguments are valid JSON matching the tool's parameter schema, preventing runtime errors in multi-step workflows. Without structured output, agent systems would need extensive error handling for malformed tool calls.
Structured output simplifies classification tasks by constraining the model's response to a predefined set of categories. Using enum constraints or choice parameters, the model must select from valid options rather than generating free-form labels that might not match expected values. This is useful for sentiment analysis, content moderation, intent detection, and similar categorization tasks.
Grammar-constrained generation can enforce syntactic validity in generated code. By defining a context-free grammar for a target programming language, the model is prevented from producing syntactically invalid code, though semantic correctness still depends on the model's capabilities.
LLMs can generate structured reports with consistent sections, headings, and data tables when constrained by a schema that defines the report format. This is particularly useful in domains like finance, healthcare, and compliance where report formats must follow strict templates.
While constrained decoding provides strong structural guarantees, production systems should still implement validation layers.
Structured output guarantees that the generated data is syntactically valid and matches the specified schema. It does not guarantee that the content is factually correct, logically consistent, or semantically meaningful. A model might produce perfectly formatted JSON with a temperature field set to an unrealistic value, or an email field containing a syntactically valid but nonexistent address. This distinction between structural and semantic correctness is fundamental to understanding the limits of constrained decoding.
Production systems typically combine constrained decoding with additional validation:
| Validation layer | What it checks | Example tools |
|---|---|---|
| Schema validation | JSON structure matches schema | Pydantic, Zod, JSON Schema validators |
| Type checking | Field values are correct types | Pydantic type coercion, Zod parsing |
| Range/constraint validation | Values fall within acceptable bounds | Pydantic field validators, custom checks |
| Semantic validation | Content makes logical sense | Application-specific business logic |
| Cross-field validation | Related fields are consistent | Pydantic model validators |
| External validation | Data matches real-world sources | Database lookups, API verification |
Even with constrained decoding, edge cases can occur (for example, the model may hit a maximum token limit before completing the JSON structure). Libraries like Instructor implement automatic retry logic: if the output fails validation, the error message is fed back to the model along with the original prompt, giving it a chance to correct the response. Configuring a maximum number of retries (typically 2 to 3) prevents infinite loops while giving the model sufficient opportunity to self-correct.
Constrained decoding can have subtle effects on output quality. By restricting the token space at each step, the model may occasionally produce less fluent or less accurate content than it would in unconstrained mode. This tradeoff is generally acceptable for data extraction and structured tasks but may matter for tasks that require creative text within a structured format. Benchmarks have shown that for well-designed schemas, the quality impact is negligible, and the reliability gains far outweigh any minor quality differences.
Despite significant advances, structured output still faces several challenges.
Schema complexity: Very large or deeply nested schemas can increase latency during constrained decoding because the grammar or FSM becomes more complex to evaluate at each token. OpenAI's implementation mitigates this through preprocessing and caching, but the first request with a new schema incurs additional latency for compilation.
Unsupported schema features: Most providers support a subset of JSON Schema rather than the full specification. For example, OpenAI's Structured Outputs require all fields to be marked as required (optional fields must use nullable types with union syntax), and some providers do not support recursive references or certain conditional keywords like if/then/else.
Token limit truncation: If the model reaches its maximum output token limit before completing the structured output, the result will be truncated and invalid. Monitoring the finish_reason field (which will indicate length rather than stop) helps detect this condition. Developers should set generous max_tokens values when requesting structured output for complex schemas.
Quality degradation with aggressive constraints: Overly restrictive grammars that leave the model with very few valid tokens at each step can reduce the quality of generated content. Designing schemas that allow the model sufficient expressive freedom within the structure produces better results.
Latency on first request: For providers and engines that compile schemas into grammars at request time, the first request with a new schema may incur higher latency. Subsequent requests benefit from caching. XGrammar and similar engines partially mitigate this by pre-computing reusable components of the grammar.