Guidance (library)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,019 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,019 words
Add missing citations, update stale details, or suggest a clearer explanation.
Guidance is an open-source Python library, originally developed at Microsoft Research, for building structured, multi-step programs that drive large language models. It lets a developer interleave ordinary Python control flow with model generation, and applies token-level constraints (regular expressions, context-free grammars, JSON schemas, choice sets) so that the output of an LLM is forced to conform to a desired structure. The project was created by Scott Lundberg (better known for SHAP) together with Marco Tulio Ribeiro and other Microsoft researchers, first appearing in 2022 as a Handlebars-style domain-specific language and re-released in late 2023 as an embedded Python API. Since January 2025, all of Guidance's grammar processing has been delegated to a Rust library called llguidance, also developed inside Microsoft Research, which computes valid-next-token masks in roughly 50 microseconds per step on a 128k-vocabulary tokenizer.[^1][^2][^3] Guidance is distributed under the MIT License from the guidance-ai GitHub organization and supports backends including the Hugging Face Transformers library, llama.cpp, ONNX Runtime GenAI, Azure AI, OpenAI, and other commercial APIs.[^4][^5]
| Field | Value |
|---|---|
| Original author | Scott Lundberg |
| Co-authors / maintainers | Marco Tulio Ribeiro, Harsha Nori, Richard Edgar, Michał Moskal, Hudson Cooper, Loc Huynh |
| Initial release | 0.0.1, November 11, 2022 (PyPI)[^4] |
| First Python rewrite | 0.1.0, November 14, 2023[^4][^6] |
llguidance integration | 0.2.0, January 7, 2025[^7] |
| Latest stable release | 0.3.x series (2025 to 2026)[^4][^5] |
| Repository | github.com/guidance-ai/guidance |
| Companion grammar engine | github.com/guidance-ai/llguidance (Rust) |
| License | MIT |
| Language | Python (library API), Rust (grammar engine llguidance) |
| Supported backends | Hugging Face Transformers, llama.cpp, ONNX Runtime GenAI, Azure AI, OpenAI, Anthropic and Gemini via litellm, experimental SGLang[^5][^8] |
| Star count (GitHub) | More than 19,000 as of 2025[^9] |
Guidance is most often described as a "programming paradigm for steering language models". The official tagline on the Microsoft Research project page promises "100% guaranteed output structure, with 30 to 50% reduction in latency and costs" relative to plain prompting.[^9] The library is built around a model object that is treated as an immutable value: writing lm += "some text" or lm += gen("name", max_tokens=20) returns a new model whose state reflects the appended text or generated tokens, allowing programs to be composed and reused with the same predictability as ordinary functional code.[^6]
Guidance began inside Microsoft Research around 2022, where Scott Lundberg was a senior researcher and Marco Tulio Ribeiro was a principal researcher. Lundberg had previously created SHAP, a widely used framework for model interpretability based on Shapley values; Ribeiro is known for the LIME interpretability method and for the CheckList behavioral-testing methodology for NLP. The first PyPI release of guidance is dated November 11, 2022.[^4] Early versions exposed a templating language inspired by Handlebars, in which one would write something like {{#select 'option'}}A{{or}}B{{/select}} inside an otherwise normal prompt string. The interpreter walked the template token by token, calling the underlying model only when generation was required, and applied logits masking when the template constrained the legal next tokens.[^10]
The library was publicly announced on May 18, 2023 in a write-up covered by The Register, which described it as a domain-specific language resembling Handlebars, with linear code execution aligned to token order, controllable temperature, pattern matching constraints, and the ability to guarantee valid JSON output. In the same article Lundberg said, "with Guidance we can both accelerate inference speed and ensure that generated JSON is always valid". The same coverage cited 2x faster character generation on an NVIDIA RTX A6000 with a LLaMA-7B backend and improved accuracy on a BigBench task (76.01% vs. 63.04%) under guided execution.[^10]
In the summer of 2023 the project paused releases while the team rewrote the library. On November 14, 2023 Lundberg posted a discussion titled "Guidance reborn" and tagged version 0.1.0, which dropped the Handlebars-like DSL in favor of plain Python. He wrote that "all guidance programs are now pure Python programs. No more worrying about a distinction between 'user code' in Python and 'template code'".[^6] The new design centered on three ideas: (1) every program is a Python function operating on an immutable model object; (2) the surface syntax is a superset of regular expressions and context-free grammars so that grammars can be built up incrementally; and (3) state is carried explicitly inside the model object, which makes a guidance computation as composable as a pure function.[^6] This redesign is the version that most users encounter today and that the rest of the article describes.
guidance-ai organizationThe repository originally lived at github.com/microsoft/guidance but was moved to a community-maintained organization called guidance-ai. Issue trackers from May 2023 onward redirect from microsoft/guidance to guidance-ai/guidance, and the contact email listed on the PyPI page is maintainers@guidance-ai.org.[^11][^4] The original maintainers (Lundberg, Nori, Ribeiro, Edgar) remained on the project and a Microsoft contact alias (guidanceai@microsoft.com) appears in the repository README, but the codebase is no longer hosted under Microsoft's GitHub organization. Lundberg has since moved from Microsoft to Google DeepMind, where his research continues to focus on language models and explainability, while Nori and several other co-maintainers remain at Microsoft.[^12][^9]
The most significant evolution after the Python rewrite was the introduction of llguidance, a Rust library that took over the responsibility of computing the set of allowed next tokens for any given grammar. The 0.2.0 release announcement, posted on January 7, 2025 by Harsha Nori, said that "Guidance's core grammar processing has been fully migrated to the llguidance Rust library", that the new engine is "state of the art across frameworks", and that the release fixed "some key, subtle bugs in the earlier processing engine". The 0.2.0 changelog also expanded JSON schema coverage (handling oneOf, required, boolean schemas, numeric ranges, and broader allOf support), overhauled the in-line Jupyter visualizations, and made parser advancement run concurrently with the model's forward pass.[^7]
The 0.3.x series continued to broaden backend support and tighten performance. 0.3.0, released on September 9, 2025, added Groq and Mistral APIs via litellm, an experimental SGLang backend, OpenAI-style tool functions defined by JSON schemas, and support for new frontier models. 0.3.1 in early 2026 added an onnxruntime-genai backend, Python 3.14 compatibility, dropped Python 3.9, and introduced an inference-time monitor that performs semantic verification of generated text. 0.3.2 (March 2025) was a maintenance release that updated llguidance to 1.6.1 and added URI-format JSON string support.[^5] In parallel, the library acquired prototype multimodal support: pull request #1020, opened in September 2024, prototyped a TransformersPhi3VisionEngine and introduced append_image, append_audio_bytes, and append_video_bytes methods on the model class, with a placeholder convention (<|_{modality.name}:{id}|>) for embedding non-text blobs inside the prompt string. That specific PR was closed in May 2025 in favor of a replacement implementation by Hudson Cooper, but multimodal models such as Microsoft's Phi-3 Vision are now supported through the Transformers backend.[^13]
A guidance program starts by constructing a model object that wraps a backend. Programs add to that object using the += operator. For example, lm = models.LlamaCpp("./model.gguf") constructs a model, and lm = lm + "The capital of France is " + gen("city", max_tokens=4) produces a new model whose state is the prompt concatenated with up to four generated tokens, captured under the variable city and accessible as lm["city"].[^6] Because the model is immutable, branching and backtracking, common in chain-of-thought style programs, are straightforward: the developer keeps multiple model objects in scope and continues whichever branch they need.
The two most frequently used primitives are gen() and select(). gen() generates text into a named variable subject to optional constraints (regex pattern, stop strings, max tokens, temperature). select() forces the model to pick one of a closed list of options; for instance, lm + "The answer is " + select(["yes", "no", "maybe"], name="answer") forces a valid choice and is guaranteed never to hallucinate an out-of-vocabulary value.[^9][^14] In addition, the library exposes:
gen(regex=r"\d{3}-\d{4}") forces the output to match a phone-number-like pattern.@guidance decorated functions or Lark-format grammars, and the resulting grammar is enforced at every step of decoding.[^2][^9]json() function accepts either a Python dict schema or a Pydantic model and forces the model to emit a JSON document that satisfies it.[^7]When a constraint is in effect, Guidance computes a token mask before each sampling step, telling the backend which tokens of the model's vocabulary are legal continuations. With a local backend (Transformers, llama.cpp, ONNX Runtime GenAI) the mask is applied directly to the logits, so the constraint is enforced exactly. With remote APIs that expose only token-level logit biases (or none at all), the enforcement is partial: some constraints can be expressed via the API's own structured-output endpoint, others fall back to retry-on-failure. The advantage of local enforcement is that constraint satisfaction is provable rather than statistical, and the library exploits structural constraints to "fast-forward" through tokens whose value is already implied by the grammar (for instance, the opening {"name": of a JSON object).[^9][^2]
A subtle issue in constrained decoding is the boundary between a fixed prompt and the next token to be generated. Greedy tokenizers split the prompt into tokens that may not align cleanly with the constraint; the most likely next token might actually start a few characters earlier than the prompt's end. Guidance addresses this with "token healing", which backs the generation pointer up by one or more tokens and constrains the first generated token to share a prefix with the truncated tail of the prompt. The result is that the program behaves as if the prompt were continued character by character, rather than token by token, which empirically improves generation quality on prefix-sensitive tokenizers.[^14]
Programs in Guidance are typically written as ordinary Python functions decorated with @guidance(stateless=True). A stateless function returns a grammar fragment rather than executing immediately, so it can be composed inside larger grammars. For instance, one can write a stateless function that yields a JSON object whose keys come from a fixed list and whose values are recursively generated; calling that function from inside another guidance program splices the grammar into the outer program. This makes grammars first-class values that can be inspected, tested against mock models, or shipped to a different backend without re-execution.[^9]
Computing a token mask from a context-free grammar at every decoding step is, in principle, expensive: for a 128k-token vocabulary, a naive implementation would parse every candidate token against the grammar before each sampling step. Guidance's initial Python parser was correct but slow enough to dominate end-to-end generation time for non-trivial schemas. The Microsoft Research team (Michał Moskal, Harsha Nori, Hudson Cooper, Loc Huynh) wrote llguidance to fix this. According to the team's technical write-up, the library was developed between 2023 and 2025 and is now used not only by Guidance but also by llama.cpp, vLLM, SGLang, Chromium, mistral.rs, and Microsoft's onnxruntime-genai.[^3][^2]
llguidance is implemented in Rust (about 86% of the code base, with thin Python and C bindings). It splits the constraint-checking job into two layers:[^3][^2]
derivre library, that performs Brzozowski-derivative lazy DFA construction with negligible startup cost. The lexer can validate fast paths such as "we are currently inside a JSON string and any non-quote character is legal" without invoking the parser.To compute a mask, the engine traverses a prefix trie of the tokenizer's vocabulary. Each edge of the trie corresponds to appending a byte; the engine attempts to extend the current parse state with that byte and prunes whole subtrees when the parser rejects the prefix. A further "slicer" optimization precomputes masks for common regex slices (for example, the body of a JSON string) so that for the dense parts of the mask the trie traversal is skipped entirely. The team reports average mask computation around 50 microseconds per token on a 128k-token tokenizer and roughly 2 milliseconds of startup overhead, with 10 to 1000 times speedups over earlier libraries.[^2][^3]
To evaluate llguidance and its peers, the team published JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models, by Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. The paper was first posted to arXiv on January 18, 2025 (with a final v3 dated February 27, 2025). It compares Guidance, Outlines, llamacpp, XGrammar, OpenAI structured outputs, and Gemini on 10,000 real-world JSON schemas, evaluating efficiency, coverage of constraint types, and output quality.[^15]
By 2025, llguidance had been integrated into several inference stacks beyond Guidance itself. OpenAI's structured outputs feature uses it as of May 2025, and llama.cpp, vLLM, SGLang, Chromium, mistral.rs, and Microsoft's onnxruntime-genai all expose it as a grammar backend.[^3][^2] In Guidance itself, the upgrade in 0.2.0 was transparent at the API level but produced large performance gains on grammar-heavy programs.[^7]
Guidance is intentionally backend-agnostic. The same Python program can target a local model running through Hugging Face Transformers or llama.cpp, a model served through Microsoft's onnxruntime-genai, an Azure AI deployment, OpenAI's API, or a Google DeepMind Gemini endpoint via litellm.[^4][^5] In practice the level of constraint support depends on the backend:
| Backend | Logits access | Full grammar enforcement | Notes |
|---|---|---|---|
| Hugging Face Transformers | Yes | Yes | Best-supported local path; used for vision language model integration via TransformersPhi3VisionEngine and successors.[^13] |
| llama.cpp | Yes | Yes | Native llguidance integration; popular for quantized local inference.[^3][^4] |
| ONNX Runtime GenAI | Yes | Yes | Added in 0.3.x for optimized local inference on a broader range of hardware.[^5] |
| Azure AI | Partial | Partial | Constraints honored when the deployment exposes compatible structured-output options.[^4] |
| OpenAI | API-level structured outputs | Partial | OpenAI's structured outputs feature itself depends on llguidance, but logits-level enforcement is not directly exposed via the public API.[^3] |
| Anthropic / Gemini (via litellm) | No | No (best effort) | Used for tool-calling style programs; constraints fall back to retry on parse failure.[^5] |
| SGLang | Yes | Yes | Experimental backend added in 0.3.0; SGLang itself integrates llguidance internally.[^5] |
The recommended path for users who need strict guarantees is one of the local backends, where llguidance masks the logits directly. Remote APIs are useful when the developer is willing to accept partial enforcement or to combine Guidance with the provider's native structured-output mode.
Guidance is commonly used for problems where prompting alone is unreliable or expensive:
select over a closed answer set), so the model can produce arbitrary chain-of-thought rationale but cannot hallucinate the final classification.[^14]gen() and select() calls inside a Python control flow so that, for instance, a planning step decides the next tool use from a closed set, an argument-generation step emits a strict JSON payload, and the result is fed back into another step, all within one process and with end-to-end constraint guarantees.[^9][^6]append_image and append_audio_bytes methods and then constraining the model's output to a structured format.[^13]Guidance sits in a small but active ecosystem of "LLM programming" or "structured generation" libraries. The libraries differ in what they optimize for: some focus on strict constrained decoding at the logits level, others on schema-driven validation, others on declarative prompt optimization.
| Library | First public release | Constraint mechanism | Optimization focus | Local backend support |
|---|---|---|---|---|
| Guidance | 2022 (Microsoft / guidance-ai) | Token masks from regex, CFG, JSON schema via llguidance[^1][^2] | Interleave Python control with token-level constraints | llama.cpp, Transformers, ONNX Runtime GenAI, SGLang[^5] |
| LMQL | 2022 (ETH Zurich) | Logit masks via SQL-like LMP language | Declarative query language for LLMs | Limited; reported issues with batching and parallelism[^16] |
| Outlines | 2023 (.txt / Normal Computing) | Finite-state machine compiled from regex / Pydantic / CFG | Pure-Python constrained decoding | Transformers, llama.cpp, vLLM |
| Instructor | 2023 (Jason Liu) | Schema-driven retries via Pydantic; uses provider structured-output endpoints | Pydantic-first ergonomics on top of existing APIs | None directly; depends on backend's structured-output mode[^16] |
| DSPy | 2023 (Stanford NLP) | Prompt optimization (signatures, modules, optimizers) rather than logits constraints | Compiling prompts and few-shots automatically | Backend-agnostic |
In practice the choice depends on where the developer wants the constraint to live. Guidance and LMQL push the constraint down to the decoding step, so the constraint is provably satisfied at every token and the backend can fast-forward through structurally implied tokens. Outlines takes a similar approach but exposes a more Pythonic, schema-first API. Instructor takes the opposite approach, layering Pydantic validation and retries on top of an arbitrary provider. DSPy is at a higher level still: rather than constrain tokens, it optimizes the prompts and few-shot demonstrations that surround the program.[^16] Guidance and DSPy are sometimes used together, with DSPy authoring the prompts and Guidance enforcing the structure of the resulting completions.
Guidance's design has several honest weaknesses that the maintainers and outside reviewers acknowledge.
llguidance with vLLM or SGLang directly.[^16][^7][^5]Guidance was one of the earliest libraries to popularize the idea that an LLM application is a program rather than a prompt, and that the language model's output should be controlled at the token level rather than coaxed via natural-language instructions. By providing both a Pythonic surface (the immutable model object) and a Rust-based grammar engine (llguidance) it occupies an unusual position: the same code that is convenient enough for notebook exploration also produces grammar-level guarantees suitable for production. The library's grammar engine has been adopted by major inference stacks including OpenAI's own structured outputs, llama.cpp, vLLM, SGLang, and onnxruntime-genai, so a substantial fraction of all "structured output" calls made through commercial and open-source LLM stacks now run through llguidance even when the user is not aware of it.[^3][^2]
Beyond engineering impact, Guidance influenced the broader conversation about how to do reliable prompting. The motivations articulated in the project (predictable structure, lower latency through token fast-forwarding, deterministic choices via select, no expensive retries or fine-tuning) reappear in adjacent ecosystems like DSPy and in commercial structured-output features at major model providers.[^9][^16]
llguidance natively.llguidance for guided decoding.llguidance as a grammar backend.