COLLIE

AI Benchmarks

12 min read

Updated May 10, 2026

Suggest edit History Talk

RawGraph

Last edited

May 10, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 2,383 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

COLLIE
Overview
Full name	Systematic Construction of Constrained Text Generation Tasks
Abbreviation	COLLIE
Description	A grammar-based framework for systematically constructing complex, compositional constrained text generation tasks
Release date	July 17, 2023 (arXiv preprint)
Conference venue	ICLR 2024 (poster)
Latest dataset	COLLIE-v1
Authors	Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, Karthik Narasimhan
Organization	Princeton University, Princeton NLP
Technical Details
Type	Constrained text generation, compositional reasoning
Modality	Text
Task format	Constraint-based instruction following with automated checking
Number of constraint structures	13
Total instances (COLLIE-v1)	2,080
Unique constraint prompts	1,435
Generation levels	Word, sentence, paragraph, passage
Evaluation metric	Constraint satisfaction rate (pass@1, pass@k)
Domains	Language understanding, logical reasoning, counting, semantic planning
Languages	English
Performance
Best reported model	GPT-4
Best average satisfaction rate	50.9% (zero-shot, pass@1)
Best pass@20 rate	63% (GPT-4)
Saturated	No
Resources
Website	Official site
Paper (arXiv)	arXiv:2307.08689
OpenReview	ICLR 2024 page
GitHub	princeton-nlp/Collie
PyPI package	`collie-bench`
Dataset file	`data/all_data.dill`
License	MIT (code), source-specific licenses (data)

COLLIE (Systematic Construction of Constrained Text Generation Tasks) is a grammar-based benchmark framework for evaluating how well large language models can produce text that satisfies rich, compositional constraints. The framework was introduced in a July 17, 2023 arXiv preprint by Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, and Karthik Narasimhan, all from the Department of Computer Science at Princeton University, and was published as a poster at ICLR 2024 in Vienna, Austria.^[1]^[2]^[3] The accompanying COLLIE-v1 dataset contains 2,080 task instances drawn from 13 constraint structures that span four generation levels (word, sentence, paragraph, passage) and probe abilities such as counting, logical composition, and semantic planning.^[1]^[3]

Unlike earlier constrained generation benchmarks, which typically rely on a small set of fixed constraint types such as including a list of seed words in a sentence, COLLIE provides a flexible specification language and an automatic extraction pipeline. Researchers declare a constraint template using primitives like Levels, Transformations, Logic combinators, Relations, and Reductions; the toolkit then mines matching examples from text corpora, renders the constraints into natural-language prompts, and verifies model outputs with deterministic checkers.^[1]^[4]

Background and motivation

Constrained text generation has a long history in natural language generation, covering keyword-constrained sentence writing, lexically constrained machine translation, and controlled paraphrasing. By 2023, frontier instruction-tuned models such as GPT-4 had largely solved the simple, single-constraint formulations exemplified by datasets like CommonGen. The COLLIE authors observed that this saturation hid important gaps: top models could insert a list of words into a paragraph yet still failed at constraints requiring exact counts, character-position reasoning, or boolean composition.^[1]^[3]

The paper frames this as a benchmark design problem. Static tasks reach their useful lifetime quickly; once models surpass them, the field lacks signal for further progress. COLLIE responds by treating constraints as a compositional language: researchers specify structures and the framework instantiates concrete tasks from any text corpus, with the same template reusable at harder difficulty settings as models improve.^[1]^[4]

Authors and publication

All five authors are affiliated with Princeton University's Department of Computer Science and the Princeton Language and Intelligence (PLI) initiative.^[2]^[3] First author Shunyu Yao is also known for the ReAct and Tree of Thoughts papers; Karthik Narasimhan, Yao's advisor, leads Princeton's NLP group. The arXiv preprint (2307.08689) was posted on July 17, 2023, with a single version listed.^[5] The paper appeared at ICLR 2024 in Vienna, May 7 to 11, 2024.^[2] Earlier OpenReview drafts referenced 1,132 instances, but the camera-ready and the released GitHub artifact both report 2,080 instances, indicating the dataset grew between submission and publication.^[3]^[6]

Item	Value
arXiv ID	2307.08689
arXiv submission date	July 17, 2023
ICLR 2024 OpenReview ID	kxgSlyirUZ
Track	Datasets and benchmarks (poster)
Affiliation	Princeton University, Department of Computer Science
Code license	MIT

Framework architecture

COLLIE's design centers on five Python primitives defined in collie/constraints.py.^[4]

Primitive	Role	Examples
Level	Linguistic unit the constraint applies to	character, word, sentence, paragraph, passage
Transformation	Property computed over a Level	Count, Position, ForEach
Relation	Comparison of transformed value to target	`==`, `!=`, `in`, `not in`, `<`, `>`
Logic	Combines multiple sub-constraints	And, Or, All
Reduction	Aggregates per-element results into one verdict	any, all, count, percentage thresholds

A constraint object is a small program: pick a Level, apply a Transformation, assert a Relation, and aggregate with a Reduction. Each constraint exposes a check() method that returns True or False on any candidate string. Because the checker runs in pure Python, evaluation is deterministic, fast, and free from grader-model bias.^[1]^[4]

Four-step pipeline

COLLIE turns constraint structures into concrete tasks through a fixed pipeline.^[1]

Step	Process	Output
1. Constraint specification	Author declares a constraint structure using the primitives above, leaving target values unbound	A reusable template such as "sentence with exactly N characters"
2. Example extraction	`FullExtractor` walks a corpus via `TextLoader` and `TextChunker`; `ConstraintExtractor` enumerates configurations that satisfy the structure	Concrete (constraint, target) pairs grounded in real text
3. Instruction rendering	`ConstraintRenderer` converts the constraint plus target into a natural-language prompt	Human-readable instructions
4. Generation and checking	A model generates a completion; `check()` reports satisfaction	Pass or fail labels and aggregate scores

The documentation notes that adding a new corpus mostly comes down to writing a high-recall filter and good post-processors, since markdown artifacts and tokenization quirks dominate the engineering effort.^[4]

Generation levels and example constraints

Level	Example COLLIE constraint
Word	"Generate a word with at least 15 letters."^[3]
Sentence	"Generate a sentence with exactly 82 characters."^[3]
Paragraph	"Generate a paragraph where each sentence begins with the word 'soft'."^[3]
Passage	"Generate a passage of two paragraphs that ends in the given sentence."^[3]

COLLIE-v1 dataset

COLLIE-v1 is the released artifact accompanying the paper. It ships as data/all_data.dill in the GitHub repository, preserving Constraint objects, target values, and reference completions alongside their plain-text instructions.^[4]

Aspect	Specification
Total instances	2,080
Unique constraint prompts	1,435
Constraint structures	13 (3 word, 4 sentence, 5 paragraph, 1 passage)
Source corpora	English Wikipedia, Project Gutenberg, CC-News (2017 to 2019)
Format	Python dill pickle (`all_data.dill`)
Loader API	`collie.constraints.Constraint`, `collie.extract.FullExtractor`

The 13 constraint structures

The paper labels structures by level and index. Their roles can be summarized as follows.^[1]^[3]^[6]

ID	Level	Constraint type
word01	Word	Minimum word length (single word with at least N characters)
word02	Word	Word containing or excluding given letters
word03	Word	Composition of length and letter constraints
sent01	Sentence	Sentence with exactly N characters
sent02	Sentence	Sentence containing required words
sent03	Sentence	Sentence with required word position
sent04	Sentence	Sentence ending with a specified word
para01	Paragraph	Paragraph where every sentence begins with a given word
para02	Paragraph	Paragraph with N sentences and target keywords
para03	Paragraph	Paragraph with controlled sentence lengths
para04	Paragraph	Mixed positional and counting constraints
para05	Paragraph	Logical compositions over keywords
pass01	Passage	Multi-paragraph passage with structural and lexical constraints

Earlier benchmarks focus on word-level inclusion. COLLIE adds counting, ordering, exact length matching, ForEach quantification, and Logic-based composition.^[1]

Evaluation and main findings

The authors evaluated five instruction-tuned models in a zero-shot setting and reported both pass@1 and pass@20 constraint satisfaction rates.^[1]^[3]

Model	Type	Notes on performance
GPT-4	Closed, instruction-tuned	Best overall, average pass@1 of 50.9%, pass@20 above 63%
GPT-3.5-turbo	Closed, instruction-tuned	About half of GPT-4's score; pass@20 around 32%
PaLM-2 (text-bison-001)	Closed, instruction-tuned	Trails GPT models by a wide margin
Vicuna-7B	Open, instruction-tuned	Comparable to Alpaca, far below closed APIs
Alpaca-7B	Open, instruction-tuned	Lowest constraint satisfaction in the panel

A few patterns recur across constraint types.^[1]^[3] Position matters: instructing GPT-4 to begin a sentence with a specific word reaches 100% success, while ending a sentence with a target word drops to between 40% and 60%. Counting is harder than inclusion: exact-character or exact-word counts at the sentence and paragraph levels prove far more error-prone than which words to include. Composition compounds the difficulty: word01, sent04, and para01 are nearly solved by leading models, yet word03, para04, para05, and pass01, which combine multiple primitives, sit between 40% and 70% even for GPT-4.

Pass@1 versus pass@20

The gap between single-sample and best-of-twenty rates is one of the paper's clearer findings. GPT-4's pass@20 averages about 63%, while GPT-3.5 reaches only 32%, suggesting that even with repeated sampling, smaller models often cannot find a satisfying completion. This makes COLLIE a useful stress test for sampling-based methods like best-of-N decoding and rejection sampling.^[1]

Failure modes

The authors document recurring failures. Models hallucinate satisfying answers, claiming a sentence has the required character count when it does not. They drift on positional requirements, especially end-of-sequence ones; they mishandle compositional logic, satisfying one clause while violating another; and they struggle on pass01, where multi-paragraph layout interacts with lexical and counting requirements.^[1]^[3]

Implementation and tooling

The code is published under the MIT license at the GitHub repository princeton-nlp/Collie and on PyPI as collie-bench. The README recommends Python 3.9 because of compatibility issues with later versions; users can install via pip install collie-bench or a development install with pip install -e . from source.^[4]

Item	Value
Recommended Python	3.9
Install	`pip install collie-bench` or `pip install -e .`
Top-level package	`collie`
Modules	`collie.constraints`, `collie.extract`, `collie.render`
Reference data	`data/all_data.dill`
Reproducibility	`scripts/analysis.ipynb`, pre-computed `logs/`
License (code)	MIT

The repository ships pre-computed model outputs in logs/ and an analysis.ipynb notebook for reproducing the paper's tables without rerunning expensive APIs. Scripts under scripts/ invoke either OpenAI-style APIs or local GPU inference.^[4]

Minimal usage example

from collie.constraints import (
    Constraint, Level, Transformation, Relation, Reduction
)

c = Constraint(
    level=Level.SENTENCE,
    transformation=Transformation.COUNT,
    relation=Relation.EQ,
    target=82,
    reduction=Reduction.ALL,
)
assert c.check("A sentence with exactly eighty-two characters in it... etc.")

The checker returns a boolean, and benchmark scoring is aggregated across the 2,080 examples in all_data.dill.^[4]

COLLIE sits in a growing cluster of benchmarks targeting controllable and instruction-following text generation. Earlier work was largely fixed and lexical; later benchmarks add verifiable rule following or domain coverage COLLIE does not attempt.

Benchmark	Year	Focus	Relation to COLLIE
CommonGen	2020	Sentence covering given concepts	Simpler baseline that COLLIE cites as too easy for GPT-4
IFEval	2023	Verifiable instruction-following constraints	IFEval focuses on instruction format rather than compositional grammars
FollowBench	2023	Multi-level fine-grained constraints	Closest in motivation, less compositional
IFBench	2024	Held-out verifiable constraints	Builds on the COLLIE/IFEval line for training-time generalization
InfoBench	2024	Decomposed instruction following	Evaluates instruction decomposition, not compositional structure

Limitations

The paper and repository are candid about scope.^[1]^[4] COLLIE is English-only and text-only. The 13 structures, while more diverse than prior work, are hand-designed and may not cover all practically interesting constraints. Long-passage constraints are expensive to extract and verify, making dataset scaling costly. Deterministic checkers can also be strict on edge cases like punctuation differences, so absolute satisfaction numbers may underestimate models that are semantically faithful but format-divergent.

Influence and follow-up work

COLLIE has been adopted as a reference benchmark for controllable generation and instruction following.^[2]^[7] Later benchmarks such as IFEval, FollowBench, and IFBench cite it as motivation for moving beyond simple lexical constraints. Its grammar-based specification has also influenced training recipes that use COLLIE-style synthetic constraints to teach models structured instruction following.^[7] Because the toolkit is lightweight and free of grader-model bias, it is a common quick test for new instruction-tuned models, often run alongside other instruction-following suites to check whether reported gains generalize to harder counting and compositional tasks.^[1]^[3]

References

Yao, Shunyu; Chen, Howard; Hanjie, Austin W.; Yang, Runzhe; Narasimhan, Karthik. "COLLIE: Systematic Construction of Constrained Text Generation Tasks." arXiv:2307.08689, July 17, 2023. https://arxiv.org/abs/2307.08689 ↩
OpenReview. "COLLIE: Systematic Construction of Constrained Text Generation Tasks." ICLR 2024 conference page (poster). https://openreview.net/forum?id=kxgSlyirUZ ↩
Princeton NLP. "COLLIE: Systematic Construction of Constrained Text Generation Tasks (project site)." https://collie-benchmark.github.io/ ↩
Princeton NLP. "princeton-nlp/Collie GitHub repository (README and `docs/extraction.md`)." https://github.com/princeton-nlp/Collie ↩
arXiv. "Submission history for arXiv:2307.08689 (v1: July 17, 2023)." https://arxiv.org/abs/2307.08689v1 ↩
Princeton University. "COLLIE: Systematic Construction of Constrained Text Generation Tasks (publication record)." https://collaborate.princeton.edu/en/publications/collie-systematic-construction-of-constrained-text-generation-tas/ ↩
Princeton Language and Intelligence. "Princeton Language and Intelligence at ICLR 2024." Princeton PLI blog, 2024. https://pli.princeton.edu/blog/2024/princeton-language-and-intelligence-iclr-2024 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Longform Creative Writing MMLU

Background and motivation

Authors and publication

Framework architecture

Four-step pipeline

Generation levels and example constraints

COLLIE-v1 dataset

The 13 constraint structures

Evaluation and main findings

Pass@1 versus pass@20

Failure modes

Implementation and tooling

Minimal usage example

Related benchmarks

Limitations

Influence and follow-up work

See also

References

Improve this article

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here