COLLIE
Last reviewed
May 10, 2026
Sources
7 citations
Review status
Source-backed
Revision
v2 ยท 2,383 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
7 citations
Review status
Source-backed
Revision
v2 ยท 2,383 words
Add missing citations, update stale details, or suggest a clearer explanation.
| COLLIE | |
|---|---|
| Overview | |
| Full name | Systematic Construction of Constrained Text Generation Tasks |
| Abbreviation | COLLIE |
| Description | A grammar-based framework for systematically constructing complex, compositional constrained text generation tasks |
| Release date | July 17, 2023 (arXiv preprint) |
| Conference venue | ICLR 2024 (poster) |
| Latest dataset | COLLIE-v1 |
| Authors | Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, Karthik Narasimhan |
| Organization | Princeton University, Princeton NLP |
| Technical Details | |
| Type | Constrained text generation, compositional reasoning |
| Modality | Text |
| Task format | Constraint-based instruction following with automated checking |
| Number of constraint structures | 13 |
| Total instances (COLLIE-v1) | 2,080 |
| Unique constraint prompts | 1,435 |
| Generation levels | Word, sentence, paragraph, passage |
| Evaluation metric | Constraint satisfaction rate (pass@1, pass@k) |
| Domains | Language understanding, logical reasoning, counting, semantic planning |
| Languages | English |
| Performance | |
| Best reported model | GPT-4 |
| Best average satisfaction rate | 50.9% (zero-shot, pass@1) |
| Best pass@20 rate | 63% (GPT-4) |
| Saturated | No |
| Resources | |
| Website | Official site |
| Paper (arXiv) | arXiv:2307.08689 |
| OpenReview | ICLR 2024 page |
| GitHub | princeton-nlp/Collie |
| PyPI package | collie-bench |
| Dataset file | data/all_data.dill |
| License | MIT (code), source-specific licenses (data) |
COLLIE (Systematic Construction of Constrained Text Generation Tasks) is a grammar-based benchmark framework for evaluating how well large language models can produce text that satisfies rich, compositional constraints. The framework was introduced in a July 17, 2023 arXiv preprint by Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, and Karthik Narasimhan, all from the Department of Computer Science at Princeton University, and was published as a poster at ICLR 2024 in Vienna, Austria.[1][2][3] The accompanying COLLIE-v1 dataset contains 2,080 task instances drawn from 13 constraint structures that span four generation levels (word, sentence, paragraph, passage) and probe abilities such as counting, logical composition, and semantic planning.[1][3]
Unlike earlier constrained generation benchmarks, which typically rely on a small set of fixed constraint types such as including a list of seed words in a sentence, COLLIE provides a flexible specification language and an automatic extraction pipeline. Researchers declare a constraint template using primitives like Levels, Transformations, Logic combinators, Relations, and Reductions; the toolkit then mines matching examples from text corpora, renders the constraints into natural-language prompts, and verifies model outputs with deterministic checkers.[1][4]
Constrained text generation has a long history in natural language generation, covering keyword-constrained sentence writing, lexically constrained machine translation, and controlled paraphrasing. By 2023, frontier instruction-tuned models such as GPT-4 had largely solved the simple, single-constraint formulations exemplified by datasets like CommonGen. The COLLIE authors observed that this saturation hid important gaps: top models could insert a list of words into a paragraph yet still failed at constraints requiring exact counts, character-position reasoning, or boolean composition.[1][3]
The paper frames this as a benchmark design problem. Static tasks reach their useful lifetime quickly; once models surpass them, the field lacks signal for further progress. COLLIE responds by treating constraints as a compositional language: researchers specify structures and the framework instantiates concrete tasks from any text corpus, with the same template reusable at harder difficulty settings as models improve.[1][4]
All five authors are affiliated with Princeton University's Department of Computer Science and the Princeton Language and Intelligence (PLI) initiative.[2][3] First author Shunyu Yao is also known for the ReAct and Tree of Thoughts papers; Karthik Narasimhan, Yao's advisor, leads Princeton's NLP group. The arXiv preprint (2307.08689) was posted on July 17, 2023, with a single version listed.[5] The paper appeared at ICLR 2024 in Vienna, May 7 to 11, 2024.[2] Earlier OpenReview drafts referenced 1,132 instances, but the camera-ready and the released GitHub artifact both report 2,080 instances, indicating the dataset grew between submission and publication.[3][6]
| Item | Value |
|---|---|
| arXiv ID | 2307.08689 |
| arXiv submission date | July 17, 2023 |
| ICLR 2024 OpenReview ID | kxgSlyirUZ |
| Track | Datasets and benchmarks (poster) |
| Affiliation | Princeton University, Department of Computer Science |
| Code license | MIT |
COLLIE's design centers on five Python primitives defined in collie/constraints.py.[4]
| Primitive | Role | Examples |
|---|---|---|
| Level | Linguistic unit the constraint applies to | character, word, sentence, paragraph, passage |
| Transformation | Property computed over a Level | Count, Position, ForEach |
| Relation | Comparison of transformed value to target | ==, !=, in, not in, <, > |
| Logic | Combines multiple sub-constraints | And, Or, All |
| Reduction | Aggregates per-element results into one verdict | any, all, count, percentage thresholds |
A constraint object is a small program: pick a Level, apply a Transformation, assert a Relation, and aggregate with a Reduction. Each constraint exposes a check() method that returns True or False on any candidate string. Because the checker runs in pure Python, evaluation is deterministic, fast, and free from grader-model bias.[1][4]
COLLIE turns constraint structures into concrete tasks through a fixed pipeline.[1]
| Step | Process | Output |
|---|---|---|
| 1. Constraint specification | Author declares a constraint structure using the primitives above, leaving target values unbound | A reusable template such as "sentence with exactly N characters" |
| 2. Example extraction | FullExtractor walks a corpus via TextLoader and TextChunker; ConstraintExtractor enumerates configurations that satisfy the structure | Concrete (constraint, target) pairs grounded in real text |
| 3. Instruction rendering | ConstraintRenderer converts the constraint plus target into a natural-language prompt | Human-readable instructions |
| 4. Generation and checking | A model generates a completion; check() reports satisfaction | Pass or fail labels and aggregate scores |
The documentation notes that adding a new corpus mostly comes down to writing a high-recall filter and good post-processors, since markdown artifacts and tokenization quirks dominate the engineering effort.[4]
| Level | Example COLLIE constraint |
|---|---|
| Word | "Generate a word with at least 15 letters."[3] |
| Sentence | "Generate a sentence with exactly 82 characters."[3] |
| Paragraph | "Generate a paragraph where each sentence begins with the word 'soft'."[3] |
| Passage | "Generate a passage of two paragraphs that ends in the given sentence."[3] |
COLLIE-v1 is the released artifact accompanying the paper. It ships as data/all_data.dill in the GitHub repository, preserving Constraint objects, target values, and reference completions alongside their plain-text instructions.[4]
| Aspect | Specification |
|---|---|
| Total instances | 2,080 |
| Unique constraint prompts | 1,435 |
| Constraint structures | 13 (3 word, 4 sentence, 5 paragraph, 1 passage) |
| Source corpora | English Wikipedia, Project Gutenberg, CC-News (2017 to 2019) |
| Format | Python dill pickle (all_data.dill) |
| Loader API | collie.constraints.Constraint, collie.extract.FullExtractor |
The paper labels structures by level and index. Their roles can be summarized as follows.[1][3][6]
| ID | Level | Constraint type |
|---|---|---|
| word01 | Word | Minimum word length (single word with at least N characters) |
| word02 | Word | Word containing or excluding given letters |
| word03 | Word | Composition of length and letter constraints |
| sent01 | Sentence | Sentence with exactly N characters |
| sent02 | Sentence | Sentence containing required words |
| sent03 | Sentence | Sentence with required word position |
| sent04 | Sentence | Sentence ending with a specified word |
| para01 | Paragraph | Paragraph where every sentence begins with a given word |
| para02 | Paragraph | Paragraph with N sentences and target keywords |
| para03 | Paragraph | Paragraph with controlled sentence lengths |
| para04 | Paragraph | Mixed positional and counting constraints |
| para05 | Paragraph | Logical compositions over keywords |
| pass01 | Passage | Multi-paragraph passage with structural and lexical constraints |
Earlier benchmarks focus on word-level inclusion. COLLIE adds counting, ordering, exact length matching, ForEach quantification, and Logic-based composition.[1]
The authors evaluated five instruction-tuned models in a zero-shot setting and reported both pass@1 and pass@20 constraint satisfaction rates.[1][3]
| Model | Type | Notes on performance |
|---|---|---|
| GPT-4 | Closed, instruction-tuned | Best overall, average pass@1 of 50.9%, pass@20 above 63% |
| GPT-3.5-turbo | Closed, instruction-tuned | About half of GPT-4's score; pass@20 around 32% |
| PaLM-2 (text-bison-001) | Closed, instruction-tuned | Trails GPT models by a wide margin |
| Vicuna-7B | Open, instruction-tuned | Comparable to Alpaca, far below closed APIs |
| Alpaca-7B | Open, instruction-tuned | Lowest constraint satisfaction in the panel |
A few patterns recur across constraint types.[1][3] Position matters: instructing GPT-4 to begin a sentence with a specific word reaches 100% success, while ending a sentence with a target word drops to between 40% and 60%. Counting is harder than inclusion: exact-character or exact-word counts at the sentence and paragraph levels prove far more error-prone than which words to include. Composition compounds the difficulty: word01, sent04, and para01 are nearly solved by leading models, yet word03, para04, para05, and pass01, which combine multiple primitives, sit between 40% and 70% even for GPT-4.
The gap between single-sample and best-of-twenty rates is one of the paper's clearer findings. GPT-4's pass@20 averages about 63%, while GPT-3.5 reaches only 32%, suggesting that even with repeated sampling, smaller models often cannot find a satisfying completion. This makes COLLIE a useful stress test for sampling-based methods like best-of-N decoding and rejection sampling.[1]
The authors document recurring failures. Models hallucinate satisfying answers, claiming a sentence has the required character count when it does not. They drift on positional requirements, especially end-of-sequence ones; they mishandle compositional logic, satisfying one clause while violating another; and they struggle on pass01, where multi-paragraph layout interacts with lexical and counting requirements.[1][3]
The code is published under the MIT license at the GitHub repository princeton-nlp/Collie and on PyPI as collie-bench. The README recommends Python 3.9 because of compatibility issues with later versions; users can install via pip install collie-bench or a development install with pip install -e . from source.[4]
| Item | Value |
|---|---|
| Recommended Python | 3.9 |
| Install | pip install collie-bench or pip install -e . |
| Top-level package | collie |
| Modules | collie.constraints, collie.extract, collie.render |
| Reference data | data/all_data.dill |
| Reproducibility | scripts/analysis.ipynb, pre-computed logs/ |
| License (code) | MIT |
The repository ships pre-computed model outputs in logs/ and an analysis.ipynb notebook for reproducing the paper's tables without rerunning expensive APIs. Scripts under scripts/ invoke either OpenAI-style APIs or local GPU inference.[4]
from collie.constraints import (
Constraint, Level, Transformation, Relation, Reduction
)
c = Constraint(
level=Level.SENTENCE,
transformation=Transformation.COUNT,
relation=Relation.EQ,
target=82,
reduction=Reduction.ALL,
)
assert c.check("A sentence with exactly eighty-two characters in it... etc.")
The checker returns a boolean, and benchmark scoring is aggregated across the 2,080 examples in all_data.dill.[4]
COLLIE sits in a growing cluster of benchmarks targeting controllable and instruction-following text generation. Earlier work was largely fixed and lexical; later benchmarks add verifiable rule following or domain coverage COLLIE does not attempt.
| Benchmark | Year | Focus | Relation to COLLIE |
|---|---|---|---|
| CommonGen | 2020 | Sentence covering given concepts | Simpler baseline that COLLIE cites as too easy for GPT-4 |
| IFEval | 2023 | Verifiable instruction-following constraints | IFEval focuses on instruction format rather than compositional grammars |
| FollowBench | 2023 | Multi-level fine-grained constraints | Closest in motivation, less compositional |
| IFBench | 2024 | Held-out verifiable constraints | Builds on the COLLIE/IFEval line for training-time generalization |
| InfoBench | 2024 | Decomposed instruction following | Evaluates instruction decomposition, not compositional structure |
The paper and repository are candid about scope.[1][4] COLLIE is English-only and text-only. The 13 structures, while more diverse than prior work, are hand-designed and may not cover all practically interesting constraints. Long-passage constraints are expensive to extract and verify, making dataset scaling costly. Deterministic checkers can also be strict on edge cases like punctuation differences, so absolute satisfaction numbers may underestimate models that are semantically faithful but format-divergent.
COLLIE has been adopted as a reference benchmark for controllable generation and instruction following.[2][7] Later benchmarks such as IFEval, FollowBench, and IFBench cite it as motivation for moving beyond simple lexical constraints. Its grammar-based specification has also influenced training recipes that use COLLIE-style synthetic constraints to teach models structured instruction following.[7] Because the toolkit is lightweight and free of grader-model bias, it is a common quick test for new instruction-tuned models, often run alongside other instruction-following suites to check whether reported gains generalize to harder counting and compositional tasks.[1][3]