IFBench
Last reviewed
May 10, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 2,499 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 2,499 words
Add missing citations, update stale details, or suggest a clearer explanation.
**
| IFBench | |
|---|---|
| Overview | |
| Full name | Instruction Following Benchmark |
| Abbreviation | IFBench |
| Description | A benchmark for evaluating precise instruction following with verifiable out-of-domain constraints |
| Release date | 2025-07-03 (arXiv preprint) |
| Latest version | v3 (November 2025) |
| Benchmark updated | 2025 |
| Authors | Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi |
| Organization | Allen Institute for Artificial Intelligence (AI2), University of Washington |
| Technical Details | |
| Type | Instruction Following, Constraint Verification |
| Modality | Text |
| Task format | Single-turn and multi-turn instruction following |
| Number of tasks | 58 test constraints + 29 IFTrain training constraints |
| Total examples | 300 prompts (test split) with 1 or 2 constraints each |
| Evaluation metric | Prompt-level strict accuracy and prompt-level loose accuracy |
| Domains | General instruction following |
| Languages | English (with one Japanese-word interleaving constraint) |
| Performance | |
| Human performance | Not reported |
| Baseline | Roughly 28.9% (Tülu-3-8B before IF-RLVR training) |
| SOTA score | 69.3% (OpenAI o3, single-turn) |
| SOTA model | OpenAI o3 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Paper | arXiv:2507.02833 |
| GitHub | allenai/IFBench |
| Dataset | allenai/IFBench_test on Hugging Face |
| License | Apache 2.0 (code), ODC-BY-1.0 (data) |
| Predecessor | IFEval |
IFBench** (Instruction Following Benchmark) is an artificial intelligence benchmark that measures whether large language models can follow precise output constraints they have not encountered during training. It was developed by researchers at the Allen Institute for Artificial Intelligence (AI2) and the University of Washington, and introduced in the paper "Generalizing Verifiable Instruction Following" (arXiv:2507.02833), first posted on July 3, 2025 and accepted to NeurIPS 2025 in the Datasets and Benchmarks track. The benchmark contains 58 verifiable out-of-domain (OOD) constraints attached to held-out WildChat prompts, plus a separate IFTrain set of 29 training constraints. It exists because the older IFEval benchmark has effectively saturated: leading models score above 80% on IFEval, but the same models score below 50% on IFBench, indicating that high IFEval scores partly reflect overfitting rather than a general ability to follow new constraints.
IFBench was created to test a specific gap in language model evaluation. Existing precise instruction-following benchmarks reuse a small set of constraint templates, which developers can target with synthetic data during post-training. Once a model has seen many examples of "include keyword X exactly N times," it learns those particular constraints rather than the general skill of reading a constraint and obeying it. The authors argue this turns instruction-following into a closed-book test that hides the fact that models still fail when the user invents an unusual rule.
The benchmark addresses this by holding out both the constraints and the host prompts. The 58 test constraints were written from scratch by the authors and outside contributors, then paired with WildChat prompts that AI2 held back from public release. A human annotator checked each pairing for compatibility. The result is a 300-instance test set where every instance combines a real user request with one or two unfamiliar verifiable constraints.
The paper has three contributions:
IFBench's predecessor, IFEval, was published by Google Research in 2023 with 25 constraint templates (Zhou et al., 2023). By 2025, 2B to 8B open-weight models routinely scored 80%+ on it, and reports for releases like Nemotron-4 340B explicitly describe synthetic data generated from the IFEval taxonomy. Analyses of WildChat and WildIFEval (Lior et al., 2025) show users invent constraints more idiosyncratic than the IFEval templates, and a model that only knows IFEval will follow the first one and drop the rest.
The 58 constraints fall into seven groups (full list in Appendix A):
| Group | Number | Examples |
|---|---|---|
| count | 8 | "Use at least N coordinating conjunctions"; "Mention at least N person names from this list" |
| ratio | 5 | "Maintain a 2:1 ratio of declarative to interrogative sentences"; "Stop words at most P% of total" |
| words | 12 | "Each word starts with the next letter of the alphabet"; "Words with prime-number lengths only"; "Include 10 palindromes" |
| sentence | 3 | "Each sentence must have more alliterative words than the previous one" |
| format | 14 | "Emoji at the end of every sentence"; "Nest parentheses at least 5 levels deep"; "One word per line"; "Title case" |
| custom | 11 | CSV with fixed schema, reverse alphabetical lists, multiple choice generation |
| copy | 5 | "Copy the span between character indices n_start and n_end"; "Repeat the request but change the first word" |
Custom-group constraints replace the user prompt entirely; the rest are concatenated to a held-out WildChat prompt. A typical instance reads: "Write a paragraph about the discovery of penicillin. Each word must start with the next letter of the alphabet, looping back to A after Z." Average prompt length is 76 tokens for single-turn and 408 tokens for multi-turn.
IFTrain is 29 constraints for RLVR training, with no overlap with IFBench. It covers ten skill clusters: keyword inclusion/exclusion, letter frequency, paragraph delimiters, first/last word positioning, format wrappers, copying, punctuation avoidance, structured counting, palindromes, and rules like "no two adjacent words start with consecutive letters of the alphabet."
IFBench supports two modes over the same 300 prompts:
| Mode | Structure |
|---|---|
| Single-turn | One user message with task + 1 or 2 constraints |
| Multi-turn | Three turns: user task, assistant reply, user follow-up adding a constraint and requesting a rewrite |
Both report strict and loose accuracy following the IFEval convention. Headline numbers in the paper are prompt-level loose accuracy.
Each constraint ships with a Python verification function that returns a boolean. instructions_registry.py maps constraint names to function objects, and evaluation_lib.py runs them over a JSONL of model responses. Because every check is deterministic, evaluation is reproducible: no LLM-as-judge, no human rating, no calibration drift. The construction pipeline excluded any constraint not expressible as a Python verifier ("write in a friendly tone" and similar).
The paper reports IFBench numbers alongside IFEval scores. Selected results:
| Model | IFBench (loose) |
|---|---|
| OpenAI o3 | 69.3% |
| Claude 4 Sonnet | below 50% |
| Qwen3-32B | below 50% |
| GPT-4.1 | below 50% |
| Claude 3.7 Sonnet | below 50% |
Every non-reasoning frontier model lost roughly 30 to 40 points compared to its IFEval score. OpenAI's o3 reasoning model is the outlier at 69.3%.
| Configuration | IFEval | IFBench |
|---|---|---|
| Tülu-3-8B (DPO baseline) | 82.4% | 28.9% |
| Tülu-3-8B + IF-RLVR | 92.2% | 45.9% |
| Qwen2.5-7B base + IF-RLVR | 87.8% | 54.7% |
| Llama-3.1-8B base + IF-RLVR | 88.2% | 54.1% |
| OLMo2 base + IF-RLVR | 70.4% | 46.6% |
Running IF-RLVR from a base model, with a chat template that encourages the model to think before answering, gave the best out-of-domain generalization. Llama-3.1-8B base reached 54.1% on IFBench versus 44.6% for the same model trained from its instruct checkpoint, despite similar IFEval scores: the paper's strongest argument that IF-RLVR teaches a transferable skill. IFBench is part of the Artificial Analysis Intelligence Index, where reasoning-augmented systems such as Grok 4's reasoning variants have been reported above 80%.
The paper's second half describes IF-RLVR, the reinforcement learning recipe the authors recommend for improving precise instruction following. The contribution is not RLVR itself (already used for math and code in Tülu 3), but the specific data and training choices that make it work for the constraint setting.
Training prompts are built by sampling an instruction from the Tülu-3-SFT mix and appending one to six constraints from two pools: the 25 IFEval templates and the 29 in IFTrain. A conflict dictionary prevents incompatible combinations. Variable ranges are widened beyond test ranges. Most experiments use 60,000 to 100,000 prompts.
The RL algorithm is GRPO (Shao et al., 2024) implemented in AI2's open-instruct library. The reward per generation is a sum of per-constraint verification scores:
Instance Reward = sum_i ( verifiable_reward_i * reward_multiplier_i * reward_weight_i )
Default multipliers and weights are 1, making the reward a count of satisfied constraints. Training uses 8 H100 GPUs, learning rate 5e-7, 16 samples per prompt, mini-batch 32, max token length 2,048 (10,240 with reasoning chat templates), and ~2,000 steps (about one day per run).
Four ablations from Section 4 shape the recommended recipe.
| Ablation | Comparison | Result |
|---|---|---|
| Constraints per prompt | 1 to 6 | Training on more constraints improved IFBench from ~49% to ~56% on a Qwen2.5-7B policy, even though IFBench prompts have only 1 or 2 constraints |
| Variable ranges | Same, wider, disjoint | Wider ranges generalized better than identical or disjoint ranges |
| Categories left out | Cases, format, length, keywords | Removing length or keyword constraints hurt IFEval most; removing format or cases barely mattered |
| Algorithm | GRPO vs DPO on identical data | GRPO reached ~89.65% IFEval; DPO on the same prompts reached ~79.67% |
The DPO comparison shows GRPO is doing more than exposing the model to verifier-labelled data: same prompts, same starting checkpoint, ten-point gap.
IF-RLVR training has a side effect: models over-prioritize constraints at the expense of the task. A model asked for a single-sentence summary with the constraint "each word must start with the next letter of the alphabet" produces a sentence that follows the alphabet rule but does not summarize the text. The authors score outputs with GPT-4.1 as a judge: their RLVR-trained Tülu drops from 7.0 to 6.4 on a 10-point helpfulness scale even as verifiable accuracy rises. The paper proposes mixing the verifiable reward with a reward model signal (Llama-3.1-Tulu-3-8B-RM); the mix lands at 30 on IFBench (rather than 45.9) but recovers on AlpacaEval 2 (31.6), giving teams a deployment knob.
| Benchmark | Constraint count | Verification | Design |
|---|---|---|---|
| IFEval | 25 templates | Python | Fixed taxonomy; saturated |
| FollowBench | 5 types, 5 levels | LLM-as-judge | Escalating constraints per prompt |
| InfoBench | 500 instructions | LLM-as-judge (DRFR) | Atomic decomposition |
| IFBench | 58 test + 29 train | Python | Held-out constraints and host prompts |
| VFF | Procedurally generated | Python | Used mainly for SFT/DPO data |
| WildIFEval | 1,500 user-collected | LLM-as-judge | Real user constraints, not all verifiable |
IFBench's niche is the combination of automatic verification with both held-out constraints and held-out host prompts.
A separate, older project also called "IFBench: Towards a Benchmark for Verifiable Instruction Following Evaluations" appears in earlier work and is not the AI2 IFBench described here. This article uses "IFBench" to refer to the Pyatkin et al. (2025) benchmark, which is what current papers and leaderboards mean.
IFBench's primary contribution is methodological: by holding out both the constraints and the host prompts, it provides a fair test of whether a model has learned to read instructions or has only memorized a fixed taxonomy. Within months of release it was integrated into the Artificial Analysis Intelligence Index and into LightEval. The reward-hacking section also documents a phenomenon worth attention for RLHF and RLVR research: training to follow constraints can degrade general response quality if the verifiable reward is the only signal.
The authors acknowledge several limitations. The benchmark only covers constraints expressible as short Python verifiers, which excludes many natural-language constraints users care about ("sound friendly," "avoid jargon"). Some constraints are unnatural compared to real user requests ("include at least 10 palindromes"). The dataset is in English. Pass/fail scoring does not credit partial compliance. And, like all held-out benchmarks, IFBench's value will erode as its constraints leak into training data over time.