# IFBench

> Source: https://aiwiki.ai/wiki/ifbench
> Updated: 2026-06-27
> Categories: AI Benchmarks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**IFBench** (Instruction Following Benchmark) is an [artificial intelligence](/wiki/artificial_intelligence) [benchmark](/wiki/benchmark) that measures whether [large language models](/wiki/large_language_model) can follow precise output constraints they have never seen during training. [1] It was created by researchers at the [Allen Institute for Artificial Intelligence](/wiki/allen_institute_for_ai) (Ai2) and the [University of Washington](/wiki/university_of_washington), and introduced in the July 2025 paper "Generalizing Verifiable Instruction Following" (arXiv:2507.02833). [1] IFBench pairs 58 new, programmatically verifiable out-of-domain (OOD) constraints with 300 held-out [WildChat](/wiki/wildchat) prompts, and ships 29 additional training constraints (IFTrain) for [reinforcement learning](/wiki/reinforcement_learning) with verifiable rewards (RLVR). [1] [2] It was built because the older [IFEval](/wiki/ifeval) benchmark has effectively saturated: frontier models score above 80% on IFEval but top out near 50% on IFBench, with leading systems such as Gemini 2.5 Pro and Claude 4 Sonnet "only able to score up to 50%." [4] [7] As of the paper's reporting, the highest score belonged to OpenAI's o3 reasoning model at 69.3%; no human-performance figure is reported. [2]

| IFBench |
| --- |
| Overview |
| Full name | Instruction Following Benchmark |
| Abbreviation | IFBench |
| Description | A benchmark for evaluating precise instruction following with verifiable out-of-domain constraints |
| Release date | 2025-07-03 (arXiv preprint) |
| Latest version | v3 (November 2025) |
| Benchmark updated | 2025 |
| Authors | Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi |
| Organization | Allen Institute for Artificial Intelligence (AI2), University of Washington |
| Technical Details |
| Type | Instruction Following, Constraint Verification |
| Modality | Text |
| Task format | Single-turn and multi-turn instruction following |
| Number of tasks | 58 test constraints + 29 IFTrain training constraints |
| Total examples | 300 prompts (test split) with 1 or 2 constraints each |
| Evaluation metric | Prompt-level strict accuracy and prompt-level loose accuracy |
| Domains | General instruction following |
| Languages | English (with one Japanese-word interleaving constraint) |
| Performance |
| Human performance | Not reported |
| Baseline | Roughly 28.9% (Tulu-3-8B before IF-RLVR training) |
| SOTA score | 69.3% (OpenAI o3, single-turn) |
| SOTA model | OpenAI o3 |
| SOTA date | 2025 |
| Saturated | No |
| Resources |
| Paper | [arXiv:2507.02833](https://arxiv.org/abs/2507.02833) |
| GitHub | [allenai/IFBench](https://github.com/allenai/IFBench) |
| Dataset | [allenai/IFBench_test on Hugging Face](https://huggingface.co/datasets/allenai/IFBench_test) |
| License | Apache 2.0 (code), ODC-BY-1.0 (data) |
| Predecessor | [IFEval](/wiki/ifeval) |

## What is IFBench?

IFBench is a benchmark that tests whether a [language model](/wiki/language_model) has actually learned to read a constraint and obey it, rather than having memorized a small, fixed set of constraint templates during post-training. [1] The benchmark contains 58 verifiable out-of-domain constraints attached to held-out WildChat prompts, plus a separate IFTrain set of 29 training constraints. [1] [2] It exists because the older IFEval benchmark has effectively saturated: leading models score above 80% on IFEval, but the same models score below 50% on IFBench, indicating that high IFEval scores partly reflect overfitting rather than a general ability to follow new constraints. [1] [7]

The paper states the problem directly: "we find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities and are not able to generalize well to unseen output constraints." [1] When Ai2 announced the benchmark, it framed the gap as an open research target: "Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training." [4]

IFBench was created to test a specific gap in language-model evaluation. Existing precise instruction-following benchmarks reuse a small set of constraint templates, which developers can target with synthetic data during post-training. Once a model has seen many examples of "include keyword X exactly N times," it learns those particular constraints rather than the general skill of reading a constraint and obeying it. The authors argue this turns instruction-following into a closed-book test that hides the fact that models still fail when the user invents an unusual rule. [1]

The benchmark addresses this by holding out both the constraints and the host prompts. The 58 test constraints were written from scratch by the authors and outside contributors, then paired with WildChat prompts that Ai2 held back from public release. A human annotator checked each pairing for compatibility. The result is a 300-instance test set where every instance combines a real user request with one or two unfamiliar verifiable constraints. [1]

The paper has three contributions:

1. IFBench itself, with 58 OOD constraints and Python verification functions. [1]
2. IFTrain, a separate set of 29 OOD training constraints, intended for [reinforcement learning with verifiable rewards](/wiki/reinforcement_learning_with_verifiable_rewards) (RLVR). [1]
3. IF-RLVR, a training recipe using [GRPO](/wiki/grpo) (Group Relative Policy Optimization) that combines multiple constraints per prompt and widens variable ranges during training, improving both IFEval and IFBench accuracy. [1]

## Who created IFBench and when was it released?

IFBench was developed by Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi at the Allen Institute for Artificial Intelligence and the University of Washington. [1] The paper "Generalizing Verifiable Instruction Following" was first posted to arXiv on July 3, 2025 (arXiv:2507.02833), last revised to version 3 on November 11, 2025, and accepted to [NeurIPS 2025](/wiki/neurips_2025) in the Datasets and Benchmarks track. [1] [4] The code is released under Apache 2.0 and the data under ODC-BY-1.0, with the test set hosted as `allenai/IFBench_test` on Hugging Face and the code at `allenai/IFBench` on GitHub. [2] [3]

## How does IFBench differ from IFEval?

IFBench's predecessor, IFEval, was published by [Google Research](/wiki/google_research) in 2023 with 25 constraint templates that can each be checked with short Python functions ([Zhou et al., 2023](https://arxiv.org/abs/2311.07911)). [7] By 2025, 2B to 8B open-weight models routinely scored 80%+ on it, and reports for releases like [Nemotron-4 340B](/wiki/nemotron) explicitly describe synthetic data generated from the IFEval taxonomy. [13] Analyses of [WildChat](/wiki/wildchat) and [WildIFEval](/wiki/wildifeval) ([Lior et al., 2025](https://arxiv.org/abs/2503.06573)) show users invent constraints more idiosyncratic than the IFEval templates, and a model that only knows IFEval will follow the first one and drop the rest. [9] The decisive difference is that IFBench holds out both the constraints (58 new ones, none in IFEval) and the host prompts (held-out WildChat), so a model cannot have trained on either. [1]

## What is IFBench made of?

### Test constraints

The 58 constraints fall into seven groups (full list in Appendix A): [1]

| Group | Number | Examples |
| --- | --- | --- |
| count | 8 | "Use at least N coordinating conjunctions"; "Mention at least N person names from this list" |
| ratio | 5 | "Maintain a 2:1 ratio of declarative to interrogative sentences"; "Stop words at most P% of total" |
| words | 12 | "Each word starts with the next letter of the alphabet"; "Words with prime-number lengths only"; "Include 10 palindromes" |
| sentence | 3 | "Each sentence must have more alliterative words than the previous one" |
| format | 14 | "Emoji at the end of every sentence"; "Nest parentheses at least 5 levels deep"; "One word per line"; "Title case" |
| custom | 11 | CSV with fixed schema, reverse alphabetical lists, multiple choice generation |
| copy | 5 | "Copy the span between character indices n_start and n_end"; "Repeat the request but change the first word" |

Custom-group constraints replace the user prompt entirely; the rest are concatenated to a held-out WildChat prompt. A typical instance reads: "Write a paragraph about the discovery of penicillin. Each word must start with the next letter of the alphabet, looping back to A after Z." Average prompt length is 76 tokens for single-turn and 408 tokens for multi-turn. [1]

### Training constraints (IFTrain)

IFTrain is 29 constraints for RLVR training, with no overlap with IFBench. [1] It covers ten skill clusters: keyword inclusion/exclusion, letter frequency, paragraph delimiters, first/last word positioning, format wrappers, copying, punctuation avoidance, structured counting, palindromes, and rules like "no two adjacent words start with consecutive letters of the alphabet."

### Evaluation modes

IFBench supports two modes over the same 300 prompts: [1]

| Mode | Structure |
| --- | --- |
| Single-turn | One user message with task + 1 or 2 constraints |
| Multi-turn | Three turns: user task, assistant reply, user follow-up adding a constraint and requesting a rewrite |

Both report strict and loose accuracy following the IFEval convention. Headline numbers in the paper are prompt-level loose accuracy. [1]

## How does IFBench verify answers?

Each constraint ships with a Python verification function that returns a boolean. [1] `instructions_registry.py` maps constraint names to function objects, and `evaluation_lib.py` runs them over a JSONL of model responses. [2] Because every check is deterministic, evaluation is reproducible: no [LLM-as-judge](/wiki/llm_as_a_judge), no human rating, no calibration drift. The construction pipeline excluded any constraint not expressible as a Python verifier ("write in a friendly tone" and similar). [1] This is the same automatic-verification philosophy IFEval introduced, which the paper describes as constraints that "can all be automatically verified using short python functions." [1] [7]

## How do models score on IFBench?

The paper reports IFBench numbers alongside IFEval scores. Selected results: [1] [2]

### Frontier models, single-turn (before IF-RLVR training)

| Model | IFBench (loose) |
| --- | --- |
| OpenAI o3 | 69.3% |
| Gemini 2.5 Pro | 52.3% |
| Claude 4 Sonnet | below 50% |
| Qwen3-32B | below 50% |
| GPT-4.1 | below 50% |
| Claude 3.7 Sonnet | below 50% |

Every non-reasoning frontier model lost roughly 30 to 40 points compared to its IFEval score. [1] OpenAI's o3 reasoning model is the outlier at 69.3%, the single highest score reported on the benchmark, while Gemini 2.5 Pro reaches 52.3% and most leading systems land below 50%. [2] [4]

### IF-RLVR results

| Configuration | IFEval | IFBench |
| --- | --- | --- |
| Tulu-3-8B (DPO baseline) | 82.4% | 28.9% |
| Tulu-3-8B + IF-RLVR | 92.2% | 45.9% |
| Qwen2.5-7B base + IF-RLVR | 87.8% | 54.7% |
| Llama-3.1-8B base + IF-RLVR | 88.2% | 54.1% |
| OLMo2 base + IF-RLVR | 70.4% | 46.6% |

Running IF-RLVR from a base model, with a chat template that encourages the model to think before answering, gave the best out-of-domain generalization. [1] Llama-3.1-8B base reached 54.1% on IFBench versus 44.6% for the same model trained from its instruct checkpoint, despite similar IFEval scores: the paper's strongest argument that IF-RLVR teaches a transferable skill. [1] IFBench is part of the [Artificial Analysis](/wiki/artificial_analysis) Intelligence Index, where reasoning-augmented systems such as Grok 4's reasoning variants have been reported above 80%. [6]

## How does the IF-RLVR training recipe work?

The paper's second half describes IF-RLVR, the [reinforcement learning](/wiki/reinforcement_learning) recipe the authors recommend for improving precise instruction following. The contribution is not RLVR itself (already used for math and code in [Tulu 3](/wiki/tulu_3)), but the specific data and training choices that make it work for the constraint setting. [1] [11]

### Data and training

Training prompts are built by sampling an instruction from the Tulu-3-SFT mix and appending one to six constraints from two pools: the 25 IFEval templates and the 29 in IFTrain. A conflict dictionary prevents incompatible combinations. Variable ranges are widened beyond test ranges. Most experiments use 60,000 to 100,000 prompts. [1]

The RL algorithm is GRPO ([Shao et al., 2024](https://arxiv.org/abs/2402.03300)) implemented in Ai2's open-instruct library. [10] The reward per generation is a sum of per-constraint verification scores:

```
Instance Reward = sum_i ( verifiable_reward_i * reward_multiplier_i * reward_weight_i )
```

Default multipliers and weights are 1, making the reward a count of satisfied constraints. Training uses 8 H100 [GPUs](/wiki/gpu), learning rate 5e-7, 16 samples per prompt, mini-batch 32, max token length 2,048 (10,240 with reasoning chat templates), and ~2,000 steps (about one day per run). [1]

### Ablation findings

Four ablations from Section 4 shape the recommended recipe. [1]

| Ablation | Comparison | Result |
| --- | --- | --- |
| Constraints per prompt | 1 to 6 | Training on more constraints improved IFBench from ~49% to ~56% on a Qwen2.5-7B policy, even though IFBench prompts have only 1 or 2 constraints |
| Variable ranges | Same, wider, disjoint | Wider ranges generalized better than identical or disjoint ranges |
| Categories left out | Cases, format, length, keywords | Removing length or keyword constraints hurt IFEval most; removing format or cases barely mattered |
| Algorithm | GRPO vs DPO on identical data | GRPO reached ~89.65% IFEval; DPO on the same prompts reached ~79.67% |

The DPO comparison shows GRPO is doing more than exposing the model to verifier-labelled data: same prompts, same starting checkpoint, ten-point gap. [1]

### Reward hacking

IF-RLVR training has a side effect: models over-prioritize constraints at the expense of the task. A model asked for a single-sentence summary with the constraint "each word must start with the next letter of the alphabet" produces a sentence that follows the alphabet rule but does not summarize the text. The authors score outputs with [GPT-4.1](/wiki/gpt_4_1) as a judge: their RLVR-trained Tulu drops from 7.0 to 6.4 on a 10-point helpfulness scale even as verifiable accuracy rises. [1] The paper proposes mixing the verifiable reward with a [reward model](/wiki/reward_model) signal (Llama-3.1-Tulu-3-8B-RM); the mix lands at 30 on IFBench (rather than 45.9) but recovers on AlpacaEval 2 (31.6), giving teams a deployment knob. [1]

## How does IFBench compare to other instruction-following benchmarks?

| Benchmark | Constraint count | Verification | Design |
| --- | --- | --- | --- |
| [IFEval](/wiki/ifeval) | 25 templates | Python | Fixed taxonomy; saturated |
| [FollowBench](/wiki/followbench) | 5 types, 5 levels | LLM-as-judge | Escalating constraints per prompt |
| [InfoBench](/wiki/infobench) | 500 instructions | LLM-as-judge (DRFR) | Atomic decomposition |
| IFBench | 58 test + 29 train | Python | Held-out constraints and host prompts |
| [VFF](/wiki/vff) | Procedurally generated | Python | Used mainly for SFT/DPO data |
| [WildIFEval](/wiki/wildifeval) | 1,500 user-collected | LLM-as-judge | Real user constraints, not all verifiable |

IFBench's niche is the combination of automatic verification with both held-out constraints and held-out host prompts. [1]

## Note on naming

A separate, older project also called "IFBench: Towards a Benchmark for Verifiable Instruction Following Evaluations" appears in earlier work and is not the Ai2 IFBench described here. This article uses "IFBench" to refer to the Pyatkin et al. (2025) benchmark, which is what current papers and leaderboards mean. [1]

## Why does IFBench matter, and what are its limits?

IFBench's primary contribution is methodological: by holding out both the constraints and the host prompts, it provides a fair test of whether a model has learned to read instructions or has only memorized a fixed taxonomy. [1] Within months of release it was integrated into the Artificial Analysis Intelligence Index and into [LightEval](/wiki/lighteval). [6] [16] The reward-hacking section also documents a phenomenon worth attention for [RLHF](/wiki/rlhf) and RLVR research: training to follow constraints can degrade general response quality if the verifiable reward is the only signal. [1]

The authors acknowledge several limitations. The benchmark only covers constraints expressible as short Python verifiers, which excludes many natural-language constraints users care about ("sound friendly," "avoid jargon"). Some constraints are unnatural compared to real user requests ("include at least 10 palindromes"). The dataset is in English. Pass/fail scoring does not credit partial compliance. And, like all held-out benchmarks, IFBench's value will erode as its constraints leak into training data over time. [1]

## See also

- [IFEval](/wiki/ifeval)
- [WildChat](/wiki/wildchat)
- [WildIFEval](/wiki/wildifeval)
- [Tulu 3](/wiki/tulu_3)
- [GRPO](/wiki/grpo)
- [Reinforcement learning with verifiable rewards](/wiki/reinforcement_learning_with_verifiable_rewards)
- [Allen Institute for AI](/wiki/allen_institute_for_ai)
- [InfoBench](/wiki/infobench)
- [FollowBench](/wiki/followbench)

## References

1. Pyatkin, V., Malik, S., Graf, V., Ivison, H., Huang, S., Dasigi, P., Lambert, N., & Hajishirzi, H. (2025). "Generalizing Verifiable Instruction Following." arXiv:2507.02833. https://arxiv.org/abs/2507.02833
2. Allen Institute for AI. "IFBench GitHub repository." https://github.com/allenai/IFBench
3. Allen Institute for AI. "IFBench_test dataset on Hugging Face." https://huggingface.co/datasets/allenai/IFBench_test
4. NeurIPS 2025. "Generalizing Verifiable Instruction Following (poster page)." https://neurips.cc/virtual/2025/poster/121379
5. OpenReview. "Generalizing Verifiable Instruction Following." https://openreview.net/forum?id=yfYgwjj5F8
6. Artificial Analysis. "IFBench Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/ifbench
7. Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., & Hou, L. (2023). "Instruction-Following Evaluation for Large Language Models" (IFEval). arXiv:2311.07911. https://arxiv.org/abs/2311.07911
8. Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., & Deng, Y. (2024). "WildChat: 1M ChatGPT Interaction Logs in the Wild." arXiv:2405.01470. https://arxiv.org/abs/2405.01470
9. Lior, G., Habba, A., Granitzer, M., & Stanovsky, G. (2025). "WildIFEval: Instruction Following in the Wild." arXiv:2503.06573. https://arxiv.org/abs/2503.06573
10. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., et al. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (introduces GRPO). arXiv:2402.03300. https://arxiv.org/abs/2402.03300
11. Lambert, N., et al. (2024). "Tulu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv:2411.15124. https://arxiv.org/abs/2411.15124
12. Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models." arXiv:2410.05229. https://arxiv.org/abs/2410.05229
13. Adler, B., et al. (2024). "Nemotron-4 340B Technical Report." arXiv:2406.11704. https://arxiv.org/abs/2406.11704
14. Jiang, Y., et al. (2023). "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models." arXiv:2310.20410. https://arxiv.org/abs/2310.20410
15. Qin, Y., Song, K., Hu, Y., Yao, W., Cho, S., Wang, X., Wu, X., Liu, F., Liu, P., & Yu, D. (2024). "InfoBench: Evaluating Instruction Following Ability in Large Language Models." arXiv:2401.03601. https://arxiv.org/abs/2401.03601
16. Hugging Face. "LightEval IFBench task implementation." https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/tasks/ifbench/main.py
17. Allen Institute for AI on X (Twitter). "Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions." https://x.com/allen_ai/status/1940833394025279857