IFBench

**

IFBench
Overview
Full name	Instruction Following Benchmark
Abbreviation	IFBench
Description	A benchmark for evaluating precise instruction following with verifiable out-of-domain constraints
Release date	2025-07-03 (arXiv preprint)
Latest version	v3 (November 2025)
Benchmark updated	2025
Authors	Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi
Organization	Allen Institute for Artificial Intelligence (AI2), University of Washington
Technical Details
Type	Instruction Following, Constraint Verification
Modality	Text
Task format	Single-turn and multi-turn instruction following
Number of tasks	58 test constraints + 29 IFTrain training constraints
Total examples	300 prompts (test split) with 1 or 2 constraints each
Evaluation metric	Prompt-level strict accuracy and prompt-level loose accuracy
Domains	General instruction following
Languages	English (with one Japanese-word interleaving constraint)
Performance
Human performance	Not reported
Baseline	Roughly 28.9% (Tülu-3-8B before IF-RLVR training)
SOTA score	69.3% (OpenAI o3, single-turn)
SOTA model	OpenAI o3
SOTA date	2025
Saturated	No
Resources
Paper	arXiv:2507.02833
GitHub	allenai/IFBench
Dataset	allenai/IFBench_test on Hugging Face
License	Apache 2.0 (code), ODC-BY-1.0 (data)
Predecessor	IFEval

IFBench** (Instruction Following Benchmark) is an artificial intelligence benchmark that measures whether large language models can follow precise output constraints they have not encountered during training. It was developed by researchers at the Allen Institute for Artificial Intelligence (AI2) and the University of Washington, and introduced in the paper "Generalizing Verifiable Instruction Following" (arXiv:2507.02833), first posted on July 3, 2025 and accepted to NeurIPS 2025 in the Datasets and Benchmarks track. The benchmark contains 58 verifiable out-of-domain (OOD) constraints attached to held-out WildChat prompts, plus a separate IFTrain set of 29 training constraints. It exists because the older IFEval benchmark has effectively saturated: leading models score above 80% on IFEval, but the same models score below 50% on IFBench, indicating that high IFEval scores partly reflect overfitting rather than a general ability to follow new constraints.

Overview

IFBench was created to test a specific gap in language model evaluation. Existing precise instruction-following benchmarks reuse a small set of constraint templates, which developers can target with synthetic data during post-training. Once a model has seen many examples of "include keyword X exactly N times," it learns those particular constraints rather than the general skill of reading a constraint and obeying it. The authors argue this turns instruction-following into a closed-book test that hides the fact that models still fail when the user invents an unusual rule.

The benchmark addresses this by holding out both the constraints and the host prompts. The 58 test constraints were written from scratch by the authors and outside contributors, then paired with WildChat prompts that AI2 held back from public release. A human annotator checked each pairing for compatibility. The result is a 300-instance test set where every instance combines a real user request with one or two unfamiliar verifiable constraints.

The paper has three contributions:

IFBench itself, with 58 OOD constraints and Python verification functions.
IFTrain, a separate set of 29 OOD training constraints, intended for reinforcement learning with verifiable rewards (RLVR).
IF-RLVR, a training recipe using GRPO (Group Relative Policy Optimization) that combines multiple constraints per prompt and widens variable ranges during training, improving both IFEval and IFBench accuracy.

Background and motivation

IFBench's predecessor, IFEval, was published by Google Research in 2023 with 25 constraint templates (Zhou et al., 2023). By 2025, 2B to 8B open-weight models routinely scored 80%+ on it, and reports for releases like Nemotron-4 340B explicitly describe synthetic data generated from the IFEval taxonomy. Analyses of WildChat and WildIFEval (Lior et al., 2025) show users invent constraints more idiosyncratic than the IFEval templates, and a model that only knows IFEval will follow the first one and drop the rest.

Benchmark composition

Test constraints

The 58 constraints fall into seven groups (full list in Appendix A):

Group	Number	Examples
count	8	"Use at least N coordinating conjunctions"; "Mention at least N person names from this list"
ratio	5	"Maintain a 2:1 ratio of declarative to interrogative sentences"; "Stop words at most P% of total"
words	12	"Each word starts with the next letter of the alphabet"; "Words with prime-number lengths only"; "Include 10 palindromes"
sentence	3	"Each sentence must have more alliterative words than the previous one"
format	14	"Emoji at the end of every sentence"; "Nest parentheses at least 5 levels deep"; "One word per line"; "Title case"
custom	11	CSV with fixed schema, reverse alphabetical lists, multiple choice generation
copy	5	"Copy the span between character indices n_start and n_end"; "Repeat the request but change the first word"

Custom-group constraints replace the user prompt entirely; the rest are concatenated to a held-out WildChat prompt. A typical instance reads: "Write a paragraph about the discovery of penicillin. Each word must start with the next letter of the alphabet, looping back to A after Z." Average prompt length is 76 tokens for single-turn and 408 tokens for multi-turn.

Training constraints (IFTrain)

IFTrain is 29 constraints for RLVR training, with no overlap with IFBench. It covers ten skill clusters: keyword inclusion/exclusion, letter frequency, paragraph delimiters, first/last word positioning, format wrappers, copying, punctuation avoidance, structured counting, palindromes, and rules like "no two adjacent words start with consecutive letters of the alphabet."

Evaluation modes

IFBench supports two modes over the same 300 prompts:

Mode	Structure
Single-turn	One user message with task + 1 or 2 constraints
Multi-turn	Three turns: user task, assistant reply, user follow-up adding a constraint and requesting a rewrite

Both report strict and loose accuracy following the IFEval convention. Headline numbers in the paper are prompt-level loose accuracy.

Verification

Each constraint ships with a Python verification function that returns a boolean. instructions_registry.py maps constraint names to function objects, and evaluation_lib.py runs them over a JSONL of model responses. Because every check is deterministic, evaluation is reproducible: no LLM-as-judge, no human rating, no calibration drift. The construction pipeline excluded any constraint not expressible as a Python verifier ("write in a friendly tone" and similar).

Reported model performance

The paper reports IFBench numbers alongside IFEval scores. Selected results:

Frontier models, single-turn (before IF-RLVR training)

Model	IFBench (loose)
OpenAI o3	69.3%
Claude 4 Sonnet	below 50%
Qwen3-32B	below 50%
GPT-4.1	below 50%
Claude 3.7 Sonnet	below 50%

Every non-reasoning frontier model lost roughly 30 to 40 points compared to its IFEval score. OpenAI's o3 reasoning model is the outlier at 69.3%.

IF-RLVR results

Configuration	IFEval	IFBench
Tülu-3-8B (DPO baseline)	82.4%	28.9%
Tülu-3-8B + IF-RLVR	92.2%	45.9%
Qwen2.5-7B base + IF-RLVR	87.8%	54.7%
Llama-3.1-8B base + IF-RLVR	88.2%	54.1%
OLMo2 base + IF-RLVR	70.4%	46.6%

Running IF-RLVR from a base model, with a chat template that encourages the model to think before answering, gave the best out-of-domain generalization. Llama-3.1-8B base reached 54.1% on IFBench versus 44.6% for the same model trained from its instruct checkpoint, despite similar IFEval scores: the paper's strongest argument that IF-RLVR teaches a transferable skill. IFBench is part of the Artificial Analysis Intelligence Index, where reasoning-augmented systems such as Grok 4's reasoning variants have been reported above 80%.

IF-RLVR training recipe

The paper's second half describes IF-RLVR, the reinforcement learning recipe the authors recommend for improving precise instruction following. The contribution is not RLVR itself (already used for math and code in Tülu 3), but the specific data and training choices that make it work for the constraint setting.

Data and training

Training prompts are built by sampling an instruction from the Tülu-3-SFT mix and appending one to six constraints from two pools: the 25 IFEval templates and the 29 in IFTrain. A conflict dictionary prevents incompatible combinations. Variable ranges are widened beyond test ranges. Most experiments use 60,000 to 100,000 prompts.

The RL algorithm is GRPO (Shao et al., 2024) implemented in AI2's open-instruct library. The reward per generation is a sum of per-constraint verification scores:

Instance Reward = sum_i ( verifiable_reward_i * reward_multiplier_i * reward_weight_i )

Default multipliers and weights are 1, making the reward a count of satisfied constraints. Training uses 8 H100 GPUs, learning rate 5e-7, 16 samples per prompt, mini-batch 32, max token length 2,048 (10,240 with reasoning chat templates), and ~2,000 steps (about one day per run).

Ablation findings

Four ablations from Section 4 shape the recommended recipe.

Ablation	Comparison	Result
Constraints per prompt	1 to 6	Training on more constraints improved IFBench from ~49% to ~56% on a Qwen2.5-7B policy, even though IFBench prompts have only 1 or 2 constraints
Variable ranges	Same, wider, disjoint	Wider ranges generalized better than identical or disjoint ranges
Categories left out	Cases, format, length, keywords	Removing length or keyword constraints hurt IFEval most; removing format or cases barely mattered
Algorithm	GRPO vs DPO on identical data	GRPO reached ~89.65% IFEval; DPO on the same prompts reached ~79.67%

The DPO comparison shows GRPO is doing more than exposing the model to verifier-labelled data: same prompts, same starting checkpoint, ten-point gap.

Reward hacking

IF-RLVR training has a side effect: models over-prioritize constraints at the expense of the task. A model asked for a single-sentence summary with the constraint "each word must start with the next letter of the alphabet" produces a sentence that follows the alphabet rule but does not summarize the text. The authors score outputs with GPT-4.1 as a judge: their RLVR-trained Tülu drops from 7.0 to 6.4 on a 10-point helpfulness scale even as verifiable accuracy rises. The paper proposes mixing the verifiable reward with a reward model signal (Llama-3.1-Tulu-3-8B-RM); the mix lands at 30 on IFBench (rather than 45.9) but recovers on AlpacaEval 2 (31.6), giving teams a deployment knob.

Benchmark	Constraint count	Verification	Design
IFEval	25 templates	Python	Fixed taxonomy; saturated
FollowBench	5 types, 5 levels	LLM-as-judge	Escalating constraints per prompt
InfoBench	500 instructions	LLM-as-judge (DRFR)	Atomic decomposition
IFBench	58 test + 29 train	Python	Held-out constraints and host prompts
VFF	Procedurally generated	Python	Used mainly for SFT/DPO data
WildIFEval	1,500 user-collected	LLM-as-judge	Real user constraints, not all verifiable

IFBench's niche is the combination of automatic verification with both held-out constraints and held-out host prompts.

Note on naming

A separate, older project also called "IFBench: Towards a Benchmark for Verifiable Instruction Following Evaluations" appears in earlier work and is not the AI2 IFBench described here. This article uses "IFBench" to refer to the Pyatkin et al. (2025) benchmark, which is what current papers and leaderboards mean.

Significance and limitations

IFBench's primary contribution is methodological: by holding out both the constraints and the host prompts, it provides a fair test of whether a model has learned to read instructions or has only memorized a fixed taxonomy. Within months of release it was integrated into the Artificial Analysis Intelligence Index and into LightEval. The reward-hacking section also documents a phenomenon worth attention for RLHF and RLVR research: training to follow constraints can degrade general response quality if the verifiable reward is the only signal.

The authors acknowledge several limitations. The benchmark only covers constraints expressible as short Python verifiers, which excludes many natural-language constraints users care about ("sound friendly," "avoid jargon"). Some constraints are unnatural compared to real user requests ("include at least 10 palindromes"). The dataset is in English. Pass/fail scoring does not credit partial compliance. And, like all held-out benchmarks, IFBench's value will erode as its constraints leak into training data over time.

References

Pyatkin, V., Malik, S., Graf, V., Ivison, H., Huang, S., Dasigi, P., Lambert, N., & Hajishirzi, H. (2025). "Generalizing Verifiable Instruction Following." arXiv:2507.02833. https://arxiv.org/abs/2507.02833
Allen Institute for AI. "IFBench GitHub repository." https://github.com/allenai/IFBench
Allen Institute for AI. "IFBench_test dataset on Hugging Face." https://huggingface.co/datasets/allenai/IFBench_test
NeurIPS 2025. "Generalizing Verifiable Instruction Following (poster page)." https://neurips.cc/virtual/2025/poster/121379
OpenReview. "Generalizing Verifiable Instruction Following." https://openreview.net/forum?id=yfYgwjj5F8
Artificial Analysis. "IFBench Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/ifbench
Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., & Hou, L. (2023). "Instruction-Following Evaluation for Large Language Models" (IFEval). arXiv:2311.07911. https://arxiv.org/abs/2311.07911
Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., & Deng, Y. (2024). "WildChat: 1M ChatGPT Interaction Logs in the Wild." arXiv:2405.01470. https://arxiv.org/abs/2405.01470
Lior, G., Habba, A., Granitzer, M., & Stanovsky, G. (2025). "WildIFEval: Instruction Following in the Wild." arXiv:2503.06573. https://arxiv.org/abs/2503.06573
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., et al. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (introduces GRPO). arXiv:2402.03300. https://arxiv.org/abs/2402.03300
Lambert, N., et al. (2024). "Tülu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv:2411.15124. https://arxiv.org/abs/2411.15124
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models." arXiv:2410.05229. https://arxiv.org/abs/2410.05229
Adler, B., et al. (2024). "Nemotron-4 340B Technical Report." arXiv:2406.11704. https://arxiv.org/abs/2406.11704
Jiang, Y., et al. (2023). "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models." arXiv:2310.20410. https://arxiv.org/abs/2310.20410
Qin, Y., Song, K., Hu, Y., Yao, W., Cho, S., Wang, X., Wu, X., Liu, F., Liu, P., & Yu, D. (2024). "InfoBench: Evaluating Instruction Following Ability in Large Language Models." arXiv:2401.03601. https://arxiv.org/abs/2401.03601
Hugging Face. "LightEval IFBench task implementation." https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/tasks/ifbench/main.py
Allen Institute for AI on X (Twitter). "Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions." https://x.com/allen_ai/status/1940833394025279857

IFBench

Overview

Background and motivation

Benchmark composition

Test constraints

Training constraints (IFTrain)

Evaluation modes

Verification

Reported model performance

Frontier models, single-turn (before IF-RLVR training)

IF-RLVR results

IF-RLVR training recipe

Data and training

Ablation findings

Reward hacking

Note on naming

Significance and limitations

See also

References

Improve this article

Overview

Background and motivation

Benchmark composition

Test constraints

Training constraints (IFTrain)

Evaluation modes

Verification

Reported model performance

Frontier models, single-turn (before IF-RLVR training)

IF-RLVR results

IF-RLVR training recipe

Data and training

Ablation findings

Reward hacking

Note on naming

Significance and limitations

See also

References

Overview

Background and motivation

Benchmark composition

Test constraints

Training constraints (IFTrain)

Evaluation modes

Verification

Reported model performance

Frontier models, single-turn (before IF-RLVR training)

IF-RLVR results

IF-RLVR training recipe

Data and training

Ablation findings

Reward hacking

Related benchmarks

Note on naming

Significance and limitations

See also

References

Improve this article

Related Articles

τ-bench

Aider Polyglot

BALROG

Longform Creative Writing

Humanity's Last Exam

Creative Writing v3

Overview

Background and motivation

Benchmark composition

Test constraints

Training constraints (IFTrain)

Evaluation modes

Verification

Reported model performance

Frontier models, single-turn (before IF-RLVR training)

IF-RLVR results

IF-RLVR training recipe

Data and training

Ablation findings

Reward hacking

Related benchmarks

Note on naming

Significance and limitations

See also

References

Related Articles

τ-bench

Aider Polyglot

BALROG

Longform Creative Writing

Humanity's Last Exam

Creative Writing v3