IFEval

AI Benchmarks Large Language Models Natural Language Processing

23 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v5 · 4,600 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

IFEval (Instruction-Following Evaluation) is a benchmark of 541 prompts that measures how reliably large language models obey explicit, machine-checkable instructions such as "write in more than 400 words," "mention the keyword AI at least 3 times," or "respond in all lowercase." Introduced in November 2023 by Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou at Google (with lead author Jeffrey Zhou at Yale University), IFEval scores each prompt against 25 types of "verifiable instructions" that can be checked automatically by short deterministic programs, removing the need for subjective human evaluation or potentially biased LLM-based auto-evaluation.^[1] It reports four accuracy metrics (prompt-level and instruction-level, each in a strict and a loose variant), is one of the six core benchmarks in Hugging Face's Open LLM Leaderboard v2, and has become the de facto standard test of instruction following for LLMs.^[4] Top frontier models now exceed 95% on the benchmark; OpenAI reports GPT-5 at about 95.9%.^[11]

The paper frames the core idea plainly: "It focuses on a set of 'verifiable instructions' such as 'write in more than 400 words' and 'mention the keyword of AI at least 3 times'."^[1] By restricting attention to constraints a computer can confirm, the authors state that they "aim to enhance the clarity and objectivity of the evaluation process, enabling a fully automatic and accurate assessment."^[1]

Background and Motivation

Evaluating how well language models follow user instructions has been a persistent challenge in natural language processing. As LLMs have grown more capable, the ability to follow complex, multi-faceted instructions has become a critical quality indicator. However, prior evaluation approaches suffered from significant drawbacks.

Why was a verifiable benchmark needed?

Before IFEval, instruction-following evaluation relied primarily on two approaches, each with notable shortcomings:

Human evaluation involves asking human annotators to judge whether a model's output correctly follows given instructions. While this method captures nuance, it is expensive, slow, difficult to scale, and inherently subjective. Different annotators may disagree on whether a response adequately follows an instruction, making results difficult to reproduce.

LLM-based auto-evaluation uses another language model (often GPT-4 or a similar frontier model) to judge whether a response follows instructions. Benchmarks such as MT-Bench and AlpacaEval employ this approach. While faster and cheaper than human evaluation, LLM judges introduce their own biases, may favor certain writing styles over others, and are limited by the capabilities of the evaluator model itself. The evaluator may also struggle with ambiguous or complex instructions.

The Verifiable Instructions Approach

IFEval's key insight was to focus on instructions whose compliance can be verified through simple, deterministic computer programs.^[1] Instead of asking subjective questions such as "Is this response helpful?" or "Does this response follow the user's intent?" IFEval tests instructions such as "write in more than 400 words" or "mention the keyword AI at least 3 times." These constraints are unambiguous and can be checked automatically with short Python scripts, ensuring that evaluation is objective, reproducible, and fully automated.^[1]

This design philosophy trades coverage for reliability. IFEval does not attempt to measure every dimension of instruction following. Instead, it provides a narrow but highly reliable signal about a model's ability to adhere to explicit, well-defined constraints.

Dataset Composition

The IFEval dataset comprises 541 prompts, each containing one or more verifiable instructions.^[1]^[3] The prompts were constructed through a combination of few-shot prompting (using an LLM to generate candidate prompts) and manual curation by the research team.^[1] Each prompt is a realistic text generation task (such as writing an essay, summarizing a topic, or composing a letter) augmented with one or more specific format, content, or structural constraints.

Dataset Structure

Each entry in the dataset contains the following fields:

Field	Type	Description
key	Integer	Unique identifier for the prompt (e.g., 1000)
prompt	String	The full task description given to the model, including all instructions
instruction_id_list	List of strings	Array of verifiable instruction IDs (1 to 3 per prompt)
kwargs	List of objects	Arguments specifying parameters for each instruction (e.g., word count thresholds, required keywords)

Example Prompt

A representative example from the dataset looks like this:

Write a 300+ word summary of the Wikipedia page on "quantum entanglement." Do not use any commas and highlight at least 3 sections that have titles in markdown format.

This single prompt contains three verifiable instructions:

Instruction ID	Constraint	Parameter
punctuation:no_comma	Do not use any commas	None
detectable_format:number_highlighted_sections	Highlight at least N sections	num_highlights = 3
length_constraints:number_words	Write at least N words	num_words = 300, relation = "at least"

The dataset is publicly available on Hugging Face under the Apache 2.0 license (google/IFEval) and on Google Research's GitHub repository.^[3]^[2]

The 25 Verifiable Instruction Types

IFEval identifies 25 types of verifiable instructions, organized into nine broad categories.^[1] Each instruction type has a corresponding verification function implemented as a short Python program.^[1]

Keywords

Instruction Type	ID	Description	Example
Include Keywords	keywords:existence	The response must include specific words	"Include the words 'neural' and 'network' in your response"
Keyword Frequency	keywords:frequency	A specific word must appear at least N times	"Mention the keyword 'AI' at least 3 times"
Forbidden Words	keywords:forbidden_words	The response must not contain certain words	"Do not use the words 'however' or 'therefore'"
Letter Frequency	keywords:letter_frequency	A specific letter must appear at least N times	"The letter 'q' should appear at least 5 times"

Language

Instruction Type	ID	Description	Example
Response Language	language:response_language	The entire response must be written in a specified language	"Your response should be in French"

Length Constraints

Instruction Type	ID	Description	Example
Number of Paragraphs	length_constraints:number_paragraphs	The response must contain exactly N paragraphs, or at least/at most N	"Write exactly 4 paragraphs"
Number of Words	length_constraints:number_words	The response must contain at least/at most N words	"Write in more than 400 words"
Number of Sentences	length_constraints:number_sentences	The response must contain at least/at most N sentences	"Your entire response should contain less than 6 sentences"
Nth Paragraph First Word	length_constraints:nth_paragraph_first_word	The Nth paragraph must start with a specific word	"The second paragraph should start with the word 'Furthermore'"

Detectable Content

Instruction Type	ID	Description	Example
Postscript	detectable_content:postscript	The response must include a postscript (P.S. or P.P.S.) at the end	"At the end of your response, add a P.S. section"
Number of Placeholders	detectable_content:number_placeholders	The response must include a specified number of bracketed placeholders	"Include at least 2 placeholders in the form [placeholder]"

Detectable Format

Instruction Type	ID	Description	Example
Number of Bullet Lists	detectable_format:number_bullet_lists	The response must contain a specific number of bullet point lists	"Include at least 2 bullet point lists"
Title	detectable_format:title	The response must include a title wrapped in double angular brackets	"Your response should contain a title in double angular brackets, i.e. <>"
Multiple Sections	detectable_format:multiple_sections	The response must be organized into at least N sections with markdown headings	"Your response must have 5 sections marked with ## heading"
Choose From	detectable_format:constrained_response	The response must be one of a predefined set of choices	"Answer with one of: Yes, No, Maybe"
JSON Format	detectable_format:json_format	The entire response must be valid JSON	"Your entire output should be in JSON format"
Number of Highlighted Sections	detectable_format:number_highlighted_sections	The response must contain N sections with markdown-highlighted titles	"Highlight at least 3 sections"

Combination

Instruction Type	ID	Description	Example
Repeat Prompt	combination:repeat_prompt	The model must repeat the original prompt before answering	"First, repeat the request above word for word, then answer it"
Two Responses	combination:two_responses	The model must provide two distinct responses separated by asterisks	"Give two different responses separated by ******"

Change Case

Instruction Type	ID	Description	Example
All Uppercase	change_case:english_capital	The entire response must be in uppercase	"Your entire response should be in English, and in all capital letters"
All Lowercase	change_case:english_lowercase	The entire response must be in lowercase	"Your entire response should be in English, and in all lowercase letters"
Capital Word Frequency	change_case:capital_word_frequency	At least N words must be fully capitalized	"In your response, words with all capital letters should appear at least 5 times"

Start/End Constraints

Instruction Type	ID	Description	Example
End Checker	startend:end_checker	The response must end with a specific phrase	"Finish your response with 'Is there anything else I can help with?'"
Quotation	startend:quotation	The entire response must be wrapped in double quotation marks	"Wrap your entire response with double quotation marks"

Punctuation

Instruction Type	ID	Description	Example
No Commas	punctuation:no_comma	The response must not contain any commas	"Do not use any commas in your response"

How does IFEval score models?

IFEval produces four distinct accuracy metrics by combining two dimensions: the level of granularity (prompt-level vs. instruction-level) and the strictness of verification (strict vs. loose).^[1]

Prompt-Level vs. Instruction-Level

Prompt-level accuracy treats each prompt as a single unit. A prompt is scored as correct only if all verifiable instructions within that prompt are satisfied.^[1] If a prompt contains three instructions and the model follows two out of three, the prompt receives a score of zero. This metric reflects a user's real-world experience, where partial compliance with a multi-part request may not be acceptable.

Instruction-level accuracy evaluates each verifiable instruction independently.^[1] If a prompt contains three instructions and the model follows two, the instruction-level score credits those two successes. This provides a more granular view of where models succeed and fail.

Strict vs. Loose Verification

Strict accuracy requires exact compliance with each instruction.^[1] The verification function checks the model's raw output directly against the instruction's requirements. For instance, if the instruction requires "at least 400 words," the strict check counts the words in the response and requires the count to be 400 or greater.

Loose accuracy accounts for the fact that minor formatting differences can cause false negatives.^[1] Before checking compliance, the loose evaluation applies several transformations to the response, including:

Removing the first line of the response (which may contain preamble text like "Sure, here is...")
Removing the last line of the response
Removing markdown formatting modifiers
Combinations of the transformations above

If any transformed version of the response passes the verification check, the instruction is considered followed under the loose criterion.^[1] This approach reduces false negatives caused by models adding extra formatting or conversational text around their main response.

The Four Metrics

Metric	Granularity	Strictness	What It Measures
Prompt-level Strict Accuracy	Prompt	Strict	Percentage of prompts where all instructions are followed exactly
Prompt-level Loose Accuracy	Prompt	Loose	Percentage of prompts where all instructions are followed (with tolerance for formatting)
Instruction-level Strict Accuracy	Instruction	Strict	Percentage of individual instructions followed exactly
Instruction-level Loose Accuracy	Instruction	Loose	Percentage of individual instructions followed (with tolerance for formatting)

In many evaluation contexts (including the Open LLM Leaderboard), the reported IFEval score is an average of all four metrics, or sometimes just the prompt-level strict accuracy. The specific metric used varies by platform.

What did the original IFEval paper find?

The original IFEval paper (Zhou et al., 2023) evaluated two models: GPT-4 (responses collected in November 2023) and PaLM 2 S (responses collected in August 2023).^[1] The results demonstrated a substantial performance gap between the two models.^[1]

Results from the Original Paper

Model	Prompt-Level Strict	Prompt-Level Loose	Instruction-Level Strict	Instruction-Level Loose
GPT-4 (Nov 2023)	76.89%	79.30%	83.57%	85.37%
PaLM 2 S (Aug 2023)	43.07%	46.95%	55.76%	59.11%

GPT-4 outperformed PaLM 2 S across all four metrics by a wide margin.^[1] At the prompt level, GPT-4 followed all instructions correctly in roughly 77% of prompts under strict evaluation, while PaLM 2 S managed only about 43%.^[1] The gap narrowed somewhat at the instruction level (where partial credit is given), but GPT-4 still maintained a lead of approximately 26 to 28 percentage points.

These results highlighted that instruction following was a significant differentiator between models at the time of the paper's publication, and that even frontier models had substantial room for improvement.

Adoption in Evaluation Suites

IFEval has been adopted as a standard benchmark across several major evaluation frameworks, making it one of the most widely used instruction-following assessments in the AI community.

Open LLM Leaderboard v2

In June 2024, Hugging Face launched the Open LLM Leaderboard v2, replacing the original leaderboard with a new suite of six more challenging benchmarks.^[4] IFEval was selected as one of these six core benchmarks alongside:^[4]

MMLU-Pro (knowledge and reasoning)
GPQA (graduate-level question answering)
MuSR (multi-step reasoning)
MATH Level 5 (competition-level mathematics)
BBH (challenging tasks from BIG-Bench)

IFEval was chosen specifically because it evaluates instruction-following capabilities rather than content generation quality, providing a dimension of evaluation that the other five benchmarks do not cover.^[4] On the Open LLM Leaderboard, scores from all six benchmarks are normalized so that random performance maps to 0 and perfect performance maps to 100, then averaged to produce a final composite score.^[9]

EleutherAI Language Model Evaluation Harness

EleutherAI's lm-evaluation-harness, the backend framework powering the Open LLM Leaderboard, includes IFEval as a built-in task.^[5] This allows any causal language model to be evaluated on IFEval with standardized inputs and consistent scoring, ensuring reproducibility across different evaluation runs.

Other Evaluation Platforms

IFEval has also been integrated into numerous other evaluation tools and platforms:

DeepEval by Confident AI includes IFEval as a built-in benchmark
Inspect Evals (by the UK AI Safety Institute) provides an IFEval implementation
Scale AI Labs maintains an instruction-following leaderboard that includes IFEval
Stanford HELM (Holistic Evaluation of Language Models) incorporates IFEval metrics

How have model scores changed over time?

Since IFEval's release in late 2023, model performance on the benchmark has improved dramatically. While GPT-4 scored roughly 77% prompt-level strict accuracy in the original paper, newer models have pushed well past 90%.^[1]

Representative Model Scores (2025-2026)

The following table shows IFEval scores for a selection of notable models, as reported on benchmark leaderboards. Scores represent an aggregate metric (typically an average of the four IFEval accuracy dimensions or a normalized score).

Model	IFEval Score	Developer
Qwen 3.5-27B	0.950	Alibaba
o3-mini	0.939	OpenAI
Claude 3.7 Sonnet	0.932	Anthropic
LLaMA 3.3 70B Instruct	0.921	Meta AI
Gemma 3 27B	0.904	Google
LLaMA 3.1 405B Instruct	0.886	Meta AI
GPT-4.5	0.882	OpenAI
LLaMA 3.1 70B Instruct	0.875	Meta AI
DeepSeek-V3	0.861	DeepSeek
Qwen 2.5 72B Instruct	0.841	Alibaba
GPT-4.1 mini	0.841	OpenAI
QwQ-32B	0.839	Alibaba
Mistral Small 3 24B Instruct	0.829	Mistral AI
GPT-4o	0.810	OpenAI
LLaMA 3.1 8B Instruct	0.804	Meta AI
Gemma 3 1B	0.802	Google
LLaMA 3.2 3B Instruct	0.774	Meta AI
GPT-4.1 nano	0.745	OpenAI
Phi 4	0.630	Microsoft

These scores show that frontier models now routinely exceed 90% on IFEval. Even smaller models with just a few billion parameters (such as Gemma 3 1B at 0.802 and LLaMA 3.2 3B at 0.774) achieve respectable scores, indicating that instruction-following capability has become a well-optimized dimension in modern LLM training.

Proprietary Model Leaders

Among proprietary, closed-source models evaluated on IFEval, the top performers include:

Model	IFEval Score
GPT-5	~95.9%
o4-mini	~95.6%
o3	~94.3%
GPT-5-mini	~94.1%

OpenAI's GPT-5 system card reports an IFEval instruction-following accuracy of about 95.9%, near the top of any publicly evaluated model.^[11] These scores indicate that the most advanced proprietary models are approaching the practical ceiling of the benchmark. As models such as GPT-5.5 and Claude Opus 4.7 were released in 2026, IFEval has become less useful as a frontier discriminator; those models are expected to perform at or above the GPT-5 tier, making instruction-following at the format-constraint level no longer a meaningful differentiator among leading systems. Evaluation focus for frontier models has shifted to harder instruction-following tasks such as AdvancedIF and multi-turn system-prompt adherence tests.

Technical Implementation

Verification Functions

Each of the 25 instruction types has a corresponding verification function implemented in Python.^[1] These functions are deterministic: given a response and the instruction parameters, they return a binary pass/fail result.^[1] Examples include:

Word count verification: Splits the response into words using whitespace tokenization and checks whether the count satisfies the required relation (e.g., "at least 400").
Keyword existence: Checks whether all required keywords appear in the response (case-insensitive matching).
No commas: Scans the response for the presence of comma characters.
JSON format: Attempts to parse the response as JSON and checks for valid structure.
Language detection: Uses heuristic or library-based language identification to verify the response language.

Running IFEval

The official implementation is available in the Google Research GitHub repository (google-research/instruction_following_eval).^[2] To run an evaluation:

Prepare model responses in JSONL format, with each line containing a prompt and response field.
Install dependencies with pip.
Run the evaluation script, specifying the input data file, the model response file, and an output directory.

The script outputs the four accuracy metrics (prompt-level strict, prompt-level loose, instruction-level strict, instruction-level loose) and detailed per-prompt results.^[2]

Alternative Implementations

Several alternative implementations exist:

EleutherAI lm-evaluation-harness: Integrates IFEval as a task, allowing batch evaluation of models with a single command.^[5]
DeepEval: Provides a Python API for running IFEval as part of a broader evaluation pipeline.
Inspect Evals: The UK AI Safety Institute's implementation, installable via pip.
oKatanaaa/ifeval: A clean reimplementation that supports multiple languages (English and Russian) and provides both a Python API and CLI interface.

Limitations and Criticisms

Despite its wide adoption, IFEval has several recognized limitations that researchers and practitioners should consider.

Narrow Scope of Instruction Types

IFEval's 25 instruction types cover a relatively narrow range of verifiable constraints. Many important instruction-following capabilities, such as maintaining a consistent tone, following complex multi-turn conversational directives, adhering to system prompts, or handling ambiguous requests gracefully, fall outside the scope of what IFEval can measure. The benchmark specifically tests format and structural constraints, not semantic or pragmatic instruction following.

Synthetic Constraints vs. Real-World Instructions

Several researchers have pointed out that IFEval's instructions are somewhat artificial. Real users rarely ask models to "avoid the letter C" or "use exactly 4 paragraphs." While these constraints serve as useful proxies for measuring instruction compliance, they do not reflect the kinds of instructions that matter most in practical applications. A model could score perfectly on IFEval while still failing to follow more nuanced, real-world instructions.

Content Quality Is Not Measured

IFEval evaluates only whether format and structural constraints are met. It does not assess the quality, coherence, accuracy, or helpfulness of the generated content. A response that meets all format requirements while containing nonsensical or incorrect information would receive a perfect IFEval score. This means IFEval should always be used alongside other benchmarks that evaluate content quality.

Is IFEval saturated?

As of 2025 and 2026, top models regularly score above 90% on IFEval, with several exceeding 95%. This saturation effect reduces the benchmark's ability to discriminate between frontier models. When most advanced models achieve near-perfect scores, the remaining errors often involve edge cases, annotation ambiguities, or unusual instruction combinations rather than systematic failures. Some instruction categories (such as keyword existence) have reached near-perfect accuracy across all models, providing little additional signal.

Overfitting Risk

Because the IFEval dataset is publicly available and consists of only 541 fixed prompts, there is a risk that model developers may explicitly or implicitly train on the evaluation data.^[1] This overfitting concern applies to many public benchmarks, but it is particularly relevant for IFEval given the dataset's small size. Models can be explicitly tuned to handle IFEval's specific instruction patterns without genuinely improving their general instruction-following abilities.

Reliability Concerns

Research published in late 2025 ("Revisiting the Reliability of Language Models in Instruction-Following") examined the consistency of model performance on IFEval-style tasks.^[10] The study found that even models with high average scores can show significant variability when prompts are rephrased or slightly modified.^[10] The relative drop from standard IFEval accuracy to reliability-adjusted metrics can be substantial: as large as 61.8% for Qwen 3-0.6B and 54.7% for GPT-3.5-turbo-1106, and even the most reliable model (GPT-5) experienced a decrease of 18.3%.^[10]

Variants and Extensions

The success and limitations of IFEval have inspired several derivative benchmarks that extend or modify its approach.

IFEval-Extended

IFEval-Extended addresses the overfitting problem by using dynamic prompt generation rather than a fixed set of prompts.^[7] It extends the original instruction categories and generates thousands of unique instructions from each base template.^[7] This approach produces a more robust evaluation that is harder to overfit to, while maintaining the verifiable instruction paradigm.

M-IFEval (Multilingual IFEval)

M-IFEval, published in 2025, expands instruction-following evaluation to French, Japanese, and Spanish.^[6] Rather than simply translating the original English prompts, M-IFEval includes language-specific instructions that test capabilities unique to each language.^[6] Evaluation of eight state-of-the-art models showed that performance varies widely across languages and instruction types, highlighting the importance of multilingual evaluation.^[6]

IndicIFEval

IndicIFEval extends the verifiable instruction paradigm to 14 Indic languages, addressing the need for instruction-following evaluation in languages that are underrepresented in most benchmarks.

CL-IFEval

CL-IFEval (Cross-Lingual IFEval) expands coverage to French, Spanish, Hindi, Arabic, and Yoruba, further broadening the linguistic scope of instruction-following evaluation.

MaXIFE

MaXIFE introduces a dataset covering 23 languages with 795 prompts and approximately 1,700 constraint templates, resulting in over 18,000 possible instruction combinations. This represents the most linguistically diverse extension of the IFEval framework to date.

AdvancedIF

Developed by Meta's Superintelligence Labs in partnership with Surge AI, AdvancedIF moves beyond IFEval's synthetic constraints to evaluate real-world instruction-following capabilities.^[8] Instead of format-based constraints checked by Python scripts, AdvancedIF uses human-written rubrics for both prompts and evaluation criteria.^[8] The benchmark tests multi-turn context management, system prompt adherence, and other practical instruction-following scenarios.^[8] A verifier trained on human-annotated data achieved 0.728 F1 agreement with human judgments, representing a 41% improvement over vanilla LLM prompting as a judge.^[8]

How does IFEval differ from other benchmarks?

IFEval occupies a specific niche in the broader landscape of LLM evaluation. The following table compares it with other notable benchmarks.

Benchmark	What It Measures	Evaluation Method	Created By	Year
IFEval	Instruction following (verifiable constraints)	Deterministic programs	Google	2023
MT-Bench	Multi-turn conversation quality	LLM judge (GPT-4)	LMSYS	2023
AlpacaEval	Instruction following (general)	LLM judge (GPT-4)	Stanford	2023
MMLU	Knowledge and reasoning (multiple choice)	Exact match	UC Berkeley et al.	2020
MMLU-Pro	Knowledge and reasoning (harder)	Exact match	TIGER-Lab	2024
GSM8K	Grade-school math reasoning	Exact match	OpenAI	2021
BIG-Bench Hard (BBH)	Challenging diverse tasks	Various	Google et al.	2022
AdvancedIF	Real-world instruction following	Human rubrics + trained verifier	Meta / Surge AI	2025

IFEval's primary advantage is its objectivity and reproducibility. Unlike benchmarks that rely on LLM judges, IFEval's deterministic verification produces identical results every time, regardless of who runs the evaluation. Its primary disadvantage is its narrow scope, covering only format-level constraints rather than the full spectrum of instruction-following behavior.

Impact and Significance

IFEval has had a notable impact on the LLM evaluation ecosystem since its release in 2023.

Standardizing Instruction-Following Evaluation

Before IFEval, there was no widely accepted, objective benchmark specifically for instruction following. By providing a simple, reproducible, and deterministic evaluation, IFEval filled an important gap and gave the community a common reference point for comparing models on this dimension.

Driving Model Improvement

The inclusion of IFEval in the Open LLM Leaderboard v2 has incentivized model developers to optimize their models for instruction following.^[4] The dramatic improvement in scores between 2023 (GPT-4 at ~77%) and 2026 (frontier models at ~95%) reflects genuine progress in training models to comply with explicit user instructions.

Inspiring New Benchmarks

IFEval's verifiable instruction paradigm has spawned a family of derivative benchmarks targeting different languages, domains, and levels of difficulty. The core idea of testing instructions that can be checked automatically has proven highly influential, even as researchers have identified the need to go beyond IFEval's specific set of constraints.

Limitations as a Ceiling

The benchmark's saturation among frontier models has also highlighted an important lesson: a benchmark is most useful when it can discriminate between models at the current frontier. As models have caught up to and exceeded IFEval's difficulty level, the community has begun developing more challenging successors like AdvancedIF and IFEval-Extended.

References

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., & Hou, L. (2023). "Instruction-Following Evaluation for Large Language Models." *arXiv preprint arXiv:2311.07911*. https://arxiv.org/abs/2311.07911 ↩
Google Research. (2023). "Instruction Following Eval." GitHub repository. https://github.com/google-research/google-research/tree/master/instruction_following_eval ↩
Hugging Face. (2023). "google/IFEval Dataset." https://huggingface.co/datasets/google/IFEval ↩
Hugging Face. (2024). "Open LLM Leaderboard v2." https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard ↩
EleutherAI. (2024). "Language Model Evaluation Harness." GitHub repository. https://github.com/EleutherAI/lm-evaluation-harness ↩
Dai, D., Tanaka, R., Wettig, A., & Iyer, R. (2025). "M-IFEval: Multilingual Instruction-Following Evaluation." *Findings of the Association for Computational Linguistics: NAACL 2025*. https://aclanthology.org/2025.findings-naacl.344/ ↩
Mecklenburg, N. et al. (2025). "IFEval-Extended: Enhancing Instruction-Following Evaluation in Large Language Models through Dynamic Prompt Generation." https://www.researchgate.net/publication/387435651 ↩
Surge AI & Meta. (2025). "Building AdvancedIF: Evolving Instruction Following Beyond IFEval." https://surgehq.ai/blog/advancedif-and-the-evolution-of-instruction-following-benchmarks ↩
Hugging Face. (2024). "Scores Normalization - Open LLM Leaderboard." https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization ↩
Li, Z. et al. (2025). "Revisiting the Reliability of Language Models in Instruction-Following." *arXiv preprint arXiv:2512.14754*. https://arxiv.org/abs/2512.14754 ↩
OpenAI. (2025). "GPT-5 System Card." https://cdn.openai.com/gpt-5-system-card.pdf ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

COLLIE HELM (Holistic Evaluation of Language Models)IFBench Instruction Tuning LiveBench Machine learning terms/Natural Language Processing MiniMax-Text-01 MuSR WildBench

Background and Motivation

Why was a verifiable benchmark needed?

The Verifiable Instructions Approach

Dataset Composition

Dataset Structure

Example Prompt

The 25 Verifiable Instruction Types

Keywords

Language

Length Constraints

Detectable Content

Detectable Format

Combination

Change Case

Start/End Constraints

Punctuation

How does IFEval score models?

Prompt-Level vs. Instruction-Level

Strict vs. Loose Verification

The Four Metrics

What did the original IFEval paper find?

Results from the Original Paper

Adoption in Evaluation Suites

Open LLM Leaderboard v2

EleutherAI Language Model Evaluation Harness

Other Evaluation Platforms

How have model scores changed over time?

Representative Model Scores (2025-2026)

Proprietary Model Leaders

Technical Implementation

Verification Functions

Running IFEval

Alternative Implementations

Limitations and Criticisms

Narrow Scope of Instruction Types

Synthetic Constraints vs. Real-World Instructions

Content Quality Is Not Measured

Is IFEval saturated?

Overfitting Risk

Reliability Concerns

Variants and Extensions

IFEval-Extended

M-IFEval (Multilingual IFEval)

IndicIFEval

CL-IFEval

MaXIFE

AdvancedIF

How does IFEval differ from other benchmarks?

Impact and Significance

Standardizing Instruction-Following Evaluation

Driving Model Improvement

Inspiring New Benchmarks

Limitations as a Ceiling

See Also

References

Improve this article

Related Articles

MT-Bench

AlpacaEval

ZebraLogic

LegalBench

WritingBench

InfiniteBench

What links here

Related Articles

MT-Bench

AlpacaEval

ZebraLogic

LegalBench

WritingBench

InfiniteBench

What links here