# IFEval

> Source: https://aiwiki.ai/wiki/ifeval
> Updated: 2026-06-24
> Categories: AI Benchmarks, Large Language Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**IFEval** (Instruction-Following Evaluation) is a [benchmark](/wiki/benchmark) of 541 prompts that measures how reliably [large language models](/wiki/large_language_model) obey explicit, machine-checkable instructions such as "write in more than 400 words," "mention the keyword AI at least 3 times," or "respond in all lowercase." Introduced in November 2023 by Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou at Google (with lead author Jeffrey Zhou at Yale University), IFEval scores each prompt against 25 types of "verifiable instructions" that can be checked automatically by short deterministic programs, removing the need for subjective human evaluation or potentially biased LLM-based auto-evaluation.[1] It reports four accuracy metrics (prompt-level and instruction-level, each in a strict and a loose variant), is one of the six core benchmarks in [Hugging Face](/wiki/hugging_face)'s Open LLM Leaderboard v2, and has become the de facto standard test of [instruction following](/wiki/instruction_following) for LLMs.[4] Top frontier models now exceed 95% on the benchmark; OpenAI reports GPT-5 at about 95.9%.[11]

The paper frames the core idea plainly: "It focuses on a set of 'verifiable instructions' such as 'write in more than 400 words' and 'mention the keyword of AI at least 3 times'."[1] By restricting attention to constraints a computer can confirm, the authors state that they "aim to enhance the clarity and objectivity of the evaluation process, enabling a fully automatic and accurate assessment."[1]

## Background and Motivation

Evaluating how well language models follow user instructions has been a persistent challenge in [natural language processing](/wiki/natural_language_processing). As [LLMs](/wiki/large_language_model) have grown more capable, the ability to follow complex, multi-faceted instructions has become a critical quality indicator. However, prior evaluation approaches suffered from significant drawbacks.

### Why was a verifiable benchmark needed?

Before IFEval, instruction-following evaluation relied primarily on two approaches, each with notable shortcomings:

**Human evaluation** involves asking human annotators to judge whether a model's output correctly follows given instructions. While this method captures nuance, it is expensive, slow, difficult to scale, and inherently subjective. Different annotators may disagree on whether a response adequately follows an instruction, making results difficult to reproduce.

**LLM-based auto-evaluation** uses another language model (often [GPT-4](/wiki/gpt-4) or a similar frontier model) to judge whether a response follows instructions. Benchmarks such as [MT-Bench](/wiki/mt_bench) and AlpacaEval employ this approach. While faster and cheaper than human evaluation, LLM judges introduce their own biases, may favor certain writing styles over others, and are limited by the capabilities of the evaluator model itself. The evaluator may also struggle with ambiguous or complex instructions.

### The Verifiable Instructions Approach

IFEval's key insight was to focus on instructions whose compliance can be verified through simple, deterministic computer programs.[1] Instead of asking subjective questions such as "Is this response helpful?" or "Does this response follow the user's intent?" IFEval tests instructions such as "write in more than 400 words" or "mention the keyword AI at least 3 times." These constraints are unambiguous and can be checked automatically with short Python scripts, ensuring that evaluation is objective, reproducible, and fully automated.[1]

This design philosophy trades coverage for reliability. IFEval does not attempt to measure every dimension of instruction following. Instead, it provides a narrow but highly reliable signal about a model's ability to adhere to explicit, well-defined constraints.

## Dataset Composition

The IFEval dataset comprises **541 prompts**, each containing one or more verifiable instructions.[1][3] The prompts were constructed through a combination of few-shot prompting (using an LLM to generate candidate prompts) and manual curation by the research team.[1] Each prompt is a realistic text generation task (such as writing an essay, summarizing a topic, or composing a letter) augmented with one or more specific format, content, or structural constraints.

### Dataset Structure

Each entry in the dataset contains the following fields:

| Field | Type | Description |
|---|---|---|
| key | Integer | Unique identifier for the prompt (e.g., 1000) |
| prompt | String | The full task description given to the model, including all instructions |
| instruction_id_list | List of strings | Array of verifiable instruction IDs (1 to 3 per prompt) |
| kwargs | List of objects | Arguments specifying parameters for each instruction (e.g., word count thresholds, required keywords) |

### Example Prompt

A representative example from the dataset looks like this:

> Write a 300+ word summary of the Wikipedia page on "quantum entanglement." Do not use any commas and highlight at least 3 sections that have titles in markdown format.

This single prompt contains three verifiable instructions:

| Instruction ID | Constraint | Parameter |
|---|---|---|
| punctuation:no_comma | Do not use any commas | None |
| detectable_format:number_highlighted_sections | Highlight at least N sections | num_highlights = 3 |
| length_constraints:number_words | Write at least N words | num_words = 300, relation = "at least" |

The dataset is publicly available on [Hugging Face](/wiki/hugging_face) under the Apache 2.0 license (google/IFEval) and on Google Research's GitHub repository.[3][2]

## The 25 Verifiable Instruction Types

IFEval identifies 25 types of verifiable instructions, organized into nine broad categories.[1] Each instruction type has a corresponding verification function implemented as a short Python program.[1]

### Keywords

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Include Keywords | keywords:existence | The response must include specific words | "Include the words 'neural' and 'network' in your response" |
| Keyword Frequency | keywords:frequency | A specific word must appear at least N times | "Mention the keyword 'AI' at least 3 times" |
| Forbidden Words | keywords:forbidden_words | The response must not contain certain words | "Do not use the words 'however' or 'therefore'" |
| Letter Frequency | keywords:letter_frequency | A specific letter must appear at least N times | "The letter 'q' should appear at least 5 times" |

### Language

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Response Language | language:response_language | The entire response must be written in a specified language | "Your response should be in French" |

### Length Constraints

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Number of Paragraphs | length_constraints:number_paragraphs | The response must contain exactly N paragraphs, or at least/at most N | "Write exactly 4 paragraphs" |
| Number of Words | length_constraints:number_words | The response must contain at least/at most N words | "Write in more than 400 words" |
| Number of Sentences | length_constraints:number_sentences | The response must contain at least/at most N sentences | "Your entire response should contain less than 6 sentences" |
| Nth Paragraph First Word | length_constraints:nth_paragraph_first_word | The Nth paragraph must start with a specific word | "The second paragraph should start with the word 'Furthermore'" |

### Detectable Content

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Postscript | detectable_content:postscript | The response must include a postscript (P.S. or P.P.S.) at the end | "At the end of your response, add a P.S. section" |
| Number of Placeholders | detectable_content:number_placeholders | The response must include a specified number of bracketed placeholders | "Include at least 2 placeholders in the form [placeholder]" |

### Detectable Format

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Number of Bullet Lists | detectable_format:number_bullet_lists | The response must contain a specific number of bullet point lists | "Include at least 2 bullet point lists" |
| Title | detectable_format:title | The response must include a title wrapped in double angular brackets | "Your response should contain a title in double angular brackets, i.e. <<title>>" |
| Multiple Sections | detectable_format:multiple_sections | The response must be organized into at least N sections with markdown headings | "Your response must have 5 sections marked with ## heading" |
| Choose From | detectable_format:constrained_response | The response must be one of a predefined set of choices | "Answer with one of: Yes, No, Maybe" |
| JSON Format | detectable_format:json_format | The entire response must be valid JSON | "Your entire output should be in JSON format" |
| Number of Highlighted Sections | detectable_format:number_highlighted_sections | The response must contain N sections with markdown-highlighted titles | "Highlight at least 3 sections" |

### Combination

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Repeat Prompt | combination:repeat_prompt | The model must repeat the original prompt before answering | "First, repeat the request above word for word, then answer it" |
| Two Responses | combination:two_responses | The model must provide two distinct responses separated by asterisks | "Give two different responses separated by ******" |

### Change Case

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| All Uppercase | change_case:english_capital | The entire response must be in uppercase | "Your entire response should be in English, and in all capital letters" |
| All Lowercase | change_case:english_lowercase | The entire response must be in lowercase | "Your entire response should be in English, and in all lowercase letters" |
| Capital Word Frequency | change_case:capital_word_frequency | At least N words must be fully capitalized | "In your response, words with all capital letters should appear at least 5 times" |

### Start/End Constraints

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| End Checker | startend:end_checker | The response must end with a specific phrase | "Finish your response with 'Is there anything else I can help with?'" |
| Quotation | startend:quotation | The entire response must be wrapped in double quotation marks | "Wrap your entire response with double quotation marks" |

### Punctuation

| Instruction Type | ID | Description | Example |
|---|---|---|---|
| No Commas | punctuation:no_comma | The response must not contain any commas | "Do not use any commas in your response" |

## How does IFEval score models?

IFEval produces four distinct accuracy metrics by combining two dimensions: the level of granularity (prompt-level vs. instruction-level) and the strictness of verification (strict vs. loose).[1]

### Prompt-Level vs. Instruction-Level

**Prompt-level accuracy** treats each prompt as a single unit. A prompt is scored as correct only if **all** verifiable instructions within that prompt are satisfied.[1] If a prompt contains three instructions and the model follows two out of three, the prompt receives a score of zero. This metric reflects a user's real-world experience, where partial compliance with a multi-part request may not be acceptable.

**Instruction-level accuracy** evaluates each verifiable instruction independently.[1] If a prompt contains three instructions and the model follows two, the instruction-level score credits those two successes. This provides a more granular view of where models succeed and fail.

### Strict vs. Loose Verification

**Strict accuracy** requires exact compliance with each instruction.[1] The verification function checks the model's raw output directly against the instruction's requirements. For instance, if the instruction requires "at least 400 words," the strict check counts the words in the response and requires the count to be 400 or greater.

**Loose accuracy** accounts for the fact that minor formatting differences can cause false negatives.[1] Before checking compliance, the loose evaluation applies several transformations to the response, including:

- Removing the first line of the response (which may contain preamble text like "Sure, here is...")
- Removing the last line of the response
- Removing markdown formatting modifiers
- Combinations of the transformations above

If any transformed version of the response passes the verification check, the instruction is considered followed under the loose criterion.[1] This approach reduces false negatives caused by models adding extra formatting or conversational text around their main response.

### The Four Metrics

| Metric | Granularity | Strictness | What It Measures |
|---|---|---|---|
| Prompt-level Strict Accuracy | Prompt | Strict | Percentage of prompts where all instructions are followed exactly |
| Prompt-level Loose Accuracy | Prompt | Loose | Percentage of prompts where all instructions are followed (with tolerance for formatting) |
| Instruction-level Strict Accuracy | Instruction | Strict | Percentage of individual instructions followed exactly |
| Instruction-level Loose Accuracy | Instruction | Loose | Percentage of individual instructions followed (with tolerance for formatting) |

In many evaluation contexts (including the Open LLM Leaderboard), the reported IFEval score is an average of all four metrics, or sometimes just the prompt-level strict accuracy. The specific metric used varies by platform.

## What did the original IFEval paper find?

The original IFEval paper (Zhou et al., 2023) evaluated two models: [GPT-4](/wiki/gpt-4) (responses collected in November 2023) and PaLM 2 S (responses collected in August 2023).[1] The results demonstrated a substantial performance gap between the two models.[1]

### Results from the Original Paper

| Model | Prompt-Level Strict | Prompt-Level Loose | Instruction-Level Strict | Instruction-Level Loose |
|---|---|---|---|---|
| [GPT-4](/wiki/gpt-4) (Nov 2023) | 76.89% | 79.30% | 83.57% | 85.37% |
| [PaLM](/wiki/palm) 2 S (Aug 2023) | 43.07% | 46.95% | 55.76% | 59.11% |

GPT-4 outperformed PaLM 2 S across all four metrics by a wide margin.[1] At the prompt level, GPT-4 followed all instructions correctly in roughly 77% of prompts under strict evaluation, while PaLM 2 S managed only about 43%.[1] The gap narrowed somewhat at the instruction level (where partial credit is given), but GPT-4 still maintained a lead of approximately 26 to 28 percentage points.

These results highlighted that instruction following was a significant differentiator between models at the time of the paper's publication, and that even frontier models had substantial room for improvement.

## Adoption in Evaluation Suites

IFEval has been adopted as a standard benchmark across several major evaluation frameworks, making it one of the most widely used instruction-following assessments in the AI community.

### Open LLM Leaderboard v2

In June 2024, [Hugging Face](/wiki/hugging_face) launched the Open LLM Leaderboard v2, replacing the original leaderboard with a new suite of six more challenging benchmarks.[4] IFEval was selected as one of these six core benchmarks alongside:[4]

- [MMLU](/wiki/mmlu)-Pro (knowledge and reasoning)
- GPQA (graduate-level question answering)
- [MuSR](/wiki/musr) (multi-step reasoning)
- MATH Level 5 (competition-level mathematics)
- [BBH](/wiki/big_bench) (challenging tasks from [BIG-Bench](/wiki/big_bench))

IFEval was chosen specifically because it evaluates instruction-following capabilities rather than content generation quality, providing a dimension of evaluation that the other five benchmarks do not cover.[4] On the Open LLM Leaderboard, scores from all six benchmarks are normalized so that random performance maps to 0 and perfect performance maps to 100, then averaged to produce a final composite score.[9]

### EleutherAI Language Model Evaluation Harness

[EleutherAI](/wiki/eleutherai)'s lm-evaluation-harness, the backend framework powering the Open LLM Leaderboard, includes IFEval as a built-in task.[5] This allows any causal language model to be evaluated on IFEval with standardized inputs and consistent scoring, ensuring reproducibility across different evaluation runs.

### Other Evaluation Platforms

IFEval has also been integrated into numerous other evaluation tools and platforms:

- **DeepEval** by Confident AI includes IFEval as a built-in benchmark
- **Inspect Evals** (by the UK AI Safety Institute) provides an IFEval implementation
- **Scale AI Labs** maintains an instruction-following leaderboard that includes IFEval
- **Stanford HELM** (Holistic Evaluation of Language Models) incorporates IFEval metrics

## How have model scores changed over time?

Since IFEval's release in late 2023, model performance on the benchmark has improved dramatically. While [GPT-4](/wiki/gpt-4) scored roughly 77% prompt-level strict accuracy in the original paper, newer models have pushed well past 90%.[1]

### Representative Model Scores (2025-2026)

The following table shows IFEval scores for a selection of notable models, as reported on benchmark leaderboards. Scores represent an aggregate metric (typically an average of the four IFEval accuracy dimensions or a normalized score).

| Model | IFEval Score | Developer |
|---|---|---|
| [Qwen](/wiki/qwen) 3.5-27B | 0.950 | Alibaba |
| o3-mini | 0.939 | [OpenAI](/wiki/openai) |
| [Claude](/wiki/claude) 3.7 Sonnet | 0.932 | Anthropic |
| [LLaMA](/wiki/llama) 3.3 70B Instruct | 0.921 | [Meta AI](/wiki/meta_ai) |
| [Gemma](/wiki/gemma) 3 27B | 0.904 | Google |
| [LLaMA](/wiki/llama) 3.1 405B Instruct | 0.886 | [Meta AI](/wiki/meta_ai) |
| GPT-4.5 | 0.882 | [OpenAI](/wiki/openai) |
| [LLaMA](/wiki/llama) 3.1 70B Instruct | 0.875 | [Meta AI](/wiki/meta_ai) |
| [DeepSeek](/wiki/deepseek)-V3 | 0.861 | DeepSeek |
| [Qwen](/wiki/qwen) 2.5 72B Instruct | 0.841 | Alibaba |
| GPT-4.1 mini | 0.841 | [OpenAI](/wiki/openai) |
| QwQ-32B | 0.839 | Alibaba |
| [Mistral](/wiki/mistral) Small 3 24B Instruct | 0.829 | [Mistral AI](/wiki/mistral) |
| GPT-4o | 0.810 | [OpenAI](/wiki/openai) |
| [LLaMA](/wiki/llama) 3.1 8B Instruct | 0.804 | [Meta AI](/wiki/meta_ai) |
| [Gemma](/wiki/gemma) 3 1B | 0.802 | Google |
| [LLaMA](/wiki/llama) 3.2 3B Instruct | 0.774 | [Meta AI](/wiki/meta_ai) |
| GPT-4.1 nano | 0.745 | [OpenAI](/wiki/openai) |
| Phi 4 | 0.630 | Microsoft |

These scores show that frontier models now routinely exceed 90% on IFEval. Even smaller models with just a few billion parameters (such as [Gemma](/wiki/gemma) 3 1B at 0.802 and [LLaMA](/wiki/llama) 3.2 3B at 0.774) achieve respectable scores, indicating that instruction-following capability has become a well-optimized dimension in modern LLM training.

### Proprietary Model Leaders

Among proprietary, closed-source models evaluated on IFEval, the top performers include:

| Model | IFEval Score |
|---|---|
| GPT-5 | ~95.9% |
| o4-mini | ~95.6% |
| o3 | ~94.3% |
| GPT-5-mini | ~94.1% |

OpenAI's GPT-5 system card reports an IFEval instruction-following accuracy of about 95.9%, near the top of any publicly evaluated model.[11] These scores indicate that the most advanced proprietary models are approaching the practical ceiling of the benchmark. As models such as GPT-5.5 and Claude Opus 4.7 were released in 2026, IFEval has become less useful as a frontier discriminator; those models are expected to perform at or above the GPT-5 tier, making instruction-following at the format-constraint level no longer a meaningful differentiator among leading systems. Evaluation focus for frontier models has shifted to harder instruction-following tasks such as AdvancedIF and multi-turn system-prompt adherence tests.

## Technical Implementation

### Verification Functions

Each of the 25 instruction types has a corresponding verification function implemented in Python.[1] These functions are deterministic: given a response and the instruction parameters, they return a binary pass/fail result.[1] Examples include:

- **Word count verification**: Splits the response into words using whitespace tokenization and checks whether the count satisfies the required relation (e.g., "at least 400").
- **Keyword existence**: Checks whether all required keywords appear in the response (case-insensitive matching).
- **No commas**: Scans the response for the presence of comma characters.
- **JSON format**: Attempts to parse the response as JSON and checks for valid structure.
- **Language detection**: Uses heuristic or library-based language identification to verify the response language.

### Running IFEval

The official implementation is available in the Google Research GitHub repository (google-research/instruction_following_eval).[2] To run an evaluation:

1. Prepare model responses in JSONL format, with each line containing a prompt and response field.
2. Install dependencies with pip.
3. Run the evaluation script, specifying the input data file, the model response file, and an output directory.

The script outputs the four accuracy metrics (prompt-level strict, prompt-level loose, instruction-level strict, instruction-level loose) and detailed per-prompt results.[2]

### Alternative Implementations

Several alternative implementations exist:

- **[EleutherAI](/wiki/eleutherai) lm-evaluation-harness**: Integrates IFEval as a task, allowing batch evaluation of models with a single command.[5]
- **DeepEval**: Provides a Python API for running IFEval as part of a broader evaluation pipeline.
- **Inspect Evals**: The UK AI Safety Institute's implementation, installable via pip.
- **oKatanaaa/ifeval**: A clean reimplementation that supports multiple languages (English and Russian) and provides both a Python API and CLI interface.

## Limitations and Criticisms

Despite its wide adoption, IFEval has several recognized limitations that researchers and practitioners should consider.

### Narrow Scope of Instruction Types

IFEval's 25 instruction types cover a relatively narrow range of verifiable constraints. Many important instruction-following capabilities, such as maintaining a consistent tone, following complex multi-turn conversational directives, adhering to system prompts, or handling ambiguous requests gracefully, fall outside the scope of what IFEval can measure. The benchmark specifically tests format and structural constraints, not semantic or pragmatic instruction following.

### Synthetic Constraints vs. Real-World Instructions

Several researchers have pointed out that IFEval's instructions are somewhat artificial. Real users rarely ask models to "avoid the letter C" or "use exactly 4 paragraphs." While these constraints serve as useful proxies for measuring instruction compliance, they do not reflect the kinds of instructions that matter most in practical applications. A model could score perfectly on IFEval while still failing to follow more nuanced, real-world instructions.

### Content Quality Is Not Measured

IFEval evaluates only whether format and structural constraints are met. It does not assess the quality, coherence, accuracy, or helpfulness of the generated content. A response that meets all format requirements while containing nonsensical or incorrect information would receive a perfect IFEval score. This means IFEval should always be used alongside other benchmarks that evaluate content quality.

### Is IFEval saturated?

As of 2025 and 2026, top models regularly score above 90% on IFEval, with several exceeding 95%. This saturation effect reduces the benchmark's ability to discriminate between frontier models. When most advanced models achieve near-perfect scores, the remaining errors often involve edge cases, annotation ambiguities, or unusual instruction combinations rather than systematic failures. Some instruction categories (such as keyword existence) have reached near-perfect accuracy across all models, providing little additional signal.

### Overfitting Risk

Because the IFEval dataset is publicly available and consists of only 541 fixed prompts, there is a risk that model developers may explicitly or implicitly train on the evaluation data.[1] This overfitting concern applies to many public benchmarks, but it is particularly relevant for IFEval given the dataset's small size. Models can be explicitly tuned to handle IFEval's specific instruction patterns without genuinely improving their general instruction-following abilities.

### Reliability Concerns

Research published in late 2025 ("Revisiting the Reliability of Language Models in Instruction-Following") examined the consistency of model performance on IFEval-style tasks.[10] The study found that even models with high average scores can show significant variability when prompts are rephrased or slightly modified.[10] The relative drop from standard IFEval accuracy to reliability-adjusted metrics can be substantial: as large as 61.8% for [Qwen](/wiki/qwen) 3-0.6B and 54.7% for [GPT-3](/wiki/gpt-3).5-turbo-1106, and even the most reliable model (GPT-5) experienced a decrease of 18.3%.[10]

## Variants and Extensions

The success and limitations of IFEval have inspired several derivative benchmarks that extend or modify its approach.

### IFEval-Extended

IFEval-Extended addresses the overfitting problem by using dynamic prompt generation rather than a fixed set of prompts.[7] It extends the original instruction categories and generates thousands of unique instructions from each base template.[7] This approach produces a more robust evaluation that is harder to overfit to, while maintaining the verifiable instruction paradigm.

### M-IFEval (Multilingual IFEval)

M-IFEval, published in 2025, expands instruction-following evaluation to French, Japanese, and Spanish.[6] Rather than simply translating the original English prompts, M-IFEval includes language-specific instructions that test capabilities unique to each language.[6] Evaluation of eight state-of-the-art models showed that performance varies widely across languages and instruction types, highlighting the importance of multilingual evaluation.[6]

### IndicIFEval

IndicIFEval extends the verifiable instruction paradigm to 14 Indic languages, addressing the need for instruction-following evaluation in languages that are underrepresented in most benchmarks.

### CL-IFEval

CL-IFEval (Cross-Lingual IFEval) expands coverage to French, Spanish, Hindi, Arabic, and Yoruba, further broadening the linguistic scope of instruction-following evaluation.

### MaXIFE

MaXIFE introduces a dataset covering 23 languages with 795 prompts and approximately 1,700 constraint templates, resulting in over 18,000 possible instruction combinations. This represents the most linguistically diverse extension of the IFEval framework to date.

### AdvancedIF

Developed by Meta's Superintelligence Labs in partnership with Surge AI, AdvancedIF moves beyond IFEval's synthetic constraints to evaluate real-world instruction-following capabilities.[8] Instead of format-based constraints checked by Python scripts, AdvancedIF uses human-written rubrics for both prompts and evaluation criteria.[8] The benchmark tests multi-turn context management, system prompt adherence, and other practical instruction-following scenarios.[8] A verifier trained on human-annotated data achieved 0.728 F1 agreement with human judgments, representing a 41% improvement over vanilla LLM prompting as a judge.[8]

## How does IFEval differ from other benchmarks?

IFEval occupies a specific niche in the broader landscape of LLM evaluation. The following table compares it with other notable benchmarks.

| Benchmark | What It Measures | Evaluation Method | Created By | Year |
|---|---|---|---|---|
| IFEval | Instruction following (verifiable constraints) | Deterministic programs | Google | 2023 |
| [MT-Bench](/wiki/mt_bench) | Multi-turn conversation quality | LLM judge (GPT-4) | LMSYS | 2023 |
| AlpacaEval | Instruction following (general) | LLM judge (GPT-4) | Stanford | 2023 |
| [MMLU](/wiki/mmlu) | Knowledge and reasoning (multiple choice) | Exact match | UC Berkeley et al. | 2020 |
| [MMLU](/wiki/mmlu)-Pro | Knowledge and reasoning (harder) | Exact match | TIGER-Lab | 2024 |
| [GSM8K](/wiki/gsm8k) | Grade-school math reasoning | Exact match | [OpenAI](/wiki/openai) | 2021 |
| [BIG-Bench](/wiki/big_bench) Hard (BBH) | Challenging diverse tasks | Various | Google et al. | 2022 |
| AdvancedIF | Real-world instruction following | Human rubrics + trained verifier | Meta / Surge AI | 2025 |

IFEval's primary advantage is its objectivity and reproducibility. Unlike benchmarks that rely on LLM judges, IFEval's deterministic verification produces identical results every time, regardless of who runs the evaluation. Its primary disadvantage is its narrow scope, covering only format-level constraints rather than the full spectrum of instruction-following behavior.

## Impact and Significance

IFEval has had a notable impact on the LLM evaluation ecosystem since its release in 2023.

### Standardizing Instruction-Following Evaluation

Before IFEval, there was no widely accepted, objective benchmark specifically for instruction following. By providing a simple, reproducible, and deterministic evaluation, IFEval filled an important gap and gave the community a common reference point for comparing models on this dimension.

### Driving Model Improvement

The inclusion of IFEval in the Open LLM Leaderboard v2 has incentivized model developers to optimize their models for instruction following.[4] The dramatic improvement in scores between 2023 (GPT-4 at ~77%) and 2026 (frontier models at ~95%) reflects genuine progress in training models to comply with explicit user instructions.

### Inspiring New Benchmarks

IFEval's verifiable instruction paradigm has spawned a family of derivative benchmarks targeting different languages, domains, and levels of difficulty. The core idea of testing instructions that can be checked automatically has proven highly influential, even as researchers have identified the need to go beyond IFEval's specific set of constraints.

### Limitations as a Ceiling

The benchmark's saturation among frontier models has also highlighted an important lesson: a benchmark is most useful when it can discriminate between models at the current frontier. As models have caught up to and exceeded IFEval's difficulty level, the community has begun developing more challenging successors like AdvancedIF and IFEval-Extended.

## See Also

- [Benchmark](/wiki/benchmark)
- [Instruction Following](/wiki/instruction_following)
- [MMLU](/wiki/mmlu)
- [MT-Bench](/wiki/mt_bench)
- [GSM8K](/wiki/gsm8k)
- [BIG-Bench](/wiki/big_bench)
- [Prompt Engineering](/wiki/prompt_engineering)
- [Instruction Tuning](/wiki/instruction_tuning)
- [Fine-tuning](/wiki/fine_tuning)
- [Large Language Models](/wiki/large_language_model)

## References

1. Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., & Hou, L. (2023). "Instruction-Following Evaluation for Large Language Models." *arXiv preprint arXiv:2311.07911*. https://arxiv.org/abs/2311.07911

2. Google Research. (2023). "Instruction Following Eval." GitHub repository. https://github.com/google-research/google-research/tree/master/instruction_following_eval

3. Hugging Face. (2023). "google/IFEval Dataset." https://huggingface.co/datasets/google/IFEval

4. Hugging Face. (2024). "Open LLM Leaderboard v2." https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

5. EleutherAI. (2024). "Language Model Evaluation Harness." GitHub repository. https://github.com/EleutherAI/lm-evaluation-harness

6. Dai, D., Tanaka, R., Wettig, A., & Iyer, R. (2025). "M-IFEval: Multilingual Instruction-Following Evaluation." *Findings of the Association for Computational Linguistics: NAACL 2025*. https://aclanthology.org/2025.findings-naacl.344/

7. Mecklenburg, N. et al. (2025). "IFEval-Extended: Enhancing Instruction-Following Evaluation in Large Language Models through Dynamic Prompt Generation." https://www.researchgate.net/publication/387435651

8. Surge AI & Meta. (2025). "Building AdvancedIF: Evolving Instruction Following Beyond IFEval." https://surgehq.ai/blog/advancedif-and-the-evolution-of-instruction-following-benchmarks

9. Hugging Face. (2024). "Scores Normalization - Open LLM Leaderboard." https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization

10. Li, Z. et al. (2025). "Revisiting the Reliability of Language Models in Instruction-Following." *arXiv preprint arXiv:2512.14754*. https://arxiv.org/abs/2512.14754

11. OpenAI. (2025). "GPT-5 System Card." https://cdn.openai.com/gpt-5-system-card.pdf

