IFEval (Instruction-Following Evaluation) is a benchmark designed to measure how well large language models follow natural language instructions. Introduced in November 2023 by Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou at Google, IFEval focuses on a class of "verifiable instructions" that can be checked automatically through deterministic programs, eliminating the need for subjective human evaluation or potentially biased LLM-based auto-evaluation. The benchmark consists of 541 prompts, each containing one or more instructions drawn from 25 verifiable instruction types. IFEval has become one of the most widely adopted instruction-following benchmarks in the AI research community and is a core component of Hugging Face's Open LLM Leaderboard v2.
Evaluating how well language models follow user instructions has been a persistent challenge in natural language processing. As LLMs have grown more capable, the ability to follow complex, multi-faceted instructions has become a critical quality indicator. However, prior evaluation approaches suffered from significant drawbacks.
Before IFEval, instruction-following evaluation relied primarily on two approaches, each with notable shortcomings:
Human evaluation involves asking human annotators to judge whether a model's output correctly follows given instructions. While this method captures nuance, it is expensive, slow, difficult to scale, and inherently subjective. Different annotators may disagree on whether a response adequately follows an instruction, making results difficult to reproduce.
LLM-based auto-evaluation uses another language model (often GPT-4 or a similar frontier model) to judge whether a response follows instructions. Benchmarks such as MT-Bench and AlpacaEval employ this approach. While faster and cheaper than human evaluation, LLM judges introduce their own biases, may favor certain writing styles over others, and are limited by the capabilities of the evaluator model itself. The evaluator may also struggle with ambiguous or complex instructions.
IFEval's key insight was to focus on instructions whose compliance can be verified through simple, deterministic computer programs. Instead of asking subjective questions such as "Is this response helpful?" or "Does this response follow the user's intent?" IFEval tests instructions such as "write in more than 400 words" or "mention the keyword AI at least 3 times." These constraints are unambiguous and can be checked automatically with short Python scripts, ensuring that evaluation is objective, reproducible, and fully automated.
This design philosophy trades coverage for reliability. IFEval does not attempt to measure every dimension of instruction following. Instead, it provides a narrow but highly reliable signal about a model's ability to adhere to explicit, well-defined constraints.
The IFEval dataset comprises 541 prompts, each containing one or more verifiable instructions. The prompts were constructed through a combination of few-shot prompting (using an LLM to generate candidate prompts) and manual curation by the research team. Each prompt is a realistic text generation task (such as writing an essay, summarizing a topic, or composing a letter) augmented with one or more specific format, content, or structural constraints.
Each entry in the dataset contains the following fields:
| Field | Type | Description |
|---|---|---|
| key | Integer | Unique identifier for the prompt (e.g., 1000) |
| prompt | String | The full task description given to the model, including all instructions |
| instruction_id_list | List of strings | Array of verifiable instruction IDs (1 to 3 per prompt) |
| kwargs | List of objects | Arguments specifying parameters for each instruction (e.g., word count thresholds, required keywords) |
A representative example from the dataset looks like this:
Write a 300+ word summary of the Wikipedia page on "quantum entanglement." Do not use any commas and highlight at least 3 sections that have titles in markdown format.
This single prompt contains three verifiable instructions:
| Instruction ID | Constraint | Parameter |
|---|---|---|
| punctuation:no_comma | Do not use any commas | None |
| detectable_format:number_highlighted_sections | Highlight at least N sections | num_highlights = 3 |
| length_constraints:number_words | Write at least N words | num_words = 300, relation = "at least" |
The dataset is publicly available on Hugging Face under the Apache 2.0 license (google/IFEval) and on Google Research's GitHub repository.
IFEval identifies 25 types of verifiable instructions, organized into nine broad categories. Each instruction type has a corresponding verification function implemented as a short Python program.
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Include Keywords | keywords:existence | The response must include specific words | "Include the words 'neural' and 'network' in your response" |
| Keyword Frequency | keywords:frequency | A specific word must appear at least N times | "Mention the keyword 'AI' at least 3 times" |
| Forbidden Words | keywords:forbidden_words | The response must not contain certain words | "Do not use the words 'however' or 'therefore'" |
| Letter Frequency | keywords:letter_frequency | A specific letter must appear at least N times | "The letter 'q' should appear at least 5 times" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Response Language | language:response_language | The entire response must be written in a specified language | "Your response should be in French" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Number of Paragraphs | length_constraints:number_paragraphs | The response must contain exactly N paragraphs, or at least/at most N | "Write exactly 4 paragraphs" |
| Number of Words | length_constraints:number_words | The response must contain at least/at most N words | "Write in more than 400 words" |
| Number of Sentences | length_constraints:number_sentences | The response must contain at least/at most N sentences | "Your entire response should contain less than 6 sentences" |
| Nth Paragraph First Word | length_constraints:nth_paragraph_first_word | The Nth paragraph must start with a specific word | "The second paragraph should start with the word 'Furthermore'" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Postscript | detectable_content:postscript | The response must include a postscript (P.S. or P.P.S.) at the end | "At the end of your response, add a P.S. section" |
| Number of Placeholders | detectable_content:number_placeholders | The response must include a specified number of bracketed placeholders | "Include at least 2 placeholders in the form [placeholder]" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Number of Bullet Lists | detectable_format:number_bullet_lists | The response must contain a specific number of bullet point lists | "Include at least 2 bullet point lists" |
| Title | detectable_format:title | The response must include a title wrapped in double angular brackets | "Your response should contain a title in double angular brackets, i.e. < |
| Multiple Sections | detectable_format:multiple_sections | The response must be organized into at least N sections with markdown headings | "Your response must have 5 sections marked with ## heading" |
| Choose From | detectable_format:constrained_response | The response must be one of a predefined set of choices | "Answer with one of: Yes, No, Maybe" |
| JSON Format | detectable_format:json_format | The entire response must be valid JSON | "Your entire output should be in JSON format" |
| Number of Highlighted Sections | detectable_format:number_highlighted_sections | The response must contain N sections with markdown-highlighted titles | "Highlight at least 3 sections" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| Repeat Prompt | combination:repeat_prompt | The model must repeat the original prompt before answering | "First, repeat the request above word for word, then answer it" |
| Two Responses | combination:two_responses | The model must provide two distinct responses separated by asterisks | "Give two different responses separated by ******" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| All Uppercase | change_case:english_capital | The entire response must be in uppercase | "Your entire response should be in English, and in all capital letters" |
| All Lowercase | change_case:english_lowercase | The entire response must be in lowercase | "Your entire response should be in English, and in all lowercase letters" |
| Capital Word Frequency | change_case:capital_word_frequency | At least N words must be fully capitalized | "In your response, words with all capital letters should appear at least 5 times" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| End Checker | startend:end_checker | The response must end with a specific phrase | "Finish your response with 'Is there anything else I can help with?'" |
| Quotation | startend:quotation | The entire response must be wrapped in double quotation marks | "Wrap your entire response with double quotation marks" |
| Instruction Type | ID | Description | Example |
|---|---|---|---|
| No Commas | punctuation:no_comma | The response must not contain any commas | "Do not use any commas in your response" |
IFEval produces four distinct accuracy metrics by combining two dimensions: the level of granularity (prompt-level vs. instruction-level) and the strictness of verification (strict vs. loose).
Prompt-level accuracy treats each prompt as a single unit. A prompt is scored as correct only if all verifiable instructions within that prompt are satisfied. If a prompt contains three instructions and the model follows two out of three, the prompt receives a score of zero. This metric reflects a user's real-world experience, where partial compliance with a multi-part request may not be acceptable.
Instruction-level accuracy evaluates each verifiable instruction independently. If a prompt contains three instructions and the model follows two, the instruction-level score credits those two successes. This provides a more granular view of where models succeed and fail.
Strict accuracy requires exact compliance with each instruction. The verification function checks the model's raw output directly against the instruction's requirements. For instance, if the instruction requires "at least 400 words," the strict check counts the words in the response and requires the count to be 400 or greater.
Loose accuracy accounts for the fact that minor formatting differences can cause false negatives. Before checking compliance, the loose evaluation applies several transformations to the response, including:
If any transformed version of the response passes the verification check, the instruction is considered followed under the loose criterion. This approach reduces false negatives caused by models adding extra formatting or conversational text around their main response.
| Metric | Granularity | Strictness | What It Measures |
|---|---|---|---|
| Prompt-level Strict Accuracy | Prompt | Strict | Percentage of prompts where all instructions are followed exactly |
| Prompt-level Loose Accuracy | Prompt | Loose | Percentage of prompts where all instructions are followed (with tolerance for formatting) |
| Instruction-level Strict Accuracy | Instruction | Strict | Percentage of individual instructions followed exactly |
| Instruction-level Loose Accuracy | Instruction | Loose | Percentage of individual instructions followed (with tolerance for formatting) |
In many evaluation contexts (including the Open LLM Leaderboard), the reported IFEval score is an average of all four metrics, or sometimes just the prompt-level strict accuracy. The specific metric used varies by platform.
The original IFEval paper (Zhou et al., 2023) evaluated two models: GPT-4 (responses collected in November 2023) and PaLM 2 S (responses collected in August 2023). The results demonstrated a substantial performance gap between the two models.
| Model | Prompt-Level Strict | Prompt-Level Loose | Instruction-Level Strict | Instruction-Level Loose |
|---|---|---|---|---|
| GPT-4 (Nov 2023) | 76.89% | 79.30% | 83.57% | 85.37% |
| PaLM 2 S (Aug 2023) | 43.07% | 46.95% | 55.76% | 59.11% |
GPT-4 outperformed PaLM 2 S across all four metrics by a wide margin. At the prompt level, GPT-4 followed all instructions correctly in roughly 77% of prompts under strict evaluation, while PaLM 2 S managed only about 43%. The gap narrowed somewhat at the instruction level (where partial credit is given), but GPT-4 still maintained a lead of approximately 26 to 28 percentage points.
These results highlighted that instruction following was a significant differentiator between models at the time of the paper's publication, and that even frontier models had substantial room for improvement.
IFEval has been adopted as a standard benchmark across several major evaluation frameworks, making it one of the most widely used instruction-following assessments in the AI community.
In June 2024, Hugging Face launched the Open LLM Leaderboard v2, replacing the original leaderboard with a new suite of six more challenging benchmarks. IFEval was selected as one of these six core benchmarks alongside:
IFEval was chosen specifically because it evaluates instruction-following capabilities rather than content generation quality, providing a dimension of evaluation that the other five benchmarks do not cover. On the Open LLM Leaderboard, scores from all six benchmarks are normalized so that random performance maps to 0 and perfect performance maps to 100, then averaged to produce a final composite score.
EleutherAI's lm-evaluation-harness, the backend framework powering the Open LLM Leaderboard, includes IFEval as a built-in task. This allows any causal language model to be evaluated on IFEval with standardized inputs and consistent scoring, ensuring reproducibility across different evaluation runs.
IFEval has also been integrated into numerous other evaluation tools and platforms:
Since IFEval's release in late 2023, model performance on the benchmark has improved dramatically. While GPT-4 scored roughly 77% prompt-level strict accuracy in the original paper, newer models have pushed well past 90%.
The following table shows IFEval scores for a selection of notable models, as reported on benchmark leaderboards. Scores represent an aggregate metric (typically an average of the four IFEval accuracy dimensions or a normalized score).
| Model | IFEval Score | Developer |
|---|---|---|
| Qwen 3.5-27B | 0.950 | Alibaba |
| o3-mini | 0.939 | OpenAI |
| Claude 3.7 Sonnet | 0.932 | Anthropic |
| LLaMA 3.3 70B Instruct | 0.921 | Meta AI |
| Gemma 3 27B | 0.904 | |
| LLaMA 3.1 405B Instruct | 0.886 | Meta AI |
| GPT-4.5 | 0.882 | OpenAI |
| LLaMA 3.1 70B Instruct | 0.875 | Meta AI |
| DeepSeek-V3 | 0.861 | DeepSeek |
| Qwen 2.5 72B Instruct | 0.841 | Alibaba |
| GPT-4.1 mini | 0.841 | OpenAI |
| QwQ-32B | 0.839 | Alibaba |
| Mistral Small 3 24B Instruct | 0.829 | Mistral AI |
| GPT-4o | 0.810 | OpenAI |
| LLaMA 3.1 8B Instruct | 0.804 | Meta AI |
| Gemma 3 1B | 0.802 | |
| LLaMA 3.2 3B Instruct | 0.774 | Meta AI |
| GPT-4.1 nano | 0.745 | OpenAI |
| Phi 4 | 0.630 | Microsoft |
These scores show that frontier models now routinely exceed 90% on IFEval. Even smaller models with just a few billion parameters (such as Gemma 3 1B at 0.802 and LLaMA 3.2 3B at 0.774) achieve respectable scores, indicating that instruction-following capability has become a well-optimized dimension in modern LLM training.
Among proprietary, closed-source models evaluated on IFEval, the top performers include:
| Model | IFEval Score |
|---|---|
| GPT-5 | ~95.9% |
| o4-mini | ~95.6% |
| o3 | ~94.3% |
| GPT-5-mini | ~94.1% |
These scores indicate that the most advanced proprietary models are approaching the practical ceiling of the benchmark.
Each of the 25 instruction types has a corresponding verification function implemented in Python. These functions are deterministic: given a response and the instruction parameters, they return a binary pass/fail result. Examples include:
The official implementation is available in the Google Research GitHub repository (google-research/instruction_following_eval). To run an evaluation:
The script outputs the four accuracy metrics (prompt-level strict, prompt-level loose, instruction-level strict, instruction-level loose) and detailed per-prompt results.
Several alternative implementations exist:
Despite its wide adoption, IFEval has several recognized limitations that researchers and practitioners should consider.
IFEval's 25 instruction types cover a relatively narrow range of verifiable constraints. Many important instruction-following capabilities, such as maintaining a consistent tone, following complex multi-turn conversational directives, adhering to system prompts, or handling ambiguous requests gracefully, fall outside the scope of what IFEval can measure. The benchmark specifically tests format and structural constraints, not semantic or pragmatic instruction following.
Several researchers have pointed out that IFEval's instructions are somewhat artificial. Real users rarely ask models to "avoid the letter C" or "use exactly 4 paragraphs." While these constraints serve as useful proxies for measuring instruction compliance, they do not reflect the kinds of instructions that matter most in practical applications. A model could score perfectly on IFEval while still failing to follow more nuanced, real-world instructions.
IFEval evaluates only whether format and structural constraints are met. It does not assess the quality, coherence, accuracy, or helpfulness of the generated content. A response that meets all format requirements while containing nonsensical or incorrect information would receive a perfect IFEval score. This means IFEval should always be used alongside other benchmarks that evaluate content quality.
As of 2025 and 2026, top models regularly score above 90% on IFEval, with several exceeding 95%. This saturation effect reduces the benchmark's ability to discriminate between frontier models. When most advanced models achieve near-perfect scores, the remaining errors often involve edge cases, annotation ambiguities, or unusual instruction combinations rather than systematic failures. Some instruction categories (such as keyword existence) have reached near-perfect accuracy across all models, providing little additional signal.
Because the IFEval dataset is publicly available and consists of only 541 fixed prompts, there is a risk that model developers may explicitly or implicitly train on the evaluation data. This overfitting concern applies to many public benchmarks, but it is particularly relevant for IFEval given the dataset's small size. Models can be explicitly tuned to handle IFEval's specific instruction patterns without genuinely improving their general instruction-following abilities.
Research published in late 2025 ("Revisiting the Reliability of Language Models in Instruction-Following") examined the consistency of model performance on IFEval-style tasks. The study found that even models with high average scores can show significant variability when prompts are rephrased or slightly modified. The relative drop from standard IFEval accuracy to reliability-adjusted metrics can be substantial: as large as 61.8% for Qwen 3-0.6B and 54.7% for GPT-3.5-turbo-1106, and even the most reliable model (GPT-5) experienced a decrease of 18.3%.
The success and limitations of IFEval have inspired several derivative benchmarks that extend or modify its approach.
IFEval-Extended addresses the overfitting problem by using dynamic prompt generation rather than a fixed set of prompts. It extends the original instruction categories and generates thousands of unique instructions from each base template. This approach produces a more robust evaluation that is harder to overfit to, while maintaining the verifiable instruction paradigm.
M-IFEval, published in 2025, expands instruction-following evaluation to French, Japanese, and Spanish. Rather than simply translating the original English prompts, M-IFEval includes language-specific instructions that test capabilities unique to each language. Evaluation of eight state-of-the-art models showed that performance varies widely across languages and instruction types, highlighting the importance of multilingual evaluation.
IndicIFEval extends the verifiable instruction paradigm to 14 Indic languages, addressing the need for instruction-following evaluation in languages that are underrepresented in most benchmarks.
CL-IFEval (Cross-Lingual IFEval) expands coverage to French, Spanish, Hindi, Arabic, and Yoruba, further broadening the linguistic scope of instruction-following evaluation.
MaXIFE introduces a dataset covering 23 languages with 795 prompts and approximately 1,700 constraint templates, resulting in over 18,000 possible instruction combinations. This represents the most linguistically diverse extension of the IFEval framework to date.
Developed by Meta's Superintelligence Labs in partnership with Surge AI, AdvancedIF moves beyond IFEval's synthetic constraints to evaluate real-world instruction-following capabilities. Instead of format-based constraints checked by Python scripts, AdvancedIF uses human-written rubrics for both prompts and evaluation criteria. The benchmark tests multi-turn context management, system prompt adherence, and other practical instruction-following scenarios. A verifier trained on human-annotated data achieved 0.728 F1 agreement with human judgments, representing a 41% improvement over vanilla LLM prompting as a judge.
IFEval occupies a specific niche in the broader landscape of LLM evaluation. The following table compares it with other notable benchmarks.
| Benchmark | What It Measures | Evaluation Method | Created By | Year |
|---|---|---|---|---|
| IFEval | Instruction following (verifiable constraints) | Deterministic programs | 2023 | |
| MT-Bench | Multi-turn conversation quality | LLM judge (GPT-4) | LMSYS | 2023 |
| AlpacaEval | Instruction following (general) | LLM judge (GPT-4) | Stanford | 2023 |
| MMLU | Knowledge and reasoning (multiple choice) | Exact match | UC Berkeley et al. | 2020 |
| MMLU-Pro | Knowledge and reasoning (harder) | Exact match | TIGER-Lab | 2024 |
| GSM8K | Grade-school math reasoning | Exact match | OpenAI | 2021 |
| BIG-Bench Hard (BBH) | Challenging diverse tasks | Various | Google et al. | 2022 |
| AdvancedIF | Real-world instruction following | Human rubrics + trained verifier | Meta / Surge AI | 2025 |
IFEval's primary advantage is its objectivity and reproducibility. Unlike benchmarks that rely on LLM judges, IFEval's deterministic verification produces identical results every time, regardless of who runs the evaluation. Its primary disadvantage is its narrow scope, covering only format-level constraints rather than the full spectrum of instruction-following behavior.
IFEval has had a notable impact on the LLM evaluation ecosystem since its release in 2023.
Before IFEval, there was no widely accepted, objective benchmark specifically for instruction following. By providing a simple, reproducible, and deterministic evaluation, IFEval filled an important gap and gave the community a common reference point for comparing models on this dimension.
The inclusion of IFEval in the Open LLM Leaderboard v2 has incentivized model developers to optimize their models for instruction following. The dramatic improvement in scores between 2023 (GPT-4 at ~77%) and 2026 (frontier models at ~95%) reflects genuine progress in training models to comply with explicit user instructions.
IFEval's verifiable instruction paradigm has spawned a family of derivative benchmarks targeting different languages, domains, and levels of difficulty. The core idea of testing instructions that can be checked automatically has proven highly influential, even as researchers have identified the need to go beyond IFEval's specific set of constraints.
The benchmark's saturation among frontier models has also highlighted an important lesson: a benchmark is most useful when it can discriminate between models at the current frontier. As models have caught up to and exceeded IFEval's difficulty level, the community has begun developing more challenging successors like AdvancedIF and IFEval-Extended.