Aider Polyglot

Aider Polyglot
Overview
Full name	Aider Polyglot Coding Benchmark
Abbreviation	Aider Polyglot
Description	A challenging multi-language code generation and editing benchmark testing LLMs on 225 difficult Exercism coding exercises across six programming languages
Release date	2024-12-21
Latest version	1.0
Benchmark updated	2025-08-13
Authors	Paul Gauthier
Organization	Aider AI
Technical Details
Type	Code Generation, Code Editing
Modality	Text, Code
Task format	Code completion and file editing
Number of tasks	225
Total examples	225
Evaluation metric	Percent Correct, Edit Format Accuracy, Cost
Domains	Software Engineering, Programming
Languages	C++, Go, Java, JavaScript, Python, Rust
Performance
Human performance	Not reported
Baseline	3.6% (GPT-4o-mini)
SOTA score	93.3% (agent system)
SOTA model	Refact.ai Agent (Claude 3.7 Sonnet, with thinking)
Resources
Website	Official leaderboard
GitHub	polyglot-benchmark repo
Dataset	Download
Predecessor	Aider Code Editing Benchmark

Aider Polyglot is a coding benchmark that evaluates large language models on their ability to write and edit code across six programming languages. Released on December 21, 2024, by Aider creator Paul Gauthier, the benchmark presents 225 carefully selected problems drawn from Exercism, an open-source platform for coding practice. The exercises span C++, Go, Java, JavaScript, Python, and Rust, chosen specifically because the hardest problems in those six languages resisted easy solution by frontier models at the time of the benchmark's construction. The benchmark is maintained at aider.chat/docs/leaderboards and updated whenever new results become available.

Background

The Aider tool

Aider is an open-source AI pair programming tool that runs in the terminal and integrates with a local git repository. It allows developers to describe coding tasks in plain language, and then the tool works with an LLM to write or modify the source files, commit the changes, and track revision history. Because Aider's primary job is to edit existing files rather than generate standalone snippets, evaluating it requires a benchmark that measures code editing ability alongside raw coding ability.

The original Aider benchmark

Before the Polyglot benchmark existed, Aider used a Python-only benchmark built from Exercism's Python exercise catalog. That benchmark contained 133 exercises. By late 2024 it had saturated: Claude 3.5 Sonnet scored 84.2% by solving 112 of the 133 exercises, leaving only 21 unsolved. With scores bunched near the top, the benchmark could no longer distinguish between frontier models. A model could jump from 80% to 85% without that jump revealing anything meaningful about its practical coding capability.

Why a new benchmark was needed

Paul Gauthier designed Aider Polyglot to recalibrate the scale. The stated goal was to produce a benchmark where today's top models would occupy a wide range of scores, roughly between 5% and 50%, giving enough spread to track progress clearly and leaving headroom for future improvement. Making the benchmark multi-language was an intentional design choice: a model that can write competent Python may struggle with Rust's ownership semantics or Go's concurrency patterns, so a polyglot benchmark tests a broader slice of coding ability than any single-language dataset.

Methodology

Problem source: Exercism

All 225 problems come from Exercism, a free, open-source platform that hosts hand-crafted exercises in dozens of programming languages. Each exercise on Exercism ships with a starter file, a complete test suite, and a prose description of the task. Because Exercism exercises are designed for human learners, they tend to have clear problem statements and reliable test suites, two properties that make automated evaluation straightforward.

The benchmark draws problems from six of Exercism's language tracks: C++, Go, Java, JavaScript, Python, and Rust. These were chosen because they represent a range of paradigms, from Go's statically typed concurrent style to Python's dynamic duck typing, from Java's object-oriented structure to Rust's compile-time memory safety guarantees.

Problem selection: filtering for difficulty

Exercism hosts 697 exercises across the six chosen languages. Rather than using all of them, Gauthier applied an empirical filter to select only the hardest subset:

Seven of the strongest coding models at the time were each asked to attempt all 697 problems.
Any problem solved by all seven models was excluded as too easy.
Any problem solved by none of the seven models was excluded as too hard to be informative.
Problems solved by three or fewer of the seven models were retained.

This process yielded the final set of 225 problems. The language breakdown in the final set is: JavaScript (49 problems), Java (47), Go (39), Python (34), Rust (30), and C++ (26).

Two-attempt rule

Every model gets two attempts on each problem. On the first attempt the model receives the problem description and the starter code. It then produces a set of file edits. Those edits are applied to the source files and the exercise's test suite is executed automatically.

If all tests pass on the first attempt, the problem is scored as solved. If any tests fail, the model is shown the test failure output and given a second attempt. The score reported in the leaderboard is the percentage of problems for which the model's solution passed all tests within two attempts. This two-attempt structure tests not just a model's ability to write correct code from scratch, but also its ability to read error messages, diagnose what went wrong, and produce a corrected edit.

Edit format requirement

Aider does not ask models to write complete files from scratch. Instead, it asks them to produce structured edit instructions, search-and-replace blocks or diff-style patches, that the tool then applies to the existing source files. The benchmark measures edit format accuracy as a secondary metric: what percentage of the time did the model produce a syntactically valid edit that could be parsed and applied?

Edit format accuracy matters in practice because a model that writes a correct solution but mangled the diff syntax would produce no usable change. A model with 100% edit format accuracy and 60% correct solutions is more useful than one with 90% accuracy and 65% correct solutions but that regularly produces malformed edits.

Additional metrics

Beyond percent correct and edit format accuracy, the leaderboard also records the total cost of running each model through all 225 problems. This cost figure includes both prompt tokens (the problem description, starter code, and, on the second attempt, the test failure output) and completion tokens (the model's edit instructions). Cost data lets developers compare models on efficiency: a model that scores 60% for $13 competes differently with one that scores 62% for $187.

The benchmark also tracks: malformed response counts, syntax errors in generated code, context window exhaustion events, and test timeouts.

Edit formats and their impact

What edit formats are

When Aider sends a coding task to an LLM, it includes instructions for how the model should structure its response. These instructions describe an "edit format," a convention for expressing which lines of which files should change and how. Different edit formats suit different models. Aider automatically picks the format expected to work best for each model, but the benchmark records which format was used so comparisons are apples-to-apples.

Whole

The simplest format: the model returns the entire updated file. There is no diff or search/replace block, just the complete new contents of the file inside a fenced code block with the file path above it. This format is reliable because it does not require the model to produce structurally valid patch syntax. The cost is that the model must regenerate the entire file even if only a few lines change, which wastes tokens and increases cost.

Diff (search/replace)

The diff format uses search-and-replace blocks. The model specifies the exact text to find in the file and the text to replace it with. Each block is fenced with markers that Aider parses. This format is more token-efficient than whole because the model only needs to emit the changed sections. Research from the Aider team and independent academic study ("Robust Learning of Diverse Code Edits," 2025) found that search-replace is the most effective format overall for large capable models, because each replacement stands on its own: an error in one block does not invalidate the rest of the response.

Udiff (unified diff)

The udiff format is based on the standard unified diff output produced by git diff. Aider modified the standard format by omitting line numbers from hunk headers, since requiring accurate line numbers introduces brittle failure modes when the model miscounts. The unified diff approach was originally developed to address a specific problem with GPT-4 Turbo: that model tended toward "lazy coding," inserting placeholder comments like "...add logic here..." rather than writing the full implementation. The familiar unified diff syntax, with its + and - line prefixes, had extensive representation in the model's training data, and prompting the model to produce diffs in that style reduced the lazy-coding rate by roughly three times. On the original Aider benchmark, switching GPT-4 Turbo from search/replace blocks to unified diffs improved its score from about 20% to 61%.

Diff-fenced

A variant of the diff format that places the file path inside the fenced code block rather than above it. This was introduced specifically for Gemini models, which frequently failed to comply with the standard fencing approach used in the regular diff format. Gemini models often succeed at search-and-replace code editing but had formatting habits that caused parse failures with the standard convention.

Editor-diff and editor-whole

These two formats are streamlined versions of diff and whole intended for use in architect mode. In architect mode, the editing task is split between two models: an architect and an editor. The editor receives a narrower prompt focused purely on writing syntactically correct edits rather than on problem-solving. Because the editor's role is mechanical rather than creative, a simpler prompt produces better compliance. Editor-diff and editor-whole use the same underlying syntax as diff and whole but with an abbreviated system prompt.

Architect mode

How it works

Architect mode splits the coding task between two model calls. The architect model is shown the problem description and starter code and asked to reason about the solution and describe the changes needed in plain language. It does not produce code edits directly. Its output is a natural-language description of what should change and why.

That description is then passed to an editor model, which is given only the architect's plan and the original starter code. The editor's job is purely mechanical: translate the plan into syntactically correct file edits using editor-diff or editor-whole format. Because the editor prompt removes the cognitive load of problem-solving, it tends to produce cleaner, better-formatted edits.

The motivation is a division of cognitive labor. Strong reasoning models such as o1 or DeepSeek R1 are good at thinking through complex problems but expensive and sometimes unreliable at producing precisely formatted diffs. Cheaper, instruction-following models are highly reliable at generating correctly formatted edits but less powerful at reasoning. Combining them exploits the strengths of each.

The R1+Sonnet SOTA (January 2025)

On January 24, 2025, Paul Gauthier reported that using DeepSeek R1 as the architect and Claude 3.5 Sonnet as the editor set a new state-of-the-art score on the benchmark: 64.0%. The previous record was o1 at 61.7%.

The cost difference was striking. Running o1 solo cost $186.50 for the full 225-problem run. The R1+Sonnet combination cost $13.29, about 14 times less, while scoring higher. This result demonstrated that architect mode could outperform a more expensive solo model, and that the division of labor between reasoning and formatting is practically valuable.

The same experiment showed that pairing o1 as architect with Sonnet as editor did not improve on o1 solo, suggesting the benefit of the approach depended on which model acted as the architect.

Configuration	Score	Edit format accuracy	Cost
R1 + Sonnet (architect)	64.0%	100.0%	$13.29
o1 solo (high)	61.7%	91.5%	$186.50
R1 solo	56.9%	96.9%	$5.42
Claude 3.5 Sonnet solo	51.6%	99.6%	$14.41

O3+GPT-4.1 architect (April 2025)

In April 2025, Paul Gauthier reported that using o3 (high) as architect and GPT-4.1 as editor produced a score of 83%, a new SOTA at the time. This result also reduced costs substantially compared to running o3 solo.

Refact.ai agent results (2025)

Refact.ai, an AI coding assistant company, adapted the benchmark methodology with an agentic approach that goes beyond Aider's two-attempt structure. Their agent uses up to 30 steps per problem, autonomously executes tests, reads failure output, revises code, and re-tests in a loop. In April 2025, Refact.ai reported their agent powered by Claude 3.7 Sonnet achieved 92.9% without extended thinking and 93.3% with thinking enabled, the highest scores reported on the benchmark as of mid-2025.

This result is not directly comparable to the standard single-model entries on the leaderboard because the agent can make many more corrective passes per problem than the two allowed in the standard protocol. It illustrates the upper bound of what iterative self-correction can achieve on the benchmark's problems.

Scoring system

Primary metric: percent correct

The main number reported for each model is the percentage of the 225 problems for which the model's output, after applying edits and running the test suite, passed all tests within two attempts. A problem passes if and only if every test case in the exercise's test suite passes. There is no partial credit for solving some tests but not others.

Edit format accuracy

This secondary metric records what fraction of the model's responses were parseable and applicable. A response that produces syntactically broken diff blocks, incomplete fences, or otherwise malformed edit instructions counts as an edit format failure even if the underlying reasoning was correct. On the standard leaderboard, most frontier models achieve edit format accuracy above 90%. Some models, particularly those from smaller providers or open-source checkpoints, score lower here, which limits their effective performance regardless of how good their code reasoning is.

Cost

The leaderboard reports total cost in US dollars for completing all 225 problems. This figure is the sum of API charges for all prompt and completion tokens across both attempts for all problems, at the API rates current when the run was conducted. Researchers and developers use this to assess whether a model's score justifies its cost. DeepSeek V3, for example, scored 70.2% while costing $0.88 for the full run, making it a dramatically more cost-efficient choice than o3-pro, which scored 84.9% at $146.32.

What the score does not measure

The benchmark measures whether edited code passes pre-existing test suites. It does not evaluate:

Whether the model wrote idiomatic code in the target language
Whether the solution would perform well on large inputs
Whether the code is readable or maintainable
Whether the model can work across multiple files in the same change
Debugging ability on pre-existing broken code
Code review or refactoring tasks

Leaderboard history

Initial release: December 2024

At launch on December 21, 2024, o1 scored at the top of the leaderboard with 61.7%, confirming the benchmark's intended calibration. The original top ten were:

Model	Score	Edit format accuracy
o1-2024-12-17 (high)	61.7%	91.5%
Claude 3.5 Sonnet (2024-10-22)	45.3%	100.0%
Gemini Exp 1206	38.2%	98.2%
o1-mini	32.9%	96.9%
Claude 3.5 Haiku	28.0%	91.1%
Gemini 2.0 Flash Exp	22.2%	100.0%
DeepSeek Chat V2.5	17.8%	92.9%
GPT-4o (2024-11-20)	15.1%	96.0%
Qwen2.5-Coder-32B	8.0%	71.6%
GPT-4o-mini	3.6%	100.0%

Early 2025: thinking models arrive

Through early 2025, the leaderboard updated rapidly as new models were released. DeepSeek R1 entered at 56.9%, and o3-mini scored 53.8% at medium compute and 60.4% at high. Claude 3.7 Sonnet without extended thinking scored 60.4%, matching o3-mini high while costing slightly less. Claude 3.7 Sonnet with 32,000 thinking tokens scored 64.9%, at that point the highest single-model score. GPT-4.5 Preview scored 44.9% at a very high cost of $183.18 for the run.

The R1+Sonnet architect combination scored 64.0% in January 2025, briefly setting the SOTA.

Mid-2025: reasoning models push above 80%

The second half of 2025 saw a large jump in leaderboard performance as dedicated reasoning models matured. o3 reached 81.3%, and o3-pro at high compute reached 84.9% at a cost of $146.32. Gemini 2.5 Pro with 32k thinking tokens hit 83.1%.

Claude 4 Opus (with 32k thinking tokens) scored 72% when evaluated by Paul Gauthier in May 2025. Claude 4 Sonnet scored 61% under the same conditions. Gauthier noted that Claude 4 Sonnet appeared to underperform Claude 3.7 Sonnet on this benchmark.

DeepSeek V3.2 Chat scored 70.2% at an exceptionally low cost of $0.88 for the full run, making it the most cost-efficient competitive model on the leaderboard.

2026: GPT-5 and Claude Opus 4.5 lead

GPT-5 from OpenAI became the top-scoring single-model entry, with 88.0% at high compute and $29.08 cost, and 86.7% at medium compute. Claude Opus 4.5 was reported by Anthropic as achieving 89.4% on the benchmark, leading the leaderboard by its own reporting. The o3-pro (high) score of 84.9% remained the next verified result on the official aider.chat leaderboard.

Full leaderboard snapshot (as of August 2025)

Rank	Model	Score	Cost	Organization
1	GPT-5 (high)	88.0%	$29.08	OpenAI
2	GPT-5 (medium)	86.7%	$17.69	OpenAI
3	o3-pro (high)	84.9%	$146.32	OpenAI
4	Gemini 2.5 Pro (32k think)	83.1%	$49.88	Google DeepMind
5	GPT-5 (low)	81.3%	$10.37	OpenAI
6	o3 (high)	81.3%	$21.23	OpenAI
7	Grok-4 (high)	79.6%	$59.62	xAI
8	Gemini 2.5 Pro (default think)	79.1%	$19.29	Google DeepMind
9	Claude 3.7 Sonnet (32k think)	64.9%	$36.83	Anthropic
10	R1 + Claude 3.5 Sonnet (architect)	64.0%	$13.29	Multiple
11	Claude 4 Opus (32k think)	72.0%	N/A	Anthropic
12	DeepSeek V3.2 Chat	70.2%	$0.88	DeepSeek
13	o1 (high)	61.7%	$74.66	OpenAI
14	Claude 4 Sonnet (32k think)	61.0%	N/A	Anthropic
15	Claude 3.7 Sonnet (no think)	60.4%	$17.72	Anthropic
16	o3-mini (high)	60.4%	$18.16	OpenAI
17	DeepSeek R1	56.9%	$5.42	DeepSeek
18	o3-mini (medium)	53.8%	$8.86	OpenAI
19	Claude 3.5 Sonnet (2024-10-22)	45.3%	$3.12	Anthropic
20	GPT-4.5 Preview	44.9%	$183.18	OpenAI
21	Gemini Exp 1206	38.2%	N/A	Google DeepMind
22	o1-mini	32.9%	$18.58	OpenAI
23	Claude 3.5 Haiku	28.0%	$6.06	Anthropic
24	DeepSeek Chat V2.5	17.8%	$0.51	DeepSeek
25	GPT-4o (2024-08-06)	23.1%	$7.03	OpenAI
26	GPT-4o-mini	3.6%	$0.32	OpenAI

Note: Cost figures reflect full runs of all 225 problems at API prices current when each model was tested. Agent systems (Refact.ai) that use more than two attempts per problem are listed separately from the standard leaderboard.

Value-for-cost analysis

Cost data from the leaderboard makes it possible to rank models not just by accuracy but by performance per dollar. At initial leaderboard construction, DeepSeek Chat V3 offered the highest value: 48.4% correct for $0.34, giving a value score many times higher than any competing model. Later, DeepSeek V3.2 at 70.2% for $0.88 maintained DeepSeek's position as the cost-efficiency leader for competitive-tier performance.

For teams that need maximum accuracy and cost is secondary, GPT-5 (high) or o3-pro represent the top tier. For teams that want strong performance with a budget under $1, DeepSeek V3 variants have consistently led.

Comparison with other benchmarks

SWE-bench Verified

SWE-bench presents models with real GitHub issues from Python open-source repositories. A successful solve requires the model to understand an existing codebase, identify the root cause of a bug, make targeted edits across potentially many files, and pass a test suite that was written by the original project maintainers.

Dimension	Aider Polyglot	SWE-bench Verified
Languages	6 (C++, Go, Java, JS, Python, Rust)	Python only
Problem type	Self-contained implementation exercises	Real-world bug fixes in existing repos
Files per problem	Typically 1	Potentially many
Test suites	Written by Exercism contributors	Written by original project maintainers
Contamination risk	Moderate (Exercism is publicly indexed)	Higher (popular repos likely in training data)
Problem count	225	~500 verified
Task format	Implement from scratch; edit starter code	Locate bug; patch across files
Cost per run	$0.32 to $146 depending on model	Higher due to long repo context

The two benchmarks measure related but distinct skills. SWE-bench tests multi-file debugging and navigating unfamiliar codebases. Aider Polyglot tests clean-room implementation and edit-format compliance. A model can score high on one and lower on the other. In practice, models that excel at reasoning tend to do well on both, but the correlation is imperfect.

One criticism noted by Refact.ai and others is that SWE-bench's Python-only scope and its heavy use of popular open-source projects creates meaningful contamination risk: many of those repositories and issues were likely present in training data. The Polyglot benchmark's Exercism exercises are also publicly available, but the specific 225 problems chosen are less likely to have appeared in specialized coding fine-tuning datasets.

LiveCodeBench

LiveCodeBench collects new problems continuously from competitive programming contests on LeetCode, AtCoder, and Codeforces. Because problems are tagged with their release date, evaluations can be restricted to problems released after a model's training cutoff, making contamination nearly impossible. The benchmark tests four scenarios: code generation, self-repair, code execution, and test output prediction.

Dimension	Aider Polyglot	LiveCodeBench
Problem source	Exercism exercises	Competitive programming contests
Languages	6	Primarily Python, C++, Java
Problem type	Software engineering exercises	Algorithmic competition problems
Contamination control	Static set, moderate risk	Rolling release, very low risk
Measures editing	Yes	No (generation only in main track)
Cost to run	Low to moderate	Moderate
Problem count	225	1055+ (v6)

LiveCodeBench is widely considered the most contamination-resistant coding signal among regularly used benchmarks, because it continuously introduces unseen problems. However, competitive programming problems emphasize algorithmic reasoning in ways that can diverge from typical software engineering work. Aider Polyglot's Exercism problems are closer to the kind of implementation tasks a developer might face in practice, even if Exercism's exercises are not drawn from real production codebases.

HumanEval

HumanEval is a Python-only benchmark of 164 hand-written function completion problems, released by OpenAI in 2021. It is widely considered saturated: most frontier models score above 90%. Aider Polyglot's multi-language, harder-filtered design is a direct response to the same kind of saturation that eventually affected HumanEval.

The original Aider code editing benchmark

The predecessor benchmark used 133 Python Exercism exercises. Its top score at the time of Polyglot's launch was 84.2%, and as noted above, scores were bunched near the top. Aider Polyglot expanded the language scope, more than tripled the language count, and filtered for difficulty, pushing the initial top score down to 61.7% and giving the benchmark several years of headroom before saturation becomes a concern again.

Dimension	Aider Code Editing Benchmark	Aider Polyglot
Languages	Python only	6 languages
Problem count	133	225
Top score at launch	84.2%	61.7%
Difficulty filter	None	Solved by 3 or fewer of 7 reference models
Release	2023	December 21, 2024

Industry impact

Adoption by model developers

Since its launch, Aider Polyglot has been cited in official announcements from Anthropic, OpenAI, Google DeepMind, and DeepSeek as evidence of coding ability. It has become one of a small set of coding benchmarks that major labs test against before releasing new models. Paul Gauthier typically updates the leaderboard within days of a major model release, and the benchmark run results often appear in lab blog posts and social media announcements alongside SWE-bench and LiveCodeBench scores.

Role in competitive model evaluation

The benchmark is useful to the research community because it combines three properties that individually common benchmarks lack: multi-language coverage, a realistic edit-format requirement, and a cost metric. This means it rewards models that can both reason about code and reliably produce correctly structured output, which matters for practical tool integration.

The benchmark has also become a reference point for companies building coding agents. Refact.ai's blog post documenting their 92.9% agent result used the benchmark to argue their agent's superiority over both Aider itself and standard model-only approaches. The benchmark's public accessibility (the full problem set is on GitHub, and the harness is open source) makes it reproducible for any team that wants to verify results or benchmark their own systems.

Influence on edit format research

Aider's work on edit formats, documented in part through benchmark results, influenced subsequent academic research on code editing representations. The 2025 paper "Robust Learning of Diverse Code Edits" (arXiv:2503.03656) benchmarked multiple edit formats and confirmed that search-replace representations outperform unified diffs and structured formats for most capable models, consistent with the patterns Gauthier observed in Aider benchmark results.

Cost transparency

The benchmark's inclusion of per-run cost is unusual among coding benchmarks and has proven practically useful. DeepSeek's V3 models' appearance on the leaderboard with competitive scores and sub-dollar run costs drove significant developer attention toward those models in early-to-mid 2025. The cost column essentially acts as an efficiency frontier chart, making it easy to identify which models offer the best tradeoff between capability and expense.

Limitations

Single-file tasks

Each Polyglot problem requires changes to a single source file. Real software engineering work frequently involves changes across many files simultaneously: updating an interface, modifying multiple callers, adjusting tests and documentation. The benchmark does not evaluate this multi-file coordination ability, which is a meaningful part of what coding agents need to do.

Exercism-specific style

All problems come from one platform with consistent formatting and style conventions. A model that has seen many Exercism exercises in training may benefit from recognizing that style, making the benchmark less of a pure test of general coding ability. The filter for hard problems reduces but does not eliminate this concern.

Static problem set

Unlike LiveCodeBench, the 225 Polyglot problems are fixed. As models are further trained with Exercism data, or as benchmark results enter the public record and become training data themselves, contamination risk grows over time. The benchmark's designers acknowledged this, intending the current calibration to remain useful for at least several years given the difficulty of the selected problems.

Two-attempt cap

The two-attempt structure is a practical compromise between evaluation thoroughness and cost, but it limits what the benchmark measures. In real development, a programmer iterates many more times, reading error messages, searching documentation, and revising code incrementally. The benchmark rewards models that can reason correctly on the first or second pass but does not measure sustained iterative refinement.

Language distribution

The 225 problems are not evenly distributed across the six languages. JavaScript and Java together make up nearly 43% of the set, while C++ accounts for only 11.6%. A model that is unusually strong at JavaScript and Java but weak at Rust will benefit more from this distribution than one with uniform multilingual ability.

No open-ended task evaluation

All problems have a predefined test suite, so the benchmark evaluates whether a model can produce code that passes specific pre-written tests. It does not evaluate whether a model can write good tests, design APIs, write readable code, handle ambiguous specifications, or produce well-documented implementations.

References

Paul Gauthier. "o1 tops aider's new polyglot leaderboard." Aider Blog, December 21, 2024. https://aider.chat/2024/12/21/polyglot.html
Paul Gauthier. "R1+Sonnet set SOTA on aider's polyglot benchmark." Aider Blog, January 24, 2025. https://aider.chat/2025/01/24/r1-sonnet.html
Paul Gauthier. "Aider LLM Leaderboards." aider.chat. https://aider.chat/docs/leaderboards/
Refact.ai. "Refact.ai Agent + Claude 3.7 Sonnet tops Aider's polyglot benchmark." Refact.ai Blog, 2025. https://refact.ai/blog/2025/refact-ai-agent-claude-3-7-sonnet-ranked-1-aider-polyglot/
Aider-AI. "Aider Polyglot Benchmark Repository." GitHub. https://github.com/Aider-AI/polyglot-benchmark
Paul Gauthier. "Unified diffs make GPT-4 Turbo 3X less lazy." aider.chat. https://aider.chat/docs/unified-diffs.html
Paul Gauthier. "Edit formats." aider.chat documentation. https://aider.chat/docs/more/edit-formats.html
"Robust Learning of Diverse Code Edits." arXiv:2503.03656, March 2025. https://arxiv.org/pdf/2503.03656
Paul Gauthier. "Qwen3 benchmark results." Aider Blog, May 8, 2025. https://aider.chat/2025/05/08/qwen3.html
LLM Stats. "Aider-Polyglot Benchmark Leaderboard." https://llm-stats.com/benchmarks/aider-polyglot
Epoch AI. "Aider Polyglot Benchmark." https://epoch.ai/benchmarks/aider-polyglot
LiveCodeBench. "Holistic and Contamination Free Evaluation of Large Language Models for Code." arXiv:2403.07974. https://arxiv.org/abs/2403.07974
Paul Gauthier. Twitter/X post on Claude 4 Opus scores. May 2025. https://x.com/paulgauthier/status/1926773685597172151

Background

The Aider tool

The original Aider benchmark

Why a new benchmark was needed

Methodology

Problem source: Exercism

Problem selection: filtering for difficulty

Two-attempt rule

Edit format requirement

Additional metrics

Edit formats and their impact

What edit formats are

Whole

Diff (search/replace)

Udiff (unified diff)

Diff-fenced

Editor-diff and editor-whole

Architect mode

How it works

The R1+Sonnet SOTA (January 2025)

O3+GPT-4.1 architect (April 2025)

Refact.ai agent results (2025)

Scoring system

Primary metric: percent correct

Edit format accuracy

Cost

What the score does not measure

Leaderboard history

Initial release: December 2024

Early 2025: thinking models arrive

Mid-2025: reasoning models push above 80%

2026: GPT-5 and Claude Opus 4.5 lead

Full leaderboard snapshot (as of August 2025)

Value-for-cost analysis

Comparison with other benchmarks

SWE-bench Verified

LiveCodeBench

HumanEval

The original Aider code editing benchmark

Industry impact

Adoption by model developers

Role in competitive model evaluation

Influence on edit format research

Cost transparency

Limitations

Single-file tasks

Exercism-specific style

Static problem set

Two-attempt cap

Language distribution

No open-ended task evaluation

See also

References

Improve this article

Related Articles

SciCode

τ-bench

BALROG

IFBench

Longform Creative Writing

GSO

Background

The Aider tool

The original Aider benchmark

Why a new benchmark was needed

Methodology

Problem source: Exercism

Problem selection: filtering for difficulty

Two-attempt rule

Edit format requirement

Additional metrics

Edit formats and their impact

What edit formats are

Whole

Diff (search/replace)

Udiff (unified diff)

Diff-fenced

Editor-diff and editor-whole

Architect mode

How it works