Aider Polyglot
Last reviewed
May 7, 2026
Sources
13 citations
Review status
Source-backed
Revision
v2 · 5,052 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
13 citations
Review status
Source-backed
Revision
v2 · 5,052 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Aider Polyglot | |
|---|---|
| Overview | |
| Full name | Aider Polyglot Coding Benchmark |
| Abbreviation | Aider Polyglot |
| Description | A challenging multi-language code generation and editing benchmark testing LLMs on 225 difficult Exercism coding exercises across six programming languages |
| Release date | 2024-12-21 |
| Latest version | 1.0 |
| Benchmark updated | 2025-08-13 |
| Authors | Paul Gauthier |
| Organization | Aider AI |
| Technical Details | |
| Type | Code Generation, Code Editing |
| Modality | Text, Code |
| Task format | Code completion and file editing |
| Number of tasks | 225 |
| Total examples | 225 |
| Evaluation metric | Percent Correct, Edit Format Accuracy, Cost |
| Domains | Software Engineering, Programming |
| Languages | C++, Go, Java, JavaScript, Python, Rust |
| Performance | |
| Human performance | Not reported |
| Baseline | 3.6% (GPT-4o-mini) |
| SOTA score | 93.3% (agent system) |
| SOTA model | Refact.ai Agent (Claude 3.7 Sonnet, with thinking) |
| Resources | |
| Website | Official leaderboard |
| GitHub | polyglot-benchmark repo |
| Dataset | Download |
| Predecessor | Aider Code Editing Benchmark |
Aider Polyglot is a coding benchmark that evaluates large language models on their ability to write and edit code across six programming languages. Released on December 21, 2024, by Aider creator Paul Gauthier, the benchmark presents 225 carefully selected problems drawn from Exercism, an open-source platform for coding practice. The exercises span C++, Go, Java, JavaScript, Python, and Rust, chosen specifically because the hardest problems in those six languages resisted easy solution by frontier models at the time of the benchmark's construction. The benchmark is maintained at aider.chat/docs/leaderboards and updated whenever new results become available.
Aider is an open-source AI pair programming tool that runs in the terminal and integrates with a local git repository. It allows developers to describe coding tasks in plain language, and then the tool works with an LLM to write or modify the source files, commit the changes, and track revision history. Because Aider's primary job is to edit existing files rather than generate standalone snippets, evaluating it requires a benchmark that measures code editing ability alongside raw coding ability.
Before the Polyglot benchmark existed, Aider used a Python-only benchmark built from Exercism's Python exercise catalog. That benchmark contained 133 exercises. By late 2024 it had saturated: Claude 3.5 Sonnet scored 84.2% by solving 112 of the 133 exercises, leaving only 21 unsolved. With scores bunched near the top, the benchmark could no longer distinguish between frontier models. A model could jump from 80% to 85% without that jump revealing anything meaningful about its practical coding capability.
Paul Gauthier designed Aider Polyglot to recalibrate the scale. The stated goal was to produce a benchmark where today's top models would occupy a wide range of scores, roughly between 5% and 50%, giving enough spread to track progress clearly and leaving headroom for future improvement. Making the benchmark multi-language was an intentional design choice: a model that can write competent Python may struggle with Rust's ownership semantics or Go's concurrency patterns, so a polyglot benchmark tests a broader slice of coding ability than any single-language dataset.
All 225 problems come from Exercism, a free, open-source platform that hosts hand-crafted exercises in dozens of programming languages. Each exercise on Exercism ships with a starter file, a complete test suite, and a prose description of the task. Because Exercism exercises are designed for human learners, they tend to have clear problem statements and reliable test suites, two properties that make automated evaluation straightforward.
The benchmark draws problems from six of Exercism's language tracks: C++, Go, Java, JavaScript, Python, and Rust. These were chosen because they represent a range of paradigms, from Go's statically typed concurrent style to Python's dynamic duck typing, from Java's object-oriented structure to Rust's compile-time memory safety guarantees.
Exercism hosts 697 exercises across the six chosen languages. Rather than using all of them, Gauthier applied an empirical filter to select only the hardest subset:
This process yielded the final set of 225 problems. The language breakdown in the final set is: JavaScript (49 problems), Java (47), Go (39), Python (34), Rust (30), and C++ (26).
Every model gets two attempts on each problem. On the first attempt the model receives the problem description and the starter code. It then produces a set of file edits. Those edits are applied to the source files and the exercise's test suite is executed automatically.
If all tests pass on the first attempt, the problem is scored as solved. If any tests fail, the model is shown the test failure output and given a second attempt. The score reported in the leaderboard is the percentage of problems for which the model's solution passed all tests within two attempts. This two-attempt structure tests not just a model's ability to write correct code from scratch, but also its ability to read error messages, diagnose what went wrong, and produce a corrected edit.
Aider does not ask models to write complete files from scratch. Instead, it asks them to produce structured edit instructions, search-and-replace blocks or diff-style patches, that the tool then applies to the existing source files. The benchmark measures edit format accuracy as a secondary metric: what percentage of the time did the model produce a syntactically valid edit that could be parsed and applied?
Edit format accuracy matters in practice because a model that writes a correct solution but mangled the diff syntax would produce no usable change. A model with 100% edit format accuracy and 60% correct solutions is more useful than one with 90% accuracy and 65% correct solutions but that regularly produces malformed edits.
Beyond percent correct and edit format accuracy, the leaderboard also records the total cost of running each model through all 225 problems. This cost figure includes both prompt tokens (the problem description, starter code, and, on the second attempt, the test failure output) and completion tokens (the model's edit instructions). Cost data lets developers compare models on efficiency: a model that scores 60% for $13 competes differently with one that scores 62% for $187.
The benchmark also tracks: malformed response counts, syntax errors in generated code, context window exhaustion events, and test timeouts.
When Aider sends a coding task to an LLM, it includes instructions for how the model should structure its response. These instructions describe an "edit format," a convention for expressing which lines of which files should change and how. Different edit formats suit different models. Aider automatically picks the format expected to work best for each model, but the benchmark records which format was used so comparisons are apples-to-apples.
The simplest format: the model returns the entire updated file. There is no diff or search/replace block, just the complete new contents of the file inside a fenced code block with the file path above it. This format is reliable because it does not require the model to produce structurally valid patch syntax. The cost is that the model must regenerate the entire file even if only a few lines change, which wastes tokens and increases cost.
The diff format uses search-and-replace blocks. The model specifies the exact text to find in the file and the text to replace it with. Each block is fenced with markers that Aider parses. This format is more token-efficient than whole because the model only needs to emit the changed sections. Research from the Aider team and independent academic study ("Robust Learning of Diverse Code Edits," 2025) found that search-replace is the most effective format overall for large capable models, because each replacement stands on its own: an error in one block does not invalidate the rest of the response.
The udiff format is based on the standard unified diff output produced by git diff. Aider modified the standard format by omitting line numbers from hunk headers, since requiring accurate line numbers introduces brittle failure modes when the model miscounts. The unified diff approach was originally developed to address a specific problem with GPT-4 Turbo: that model tended toward "lazy coding," inserting placeholder comments like "...add logic here..." rather than writing the full implementation. The familiar unified diff syntax, with its + and - line prefixes, had extensive representation in the model's training data, and prompting the model to produce diffs in that style reduced the lazy-coding rate by roughly three times. On the original Aider benchmark, switching GPT-4 Turbo from search/replace blocks to unified diffs improved its score from about 20% to 61%.
A variant of the diff format that places the file path inside the fenced code block rather than above it. This was introduced specifically for Gemini models, which frequently failed to comply with the standard fencing approach used in the regular diff format. Gemini models often succeed at search-and-replace code editing but had formatting habits that caused parse failures with the standard convention.
These two formats are streamlined versions of diff and whole intended for use in architect mode. In architect mode, the editing task is split between two models: an architect and an editor. The editor receives a narrower prompt focused purely on writing syntactically correct edits rather than on problem-solving. Because the editor's role is mechanical rather than creative, a simpler prompt produces better compliance. Editor-diff and editor-whole use the same underlying syntax as diff and whole but with an abbreviated system prompt.
Architect mode splits the coding task between two model calls. The architect model is shown the problem description and starter code and asked to reason about the solution and describe the changes needed in plain language. It does not produce code edits directly. Its output is a natural-language description of what should change and why.
That description is then passed to an editor model, which is given only the architect's plan and the original starter code. The editor's job is purely mechanical: translate the plan into syntactically correct file edits using editor-diff or editor-whole format. Because the editor prompt removes the cognitive load of problem-solving, it tends to produce cleaner, better-formatted edits.
The motivation is a division of cognitive labor. Strong reasoning models such as o1 or DeepSeek R1 are good at thinking through complex problems but expensive and sometimes unreliable at producing precisely formatted diffs. Cheaper, instruction-following models are highly reliable at generating correctly formatted edits but less powerful at reasoning. Combining them exploits the strengths of each.
On January 24, 2025, Paul Gauthier reported that using DeepSeek R1 as the architect and Claude 3.5 Sonnet as the editor set a new state-of-the-art score on the benchmark: 64.0%. The previous record was o1 at 61.7%.
The cost difference was striking. Running o1 solo cost $186.50 for the full 225-problem run. The R1+Sonnet combination cost $13.29, about 14 times less, while scoring higher. This result demonstrated that architect mode could outperform a more expensive solo model, and that the division of labor between reasoning and formatting is practically valuable.
The same experiment showed that pairing o1 as architect with Sonnet as editor did not improve on o1 solo, suggesting the benefit of the approach depended on which model acted as the architect.
| Configuration | Score | Edit format accuracy | Cost |
|---|---|---|---|
| R1 + Sonnet (architect) | 64.0% | 100.0% | $13.29 |
| o1 solo (high) | 61.7% | 91.5% | $186.50 |
| R1 solo | 56.9% | 96.9% | $5.42 |
| Claude 3.5 Sonnet solo | 51.6% | 99.6% | $14.41 |
In April 2025, Paul Gauthier reported that using o3 (high) as architect and GPT-4.1 as editor produced a score of 83%, a new SOTA at the time. This result also reduced costs substantially compared to running o3 solo.
Refact.ai, an AI coding assistant company, adapted the benchmark methodology with an agentic approach that goes beyond Aider's two-attempt structure. Their agent uses up to 30 steps per problem, autonomously executes tests, reads failure output, revises code, and re-tests in a loop. In April 2025, Refact.ai reported their agent powered by Claude 3.7 Sonnet achieved 92.9% without extended thinking and 93.3% with thinking enabled, the highest scores reported on the benchmark as of mid-2025.
This result is not directly comparable to the standard single-model entries on the leaderboard because the agent can make many more corrective passes per problem than the two allowed in the standard protocol. It illustrates the upper bound of what iterative self-correction can achieve on the benchmark's problems.
The main number reported for each model is the percentage of the 225 problems for which the model's output, after applying edits and running the test suite, passed all tests within two attempts. A problem passes if and only if every test case in the exercise's test suite passes. There is no partial credit for solving some tests but not others.
This secondary metric records what fraction of the model's responses were parseable and applicable. A response that produces syntactically broken diff blocks, incomplete fences, or otherwise malformed edit instructions counts as an edit format failure even if the underlying reasoning was correct. On the standard leaderboard, most frontier models achieve edit format accuracy above 90%. Some models, particularly those from smaller providers or open-source checkpoints, score lower here, which limits their effective performance regardless of how good their code reasoning is.
The leaderboard reports total cost in US dollars for completing all 225 problems. This figure is the sum of API charges for all prompt and completion tokens across both attempts for all problems, at the API rates current when the run was conducted. Researchers and developers use this to assess whether a model's score justifies its cost. DeepSeek V3, for example, scored 70.2% while costing $0.88 for the full run, making it a dramatically more cost-efficient choice than o3-pro, which scored 84.9% at $146.32.
The benchmark measures whether edited code passes pre-existing test suites. It does not evaluate:
At launch on December 21, 2024, o1 scored at the top of the leaderboard with 61.7%, confirming the benchmark's intended calibration. The original top ten were:
| Model | Score | Edit format accuracy |
|---|---|---|
| o1-2024-12-17 (high) | 61.7% | 91.5% |
| Claude 3.5 Sonnet (2024-10-22) | 45.3% | 100.0% |
| Gemini Exp 1206 | 38.2% | 98.2% |
| o1-mini | 32.9% | 96.9% |
| Claude 3.5 Haiku | 28.0% | 91.1% |
| Gemini 2.0 Flash Exp | 22.2% | 100.0% |
| DeepSeek Chat V2.5 | 17.8% | 92.9% |
| GPT-4o (2024-11-20) | 15.1% | 96.0% |
| Qwen2.5-Coder-32B | 8.0% | 71.6% |
| GPT-4o-mini | 3.6% | 100.0% |
Through early 2025, the leaderboard updated rapidly as new models were released. DeepSeek R1 entered at 56.9%, and o3-mini scored 53.8% at medium compute and 60.4% at high. Claude 3.7 Sonnet without extended thinking scored 60.4%, matching o3-mini high while costing slightly less. Claude 3.7 Sonnet with 32,000 thinking tokens scored 64.9%, at that point the highest single-model score. GPT-4.5 Preview scored 44.9% at a very high cost of $183.18 for the run.
The R1+Sonnet architect combination scored 64.0% in January 2025, briefly setting the SOTA.
The second half of 2025 saw a large jump in leaderboard performance as dedicated reasoning models matured. o3 reached 81.3%, and o3-pro at high compute reached 84.9% at a cost of $146.32. Gemini 2.5 Pro with 32k thinking tokens hit 83.1%.
Claude 4 Opus (with 32k thinking tokens) scored 72% when evaluated by Paul Gauthier in May 2025. Claude 4 Sonnet scored 61% under the same conditions. Gauthier noted that Claude 4 Sonnet appeared to underperform Claude 3.7 Sonnet on this benchmark.
DeepSeek V3.2 Chat scored 70.2% at an exceptionally low cost of $0.88 for the full run, making it the most cost-efficient competitive model on the leaderboard.
GPT-5 from OpenAI became the top-scoring single-model entry, with 88.0% at high compute and $29.08 cost, and 86.7% at medium compute. Claude Opus 4.5 was reported by Anthropic as achieving 89.4% on the benchmark, leading the leaderboard by its own reporting. The o3-pro (high) score of 84.9% remained the next verified result on the official aider.chat leaderboard.
| Rank | Model | Score | Cost | Organization |
|---|---|---|---|---|
| 1 | GPT-5 (high) | 88.0% | $29.08 | OpenAI |
| 2 | GPT-5 (medium) | 86.7% | $17.69 | OpenAI |
| 3 | o3-pro (high) | 84.9% | $146.32 | OpenAI |
| 4 | Gemini 2.5 Pro (32k think) | 83.1% | $49.88 | Google DeepMind |
| 5 | GPT-5 (low) | 81.3% | $10.37 | OpenAI |
| 6 | o3 (high) | 81.3% | $21.23 | OpenAI |
| 7 | Grok-4 (high) | 79.6% | $59.62 | xAI |
| 8 | Gemini 2.5 Pro (default think) | 79.1% | $19.29 | Google DeepMind |
| 9 | Claude 3.7 Sonnet (32k think) | 64.9% | $36.83 | Anthropic |
| 10 | R1 + Claude 3.5 Sonnet (architect) | 64.0% | $13.29 | Multiple |
| 11 | Claude 4 Opus (32k think) | 72.0% | N/A | Anthropic |
| 12 | DeepSeek V3.2 Chat | 70.2% | $0.88 | DeepSeek |
| 13 | o1 (high) | 61.7% | $74.66 | OpenAI |
| 14 | Claude 4 Sonnet (32k think) | 61.0% | N/A | Anthropic |
| 15 | Claude 3.7 Sonnet (no think) | 60.4% | $17.72 | Anthropic |
| 16 | o3-mini (high) | 60.4% | $18.16 | OpenAI |
| 17 | DeepSeek R1 | 56.9% | $5.42 | DeepSeek |
| 18 | o3-mini (medium) | 53.8% | $8.86 | OpenAI |
| 19 | Claude 3.5 Sonnet (2024-10-22) | 45.3% | $3.12 | Anthropic |
| 20 | GPT-4.5 Preview | 44.9% | $183.18 | OpenAI |
| 21 | Gemini Exp 1206 | 38.2% | N/A | Google DeepMind |
| 22 | o1-mini | 32.9% | $18.58 | OpenAI |
| 23 | Claude 3.5 Haiku | 28.0% | $6.06 | Anthropic |
| 24 | DeepSeek Chat V2.5 | 17.8% | $0.51 | DeepSeek |
| 25 | GPT-4o (2024-08-06) | 23.1% | $7.03 | OpenAI |
| 26 | GPT-4o-mini | 3.6% | $0.32 | OpenAI |
Note: Cost figures reflect full runs of all 225 problems at API prices current when each model was tested. Agent systems (Refact.ai) that use more than two attempts per problem are listed separately from the standard leaderboard.
Cost data from the leaderboard makes it possible to rank models not just by accuracy but by performance per dollar. At initial leaderboard construction, DeepSeek Chat V3 offered the highest value: 48.4% correct for $0.34, giving a value score many times higher than any competing model. Later, DeepSeek V3.2 at 70.2% for $0.88 maintained DeepSeek's position as the cost-efficiency leader for competitive-tier performance.
For teams that need maximum accuracy and cost is secondary, GPT-5 (high) or o3-pro represent the top tier. For teams that want strong performance with a budget under $1, DeepSeek V3 variants have consistently led.
SWE-bench presents models with real GitHub issues from Python open-source repositories. A successful solve requires the model to understand an existing codebase, identify the root cause of a bug, make targeted edits across potentially many files, and pass a test suite that was written by the original project maintainers.
| Dimension | Aider Polyglot | SWE-bench Verified |
|---|---|---|
| Languages | 6 (C++, Go, Java, JS, Python, Rust) | Python only |
| Problem type | Self-contained implementation exercises | Real-world bug fixes in existing repos |
| Files per problem | Typically 1 | Potentially many |
| Test suites | Written by Exercism contributors | Written by original project maintainers |
| Contamination risk | Moderate (Exercism is publicly indexed) | Higher (popular repos likely in training data) |
| Problem count | 225 | ~500 verified |
| Task format | Implement from scratch; edit starter code | Locate bug; patch across files |
| Cost per run | $0.32 to $146 depending on model | Higher due to long repo context |
The two benchmarks measure related but distinct skills. SWE-bench tests multi-file debugging and navigating unfamiliar codebases. Aider Polyglot tests clean-room implementation and edit-format compliance. A model can score high on one and lower on the other. In practice, models that excel at reasoning tend to do well on both, but the correlation is imperfect.
One criticism noted by Refact.ai and others is that SWE-bench's Python-only scope and its heavy use of popular open-source projects creates meaningful contamination risk: many of those repositories and issues were likely present in training data. The Polyglot benchmark's Exercism exercises are also publicly available, but the specific 225 problems chosen are less likely to have appeared in specialized coding fine-tuning datasets.
LiveCodeBench collects new problems continuously from competitive programming contests on LeetCode, AtCoder, and Codeforces. Because problems are tagged with their release date, evaluations can be restricted to problems released after a model's training cutoff, making contamination nearly impossible. The benchmark tests four scenarios: code generation, self-repair, code execution, and test output prediction.
| Dimension | Aider Polyglot | LiveCodeBench |
|---|---|---|
| Problem source | Exercism exercises | Competitive programming contests |
| Languages | 6 | Primarily Python, C++, Java |
| Problem type | Software engineering exercises | Algorithmic competition problems |
| Contamination control | Static set, moderate risk | Rolling release, very low risk |
| Measures editing | Yes | No (generation only in main track) |
| Cost to run | Low to moderate | Moderate |
| Problem count | 225 | 1055+ (v6) |
LiveCodeBench is widely considered the most contamination-resistant coding signal among regularly used benchmarks, because it continuously introduces unseen problems. However, competitive programming problems emphasize algorithmic reasoning in ways that can diverge from typical software engineering work. Aider Polyglot's Exercism problems are closer to the kind of implementation tasks a developer might face in practice, even if Exercism's exercises are not drawn from real production codebases.
HumanEval is a Python-only benchmark of 164 hand-written function completion problems, released by OpenAI in 2021. It is widely considered saturated: most frontier models score above 90%. Aider Polyglot's multi-language, harder-filtered design is a direct response to the same kind of saturation that eventually affected HumanEval.
The predecessor benchmark used 133 Python Exercism exercises. Its top score at the time of Polyglot's launch was 84.2%, and as noted above, scores were bunched near the top. Aider Polyglot expanded the language scope, more than tripled the language count, and filtered for difficulty, pushing the initial top score down to 61.7% and giving the benchmark several years of headroom before saturation becomes a concern again.
| Dimension | Aider Code Editing Benchmark | Aider Polyglot |
|---|---|---|
| Languages | Python only | 6 languages |
| Problem count | 133 | 225 |
| Top score at launch | 84.2% | 61.7% |
| Difficulty filter | None | Solved by 3 or fewer of 7 reference models |
| Release | 2023 | December 21, 2024 |
Since its launch, Aider Polyglot has been cited in official announcements from Anthropic, OpenAI, Google DeepMind, and DeepSeek as evidence of coding ability. It has become one of a small set of coding benchmarks that major labs test against before releasing new models. Paul Gauthier typically updates the leaderboard within days of a major model release, and the benchmark run results often appear in lab blog posts and social media announcements alongside SWE-bench and LiveCodeBench scores.
The benchmark is useful to the research community because it combines three properties that individually common benchmarks lack: multi-language coverage, a realistic edit-format requirement, and a cost metric. This means it rewards models that can both reason about code and reliably produce correctly structured output, which matters for practical tool integration.
The benchmark has also become a reference point for companies building coding agents. Refact.ai's blog post documenting their 92.9% agent result used the benchmark to argue their agent's superiority over both Aider itself and standard model-only approaches. The benchmark's public accessibility (the full problem set is on GitHub, and the harness is open source) makes it reproducible for any team that wants to verify results or benchmark their own systems.
Aider's work on edit formats, documented in part through benchmark results, influenced subsequent academic research on code editing representations. The 2025 paper "Robust Learning of Diverse Code Edits" (arXiv:2503.03656) benchmarked multiple edit formats and confirmed that search-replace representations outperform unified diffs and structured formats for most capable models, consistent with the patterns Gauthier observed in Aider benchmark results.
The benchmark's inclusion of per-run cost is unusual among coding benchmarks and has proven practically useful. DeepSeek's V3 models' appearance on the leaderboard with competitive scores and sub-dollar run costs drove significant developer attention toward those models in early-to-mid 2025. The cost column essentially acts as an efficiency frontier chart, making it easy to identify which models offer the best tradeoff between capability and expense.
Each Polyglot problem requires changes to a single source file. Real software engineering work frequently involves changes across many files simultaneously: updating an interface, modifying multiple callers, adjusting tests and documentation. The benchmark does not evaluate this multi-file coordination ability, which is a meaningful part of what coding agents need to do.
All problems come from one platform with consistent formatting and style conventions. A model that has seen many Exercism exercises in training may benefit from recognizing that style, making the benchmark less of a pure test of general coding ability. The filter for hard problems reduces but does not eliminate this concern.
Unlike LiveCodeBench, the 225 Polyglot problems are fixed. As models are further trained with Exercism data, or as benchmark results enter the public record and become training data themselves, contamination risk grows over time. The benchmark's designers acknowledged this, intending the current calibration to remain useful for at least several years given the difficulty of the selected problems.
The two-attempt structure is a practical compromise between evaluation thoroughness and cost, but it limits what the benchmark measures. In real development, a programmer iterates many more times, reading error messages, searching documentation, and revising code incrementally. The benchmark rewards models that can reason correctly on the first or second pass but does not measure sustained iterative refinement.
The 225 problems are not evenly distributed across the six languages. JavaScript and Java together make up nearly 43% of the set, while C++ accounts for only 11.6%. A model that is unusually strong at JavaScript and Java but weak at Rust will benefit more from this distribution than one with uniform multilingual ability.
All problems have a predefined test suite, so the benchmark evaluates whether a model can produce code that passes specific pre-written tests. It does not evaluate whether a model can write good tests, design APIs, write readable code, handle ambiguous specifications, or produce well-documented implementations.