| Aider Polyglot | |
|---|---|
| Overview | |
| Full name | Aider Polyglot Coding Benchmark |
| Abbreviation | Aider Polyglot |
| Description | A challenging multi-language code generation benchmark testing LLMs on 225 difficult Exercism coding exercises across six programming languages |
| Release date | 2024-12-21 |
| Latest version | 1.0 |
| Benchmark updated | 2025-08-13 |
| Authors | Paul Gauthier |
| Organization | Aider AI |
| Technical Details | |
| Type | Code Generation, Code Editing |
| Modality | Text, Code |
| Task format | Code completion and editing |
| Number of tasks | 225 |
| Total examples | 225 |
| Evaluation metric | Percent Correct, Edit Format Accuracy |
| Domains | Software Engineering, Programming |
| Languages | C++, Go, Java, JavaScript, Python, Rust |
| Performance | |
| Human performance | Not reported |
| Baseline | 3.6% (GPT-4o-mini) |
| SOTA score | 84.9% (standard model), 92.9% (agent system) |
| SOTA model | o3-pro (high) / Refact.ai Agent |
| SOTA date | 2025-08 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | Aider Code Editing Benchmark |
Aider Polyglot is a challenging code generation benchmark that evaluates large language models' ability to solve complex programming problems across six major programming languages. Released on December 21, 2024, by Aider AI creator Paul Gauthier, the benchmark consists of 225 carefully selected Exercism coding exercises designed to test models' capabilities in C++, Go, Java, JavaScript, Python, and Rust. It represents a significant evolution from the original Python-only Aider benchmark, providing better differentiation between frontier models through increased difficulty and language diversity.
Aider Polyglot was created to address the saturation of existing code generation benchmarks, where top models were achieving 80%+ scores, making meaningful comparisons difficult. The benchmark specifically tests whether AI can write new code that integrates seamlessly into existing codebases and successfully apply changes to source files without human intervention.
The benchmark was designed with several key objectives:
The 225 problems were carefully selected from 697 available Exercism exercises through empirical testing: 1. Seven top coding models attempted all 697 problems 2. Problems solved by 3 or fewer models were selected 3. Final set balanced difficulty across languages 4. Ensured sufficient headroom for future model improvements
The benchmark's 225 problems are distributed across six programming languages:
| Language | Number of Problems | Percentage | Paradigm |
|---|---|---|---|
| JavaScript | 49 | 21.8% | Multi-paradigm, Dynamic |
| Java | 47 | 20.9% | Object-oriented, Static |
| Go | 39 | 17.3% | Concurrent, Static |
| Python | 34 | 15.1% | Multi-paradigm, Dynamic |
| Rust | 30 | 13.3% | Systems, Memory-safe |
| C++ | 26 | 11.6% | Multi-paradigm, Systems |
| Total | 225 | 100% | Various |
| Metric | Description | Significance |
|---|---|---|
| Percent Correct | Percentage of problems solved correctly | Primary performance indicator |
| Edit Format Accuracy | Percentage using correct diff format | Implementation quality measure |
| Cost | Average cost per problem attempt | Efficiency metric |
| Pass Rate | Problems passing all test cases | Functional correctness |
The benchmark requires models to:
Problems in Aider Polyglot typically involve:
| Rank | Model | Percent Correct | Cost | Organization |
|---|---|---|---|---|
| 1 | o3-pro (high) | 84.9% | $146.32 | OpenAI |
| 2 | gemini-2.5-pro-preview (32k think) | 83.1% | $49.88 | Google DeepMind |
| 3 | o3 (high) | 81.3% | $21.23 | OpenAI |
| 4 | Grok-4 (high) | 79.6% | $59.62 | xAI |
| 5 | gemini-2.5-pro-preview (default think) | 79.1% | $19.29 | Google DeepMind |
| 6 | o1 (high) | 61.7% | $74.66 | OpenAI |
| 7 | Claude 3.5 Sonnet | 45.3% | $3.12 | Anthropic |
| 8 | Gemini Experimental | 38.2% | - | Google DeepMind |
| 9 | GPT-4o | ~25% | - | OpenAI |
| 10 | GPT-4o-mini | 3.6% | $0.14 | OpenAI |
In addition to standard model evaluations, agent systems have achieved higher scores:
| Agent System | Base Model | Score | Date | Notes |
|---|---|---|---|---|
| Refact.ai Agent | Claude 3.7 Sonnet | 92.9% | April 2025 | 30 steps, enforced test execution |
| Refact.ai Agent (Thinking) | Claude 3.7 Sonnet | 93.3% | April 2025 | With thinking mode enabled |
| Aider | Various | 60.4% | 2024 | Original agent baseline |
Note: Agent systems use iterative approaches with multiple attempts and self-correction, achieving higher scores than single-pass model evaluations.
While detailed per-language scores aren't publicly available, analysis suggests:
| Category | Score Range | Examples | Characteristics |
|---|---|---|---|
| Frontier Reasoning | 75-85% | o3-pro, gemini-2.5-pro | Advanced reasoning, high compute |
| Agent Systems | 85-93% | Refact.ai Agent | Iterative, self-correcting |
| Top Tier | 45-75% | Claude 3.5, o1 | Strong general capability |
| Mid Tier | 15-45% | GPT-4o, older models | Good but limited on complex tasks |
| Entry Level | <15% | GPT-4o-mini, open models | Basic capability, frequent failures |
| Aspect | Original Aider | Aider Polyglot |
|---|---|---|
| Languages | Python only | 6 languages |
| Problems | 133 (all Exercism Python) | 225 (hardest from 697) |
| Top Score | 84.2% (saturating) | 84.9% (room for growth) |
| Difficulty | Moderate | High |
| Release | 2023 | December 21, 2024 |
The benchmark employs:
Each problem includes: 1. Starter Code: Initial implementation skeleton 2. Test Suite: Comprehensive unit tests 3. Instructions: Problem description and requirements 4. Expected Output: Reference solution behavior
| Step | Action | Validation |
|---|---|---|
| 1 | Model receives problem description and starter code | Input formatting check |
| 2 | Model generates edit instructions | Diff format validation |
| 3 | Edits applied to source files | Syntax verification |
| 4 | Test suite executed | Functional correctness |
| 5 | Results recorded | Performance metrics |
The benchmark provides insights into:
| Limitation | Description | Impact |
|---|---|---|
| Limited Languages | Only 6 languages covered | Misses domain-specific languages |
| Exercism Focus | All problems from one source | Potential style bias |
| Static Dataset | Fixed 225 problems | Risk of overfitting |
| Edit Format | Specific diff requirement | May not match all workflows |
| No Debugging | Only generation tested | Misses fix/refactor capabilities |
Cite error: <ref> tag with name "aider_polyglot" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "leaderboard" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "refact" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "medium_refact" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "llmdb" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "claude37" defined in <references> has group attribute "" which does not appear in prior text.