Category
2024 Benchmarks
11 articles
AIME 2024
AI Benchmarks, Mathematical Reasoning Benchmarks
Aider Polyglot
AI Benchmarks, Code Generation Benchmarks, Multi-language Benchmarks
BALROG
AI Benchmarks, Agentic AI Benchmarks, Game-Based AI Evaluation
CharXiv
AI Benchmarks, Chart Understanding, Multimodal Benchmarks
GeoBench
2023 Benchmarks, AI Benchmarks, Earth Observation Benchmarks
IFBench
AI Benchmarks, Constraint Satisfaction Benchmarks, Instruction Following Benchmarks
Longform Creative Writing
AI Benchmarks, Creative Writing Benchmarks, Long-form Generation Benchmarks
MMMLU
AI Benchmarks, Knowledge Benchmarks, Multilingual Benchmarks
SciCode
AI Benchmarks, Code Generation Benchmarks, Domain-Specific Benchmarks
WebDev Arena
AI Benchmarks, Coding Benchmarks, Community-Driven Benchmarks
τ-bench
AI Benchmarks, Agent Benchmarks, Multi-turn Interaction Benchmarks