AI Benchmarks
95 articles
AA-LCR
2025 Benchmarks, Document Understanding Benchmarks, Knowledge Work Benchmarks
AIME 2024
2024 Benchmarks, Mathematical Reasoning Benchmarks
AIME 2025
2025 Benchmarks, Mathematical Reasoning Benchmarks, Olympiad Mathematics
ARC-AGI
AGI, AI Evaluations
ARC-AGI 1
AGI, Reasoning
ARC-AGI 2
2025 in artificial intelligence, Artificial Intelligence, Cognitive Science
ARC-AGI 3
2026 Benchmarks, Game-Based AI Evaluation, General Intelligence Benchmarks
AdvBench
AI Safety, Adversarial Attacks, Large Language Models
AgentBench
AI Agents, Large Language Models
Aider Polyglot
2024 Benchmarks, Code Generation Benchmarks, Multi-language Benchmarks
AlpacaEval
Large Language Models, Natural Language Processing
BALROG
2024 Benchmarks, Agentic AI Benchmarks, Game-Based AI Evaluation
BBQ (Bias Benchmark for QA)
AI Safety, Bias and Fairness, Natural Language Processing
BIG-Bench Hard
Machine Learning, Natural Language Processing
Benchmarks
Artificial Intelligence
Berkeley Function Calling Leaderboard
Function Calling, Large Language Models, Tool Use
BoolQ
Natural Language Processing, Reading Comprehension
BrowseComp
2025 Benchmarks, Information Retrieval Benchmarks, OpenAI
CLIP Score
Computer Vision, Image Generation, Multimodal AI
COLLIE
2023 Benchmarks, Compositional Reasoning Benchmarks, Constrained Generation
CRUXEval
Code Generation, Machine Learning, Natural Language Processing
CharXiv
2024 Benchmarks, Chart Understanding, Multimodal Benchmarks
Chatbot Arena
Large Language Models
CodeContests
Code Generation, Competitive Programming, Machine Learning
CommonsenseQA
Commonsense Reasoning, Natural Language Processing
Creative Writing v3
2025 Benchmarks, Creative Writing Benchmarks, Language Model Benchmarks
DCLM (DataComp for Language Models)
AI Datasets, Natural Language Processing, Open Source AI
Deep Research Bench
2025 Benchmarks, Information Retrieval Benchmarks, Multi-step Task Benchmarks
DeepResearch Bench
2025 Benchmarks, Academic AI Evaluation, Multilingual Benchmarks
Dynabench
2020 Establishments, Adversarial Evaluation, Dynamic Benchmarks
EQ-Bench 3
2025 Benchmarks, Emotional Intelligence Benchmarks, Language Model Benchmarks
ERQA
2025 Benchmarks, Embodied AI, Google DeepMind
EgoSchema
Computer Vision, Multimodal AI, Video Understanding
FLORES-200
Machine Translation, Multilingual AI, Natural Language Processing
Factorio Learning Environment
2025 Benchmarks, Game-Based AI Evaluation, Long-term Planning Benchmarks
Frechet Inception Distance
Computer Vision, Generative AI, Image Generation
FrontierMath
Artificial Intelligence, Mathematics
GDPval
AI Economics, AI Evaluations
GSO
2025 Benchmarks, Code Optimization Benchmarks, Multi-language Benchmarks
GeoBench
2023 Benchmarks, 2024 Benchmarks, Earth Observation Benchmarks
HaluEval
AI Safety, Hallucination, Large Language Models
HarmBench
AI Safety, Large Language Models, Red Teaming
HealthBench
2025 Benchmarks, Healthcare AI, Medical Benchmarks
HealthBench Hard
2025 Benchmarks, Challenging Benchmarks, Clinical AI Evaluation
HellaSwag
Commonsense Reasoning, Natural Language Processing
Humanity's Last Exam
AI Evaluations, AI Safety
Humanity's Last Exam
Frontier benchmarks, Reasoning evaluation
IFBench
2024 Benchmarks, Constraint Satisfaction Benchmarks, Instruction Following Benchmarks
IFEval
Large Language Models, Natural Language Processing
InfiniteBench
Large Language Models, Long Context, Natural Language Processing
Iris dataset
Datasets, Machine Learning, Statistics
JailbreakBench
AI Safety, Large Language Models, Red Teaming
LAMBADA
Language Modeling, Natural Language Processing
LegalBench
Large Language Models, Legal AI, Natural Language Processing
LibriSpeech
Audio AI, Natural Language Processing, Speech Recognition
LiveBench
Machine Learning, Natural Language Processing
LiveCodeBench
Code Generation, Machine Learning
LongBench
Large Language Models, Long Context, Natural Language Processing
Longform Creative Writing
2024 Benchmarks, Creative Writing Benchmarks, Long-form Generation Benchmarks
MACHIAVELLI (benchmark)
AI Alignment, AI Ethics, AI Safety