AI Benchmarks
222 articles
AA-LCR
Natural Language Processing
AGIEval
Model Evaluation
AIME (American Invitational Mathematics Examination)
Mathematics
AIME 2024
AIME 2025
ARC-AGI
Artificial Intelligence
ARC-AGI 1
Artificial Intelligence, Reasoning Models
ARC-AGI 3
ARC-AGI-2
2025 in artificial intelligence, Artificial Intelligence, Machine Learning
AdvBench
AI Safety, Large Language Models
Agent benchmark reward hacking
AI Agents, AI Safety
Agent evaluation
AI Agents, Model Evaluation
AgentBench
AI Agents, Large Language Models
AgentDojo
AI Agents, AI Safety
AgentHarm
AI Agents, AI Safety
Aider Polyglot
AlpacaEval
Large Language Models, Natural Language Processing
Arena-Hard
Model Evaluation
Artificial Analysis
Developer Tools, Large Language Models
BABILong
Large Language Models, Model Evaluation
BALROG
BBQ (Bias Benchmark for QA)
AI Ethics, AI Safety, Natural Language Processing
BELEBELE
Computer Vision
BIG-Bench
Large Language Models, Machine Learning
BIG-Bench Extra Hard
Google DeepMind, Reasoning Models
BIG-Bench Hard
Machine Learning, Natural Language Processing
BLINK
Computer Vision
Benchmark (AI)
Model Evaluation
Benchmarks
Artificial Intelligence
Berkeley Function Calling Leaderboard
Large Language Models
BigCodeBench
AI Code Generation
BoolQ
Natural Language Processing
BountyBench
AI Agents
BrowseComp
OpenAI
BrowserGym
AI Agents
CIFAR-10
Computer Vision, Data & Datasets
CLIP Score
Computer Vision, Image Generation, Multimodal AI
COLLIE
CRMArena / CRMArena-Pro
AI Code Generation
CRUXEval
AI Code Generation, Machine Learning, Natural Language Processing
CharXiv
Data & Datasets
Chatbot Arena
Large Language Models
ChemBench
Model Evaluation
CodeContests
AI Code Generation, Machine Learning
CommonsenseQA
Natural Language Processing
Creative Writing v3
Cybench
AI Safety, Model Evaluation
DCLM (DataComp for Language Models)
Data & Datasets, Natural Language Processing, Open Source AI
DROP (Discrete Reasoning Over Paragraphs)
Machine Learning, Natural Language Processing
Deep Research Bench
DeepResearch Bench
Dynabench
EQ-Bench 3
ERQA
Embodied AI, Google DeepMind, Multimodal AI
EgoSchema
Computer Vision, Multimodal AI
Elo rating system (AI model ranking)
Machine Learning, Model Evaluation, Statistics
EnigmaEval
Model Evaluation
FACTS Grounding
Large Language Models, Model Evaluation
FActScore
AI Safety
FLORES-200
Natural Language Processing