2025 Benchmarks
16 articles
AA-LCR
AI Benchmarks, Document Understanding Benchmarks, Knowledge Work Benchmarks
AIME 2025
AI Benchmarks, Mathematical Reasoning Benchmarks, Olympiad Mathematics
BrowseComp
AI Benchmarks, Information Retrieval Benchmarks, OpenAI
Creative Writing v3
AI Benchmarks, Creative Writing Benchmarks, Language Model Benchmarks
Deep Research Bench
AI Benchmarks, Information Retrieval Benchmarks, Multi-step Task Benchmarks
DeepResearch Bench
AI Benchmarks, Academic AI Evaluation, Multilingual Benchmarks
EQ-Bench 3
AI Benchmarks, Emotional Intelligence Benchmarks, Language Model Benchmarks
ERQA
AI Benchmarks, Embodied AI, Google DeepMind
Factorio Learning Environment
AI Benchmarks, Game-Based AI Evaluation, Long-term Planning Benchmarks
GSO
AI Benchmarks, Code Optimization Benchmarks, Multi-language Benchmarks
HealthBench
AI Benchmarks, Healthcare AI, Medical Benchmarks
HealthBench Hard
AI Benchmarks, Challenging Benchmarks, Clinical AI Evaluation
Tau2-bench
AI Benchmarks, Agent Evaluation, Conversational AI Benchmarks
Video-MMMU
AI Benchmarks, Educational AI, EvolvingLMMs-Lab
Vimgolf
AI Benchmarks, Agent Benchmarks, Benchmarks
WeirdML
AI Benchmarks, Code Generation Benchmarks, Machine Learning Benchmarks