Benchmarks
31 articles
Agent evaluation
AI Agents, Evaluation
Artificial Analysis
Developer Tools, Large Language Models
BIG-Bench
Large Language Models, Machine Learning
BrowserGym
AI Agents, Web Agents
CIFAR-10
Computer Vision, Datasets
DROP (Discrete Reasoning Over Paragraphs)
Machine Learning, Natural Language Processing, Reading Comprehension
Fox (benchmark)
Computer Vision, Document Understanding
GAIA benchmark
AI evaluation, Agentic AI
GLUE benchmark
Machine Learning, Natural Language Processing
GPQA
Reasoning
GSM8K
Large Language Models, Machine Learning, Reasoning
HumanEval
Code Generation
LLM Benchmarks Timeline
Aggregate pages, Timelines
MATH
AI Evaluation, Mathematical Reasoning
MATH (benchmark)
Large Language Models, Machine Learning, Reasoning
MBPP
Code Generation, Large Language Models, Machine Learning
MMLU
LLM Evaluation
MMLU-Pro
Large Language Models, Machine Learning
MathVista
Multimodal AI
Mind2Web
AI Agents, Web Agents
Paper2Video
AI Research, Articles with short description, Multimodal AI
SQuAD
Natural Language Processing, Question Answering
SWE-bench
AI Agents, Code Generation
SWE-bench Verified
Coding
SWE-bench Verified
AI Evaluation, Code Generation
SuperGLUE
Datasets, Natural Language Processing
Tokens per second
AI Hardware, Large Language Models
Vimgolf
2025 Benchmarks, AI Benchmarks, Agent Benchmarks
Visual Question Answering Models
Computer Vision, Multimodal, Tasks
WebArena
AI Agents, Evaluation
WinoGrande
Commonsense Reasoning, Natural Language Processing