AI Benchmarks

220 articlesRSS

Showing 1-60 of 220 articles

AA-LCR

Artificial Analysis Long Context Reasoning Abbreviation A benchmark evaluating long context reasoning across multiple real-world documents (approximately...

Natural Language Processing

AGIEval

AGIEval is an AI benchmark for evaluating foundation models on tasks that were originally designed for, and taken by, humans. Rather than building synthetic...

Model Evaluation

AIME (American Invitational Mathematics Examination)

The American Invitational Mathematics Examination (AIME) is a 15-question, 3-hour mathematics competition for high school students in the United States and...

Mathematics

AIME 2024

AIME 2024 is an AI benchmark of 30 problems drawn from the 2024 American Invitational Mathematics Examination that has become the standard yardstick for...

AIME 2025

American Invitational Mathematics Examination 2025 Abbreviation A challenging mathematical reasoning benchmark based on the American Invitational...

ARC-AGI

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a family of AI benchmarks, created by Francois Chollet, that measures fluid...

Artificial Intelligence

ARC-AGI 1

Abstraction and Reasoning Corpus for Artificial General Intelligence, version 1 Abbreviation A benchmark testing fluid intelligence and abstract reasoning...

Artificial IntelligenceReasoning Models

ARC-AGI 3

Abstraction and Reasoning Corpus for Artificial General Intelligence, Version 3 (Interactive Reasoning Benchmark) Abbreviation An interactive, agentic...

ARC-AGI-2

ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an abstract reasoning benchmark for artificial intelligence, released on...

2025 in artificial intelligenceArtificial Intelligence

AdvBench

AdvBench (Adversarial Behavior Benchmark) is a red-teaming benchmark dataset for measuring how easily an aligned large language model can be pushed into...

AI SafetyLarge Language Models

Agent benchmark reward hacking

Agent benchmark reward hacking refers to the practice of inflating an AI agent's score on an evaluation suite by attacking the evaluation machinery itself...

AI AgentsAI Safety

Agent evaluation

Agent evaluation is the systematic measurement of how well AI agents (LLM-based systems that plan and act over multiple steps using tools) perform on...

AI AgentsModel Evaluation

AgentBench

AgentBench is a multi-dimensional benchmark for evaluating large language models (LLMs) as autonomous agents across eight distinct interactive environments,...

AI AgentsLarge Language Models

AgentDojo

AgentDojo is a dynamic evaluation environment for measuring prompt injection attacks and defenses against tool-using large language model agents. It was...

AI AgentsAI Safety

AgentHarm

AgentHarm is a benchmark for measuring the harmfulness of LLM agents: systems that wrap a large language model in a loop that lets it call external tools and...

AI AgentsAI Safety

Aider Polyglot

Aider Polyglot Coding Benchmark Abbreviation A challenging multi-language code generation and editing benchmark testing LLMs on 225 difficult Exercism...

AlpacaEval

AlpacaEval is an automatic evaluation framework for instruction-following large language models (LLMs) developed by Stanford University's Tatsu Lab, which...

Large Language ModelsNatural Language Processing

Arena-Hard

Arena-Hard (and its evaluation tool Arena-Hard-Auto) is an automatic large language model (LLM) benchmark developed by the team behind Chatbot Arena that...

Model Evaluation

Artificial Analysis

Artificial Analysis is an independent benchmarking and analytics platform that evaluates artificial intelligence models and API providers across intelligence,...

Developer ToolsLarge Language Models

BABILong

BABILong is a benchmark for testing how well a large language model can reason over facts scattered through very long text. It was introduced by Yuri Kuratov,...

Large Language ModelsModel Evaluation

BALROG

Benchmarking Agentic LLM and VLM Reasoning On Games Abbreviation A benchmark that evaluates agentic LLM and VLM capabilities through six diverse,...

BBQ (Bias Benchmark for QA)

BBQ (the Bias Benchmark for QA) is a hand-built evaluation dataset that measures whether a question answering (QA) language model relies on social stereotypes...

AI EthicsAI Safety

BELEBELE

Belebele is a multiple-choice machine reading comprehension (MRC) AI benchmark that is fully parallel across 122 language variants, meaning the same questions,...

Computer Vision

BIG-Bench

BIG-Bench (Beyond the Imitation Game Benchmark) is a large-scale, collaborative benchmark of 204 tasks, contributed by 450 authors across 132 institutions,...

Large Language ModelsMachine Learning

BIG-Bench Extra Hard

BIG-Bench Extra Hard (BBEH) is a reasoning benchmark released by Google DeepMind in February 2025 that replaces each of the 23 tasks in BIG-Bench Hard (BBH)...

Google DeepMindReasoning Models

BIG-Bench Hard

BIG-Bench Hard (BBH) is a suite of 23 challenging tasks drawn from the BIG-Bench benchmark, selected because they are "the [tasks] for which prior language...

Machine LearningNatural Language Processing

BLINK

BLINK is an AI benchmark that evaluates the core visual perception abilities of multimodal large language models (MLLMs). It reformats 14 classic computer...

Computer Vision

Benchmark (AI)

In artificial intelligence and machine learning, a benchmark is a standardized combination of a dataset, a task definition, and a scoring protocol that lets...

Model Evaluation

Berkeley Function Calling Leaderboard

The Berkeley Function Calling Leaderboard (BFCL) is the standard benchmark for measuring how accurately large language models (LLMs) invoke functions, APIs,...

Large Language Models

BigCodeBench

BigCodeBench is a Python code generation benchmark of 1,140 function-level programming tasks that require composing 723 distinct function calls from 139...

AI Code Generation

BoolQ

BoolQ (Boolean Questions) is a natural language processing benchmark dataset of 15,942 naturally occurring yes/no question answering examples, each pairing a...

Natural Language Processing

BountyBench

BountyBench is a cybersecurity benchmark from Stanford University that measures the offensive and defensive capabilities of AI agents on real-world bug-bounty...

AI Agents

BrowseComp

Browsing Competition Abbreviation A benchmark evaluating AI agents' web browsing ability through 1,266 challenging short-answer questions that require...

OpenAI

BrowserGym

BrowserGym is an open-source Gymnasium-style environment and unified benchmark ecosystem for web agent research, developed by ServiceNow Research. It provides...

AI Agents

CIFAR-10

CIFAR-10 is a labeled dataset of 60,000 small color images sorted into 10 mutually exclusive object categories, with 6,000 images per class, used as a standard...

Computer VisionData & Datasets

CLIP Score

CLIP Score (also written CLIPScore or CLIP-S) is a reference-free automatic evaluation metric that measures how well a text caption matches an image, computed...

Computer VisionImage Generation

COLLIE

Systematic Construction of Constrained Text Generation Tasks Abbreviation A grammar-based framework for systematically constructing complex, compositional...

CRMArena / CRMArena-Pro

CRMArena is an AI benchmark for evaluating large language model agents on professional customer relationship management (CRM) tasks inside a realistic,...

AI Code Generation

CRUXEval

CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation) is a benchmark designed to measure how well large language models can reason about,...

AI Code GenerationMachine Learning

CharXiv

CharXiv Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Abbreviation An evaluation suite for assessing chart understanding capabilities...

Data & Datasets

Chatbot Arena

Chatbot Arena (now branded simply as Arena, and previously known as LMArena) is a crowdsourced evaluation platform for large language models that ranks AI...

Large Language Models

ChemBench

ChemBench is an automated AI benchmark that measures the chemical knowledge, reasoning, and safety judgment of large language models and compares their...

Model Evaluation

CodeContests

CodeContests is a competitive programming dataset created by Google DeepMind for training and evaluating machine learning models on algorithmic problem-solving...

AI Code GenerationMachine Learning

CommonsenseQA

CommonsenseQA is a multiple-choice question answering benchmark of 12,247 questions, introduced in 2019 by Alon Talmor, Jonathan Herzig, Nicholas Lourie, and...

Natural Language Processing

Creative Writing v3

EQ-Bench Creative Writing Benchmark Version 3 Abbreviation An LLM-judged creative writing benchmark using a hybrid rubric and Elo scoring system for...

Cybench

Cybench (short for Cybersecurity benchmark) is an open-source evaluation framework for measuring the cybersecurity capabilities and risks of large language...

AI SafetyModel Evaluation

DCLM (DataComp for Language Models)

DCLM, short for DataComp for Language Models (also styled DataComp-LM), is an open benchmark, dataset, and software framework, released in June 2024, for...

Data & DatasetsNatural Language Processing

DROP (Discrete Reasoning Over Paragraphs)

Discrete Reasoning Over Paragraphs Abbreviation A reading comprehension benchmark requiring discrete reasoning and mathematical operations over paragraphs ...

Machine LearningNatural Language Processing

Deep Research Bench

Deep Research Bench Abbreviation A benchmark evaluating LLM agents' web research capabilities using frozen web snapshots for reproducible evaluation ...

DeepResearch Bench

Not to be confused with Deep Research Bench, a separate benchmark by FutureSearch. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents ...

Dynabench

Dynamic Benchmarking Platform Abbreviation An open-source research platform for dynamic adversarial data collection and benchmarking in NLP, with humans and...

EQ-Bench 3

Emotional Intelligence Benchmark, Version 3 Abbreviation An LLM-judged benchmark testing emotional intelligence in large language models through multi-turn...

ERQA

Embodied Reasoning Question Answering Release date 1.0 Authors Embodied Reasoning, Visual Question Answering, Robotics Modality Multiple-choice VQA (4...

Embodied AIGoogle DeepMind

EgoSchema

EgoSchema is a diagnostic benchmark for evaluating very long-form video language understanding, introduced by Karttikeya Mangalam, Raiymbek Akshulakov, and...

Computer VisionMultimodal AI

Elo rating system (AI model ranking)

The Elo rating system, as applied to AI models, is a method for turning a pile of head-to-head preference votes into a single number per model, so that large...

Machine LearningModel Evaluation

EnigmaEval

EnigmaEval is an AI benchmark of long, complex multimodal puzzles drawn from real-world puzzle hunts, designed to measure the unstructured, creative,...

Model Evaluation

FACTS Grounding

FACTS Grounding is a factuality benchmark from Google DeepMind and Google Research that measures whether a large language model answers a request using only...

Large Language ModelsModel Evaluation

FActScore

FActScore (Factual precision in Atomicity Score) is an evaluation method and metric, introduced in 2023, for measuring the factual precision of long-form text...

AI Safety

FLORES-200

FLORES-200 is a multilingual evaluation benchmark for machine translation systems, covering 200 languages across a wide range of language families, scripts,...

Natural Language Processing

FRAMES (benchmark)

FRAMES is an evaluation dataset for retrieval-augmented generation that tests factual accuracy, retrieval, and reasoning together rather than one at a time....

Information RetrievalModel Evaluation