AI Wiki
Category

AI Benchmarks

222 articles

AA-LCR

Natural Language Processing

AGIEval

Model Evaluation

AIME (American Invitational Mathematics Examination)

Mathematics

AIME 2024

AIME 2025

ARC-AGI

Artificial Intelligence

ARC-AGI 1

Artificial Intelligence, Reasoning Models

ARC-AGI 3

ARC-AGI-2

2025 in artificial intelligence, Artificial Intelligence, Machine Learning

AdvBench

AI Safety, Large Language Models

Agent benchmark reward hacking

AI Agents, AI Safety

Agent evaluation

AI Agents, Model Evaluation

AgentBench

AI Agents, Large Language Models

AgentDojo

AI Agents, AI Safety

AgentHarm

AI Agents, AI Safety

Aider Polyglot

AlpacaEval

Large Language Models, Natural Language Processing

Arena-Hard

Model Evaluation

Artificial Analysis

Developer Tools, Large Language Models

BABILong

Large Language Models, Model Evaluation

BALROG

BBQ (Bias Benchmark for QA)

AI Ethics, AI Safety, Natural Language Processing

BELEBELE

Computer Vision

BIG-Bench

Large Language Models, Machine Learning

BIG-Bench Extra Hard

Google DeepMind, Reasoning Models

BIG-Bench Hard

Machine Learning, Natural Language Processing

BLINK

Computer Vision

Benchmark (AI)

Model Evaluation

Benchmarks

Artificial Intelligence

Berkeley Function Calling Leaderboard

Large Language Models

BigCodeBench

AI Code Generation

BoolQ

Natural Language Processing

BountyBench

AI Agents

BrowseComp

OpenAI

BrowserGym

AI Agents

CIFAR-10

Computer Vision, Data & Datasets

CLIP Score

Computer Vision, Image Generation, Multimodal AI

COLLIE

CRMArena / CRMArena-Pro

AI Code Generation

CRUXEval

AI Code Generation, Machine Learning, Natural Language Processing

CharXiv

Data & Datasets

Chatbot Arena

Large Language Models

ChemBench

Model Evaluation

CodeContests

AI Code Generation, Machine Learning

CommonsenseQA

Natural Language Processing

Creative Writing v3

Cybench

AI Safety, Model Evaluation

DCLM (DataComp for Language Models)

Data & Datasets, Natural Language Processing, Open Source AI

DROP (Discrete Reasoning Over Paragraphs)

Machine Learning, Natural Language Processing

Deep Research Bench

DeepResearch Bench

Dynabench

EQ-Bench 3

ERQA

Embodied AI, Google DeepMind, Multimodal AI

EgoSchema

Computer Vision, Multimodal AI

Elo rating system (AI model ranking)

Machine Learning, Model Evaluation, Statistics

EnigmaEval

Model Evaluation

FACTS Grounding

Large Language Models, Model Evaluation

FActScore

AI Safety

FLORES-200

Natural Language Processing