Model Evaluation

121 articlesRSS

Showing 1-60 of 121 articles

AGIEval

AGIEval is an AI benchmark for evaluating foundation models on tasks that were originally designed for, and taken by, humans. Rather than building synthetic...

AI Benchmarks

ARC Evals

ARC Evals was the evaluations team incubated inside the Alignment Research Center (ARC) between 2022 and 2023, and the direct predecessor of METR (Model...

AI SafetyResearch Organizations

ARC-AGI-2

ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) is an abstract reasoning benchmark for artificial intelligence, released on...

2025 in artificial intelligenceAI Benchmarks

AUC (Area Under the ROC Curve)

AUC (Area Under the ROC Curve) is a classifier evaluation metric equal to the probability that a model ranks a randomly chosen positive instance higher than a...

Machine Learning

AUC-ROC

See also: Machine learning terms AUC (Area Under the Curve), most often the area under the ROC curve (AUC-ROC), is a threshold-independent evaluation metric...

Machine LearningStatistics

Accuracy

See also: machine learning terms, confusion matrix, precision, recall, F1 score Accuracy is a classification metric that measures the fraction of predictions a...

Machine Learning

Agent evaluation

Agent evaluation is the systematic measurement of how well AI agents (LLM-based systems that plan and act over multiple steps using tools) perform on...

AI AgentsAI Benchmarks

Area under the curve

Area under the curve (AUC) is a single scalar metric that summarizes the performance of a binary classifier or diagnostic test across all possible decision...

Machine LearningStatistics

Arena-Hard

Arena-Hard (and its evaluation tool Arena-Hard-Auto) is an automatic large language model (LLM) benchmark developed by the team behind Chatbot Arena that...

AI Benchmarks

Arize Phoenix

Arize Phoenix is an open-source AI observability and evaluation platform developed by Arize AI for tracing, evaluating, and debugging large language model...

AI CompaniesDeveloper Tools

Average Precision

See also: precision, recall, F1 score, confusion matrix, AUC, precision-recall curve Average precision (AP) is an evaluation metric that summarizes the...

Computer VisionInformation Retrieval

BABILong

BABILong is a benchmark for testing how well a large language model can reason over facts scattered through very long text. It was introduced by Yuri Kuratov,...

AI BenchmarksLarge Language Models

BERTScore

BERTScore is an automatic, reference-based metric for evaluating text generation that scores a candidate sentence against one or more references by comparing...

Machine LearningNatural Language Processing

BLEU (Bilingual Evaluation Understudy)

BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric that scores the quality of machine translation output by measuring how many word...

Machine LearningNatural Language Processing

Baseline

In machine learning, a baseline is a simple reference model or method used as a point of comparison to judge whether a more complex model actually adds value....

Machine Learning

Benchmark (AI)

In artificial intelligence and machine learning, a benchmark is a standardized combination of a dataset, a task definition, and a scoring protocol that lets...

AI Benchmarks

Best AI Models for Reasoning and Math

As of July 2026, the strongest general reasoning models are Anthropic's Claude Opus 4.8 and Claude Fable 5, OpenAI's GPT-5.5, and Google's Gemini 3.1 Pro and...

Large Language ModelsReasoning Models

CIDEr

CIDEr (Consensus-based Image Description Evaluation) is an automatic evaluation metric for image captioning that scores a machine-generated caption by how...

Computer VisionMachine Learning

Calibration Layer

A calibration layer is a post-prediction adjustment appended to a trained machine learning model that rescales its raw output scores or predicted probabilities...

Deep LearningMachine Learning

ChemBench

ChemBench is an automated AI benchmark that measures the chemical knowledge, reasoning, and safety judgment of large language models and compares their...

AI Benchmarks

Classification Threshold

A classification threshold (also called a decision threshold or cut-off point) is a numeric value used to convert the continuous probability output of a...

Machine Learning

Confusion Matrix

A confusion matrix is a table that summarizes the performance of a classification model by tabulating its predicted class labels against the actual class...

Machine Learning

Cross-Validation

Cross-validation is a statistical resampling technique used in machine learning to estimate how accurately a predictive model will generalize to data it was...

Machine Learning

Cybench

Cybench (short for Cybersecurity benchmark) is an open-source evaluation framework for measuring the cybersecurity capabilities and risks of large language...

AI BenchmarksAI Safety

Decision Threshold

A decision threshold (also called a classification threshold or cutoff point) is a value used to convert the continuous probability output of a machine...

Machine Learning

Elo rating system (AI model ranking)

The Elo rating system, as applied to AI models, is a method for turning a pile of head-to-head preference votes into a single number per model, so that large...

AI BenchmarksMachine Learning

EnigmaEval

EnigmaEval is an AI benchmark of long, complex multimodal puzzles drawn from real-world puzzle hunts, designed to measure the unstructured, creative,...

AI Benchmarks

Epoch AI

Epoch AI is a nonprofit research organization, founded in 2022 and directed by Jaime Sevilla, that studies the trajectory of artificial intelligence through...

AI Companies

Expected calibration error

Expected calibration error (ECE) is a metric that measures how well a classifier's predicted confidence matches its observed accuracy. A model is well...

Machine LearningStatistics

F1 score

The F1 score (also written as F1-score, F-score, or F-measure) is the harmonic mean of precision and recall, calculated as , and it ranges from 0 (worst) to 1...

Machine LearningStatistics

FACTS Grounding

FACTS Grounding is a factuality benchmark from Google DeepMind and Google Research that measures whether a large language model answers a request using only...

AI BenchmarksLarge Language Models

FRAMES (benchmark)

FRAMES is an evaluation dataset for retrieval-augmented generation that tests factual accuracy, retrieval, and reasoning together rather than one at a time....

AI BenchmarksInformation Retrieval

Fairness Metric

A fairness metric is a quantitative, mathematical measure used to evaluate whether a machine learning model's predictions or decisions treat different...

AI EthicsMachine Learning

False Negative Rate

The false negative rate (FNR), also known as the miss rate, is the proportion of actual positive instances that a model or test incorrectly classifies as...

Machine LearningStatistics

False Positive Rate (FPR)

The false positive rate (FPR) is the proportion of actual negative cases that a test, model, or decision process incorrectly classifies as positive, defined as...

Machine LearningStatistics

False negative

A false negative (FN), also called a Type II error or a miss, is an instance whose true label is positive but that a classification model or test predicts as...

Machine LearningStatistics

False positive

A false positive (FP), also called a Type I error or a false alarm, is an instance whose true label is negative but whose predicted label is positive: the...

Machine LearningStatistics

Feature Importances

Feature importances are numeric scores that quantify how much each input feature contributes to the predictions of a machine learning model. The three dominant...

InterpretabilityMachine Learning

FormulaOne

FormulaOne is an AI benchmark designed to measure the depth of algorithmic reasoning in frontier large language models, introduced in the July 2025 paper...

AI Benchmarks

Future of Life Institute AI Safety Index

The Future of Life Institute AI Safety Index (often shortened to the AI Safety Index or FLI AI Safety Index) is a periodic "report card" published by the...

AI Companies

Generalization

See also: Machine learning terms, Bias-variance tradeoff Generalization in machine learning is the ability of a trained model to perform accurately on new,...

Deep LearningMachine Learning

Generalization Curve

A generalization curve (also called a learning curve) is a plot that visualizes how a machine learning model's performance on training data and unseen data...

Machine Learning

Global-MMLU

Global-MMLU is a multilingual evaluation benchmark that extends the MMLU question-answering dataset across 42 languages, with designated subsets labeled...

AI BenchmarksNatural Language Processing

HELM (Holistic Evaluation of Language Models)

HELM (Holistic Evaluation of Language Models) is an open-source benchmark framework created by the Center for Research on Foundation Models (CRFM) at Stanford...

AI Benchmarks

HELMET

HELMET (How to Evaluate Long-context Models Effectively and Thoroughly) is a benchmark for evaluating long-context language models introduced by researchers at...

AI Benchmarks

HalluLens

HalluLens is a large language model hallucination benchmark introduced by researchers at Meta AI's Fundamental AI Research (FAIR) lab, together with...

AI BenchmarksMeta AI

Harness (AI)

This article is about software that wraps and evaluates AI models. For the 2015 paper "Explaining and Harnessing Adversarial Examples" and other uses of the...

AI AgentsAI Benchmarks

Helicone

Helicone is an open-source LLM observability platform and AI gateway founded in 2023 by Justin Torre, Cole Gottdank, Barak Oshri, and Scott...

AI CompaniesDeveloper Tools

Inter-rater agreement

See also: Machine learning terms Inter-rater agreement is the degree of consensus among two or more independent raters when they label or score the same set of...

Data & DatasetsStatistics

Interpretability

Interpretability in artificial intelligence is the degree to which a human can understand the cause of a decision a machine learning model makes, by inspecting...

AI EthicsMachine Learning

IoU

See also: Machine learning terms Intersection over Union (IoU), also known as the Jaccard index or Jaccard similarity coefficient, is the standard overlap...

Computer Vision

KernelBench

KernelBench is an AI benchmark and open-source evaluation environment that measures how well large language models can write fast and correct GPU kernels. The...

AI Benchmarks

LAB-Bench

LAB-Bench (the Language Agent Biology Benchmark) is an AI benchmark of more than 2,400 multiple-choice questions built to measure how well language models and...

AI Benchmarks

LLM API Pricing Comparison

As of July 2026, the cheapest credible LLM API depends on the capability tier you need, so here is the direct answer. For frontier-class intelligence, the...

AI Tools & ProductsLarge Language Models

LLM Benchmark Comparison (Leaderboard Overview)

As of July 2026, no single model wins every LLM benchmark, but Anthropic's Claude Fable 5 tops the most: it currently leads MMLU-Pro (91.5%), SWE-bench...

AI BenchmarksLarge Language Models

LLM Context Window Comparison

As of July 2026, the largest context window ever announced belongs to Magic's LTM-2-mini at 100,000,000 tokens (100M), but it is a research prototype that has...

AI ModelsLarge Language Models

LLM-as-a-judge

LLM-as-a-judge is the practice of using a strong large language model to evaluate the outputs of other models, or of itself, in place of a human annotator....

AI BenchmarksLarge Language Models

LM Evaluation Harness

LM Evaluation Harness (the Language Model Evaluation Harness, often abbreviated lm-eval or written lm-evaluation-harness) is an open-source software framework...

AI BenchmarksOpen Source AI

LMArena

LMArena is a crowdsourced artificial intelligence evaluation platform and company that ranks large language models by having anonymous users vote on which of...

AI BenchmarksAI Companies

LangSmith

LangSmith is a commercial observability, evaluation, and deployment platform for large language model (LLM) applications and AI agents, developed and operated...

AI CompaniesAI Infrastructure