AI Wiki
Category

Model Evaluation

120 articles

AGIEval

AI Benchmarks

ARC Evals

AI Safety, Research Organizations

ARC-AGI-2

2025 in artificial intelligence, AI Benchmarks, Artificial Intelligence

AUC (Area Under the ROC Curve)

Machine Learning

AUC-ROC

Machine Learning, Statistics

Accuracy

Machine Learning

Agent evaluation

AI Agents, AI Benchmarks

Area under the curve

Machine Learning, Statistics

Arena-Hard

AI Benchmarks

Arize Phoenix

AI Companies, Developer Tools, MLOps

Average Precision

Computer Vision, Information Retrieval, Machine Learning

BABILong

AI Benchmarks, Large Language Models

BERTScore

Machine Learning, Natural Language Processing

BLEU (Bilingual Evaluation Understudy)

Natural Language Processing

BLEU (Bilingual Evaluation Understudy)

Machine Learning, Natural Language Processing

Baseline

Machine Learning

Benchmark (AI)

AI Benchmarks

CIDEr

Computer Vision, Machine Learning

Calibration Layer

Deep Learning, Machine Learning, Neural Networks

ChemBench

AI Benchmarks

Classification Threshold

Machine Learning

Confusion Matrix

Machine Learning

Cross-Validation

Machine Learning

Cybench

AI Benchmarks, AI Safety

Decision Threshold

Machine Learning

Elo rating system (AI model ranking)

AI Benchmarks, Machine Learning, Statistics

EnigmaEval

AI Benchmarks

Epoch AI

AI Companies

Expected calibration error

Machine Learning, Statistics

F1 score

Machine Learning, Statistics

FACTS Grounding

AI Benchmarks, Large Language Models

FRAMES (benchmark)

AI Benchmarks, Information Retrieval

Fairness Metric

AI Ethics, Machine Learning

False Negative (FN)

Machine Learning

False Negative Rate

Machine Learning, Statistics

False Positive (FP)

Machine Learning

False Positive Rate (FPR)

Machine Learning, Statistics

False negative

Machine Learning, Statistics

False positive

Machine Learning, Statistics

Feature Importances

Interpretability, Machine Learning

FormulaOne

AI Benchmarks

Future of Life Institute AI Safety Index

AI Companies

Generalization

Deep Learning, Machine Learning

Generalization Curve

Machine Learning

Global-MMLU

AI Benchmarks, Natural Language Processing

HELM (Holistic Evaluation of Language Models)

AI Benchmarks

HELMET

AI Benchmarks

HalluLens

AI Benchmarks, Meta AI

Helicone

AI Companies, Developer Tools, MLOps

Inter-rater agreement

Data & Datasets, Statistics

Interpretability

AI Ethics, Machine Learning

IoU

Computer Vision

KernelBench

AI Benchmarks

LAB-Bench

AI Benchmarks

LLM-as-a-judge

AI Benchmarks, Large Language Models

LMArena

AI Benchmarks, AI Companies

LangSmith

AI Companies, AI Infrastructure, Developer Tools

Langfuse

AI Companies, Developer Tools, MLOps

LongBench v2

AI Benchmarks, Large Language Models

Loss Curve

Deep Learning, Machine Learning, Training & Optimization