Model Evaluation
120 articles
AGIEval
AI Benchmarks
ARC Evals
AI Safety, Research Organizations
ARC-AGI-2
2025 in artificial intelligence, AI Benchmarks, Artificial Intelligence
AUC (Area Under the ROC Curve)
Machine Learning
AUC-ROC
Machine Learning, Statistics
Accuracy
Machine Learning
Agent evaluation
AI Agents, AI Benchmarks
Area under the curve
Machine Learning, Statistics
Arena-Hard
AI Benchmarks
Arize Phoenix
AI Companies, Developer Tools, MLOps
Average Precision
Computer Vision, Information Retrieval, Machine Learning
BABILong
AI Benchmarks, Large Language Models
BERTScore
Machine Learning, Natural Language Processing
BLEU (Bilingual Evaluation Understudy)
Natural Language Processing
BLEU (Bilingual Evaluation Understudy)
Machine Learning, Natural Language Processing
Baseline
Machine Learning
Benchmark (AI)
AI Benchmarks
CIDEr
Computer Vision, Machine Learning
Calibration Layer
Deep Learning, Machine Learning, Neural Networks
ChemBench
AI Benchmarks
Classification Threshold
Machine Learning
Confusion Matrix
Machine Learning
Cross-Validation
Machine Learning
Cybench
AI Benchmarks, AI Safety
Decision Threshold
Machine Learning
Elo rating system (AI model ranking)
AI Benchmarks, Machine Learning, Statistics
EnigmaEval
AI Benchmarks
Epoch AI
AI Companies
Expected calibration error
Machine Learning, Statistics
F1 score
Machine Learning, Statistics
FACTS Grounding
AI Benchmarks, Large Language Models
FRAMES (benchmark)
AI Benchmarks, Information Retrieval
Fairness Metric
AI Ethics, Machine Learning
False Negative (FN)
Machine Learning
False Negative Rate
Machine Learning, Statistics
False Positive (FP)
Machine Learning
False Positive Rate (FPR)
Machine Learning, Statistics
False negative
Machine Learning, Statistics
False positive
Machine Learning, Statistics
Feature Importances
Interpretability, Machine Learning
FormulaOne
AI Benchmarks
Future of Life Institute AI Safety Index
AI Companies
Generalization
Deep Learning, Machine Learning
Generalization Curve
Machine Learning
Global-MMLU
AI Benchmarks, Natural Language Processing
HELM (Holistic Evaluation of Language Models)
AI Benchmarks
HELMET
AI Benchmarks
HalluLens
AI Benchmarks, Meta AI
Helicone
AI Companies, Developer Tools, MLOps
Inter-rater agreement
Data & Datasets, Statistics
Interpretability
AI Ethics, Machine Learning
IoU
Computer Vision
KernelBench
AI Benchmarks
LAB-Bench
AI Benchmarks
LLM-as-a-judge
AI Benchmarks, Large Language Models
LMArena
AI Benchmarks, AI Companies
LangSmith
AI Companies, AI Infrastructure, Developer Tools
Langfuse
AI Companies, Developer Tools, MLOps
LongBench v2
AI Benchmarks, Large Language Models
Loss Curve
Deep Learning, Machine Learning, Training & Optimization