AI Wiki
Category

Model Evaluation

49 articles

AUC (Area Under the ROC Curve)

Classification, Machine Learning

Accuracy

Classification, Machine Learning

BABILong

AI Benchmarks, Large Language Models

BERTScore

Machine Learning, Natural Language Processing

Baseline

Machine Learning

CIDEr

Computer Vision, Machine Learning

Classification Threshold

Classification, Machine Learning

Confusion Matrix

Classification, Machine Learning

Cross-Validation

Machine Learning

Decision Threshold

Classification, Machine Learning

Elo rating system (AI model ranking)

AI Benchmarks, Machine Learning, Statistics

Expected calibration error

Machine Learning, Statistics

FACTS Grounding

AI Benchmarks, Large Language Models

FRAMES (benchmark)

AI Benchmarks, Information Retrieval

Fairness Metric

AI Fairness, Ethics, Machine Learning

False Negative (FN)

Classification, Machine Learning

False Negative Rate

Classification, Machine Learning, Statistics

False Positive (FP)

Classification, Machine Learning

False Positive Rate (FPR)

Classification, Machine Learning, Statistics

Feature Importances

Interpretability, Machine Learning

Generalization

Deep Learning, Machine Learning

Generalization Curve

Learning Theory, Machine Learning

Interpretability

AI Ethics, Machine Learning

LLM-as-a-judge

AI Benchmarks, Large Language Models

LongBench v2

AI Benchmarks, Large Language Models

Loss Curve

Deep Learning, Machine Learning, Training

METEOR (metric)

Machine Learning, Natural Language Processing

Mean Absolute Error (MAE)

Machine Learning, Statistics

Mean Squared Error (MSE)

Machine Learning, Statistics

Model Capacity

Machine Learning

MuSR

AI Benchmarks, Reasoning

NoLiMa

AI Benchmarks, Large Language Models

Overfitting

Deep Learning, Machine Learning

Pass@k

AI Benchmarks, AI Code Generation, Machine Learning

Precision

Classification, Machine Learning

Prediction Bias

Machine Learning

Process reward model (PRM)

AI Safety, Machine Learning, Reinforcement Learning

ProcessBench

AI Benchmarks, Reasoning

Recall (metric)

Classification, Machine Learning

RewardBench

AI Benchmarks, Reinforcement Learning

Spider 2.0

AI Benchmarks, AI Code Generation

StrongREJECT

AI Benchmarks, AI Safety

Task-completion time horizon (METR)

AI Benchmarks, AI Safety

Terminal-Bench

AI Agents, AI Benchmarks, AI Code Generation

Test Set

Machine Learning

Validation Set

Machine Learning

WebVoyager

AI Agents, AI Benchmarks

Word error rate

Machine Learning, Natural Language Processing, Speech Recognition

chrF

Machine Learning, Natural Language Processing