Model Evaluation
49 articles
AUC (Area Under the ROC Curve)
Classification, Machine Learning
Accuracy
Classification, Machine Learning
BABILong
AI Benchmarks, Large Language Models
BERTScore
Machine Learning, Natural Language Processing
Baseline
Machine Learning
CIDEr
Computer Vision, Machine Learning
Classification Threshold
Classification, Machine Learning
Confusion Matrix
Classification, Machine Learning
Cross-Validation
Machine Learning
Decision Threshold
Classification, Machine Learning
Elo rating system (AI model ranking)
AI Benchmarks, Machine Learning, Statistics
Expected calibration error
Machine Learning, Statistics
FACTS Grounding
AI Benchmarks, Large Language Models
FRAMES (benchmark)
AI Benchmarks, Information Retrieval
Fairness Metric
AI Fairness, Ethics, Machine Learning
False Negative (FN)
Classification, Machine Learning
False Negative Rate
Classification, Machine Learning, Statistics
False Positive (FP)
Classification, Machine Learning
False Positive Rate (FPR)
Classification, Machine Learning, Statistics
Feature Importances
Interpretability, Machine Learning
Generalization
Deep Learning, Machine Learning
Generalization Curve
Learning Theory, Machine Learning
Interpretability
AI Ethics, Machine Learning
LLM-as-a-judge
AI Benchmarks, Large Language Models
LongBench v2
AI Benchmarks, Large Language Models
Loss Curve
Deep Learning, Machine Learning, Training
METEOR (metric)
Machine Learning, Natural Language Processing
Mean Absolute Error (MAE)
Machine Learning, Statistics
Mean Squared Error (MSE)
Machine Learning, Statistics
Model Capacity
Machine Learning
MuSR
AI Benchmarks, Reasoning
NoLiMa
AI Benchmarks, Large Language Models
Overfitting
Deep Learning, Machine Learning
Pass@k
AI Benchmarks, AI Code Generation, Machine Learning
Precision
Classification, Machine Learning
Prediction Bias
Machine Learning
Process reward model (PRM)
AI Safety, Machine Learning, Reinforcement Learning
ProcessBench
AI Benchmarks, Reasoning
Recall (metric)
Classification, Machine Learning
RewardBench
AI Benchmarks, Reinforcement Learning
Spider 2.0
AI Benchmarks, AI Code Generation
StrongREJECT
AI Benchmarks, AI Safety
Task-completion time horizon (METR)
AI Benchmarks, AI Safety
Terminal-Bench
AI Agents, AI Benchmarks, AI Code Generation
Test Set
Machine Learning
Validation Set
Machine Learning
WebVoyager
AI Agents, AI Benchmarks
Word error rate
Machine Learning, Natural Language Processing, Speech Recognition
chrF
Machine Learning, Natural Language Processing