43 articles
2025_Benchmarks, Document_Understanding_Benchmarks, Knowledge_Work_Benchmarks
2024_Benchmarks, Mathematical_Reasoning_Benchmarks, Pages_with_reference_errors
2025_Benchmarks, Mathematical_Reasoning_Benchmarks, Olympiad_Mathematics
Artificial_intelligence
2019_Benchmarks, Abstract_Reasoning_Benchmarks, General_Intelligence_Benchmarks
2025_in_artificial_intelligence, Artificial_Intelligence, Cognitive_Science
2026_Benchmarks, Game-Based_AI_Evaluation, General_Intelligence_Benchmarks
2024_Benchmarks, Code_Generation_Benchmarks, Multi-language_Benchmarks
2024_Benchmarks, Agentic_AI_Benchmarks, Game-Based_AI_Evaluation
Artificial_intelligence
2024_Benchmarks, Information_Retrieval_Benchmarks, OpenAI
2023_Benchmarks, Compositional_Reasoning_Benchmarks, Constrained_Generation
2024_Benchmarks, Chart_Understanding, Multimodal_Benchmarks
Large_language_models
2025_Benchmarks, Creative_Writing_Benchmarks, Language_Model_Benchmarks
2025_Benchmarks, Information_Retrieval_Benchmarks, Multi-step_Task_Benchmarks
2025_Benchmarks, Academic_AI_Evaluation, Multilingual_Benchmarks
2020_Establishments, Adversarial_Evaluation, Dynamic_Benchmarks
2025_Benchmarks, Emotional_Intelligence_Benchmarks, Language_Model_Benchmarks
2025_Benchmarks, Embodied_AI, Google_DeepMind
2025_Benchmarks, Game-Based_AI_Evaluation, Long-term_Planning_Benchmarks
Large_language_models
2025_Benchmarks, Code_Optimization_Benchmarks, Multi-language_Benchmarks
2023_Benchmarks, 2024_Benchmarks, Earth_Observation_Benchmarks
2025_Benchmarks, Healthcare_AI, Medical_Benchmarks
2025_Benchmarks, Challenging_Benchmarks, Clinical_AI_Evaluation
Large_language_models
2024_Benchmarks, Constraint_Satisfaction_Benchmarks, Instruction_Following_Benchmarks
2024_Benchmarks, Creative_Writing_Benchmarks, Long-form_Generation_Benchmarks
2021_Benchmarks, Competition_Mathematics, Mathematical_Reasoning_Benchmarks
2021_Benchmarks, Competition_Mathematics, Mathematical_Reasoning_Benchmarks
2022_Establishments, AI_Risk_Assessment, AI_Safety_Organizations
2023_Benchmarks, Knowledge_Benchmarks, Multilingual_Benchmarks
2025_in_artificial_intelligence, Mathematical_Reasoning, Mathematics_Competitions
Large_language_models
2024_Benchmarks, Code_Generation_Benchmarks, OpenAI
2024_Benchmarks, Code_Generation_Benchmarks, Domain-Specific_Benchmarks
2024_in_artificial_intelligence, Common_Sense, Natural_Language_Processing
2025_Benchmarks, Agent_Evaluation, Conversational_AI_Benchmarks
2025_Benchmarks, Educational_AI, EvolvingLMMs-Lab
2025_Benchmarks, Agent_Benchmarks, Benchmarks
2024_Benchmarks, Coding_Benchmarks, Community-Driven_Benchmarks
2024_Benchmarks, Code_Generation_Benchmarks, Machine_Learning_Benchmarks