Jump to content
Toggle sidebar
Search
Create account
Personal tools
Create account
Log in
Navigation
Main page
Terms
Models
Applications
Organizations
Papers
Guides
random page
How to help
Community portal
To-Do List
Recent changes
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Printable version
Permanent link
Page information
Browse properties
Property:Description
Property
Discussion
English
Read
View source
View history
More
Read
View source
View history
From AI Wiki
text
Usage
60
Improper assignments
5
previous 20
20
50
100
250
500
next 20
Filter
<p>The <a rel="nofollow" class="external text" href="https://www.semantic-mediawiki.org/wiki/Help:Property_page/Filter">search filter</a> allows the inclusion of <a rel="nofollow" class="external text" href="https://www.semantic-mediawiki.org/wiki/Help:Query_expressions">query expressions</a> such as <code>~</code> or <code>!</code>. The selected <a rel="nofollow" class="external text" href="https://www.semantic-mediawiki.org/wiki/Query_engine">query engine</a> might also support case insensitive matching or other short expressions like:</p><ul><li><code>in:</code> result should include the term, e.g. '<code>in:Foo</code>'</li></ul><ul><li><code>not:</code> result should to not include the term, e.g. '<code>not:Bar</code>'</li></ul>
Showing 20 pages using this property.
W
WebGPT🤖
+
ChatGPT with unbiased access to the Web, can build products using No-Code playgrounds, and use API's. Powered by Web Requests.
+
G
GPT Public Directory
+
A directory assistant for finding and registering GPTs. With 11,000+ GPTs Available!
+
A
ARC-AGI 2
+
A benchmark for measuring general intelligence through abstract reasoning and pattern recognition tasks
+
H
Humanity's Last Exam
+
Multi-modal AI benchmark testing frontier knowledge across 100+ academic subjects, designed to be the final robust academic test for large language models
+
M
MMLU-Pro
+
A more robust and challenging multi-task language understanding benchmark with 10-choice questions
+
A
Aider Polyglot
+
A challenging multi-language code generation benchmark testing LLMs on 225 difficult Exercism coding exercises across six programming languages
+
AIME 2024
+
A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning
+
AIME 2025
+
A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2025 problems, testing olympiad-level mathematical reasoning with complex multi-step problem solving
+
G
GPQA Diamond
+
A challenging subset of graduate-level, Google-proof science questions testing PhD-level knowledge in biology, physics, and chemistry
+
M
MMMLU
+
Multilingual evaluation frameworks based on the Massive Multitask Language Understanding benchmark, including translations and adaptations for 26+ languages
+
T
Tau-bench
+
A benchmark for evaluating AI agents' ability to complete complex tasks through realistic tool-agent-user interactions in real-world domains
+
A
AA-LCR
+
A benchmark evaluating long context reasoning across multiple real-world documents (~100k tokens)
+
C
Creative Writing v3
+
An LLM-judged creative writing benchmark using hybrid rubric and Elo scoring for enhanced discrimination
+
E
EQ-Bench 3
+
An LLM-judged benchmark testing emotional intelligence through challenging role-plays and analysis tasks
+
I
IFBench
+
A benchmark for evaluating precise instruction following with verifiable out-of-domain constraints
+
L
LiveCodeBench
+
A holistic and contamination-free evaluation benchmark for code LLMs with continuous updates
+
Longform Creative Writing
+
An LLM-judged benchmark evaluating extended narrative generation across 8 chapters
+
M
MMMU
+
A massive multi-discipline multimodal benchmark evaluating expert-level understanding and reasoning across college-level subjects
+
S
SciCode
+
A research coding benchmark curated by scientists for realistic scientific problem-solving
+
T
Terminal-Bench
+
A benchmark for evaluating AI agents' ability to complete real-world, end-to-end tasks in terminal environments
+
Showing 5 related entities.
L
LiveBench
+
M
MathArena
+
MMLU
+
S
SimpleBench
+
SWE-bench
+