Dynabench

AI Benchmarks

15 min read

Updated May 10, 2026

Suggest edit History Talk

RawGraph

Last edited

May 10, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 2,984 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Dynabench
Overview
Full name	Dynamic Benchmarking Platform
Abbreviation	Dynabench
Description	An open-source research platform for dynamic adversarial data collection and benchmarking in NLP, with humans and models in the loop
Release date	September 24, 2020
Latest version	2.0 (under MLCommons stewardship)
Authors	Douwe Kiela, Max Bartolo, Yixin Nie, Adina Williams, Christopher Potts, Mohit Bansal, Robin Jia, Pontus Stenetorp, and 11 others
Organization	Originally Facebook AI Research (FAIR / Meta AI), transferred to MLCommons in August 2022
Technical Details
Type	Dynamic benchmarking, adversarial evaluation, human-in-the-loop dataset creation
Modality	Text, natural language
Task format	Web-based annotation; humans craft examples that fool a target model but remain solvable for other humans
Number of initial tasks	4 core NLP tasks
Total examples (NAACL 2021 paper)	267,930 across the 4 tasks
Evaluation metric	Validated Model Error Rate (vMER), Dynascore (introduced via Dynaboard, 2021)
Domains	Natural language inference, question answering, sentiment analysis, hate speech detection
Languages	English (primarily)
Resources
Website	Official website
Paper	Kiela et al., NAACL 2021
GitHub	mlcommons/dynabench
Related workshop	DADC (Dynamic Adversarial Data Collection), NAACL 2022
License	MIT

Dynabench is an open-source artificial intelligence benchmarking platform that runs in a web browser and supports human-and-model-in-the-loop dataset creation, where annotators try to write examples that fool a target machine learning model but that other humans still classify correctly. It was launched on September 24, 2020 by Facebook AI Research (now Meta AI) together with academic collaborators at UNC Chapel Hill, University College London (UCL), and Stanford ^[1]^[2]. Stewardship of the platform moved to MLCommons in August 2022, where the Dynabench Working Group continues to develop it as a community resource for data-centric AI and dynamic benchmarking ^[3].

The project is best known for the NAACL 2021 paper "Dynabench: Rethinking Benchmarking in NLP" by Douwe Kiela and 18 co-authors, which laid out the case for moving away from static NLP benchmarks such as GLUE and SuperGLUE and introduced four initial round-based tasks built on the platform ^[4]^[5].

Background and motivation

Dynabench was a direct response to a phenomenon that had become hard to ignore by 2019, benchmark saturation. Models had taken roughly 18 years to reach human-level performance on MNIST and about 6 years on ImageNet, but only about a year to surpass non-expert human scores on GLUE ^[1]. SuperGLUE, designed in 2019 specifically to be harder, was already approaching saturation by the time Dynabench shipped. Static test sets were also being shown to contain spurious patterns and annotation artifacts that models could exploit without any deep understanding of the task.

The Dynabench team argued that this is partly an artifact of how datasets are made. If you collect examples once and freeze them, the resulting benchmark is only ever as hard as the models that existed when you collected it. As models get better, the benchmark gets easier in relative terms, even if the underlying task in the real world has not gotten any easier. The proposed fix was to collect new examples continuously, with the current best model in the loop, so that the test set keeps pace with the field.

Core idea: humans and models in the loop

The central mechanic is straightforward. A human annotator opens the Dynabench web interface, picks a task, and is asked to write an input that the target model will get wrong. The model responds in real time, telling the annotator whether it was fooled. If it was, the example goes into a queue for validation by other humans. If the validators agree on the intended label and the model still gets it wrong, the example is added to the dataset for that round ^[4].

When enough such examples accumulate, the team retrains the target model on the new data plus older rounds, and a new round begins with the stronger model in the loop. The result is a dataset that progressively gets harder for models while staying solvable for people, and a benchmark that, by construction, never saturates the way a static one does.

Kiela et al. summarize this with the metric Validated Model Error Rate (vMER): the share of submitted examples that fool the model and survive human validation. Across the four initial tasks, vMER ranges from about 33% on NLI to roughly 44% on hate speech, even after multiple rounds of model improvement ^[4].

The four initial tasks

The NAACL 2021 paper reported on four tasks, each with a designated academic owner from one of the partner institutions ^[4]:

Task	Total examples	Rounds	Validated model error rate	Lead institution
Natural Language Inference (NLI)	170,294	4	33.24%	UNC Chapel Hill / FAIR
Question Answering (QA)	36,406	2	33.74%	UCL
Sentiment Analysis	19,975	3	35.00%	Stanford
Hate Speech Detection	41,255	4	43.90%	Alan Turing Institute / Simon Fraser

Natural language inference

The NLI task is built on top of ANLI (Adversarial NLI), a dataset Yixin Nie and colleagues introduced at ACL 2020. ANLI was already a three-round, model-in-the-loop dataset (R1 with a single BERT-style model, R2 with a stronger ensemble, R3 with even harder contexts), totaling 169,265 examples ^[6]. Dynabench inherited those three rounds and added a fourth round with new contexts and stronger models, so the platform now hosts the canonical successor to SNLI and MultiNLI for adversarial NLI work. A widely cited finding is that early GPT-3 variants performed close to chance on ANLI even though they did well on older NLI benchmarks ^[4]^[6].

Question answering

The QA track grew out of "Beat the AI," a 2020 paper by Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp at UCL ^[7]. They had crowd workers write SQuAD-style questions over Wikipedia passages while a reading comprehension model tried to answer them. Round 1 used three different models in the loop: BiDAF, BERT-Large, and RoBERTa-Large, with 10,000 training, 1,000 validation, and 1,000 test examples per model, for 30,000 training and 3,000 validation/test examples in total. The Dynabench QA task uses this data as round 1 and continues collecting harder questions in subsequent rounds.

Sentiment analysis

The sentiment task was led by Christopher Potts and the Stanford team. Rather than treating sentiment as a solved binary problem, they used the platform to probe the long tail: sarcasm, mixed sentiment, context-dependence, and demographic shifts in language. Annotators write naturalistic, prompt-based sentences and try to elicit incorrect predictions from a model trained on Amazon reviews and similar resources ^[4].

Hate speech detection

The hate speech task was led by Bertie Vidgen (then at the Alan Turing Institute) and Zeerak Waseem at Simon Fraser. The seed data drew from the corpus indexed at hatespeechdata.com, which aggregates roughly 470,000 hateful or abusive statements from prior research, and the round-by-round annotation focused on cases where models confuse hate speech with mere counter-speech, sarcasm, or in-group reclamation ^[4]. The 43.9% vMER on this task is the highest of the four, which the authors read as evidence that hate speech is much further from solved than a single-number F1 on a static benchmark would suggest.

Authors and affiliations

The NAACL 2021 paper had 19 authors from a mix of industry and academic labs, which makes it useful to list them in one place ^[4]:

Author	Affiliation
Douwe Kiela (lead)	Facebook AI Research
Max Bartolo	UCL
Yixin Nie	UNC Chapel Hill
Divyansh Kaushik	Carnegie Mellon University
Atticus Geiger	Stanford
Zhengxuan Wu	Stanford
Bertie Vidgen	Alan Turing Institute
Grusha Prasad	Johns Hopkins University
Amanpreet Singh	Facebook AI Research
Pratik Ringshia	Facebook AI Research
Zhiyi Ma	Facebook AI Research
Tristan Thrush	Facebook AI Research
Sebastian Riedel	Facebook AI Research / UCL
Zeerak Waseem	Simon Fraser University
Pontus Stenetorp	UCL
Robin Jia	Facebook AI Research
Mohit Bansal	UNC Chapel Hill
Christopher Potts	Stanford
Adina Williams	Facebook AI Research

Round-based structure

Dynabench organizes data collection into progressive rounds. Each round fixes a target model (or ensemble), collects adversarial examples until a predetermined budget is reached, validates them, and then retrains a stronger model that becomes the target for the next round.

Round	Typical target model	Annotator success at fooling	Notes
Round 1	Single BERT-Large class model	~40 to 50%	Easiest to fool, used to seed the dataset
Round 2	RoBERTa-Large or ensemble	~25 to 35%	Includes round 1 data in training
Round 3	Adversarially trained ensemble	~15 to 25%	Harder contexts, longer passages
Round 4 (NLI, hate speech)	Stronger ensembles, sometimes mixture-of-experts	Below 15%	Used for held-out evaluation

These ranges are reported by the Dynabench paper for ANLI and the hate speech task; QA and sentiment ran fewer rounds during the initial release ^[4].

Dynaboard and Dynascore

A companion paper, "Dynaboard: An Evaluation-as-a-Service Platform for Holistic Next-Generation Benchmarking," appeared at NeurIPS 2021 and added a server-side evaluation backend on top of Dynabench ^[8]. Dynaboard lets researchers upload model containers rather than test-set predictions, which prevents test-set contamination and allows the platform to measure things you cannot measure from a leaderboard CSV.

Dynaboard introduced Dynascore, a single-number metric that aggregates five axes:

Axis	What it measures
Accuracy	Standard task metric, for example F1 or exact match
Compute	Inference speed on a fixed evaluation cloud
Memory	Average memory footprint in gigabytes
Robustness	Performance on perturbed inputs (typos, paraphrases)
Fairness	Disparities across demographic or topical slices

The weights on these axes are user-configurable, so a researcher who cares mostly about accuracy and a deployment engineer who cares mostly about latency can rank submissions by their own priorities on the same backend ^[8].

Move to MLCommons

On August 23, 2022, MLCommons announced that it was adopting Dynabench and forming a Dynabench Working Group to continue developing the platform ^[3]. The transfer happened for two related reasons. First, FAIR was reorganizing under Meta AI and several of the original maintainers, including Douwe Kiela, were moving on (Kiela later co-founded Contextual AI). Second, MLCommons wanted a software-side counterpart to the MLPerf benchmarks it had built for ML hardware, and Dynabench was the most fully developed dynamic benchmarking platform available.

Under MLCommons, the platform got a 2.0 rewrite that significantly cleans up the codebase, improves the annotator interface, and broadens the kinds of benchmarks the system can host beyond the original four tasks. The repository now lives at github.com/mlcommons/dynabench under an MIT license ^[9].

MLCommons also began hosting other community efforts on top of Dynabench. By mid-2023, the platform was running at least five communities ^[10]:

Community	Focus
DADC	Original adversarial NLP tasks (NLI, QA, sentiment, hate speech)
DataPerf	Data-centric benchmarks for dataset quality and selection
FLORES	Low-resource machine translation evaluation
BabyLM	Sample-efficient pretraining on developmentally plausible corpora
Open community tasks	Custom tasks created by researchers via the 2.0 platform

The BabyLM collaboration was announced on July 11, 2023, and used Dynabench to host the leaderboard for the BabyLM Challenge, where researchers train small language models on data budgets comparable to what a child hears in early life ^[10].

DADC workshop

The First Workshop on Dynamic Adversarial Data Collection (DADC) was co-located with NAACL 2022 in Seattle on July 14, 2022 ^[11]. It featured a shared task on extractive question answering and a panel on the future of data collection moderated by Adina Williams, with panelists Anna Rogers, Jordan Boyd-Graber, Sam Bowman, Sherry Tongshuang Wu, Lora Aroyo, Douwe Kiela, and Swabha Swayamdipta. Proceedings are in the ACL Anthology, and the workshop helped formalize DADC as a research subarea rather than a single platform feature.

Why Dynabench matters

It is easy in 2026 to forget how unsettled NLP benchmarking felt in 2019 and 2020. Models were posting human-level scores on test sets that had been considered hard a year or two earlier, but anyone using those models in production knew the scores were misleading. Dynabench is one of the more honest responses to that mismatch. It said, in effect, the problem is not that we need a stickier benchmark, the problem is that any static benchmark is going to go stale as soon as a new model class arrives.

Three contributions stand out in retrospect:

A working artifact, not just a position paper. The platform actually shipped, accumulated more than 500,000 examples and 1,800-plus registered users, and produced datasets such as ANLI that are still cited as standard adversarial NLI evaluations ^[3]^[4].
Dynamic data collection as a research method. The DADC workshop and follow-on work treated "how should humans and models cooperate to produce hard examples" as a research question in its own right, separate from any single dataset ^[11].
Holistic evaluation through Dynaboard. Forcing submissions to be runnable model containers and scoring them on accuracy plus compute, memory, robustness, and fairness anticipated later evaluation efforts such as HELM, BIG-bench, and the more recent push toward containerized leaderboards in the LLM evaluation space ^[8].

There are also honest limitations. Most of the original tasks are English-only. Adversarial annotation is expensive and slower than scraping. The vMER metric is sensitive to the choice of target model, so a paper that improves on, say, ANLI R3 is not always demonstrating real-world progress. And there is an open theoretical question, sometimes called the "adversarial treadmill" problem, about whether dynamic benchmarks systematically push models toward surface robustness without measuring deeper capabilities. A 2023 ICLR paper by Shirali, Abebe, and Hardt offered a formal model of dynamic benchmarks that begins to address this ^[12].

Comparison with static benchmarks

Aspect	Static benchmarks (GLUE, SuperGLUE, SQuAD)	Dynabench
Data collection	One-time, frozen	Continuous, round-based
Difficulty	Fixed at creation	Adapts to current best models
Saturation	Common within 1 to 3 years	By design, hard to fully saturate
Annotation artifacts	Often exploitable	Reduced by adversarial filter
Real-world relevance	Decays over time	Maintained by ongoing collection
Cost	Low after initial collection	High, requires ongoing annotation
Reproducibility	Easy, just download the test set	Harder, depends on a live platform

Limitations and open challenges

The Dynabench team has been unusually candid about the platform's weak spots. In the NAACL 2021 paper and later DADC writeups, they flag the following ^[4]^[11]:

Annotator effects. Crowd workers vary widely in how creatively they probe a model. A small group of skilled annotators can produce most of the hard examples, which complicates fair comparison across rounds.
Model-specific overfitting. Examples that fool one target model may not fool a different architecture, so adversarial datasets can encode the target model's idiosyncrasies as much as the underlying task difficulty.
Cost. Round-based collection at the scale of ANLI runs into hundreds of thousands of dollars in crowd-worker payments per task.
Coverage. The four initial tasks are all in English, all in text, and mostly in domains where Wikipedia or product reviews provide easy seed material.
Saturation in practice. Even adversarial datasets like ANLI eventually become easier as larger models close the human-machine gap. ANLI R1 to R3 are now solved well above chance by GPT-4 class models, which suggests that the dynamic loop has to keep running, not that it has reached an endpoint.

References

Meta AI Blog. "Introducing Dynabench: Rethinking the way we benchmark AI." September 24, 2020. https://ai.meta.com/blog/dynabench-rethinking-ai-benchmarking/ ↩
Synced Review. "Facebook's Dynabench Radically Rethinks AI Benchmarking." September 30, 2020. https://syncedreview.com/2020/09/30/facebooks-dynabench-radically-rethinks-ai-benchmarking/ ↩
MLCommons. "MLCommons Adopts the Dynabench Platform." August 23, 2022. https://mlcommons.org/2022/08/mlcommons-adopts-the-dynabench-platform/ ↩
Kiela, Douwe, et al. "Dynabench: Rethinking Benchmarking in NLP." Proceedings of NAACL-HLT 2021, pages 4110-4124. https://aclanthology.org/2021.naacl-main.324/ ↩
arXiv preprint of Kiela et al. 2021. https://arxiv.org/abs/2104.14337 ↩
Nie, Yixin, et al. "Adversarial NLI: A New Benchmark for Natural Language Understanding." ACL 2020. https://aclanthology.org/2020.acl-main.441/ ↩
Bartolo, Max, et al. "Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension." Transactions of the Association for Computational Linguistics, 2020. https://adversarialqa.github.io/ ↩
Ma, Zhiyi, et al. "Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking." NeurIPS 2021. https://arxiv.org/abs/2106.06052 ↩
MLCommons / Dynabench GitHub repository. https://github.com/mlcommons/dynabench ↩
MLCommons. "Dynabench and BabyLM Join Forces." July 11, 2023. https://mlcommons.org/2023/07/dynabench-and-babylm-join-forces/ ↩
Proceedings of the First Workshop on Dynamic Adversarial Data Collection (DADC), NAACL 2022. https://aclanthology.org/volumes/2022.dadc-1/ ↩
Shirali, A., Abebe, R., and Hardt, M. "A Theory of Dynamic Benchmarks." ICLR 2023. https://openreview.net/pdf?id=i8L9qoeZOS ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Chatbot Arena MMLU

Background and motivation

Core idea: humans and models in the loop

The four initial tasks

Natural language inference

Question answering

Sentiment analysis

Hate speech detection

Authors and affiliations

Round-based structure

Dynaboard and Dynascore

Move to MLCommons

DADC workshop

Why Dynabench matters

Comparison with static benchmarks

Limitations and open challenges

See also

References

Improve this article

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here