Dynabench
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,984 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,984 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Dynabench | |
|---|---|
| Overview | |
| Full name | Dynamic Benchmarking Platform |
| Abbreviation | Dynabench |
| Description | An open-source research platform for dynamic adversarial data collection and benchmarking in NLP, with humans and models in the loop |
| Release date | September 24, 2020 |
| Latest version | 2.0 (under MLCommons stewardship) |
| Authors | Douwe Kiela, Max Bartolo, Yixin Nie, Adina Williams, Christopher Potts, Mohit Bansal, Robin Jia, Pontus Stenetorp, and 11 others |
| Organization | Originally Facebook AI Research (FAIR / Meta AI), transferred to MLCommons in August 2022 |
| Technical Details | |
| Type | Dynamic benchmarking, adversarial evaluation, human-in-the-loop dataset creation |
| Modality | Text, natural language |
| Task format | Web-based annotation; humans craft examples that fool a target model but remain solvable for other humans |
| Number of initial tasks | 4 core NLP tasks |
| Total examples (NAACL 2021 paper) | 267,930 across the 4 tasks |
| Evaluation metric | Validated Model Error Rate (vMER), Dynascore (introduced via Dynaboard, 2021) |
| Domains | Natural language inference, question answering, sentiment analysis, hate speech detection |
| Languages | English (primarily) |
| Resources | |
| Website | Official website |
| Paper | Kiela et al., NAACL 2021 |
| GitHub | mlcommons/dynabench |
| Related workshop | DADC (Dynamic Adversarial Data Collection), NAACL 2022 |
| License | MIT |
Dynabench is an open-source artificial intelligence benchmarking platform that runs in a web browser and supports human-and-model-in-the-loop dataset creation, where annotators try to write examples that fool a target machine learning model but that other humans still classify correctly. It was launched on September 24, 2020 by Facebook AI Research (now Meta AI) together with academic collaborators at UNC Chapel Hill, University College London (UCL), and Stanford [1][2]. Stewardship of the platform moved to MLCommons in August 2022, where the Dynabench Working Group continues to develop it as a community resource for data-centric AI and dynamic benchmarking [3].
The project is best known for the NAACL 2021 paper "Dynabench: Rethinking Benchmarking in NLP" by Douwe Kiela and 18 co-authors, which laid out the case for moving away from static NLP benchmarks such as GLUE and SuperGLUE and introduced four initial round-based tasks built on the platform [4][5].
Dynabench was a direct response to a phenomenon that had become hard to ignore by 2019, benchmark saturation. Models had taken roughly 18 years to reach human-level performance on MNIST and about 6 years on ImageNet, but only about a year to surpass non-expert human scores on GLUE [1]. SuperGLUE, designed in 2019 specifically to be harder, was already approaching saturation by the time Dynabench shipped. Static test sets were also being shown to contain spurious patterns and annotation artifacts that models could exploit without any deep understanding of the task.
The Dynabench team argued that this is partly an artifact of how datasets are made. If you collect examples once and freeze them, the resulting benchmark is only ever as hard as the models that existed when you collected it. As models get better, the benchmark gets easier in relative terms, even if the underlying task in the real world has not gotten any easier. The proposed fix was to collect new examples continuously, with the current best model in the loop, so that the test set keeps pace with the field.
The central mechanic is straightforward. A human annotator opens the Dynabench web interface, picks a task, and is asked to write an input that the target model will get wrong. The model responds in real time, telling the annotator whether it was fooled. If it was, the example goes into a queue for validation by other humans. If the validators agree on the intended label and the model still gets it wrong, the example is added to the dataset for that round [4].
When enough such examples accumulate, the team retrains the target model on the new data plus older rounds, and a new round begins with the stronger model in the loop. The result is a dataset that progressively gets harder for models while staying solvable for people, and a benchmark that, by construction, never saturates the way a static one does.
Kiela et al. summarize this with the metric Validated Model Error Rate (vMER): the share of submitted examples that fool the model and survive human validation. Across the four initial tasks, vMER ranges from about 33% on NLI to roughly 44% on hate speech, even after multiple rounds of model improvement [4].
The NAACL 2021 paper reported on four tasks, each with a designated academic owner from one of the partner institutions [4]:
| Task | Total examples | Rounds | Validated model error rate | Lead institution |
|---|---|---|---|---|
| Natural Language Inference (NLI) | 170,294 | 4 | 33.24% | UNC Chapel Hill / FAIR |
| Question Answering (QA) | 36,406 | 2 | 33.74% | UCL |
| Sentiment Analysis | 19,975 | 3 | 35.00% | Stanford |
| Hate Speech Detection | 41,255 | 4 | 43.90% | Alan Turing Institute / Simon Fraser |
The NLI task is built on top of ANLI (Adversarial NLI), a dataset Yixin Nie and colleagues introduced at ACL 2020. ANLI was already a three-round, model-in-the-loop dataset (R1 with a single BERT-style model, R2 with a stronger ensemble, R3 with even harder contexts), totaling 169,265 examples [6]. Dynabench inherited those three rounds and added a fourth round with new contexts and stronger models, so the platform now hosts the canonical successor to SNLI and MultiNLI for adversarial NLI work. A widely cited finding is that early GPT-3 variants performed close to chance on ANLI even though they did well on older NLI benchmarks [4][6].
The QA track grew out of "Beat the AI," a 2020 paper by Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp at UCL [7]. They had crowd workers write SQuAD-style questions over Wikipedia passages while a reading comprehension model tried to answer them. Round 1 used three different models in the loop: BiDAF, BERT-Large, and RoBERTa-Large, with 10,000 training, 1,000 validation, and 1,000 test examples per model, for 30,000 training and 3,000 validation/test examples in total. The Dynabench QA task uses this data as round 1 and continues collecting harder questions in subsequent rounds.
The sentiment task was led by Christopher Potts and the Stanford team. Rather than treating sentiment as a solved binary problem, they used the platform to probe the long tail: sarcasm, mixed sentiment, context-dependence, and demographic shifts in language. Annotators write naturalistic, prompt-based sentences and try to elicit incorrect predictions from a model trained on Amazon reviews and similar resources [4].
The hate speech task was led by Bertie Vidgen (then at the Alan Turing Institute) and Zeerak Waseem at Simon Fraser. The seed data drew from the corpus indexed at hatespeechdata.com, which aggregates roughly 470,000 hateful or abusive statements from prior research, and the round-by-round annotation focused on cases where models confuse hate speech with mere counter-speech, sarcasm, or in-group reclamation [4]. The 43.9% vMER on this task is the highest of the four, which the authors read as evidence that hate speech is much further from solved than a single-number F1 on a static benchmark would suggest.
The NAACL 2021 paper had 19 authors from a mix of industry and academic labs, which makes it useful to list them in one place [4]:
| Author | Affiliation |
|---|---|
| Douwe Kiela (lead) | Facebook AI Research |
| Max Bartolo | UCL |
| Yixin Nie | UNC Chapel Hill |
| Divyansh Kaushik | Carnegie Mellon University |
| Atticus Geiger | Stanford |
| Zhengxuan Wu | Stanford |
| Bertie Vidgen | Alan Turing Institute |
| Grusha Prasad | Johns Hopkins University |
| Amanpreet Singh | Facebook AI Research |
| Pratik Ringshia | Facebook AI Research |
| Zhiyi Ma | Facebook AI Research |
| Tristan Thrush | Facebook AI Research |
| Sebastian Riedel | Facebook AI Research / UCL |
| Zeerak Waseem | Simon Fraser University |
| Pontus Stenetorp | UCL |
| Robin Jia | Facebook AI Research |
| Mohit Bansal | UNC Chapel Hill |
| Christopher Potts | Stanford |
| Adina Williams | Facebook AI Research |
Dynabench organizes data collection into progressive rounds. Each round fixes a target model (or ensemble), collects adversarial examples until a predetermined budget is reached, validates them, and then retrains a stronger model that becomes the target for the next round.
| Round | Typical target model | Annotator success at fooling | Notes |
|---|---|---|---|
| Round 1 | Single BERT-Large class model | ~40 to 50% | Easiest to fool, used to seed the dataset |
| Round 2 | RoBERTa-Large or ensemble | ~25 to 35% | Includes round 1 data in training |
| Round 3 | Adversarially trained ensemble | ~15 to 25% | Harder contexts, longer passages |
| Round 4 (NLI, hate speech) | Stronger ensembles, sometimes mixture-of-experts | Below 15% | Used for held-out evaluation |
These ranges are reported by the Dynabench paper for ANLI and the hate speech task; QA and sentiment ran fewer rounds during the initial release [4].
A companion paper, "Dynaboard: An Evaluation-as-a-Service Platform for Holistic Next-Generation Benchmarking," appeared at NeurIPS 2021 and added a server-side evaluation backend on top of Dynabench [8]. Dynaboard lets researchers upload model containers rather than test-set predictions, which prevents test-set contamination and allows the platform to measure things you cannot measure from a leaderboard CSV.
Dynaboard introduced Dynascore, a single-number metric that aggregates five axes:
| Axis | What it measures |
|---|---|
| Accuracy | Standard task metric, for example F1 or exact match |
| Compute | Inference speed on a fixed evaluation cloud |
| Memory | Average memory footprint in gigabytes |
| Robustness | Performance on perturbed inputs (typos, paraphrases) |
| Fairness | Disparities across demographic or topical slices |
The weights on these axes are user-configurable, so a researcher who cares mostly about accuracy and a deployment engineer who cares mostly about latency can rank submissions by their own priorities on the same backend [8].
On August 23, 2022, MLCommons announced that it was adopting Dynabench and forming a Dynabench Working Group to continue developing the platform [3]. The transfer happened for two related reasons. First, FAIR was reorganizing under Meta AI and several of the original maintainers, including Douwe Kiela, were moving on (Kiela later co-founded Contextual AI). Second, MLCommons wanted a software-side counterpart to the MLPerf benchmarks it had built for ML hardware, and Dynabench was the most fully developed dynamic benchmarking platform available.
Under MLCommons, the platform got a 2.0 rewrite that significantly cleans up the codebase, improves the annotator interface, and broadens the kinds of benchmarks the system can host beyond the original four tasks. The repository now lives at github.com/mlcommons/dynabench under an MIT license [9].
MLCommons also began hosting other community efforts on top of Dynabench. By mid-2023, the platform was running at least five communities [10]:
| Community | Focus |
|---|---|
| DADC | Original adversarial NLP tasks (NLI, QA, sentiment, hate speech) |
| DataPerf | Data-centric benchmarks for dataset quality and selection |
| FLORES | Low-resource machine translation evaluation |
| BabyLM | Sample-efficient pretraining on developmentally plausible corpora |
| Open community tasks | Custom tasks created by researchers via the 2.0 platform |
The BabyLM collaboration was announced on July 11, 2023, and used Dynabench to host the leaderboard for the BabyLM Challenge, where researchers train small language models on data budgets comparable to what a child hears in early life [10].
The First Workshop on Dynamic Adversarial Data Collection (DADC) was co-located with NAACL 2022 in Seattle on July 14, 2022 [11]. It featured a shared task on extractive question answering and a panel on the future of data collection moderated by Adina Williams, with panelists Anna Rogers, Jordan Boyd-Graber, Sam Bowman, Sherry Tongshuang Wu, Lora Aroyo, Douwe Kiela, and Swabha Swayamdipta. Proceedings are in the ACL Anthology, and the workshop helped formalize DADC as a research subarea rather than a single platform feature.
It is easy in 2026 to forget how unsettled NLP benchmarking felt in 2019 and 2020. Models were posting human-level scores on test sets that had been considered hard a year or two earlier, but anyone using those models in production knew the scores were misleading. Dynabench is one of the more honest responses to that mismatch. It said, in effect, the problem is not that we need a stickier benchmark, the problem is that any static benchmark is going to go stale as soon as a new model class arrives.
Three contributions stand out in retrospect:
There are also honest limitations. Most of the original tasks are English-only. Adversarial annotation is expensive and slower than scraping. The vMER metric is sensitive to the choice of target model, so a paper that improves on, say, ANLI R3 is not always demonstrating real-world progress. And there is an open theoretical question, sometimes called the "adversarial treadmill" problem, about whether dynamic benchmarks systematically push models toward surface robustness without measuring deeper capabilities. A 2023 ICLR paper by Shirali, Abebe, and Hardt offered a formal model of dynamic benchmarks that begins to address this [12].
| Aspect | Static benchmarks (GLUE, SuperGLUE, SQuAD) | Dynabench |
|---|---|---|
| Data collection | One-time, frozen | Continuous, round-based |
| Difficulty | Fixed at creation | Adapts to current best models |
| Saturation | Common within 1 to 3 years | By design, hard to fully saturate |
| Annotation artifacts | Often exploitable | Reduced by adversarial filter |
| Real-world relevance | Decays over time | Maintained by ongoing collection |
| Cost | Low after initial collection | High, requires ongoing annotation |
| Reproducibility | Easy, just download the test set | Harder, depends on a live platform |
The Dynabench team has been unusually candid about the platform's weak spots. In the NAACL 2021 paper and later DADC writeups, they flag the following [4][11]: