ToxiGen is a large-scale, machine-generated dataset designed for adversarial and implicit hate speech detection. Created by Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar, the dataset was published at the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) in Dublin, Ireland. ToxiGen contains 274,186 toxicity-annotated statements spanning 13 minority demographic groups, with a roughly equal split between toxic and benign examples. What distinguishes ToxiGen from earlier hate speech datasets is its focus on implicit toxicity: 98.2% of the statements in the dataset contain no explicit slurs, profanity, or swear words. The dataset was generated using GPT-3 through a novel demonstration-based prompting framework, combined with an adversarial classifier-in-the-loop decoding method called ALICE. ToxiGen is hosted on Hugging Face and released through a GitHub repository maintained by Microsoft.
Toxic language detection is a critical component of online content moderation. Platforms and AI systems need reliable classifiers that can identify hateful, abusive, or harmful text. However, existing toxicity detection systems suffer from two major problems that ToxiGen was specifically designed to address.
First, many classifiers exhibit a high false positive rate for text that merely mentions minority groups. Because minority groups are frequently the targets of online hate, models trained on existing corpora learn spurious correlations between group mentions and toxicity labels. This means that a perfectly benign sentence like "My Muslim neighbor is a great cook" might be flagged as toxic simply because it references a religious minority. This over-flagging has real consequences: it can silence marginalized voices and discourage legitimate discussion of identity and social issues.
Second, existing classifiers struggle with implicit toxicity. Implicit hate speech avoids explicit slurs and profanity but conveys harmful stereotypes, dog whistles, or coded language. A statement like "They just have a different way of thinking about property" directed at a racial group is implicitly toxic because it perpetuates a stereotype without using any flagged keywords. Most toxicity datasets are dominated by explicitly hateful text, leaving classifiers poorly equipped to handle these subtler forms of harm.
Prior to ToxiGen, several datasets attempted to address these challenges. The Davidson et al. (2017) dataset contained 24,802 statements but only 30.2% were implicit. The Founta et al. (2018) dataset had 80,000 statements with just 26.1% implicit content. The Implicit Hate Corpus (ElSherief et al., 2021) improved the ratio to 96.8% implicit but contained only 22,584 statements. DynaHate (Vidgen et al., 2021) reached 41,134 statements with 83.3% implicit content. ToxiGen addressed the need for a dataset that was both large in scale and overwhelmingly implicit, while also maintaining a balanced split between toxic and benign examples.
ToxiGen was developed by a team of researchers from multiple institutions:
| Author | Affiliation |
|---|---|
| Thomas Hartvigsen | Massachusetts Institute of Technology (MIT) |
| Saadia Gabriel | University of Washington |
| Hamid Palangi | Microsoft Research |
| Maarten Sap | Allen Institute for AI (AI2) and Carnegie Mellon University |
| Dipankar Ray | Microsoft |
| Ece Kamar | Microsoft Research |
The project was released under the Microsoft organization on GitHub, reflecting the close collaboration between academic and industry researchers in its development.
ToxiGen consists of 274,186 statements in total, split approximately evenly between toxic and benign categories (roughly 137,093 each). The dataset covers 13 minority demographic groups and was generated entirely by GPT-3 using carefully designed prompting strategies.
The 13 target groups in ToxiGen were selected to cover a broad range of identity categories:
| Group | Category |
|---|---|
| Black | Race/Ethnicity |
| Asian | Race/Ethnicity |
| Chinese | Race/Ethnicity/Nationality |
| Latino | Race/Ethnicity |
| Mexican | Race/Ethnicity/Nationality |
| Native American | Race/Ethnicity |
| Middle Eastern | Race/Ethnicity |
| Jewish | Religion |
| Muslim | Religion |
| LGBTQ+ | Sexual Orientation/Gender Identity |
| Women | Gender |
| Mental Disability | Disability |
| Physical Disability | Disability |
Each group has both toxic and benign statements, and the dataset maintains an approximate balance across groups. For example, the "Black" group contains 10,554 benign statements and 10,306 toxic statements.
The dataset is organized into several components:
| Split | Size | Description |
|---|---|---|
| Full Training Set | ~260,012 | Generated via top-k decoding for model development |
| ALICE Adversarial Set | 14,174 | Generated via adversarial classifier-in-the-loop decoding |
| Human-Validated Test Set (ToxiGen-HumanVal) | 940 | Annotated by 3 human annotators per item |
| Human-Annotated Training Sample | 8,960 | Human-labeled subset for fine-tuning classifiers |
| Raw Annotations | 27,450 | Full set of Mechanical Turk annotation responses |
One of ToxiGen's defining features is the near-total absence of explicit toxic language. Across the full dataset, 98.2% of all statements are implicit, meaning they contain no overt profanity, slurs, or swear words. The ALICE-generated subset is even more implicit at 99.7%. This makes ToxiGen uniquely challenging for toxicity classifiers that rely on keyword-based or surface-level features.
The core generation method in ToxiGen uses a technique called demonstration-based prompting. Rather than providing explicit instructions to GPT-3, the researchers showed the model a small set of example sentences (called demonstrations) and let it continue generating text in the same style.
For each of the 13 demographic groups, the team collected 20 to 50 human-written seed sentences. These seeds were drawn from two sources:
During generation, 5 random seed sentences were sampled from the relevant pool, joined together as a prompt, and fed to GPT-3. The model then generated a continuation in a style consistent with the demonstration. This process was repeated thousands of times per group to build up the dataset. In total, 26 prompt sets were created (one toxic set and one benign set for each of the 13 groups), and the process generated approximately 260,000 examples using standard top-k decoding.
The demonstration-based approach had a key advantage: by varying the seed sentences, the researchers could control the style and tone of GPT-3's outputs without relying on explicit instructions that the model might ignore or interpret inconsistently. The model produced linguistically diverse outputs while maintaining the implicit quality of the demonstrations.
Beyond the standard demonstration-based generation, ToxiGen introduced an adversarial generation method called ALICE (Adversarial Classifier-In-the-Loop Constrained dEcoding). ALICE pairs the language model with a toxicity classifier during the text generation process, creating a feedback loop that produces statements specifically designed to fool existing detection systems.
The core idea behind ALICE is formalized as a modified decoding objective. At each generation step, the probability of the next token is determined by two factors:
Both weights were set to 0.5 in the experiments. The generation used a beam size of 10, a temperature of 0.9, a maximum generation length of 30 tokens, and restricted the vocabulary to the top 100 tokens at each step. The toxicity classifier used in the loop was HateBERT fine-tuned on the OffensEval dataset.
ALICE operates in two adversarial modes:
Through this adversarial process, ALICE generated approximately 14,174 examples. These examples are particularly valuable for stress-testing toxicity classifiers because they represent the exact boundary cases where classifiers are most likely to fail. Human evaluation confirmed that ALICE-generated sentences were significantly more successful at deceiving HateBERT, with a 26.4% deception rate compared to 16.8% for standard top-k sampling.
To validate the quality of the machine-generated data, the researchers conducted a large-scale human evaluation using Amazon Mechanical Turk. A total of 156 workers were pre-qualified for the annotation task, and 51 of them participated in the actual annotation work.
Each statement in the human-validated subset was annotated by 3 independent annotators. The annotators assessed multiple dimensions of each statement, including:
The annotation process achieved the following agreement metrics:
| Metric | Score |
|---|---|
| Fleiss' κ (kappa) | 0.46 (moderate agreement) |
| Krippendorff's α (alpha) | 0.64 |
| Full agreement (all 3 annotators agree) | 55.17% |
| Majority agreement (at least 2 of 3 agree) | 93.4% |
The moderate Fleiss' kappa score reflects the inherent subjectivity of toxicity judgments, particularly for implicit content where reasonable people can disagree about whether a statement crosses the line from insensitive to harmful.
The annotator pool had the following demographic breakdown: 56.9% White, 9.8% Black, 3.9% Hispanic, 3.9% Asian, and 45.1% female. The researchers acknowledged that annotator demographics can influence toxicity judgments and that the pool may not be fully representative of all affected communities.
The human evaluation produced several important findings:
The researchers evaluated several prominent toxicity detection systems on ToxiGen:
The central experimental finding of ToxiGen is that fine-tuning toxicity classifiers on ToxiGen data substantially improves their performance, not just on ToxiGen's own test set, but on external human-written datasets as well. The researchers tested this using three external benchmarks: SocialBiasFrames, the Implicit Hate Corpus, and DynaHate.
Performance was measured using Area Under the ROC Curve (AUC):
| Model | Dataset | No Fine-Tuning | Fine-Tuned on ToxiGen |
|---|---|---|---|
| HateBERT | SocialBiasFrames | 0.60 | 0.71 |
| HateBERT | Implicit Hate Corpus | 0.60 | 0.67 |
| HateBERT | DynaHate | 0.47 | 0.66 |
| HateBERT | ToxiGen-Val | 0.57 | 0.96 |
| RoBERTa | SocialBiasFrames | 0.65 | 0.70 |
| RoBERTa | Implicit Hate Corpus | 0.57 | 0.66 |
| RoBERTa | DynaHate | 0.49 | 0.54 |
| RoBERTa | ToxiGen-Val | 0.57 | 0.93 |
The improvements ranged from 5 to 19 percentage points in AUC across different model-dataset combinations. The largest gains were observed on the ToxiGen validation set itself (39 points for HateBERT, 36 points for RoBERTa), but the consistent improvements on external human-written datasets demonstrate that ToxiGen training data generalizes beyond machine-generated text.
The team released two pre-trained classifiers fine-tuned on ToxiGen data:
Both models are available through the ToxiGen GitHub repository and showed significant improvements over their base versions on implicit hate speech detection tasks.
ToxiGen represents a significant advancement over prior hate speech and toxicity datasets in several dimensions:
| Dataset | Size | % Implicit | % Hate Class | Year |
|---|---|---|---|---|
| Davidson et al. | 24,802 | 30.2% | 5.0% | 2017 |
| Founta et al. | 80,000 | 26.1% | 53.9% | 2018 |
| Implicit Hate Corpus | 22,584 | 96.8% | 39.8% | 2021 |
| DynaHate | 41,134 | 83.3% | 53.8% | 2021 |
| ToxiGen | 274,186 | 98.2% | 50.1% | 2022 |
ToxiGen is roughly 3.4 times larger than the next biggest dataset (Founta et al.) and has the highest proportion of implicit content. It also has the most balanced split between toxic and benign classes at 50.1% hate, which is important for training classifiers that do not develop a bias toward predicting one class over the other.
The primary intended use of ToxiGen is as a benchmark for evaluating and improving toxicity classifiers. Because the dataset is dominated by implicit toxic content, it tests whether classifiers can detect harmful language that lacks obvious surface-level cues. Research teams and companies building content moderation systems can use ToxiGen to stress-test their models against the kinds of subtle toxic language that are most likely to slip through existing filters.
ToxiGen has also been adopted as a benchmark for evaluating large language models themselves. By prompting a language model with ToxiGen's demonstration sets, researchers can measure how likely the model is to generate toxic continuations. This application has become increasingly important as LLMs are deployed in consumer-facing products like chatbots and writing assistants.
Deshpande et al. (2023) adapted ToxiGen for investigating the safety of LLM-based chatbots by providing seven sentences from the dataset and prompting models to respond in a similar style. This methodology has been used to evaluate models including GPT-4, LLaMA, and others for their tendency to produce harmful content.
Because fine-tuning on ToxiGen data consistently improves classifier performance on human-written benchmarks, the dataset serves as valuable training data for building more robust content moderation tools. Organizations can augment their existing training pipelines with ToxiGen examples to improve their systems' ability to catch implicit toxicity.
ToxiGen has been integrated into several LLM evaluation frameworks, including the EleutherAI Language Model Evaluation Harness. This integration allows researchers to include ToxiGen as a standard benchmark when evaluating new models, making it part of the broader ecosystem of AI safety evaluation tools.
ToxiGen is available through multiple channels:
toxigen/toxigen-data on the Hugging Face Hub. Access requires filling out a form and authenticating with a Hugging Face token.pip install toxigen, which provides utilities for loading and working with the data.The GitHub repository includes:
demonstrations/ directory)prompts/ directory)notebooks/ directory)generate.py) with support for both standard and ALICE-based generationThe researchers acknowledged that toxicity is inherently subjective. What one person considers harmful, another might view as merely insensitive or even acceptable. The moderate inter-annotator agreement (Fleiss' kappa of 0.46) reflects this reality. The labels in ToxiGen should be understood as reflecting majority annotator judgments rather than objective ground truth.
While 13 groups represents broader coverage than many earlier datasets, it does not encompass all communities that experience online hate. The dataset also focuses exclusively on groups that are targets of oppression, without addressing the language of dominant groups (whiteness, heterosexuality, able-bodiedness) and how such language contributes to systems of harm.
The annotator pool was predominantly White (56.9%), which raises questions about whether the toxicity judgments fully capture the perspectives of the communities most affected by the hate speech in the dataset. Members of targeted groups may perceive and evaluate hate speech differently from those who do not share those identities.
Any dataset generated at the scale of 274,000 examples using automated methods will inevitably contain noise. Some statements labeled as toxic may be relatively benign, and vice versa. The human-validated subsets provide higher-quality labels, but the broader training set relies on prompt-based heuristic labels.
The methods described in ToxiGen, particularly ALICE, can be used to generate toxic text that is specifically designed to evade detection systems. While the authors released these tools for defensive research purposes (building better classifiers), the same techniques could theoretically be used by bad actors to produce harder-to-detect hate speech at scale. The authors flagged this dual-use concern in the paper.
Toxic language is highly context-dependent. A statement that is harmful in one context might be acceptable in another (for instance, reclaiming slurs within a community versus using them as outsiders). ToxiGen's statements are evaluated in isolation, without the broader conversational or social context that would exist in real-world deployment scenarios.
Since its publication, ToxiGen has become one of the most widely cited resources in the hate speech detection literature. The dataset and its methods have influenced subsequent research in several directions:
Community contributions have also expanded the dataset. In March 2024, contributors added demonstration sets for additional demographic categories, including immigrants and bisexual individuals, extending ToxiGen's coverage beyond the original 13 groups.
Newer benchmarks like TET (Realistic Evaluation of Toxicity) have built on ToxiGen's foundations while attempting to address some of its limitations, such as eliciting more toxicity from modern LLMs in controlled settings. These newer tools often compare their performance against ToxiGen as a baseline, underscoring ToxiGen's role as a standard reference point in the field.