# ToxiGen

> Source: https://aiwiki.ai/wiki/toxigen
> Updated: 2026-07-16
> Categories: AI Benchmarks, AI Ethics, AI Safety, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

ToxiGen is a large-scale, machine-generated dataset designed for adversarial and implicit hate speech detection. Created by Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar, the dataset was published at the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) in Dublin, Ireland.[1] ToxiGen contains 274,186 toxicity-annotated statements spanning 13 minority demographic groups, with a roughly equal split between toxic and benign examples.[1] What distinguishes ToxiGen from earlier hate speech datasets is its focus on implicit toxicity: 98.2% of the statements in the dataset contain no explicit slurs, profanity, or swear words.[1] The dataset was generated using [GPT-3](/wiki/gpt-3) through a novel demonstration-based prompting framework, combined with an adversarial classifier-in-the-loop decoding method called ALICE.[1] ToxiGen is hosted on [Hugging Face](/wiki/hugging_face) and released through a [GitHub](https://github.com/microsoft/TOXIGEN) repository maintained by [Microsoft](/wiki/microsoft_ai).

## Background and Motivation

Toxic language detection is a critical component of online content moderation. Platforms and AI systems need reliable classifiers that can identify hateful, abusive, or harmful text. However, existing toxicity detection systems suffer from two major problems that ToxiGen was specifically designed to address.

First, many classifiers exhibit a high false positive rate for text that merely mentions minority groups. Because minority groups are frequently the targets of online hate, models trained on existing corpora learn spurious correlations between group mentions and toxicity labels. This means that a perfectly benign sentence like "My Muslim neighbor is a great cook" might be flagged as toxic simply because it references a religious minority. This over-flagging has real consequences: it can silence marginalized voices and discourage legitimate discussion of identity and social issues.

Second, existing classifiers struggle with implicit toxicity. Implicit hate speech avoids explicit slurs and profanity but conveys harmful stereotypes, dog whistles, or coded language. A statement like "They just have a different way of thinking about property" directed at a racial group is implicitly toxic because it perpetuates a stereotype without using any flagged keywords. Most toxicity datasets are dominated by explicitly hateful text, leaving classifiers poorly equipped to handle these subtler forms of harm.

Prior to ToxiGen, several datasets attempted to address these challenges. The Davidson et al. (2017) dataset contained 24,802 statements but only 30.2% were implicit.[6] The Founta et al. (2018) dataset had 80,000 statements with just 26.1% implicit content.[7] The Implicit Hate Corpus (ElSherief et al., 2021) improved the ratio to 96.8% implicit but contained only 22,584 statements.[4] DynaHate (Vidgen et al., 2021) reached 41,134 statements with 83.3% implicit content.[5] ToxiGen addressed the need for a dataset that was both large in scale and overwhelmingly implicit, while also maintaining a balanced split between toxic and benign examples.

## Authors and Institutions

ToxiGen was developed by a team of researchers from multiple institutions:[1]

| Author | Affiliation |
|---|---|
| Thomas Hartvigsen | Massachusetts Institute of Technology (MIT) |
| Saadia Gabriel | University of Washington |
| Hamid Palangi | [Microsoft Research](/wiki/microsoft_ai) |
| Maarten Sap | Allen Institute for AI (AI2) and Carnegie Mellon University |
| Dipankar Ray | Microsoft |
| Ece Kamar | Microsoft Research |

The project was released under the Microsoft organization on GitHub, reflecting the close collaboration between academic and industry researchers in its development.

## Dataset Composition

### Overview

ToxiGen consists of 274,186 statements in total, split approximately evenly between toxic and benign categories (roughly 137,093 each).[1] The dataset covers 13 minority demographic groups and was generated entirely by [GPT-3](/wiki/gpt-3) using carefully designed prompting strategies.[1]

### Demographic Groups

The 13 target groups in ToxiGen were selected to cover a broad range of identity categories:[1]

| Group | Category |
|---|---|
| Black | Race/Ethnicity |
| Asian | Race/Ethnicity |
| Chinese | Race/Ethnicity/Nationality |
| Latino | Race/Ethnicity |
| Mexican | Race/Ethnicity/Nationality |
| Native American | Race/Ethnicity |
| Middle Eastern | Race/Ethnicity |
| Jewish | Religion |
| Muslim | Religion |
| LGBTQ+ | Sexual Orientation/Gender Identity |
| Women | Gender |
| Mental Disability | Disability |
| Physical Disability | Disability |

Each group has both toxic and benign statements, and the dataset maintains an approximate balance across groups. For example, the "Black" group contains 10,554 benign statements and 10,306 toxic statements.[1]

### Dataset Splits

The dataset is organized into several components:[1]

| Split | Size | Description |
|---|---|---|
| Full Training Set | ~260,012 | Generated via top-k decoding for model development |
| ALICE Adversarial Set | 14,174 | Generated via adversarial classifier-in-the-loop decoding |
| Human-Validated Test Set (ToxiGen-HumanVal) | 940 | Annotated by 3 human annotators per item |
| Human-Annotated Training Sample | 8,960 | Human-labeled subset for fine-tuning classifiers |
| Raw Annotations | 27,450 | Full set of Mechanical Turk annotation responses |

### Implicit vs. Explicit Content

One of ToxiGen's defining features is the near-total absence of explicit toxic language. Across the full dataset, 98.2% of all statements are implicit, meaning they contain no overt profanity, slurs, or swear words. The ALICE-generated subset is even more implicit at 99.7%.[1] This makes ToxiGen uniquely challenging for toxicity classifiers that rely on keyword-based or surface-level features.

## Methodology

### Demonstration-Based Prompting

The core generation method in ToxiGen uses a technique called demonstration-based prompting. Rather than providing explicit instructions to [GPT-3](/wiki/gpt-3), the researchers showed the model a small set of example sentences (called demonstrations) and let it continue generating text in the same style.

For each of the 13 demographic groups, the team collected 20 to 50 human-written seed sentences.[1] These seeds were drawn from two sources:

1. **Implicit hate speech examples** taken from existing hate speech datasets, which provided models of subtle, coded toxic language.
2. **Neutral statements** collected from news articles, opinion pieces, podcast transcripts, and other public sources that mentioned minority groups in benign contexts.

During generation, 5 random seed sentences were sampled from the relevant pool, joined together as a prompt, and fed to GPT-3. The model then generated a continuation in a style consistent with the demonstration. This process was repeated thousands of times per group to build up the dataset. In total, 26 prompt sets were created (one toxic set and one benign set for each of the 13 groups), and the process generated approximately 260,000 examples using standard top-k decoding.[1]

The demonstration-based approach had a key advantage: by varying the seed sentences, the researchers could control the style and tone of GPT-3's outputs without relying on explicit instructions that the model might ignore or interpret inconsistently. The model produced linguistically diverse outputs while maintaining the implicit quality of the demonstrations.

### ALICE: Adversarial Classifier-in-the-Loop Decoding

Beyond the standard demonstration-based generation, ToxiGen introduced an adversarial generation method called ALICE (Adversarial Classifier-In-the-Loop Constrained dEcoding).[1] ALICE pairs the [language model](/wiki/large_language_model) with a toxicity classifier during the text generation process, creating a feedback loop that produces statements specifically designed to fool existing detection systems.

The core idea behind ALICE is formalized as a modified decoding objective. At each generation step, the probability of the next token is determined by two factors:

- The language model's own probability of the token given the context (weighted by a parameter λ_L)
- The toxicity classifier's assessment of the generated sequence so far (weighted by a parameter λ_C)

Both weights were set to 0.5 in the experiments. The generation used a beam size of 10, a temperature of 0.9, a maximum generation length of 30 tokens, and restricted the vocabulary to the top 100 tokens at each step. The toxicity classifier used in the loop was HateBERT fine-tuned on the OffensEval dataset.[1]

ALICE operates in two adversarial modes:

1. **Generating false negatives:** Starting with a toxic prompt, ALICE generates toxic text while maximizing the classifier's probability that the output is benign. This produces toxic statements that evade detection.
2. **Generating false positives:** Starting with a benign prompt, ALICE generates benign text while maximizing the classifier's probability that the output is toxic. This produces innocuous statements that classifiers would incorrectly flag.

Through this adversarial process, ALICE generated approximately 14,174 examples.[1] These examples are particularly valuable for stress-testing toxicity classifiers because they represent the exact boundary cases where classifiers are most likely to fail. Human evaluation confirmed that ALICE-generated sentences were significantly more successful at deceiving HateBERT, with a 26.4% deception rate compared to 16.8% for standard top-k sampling.[1]

## Human Evaluation

### Annotation Process

To validate the quality of the machine-generated data, the researchers conducted a large-scale human evaluation using [Amazon Mechanical Turk](https://www.mturk.com/). A total of 156 workers were pre-qualified for the annotation task, and 51 of them participated in the actual annotation work.[1]

Each statement in the human-validated subset was annotated by 3 independent annotators.[1] The annotators assessed multiple dimensions of each statement, including:

- Whether the statement was toxic or benign
- Whether the statement contained harmful stereotypes
- Which demographic group the statement referenced
- Whether the statement appeared to be human-written or machine-generated
- Factuality assessments

### Inter-Annotator Agreement

The annotation process achieved the following agreement metrics:[1]

| Metric | Score |
|---|---|
| Fleiss' κ (kappa) | 0.46 (moderate agreement) |
| Krippendorff's α (alpha) | 0.64 |
| Full agreement (all 3 annotators agree) | 55.17% |
| Majority agreement (at least 2 of 3 agree) | 93.4% |

The moderate Fleiss' kappa score reflects the inherent subjectivity of toxicity judgments, particularly for implicit content where reasonable people can disagree about whether a statement crosses the line from insensitive to harmful.

### Annotator Demographics

The annotator pool had the following demographic breakdown: 56.9% White, 9.8% Black, 3.9% Hispanic, 3.9% Asian, and 45.1% female.[1] The researchers acknowledged that annotator demographics can influence toxicity judgments and that the pool may not be fully representative of all affected communities.

### Key Findings from Human Evaluation

The human evaluation produced several important findings:

- **94.5% of toxic examples were labeled as hate speech** by human annotators, confirming that the machine-generated toxic content was genuinely harmful rather than merely edgy or ambiguous.[1]
- **90.5% of machine-generated examples were perceived as human-written**, indicating that GPT-3 produced text of sufficient quality and naturalness that annotators could not reliably distinguish it from human-authored content.[1]
- Both the standard top-k and ALICE methods reliably produced data matching the intended toxicity labels: 95.2% of benign prompts produced benign outputs with top-k decoding, and 92.1% with ALICE.[1]
- For toxic prompts, the match rate was lower: 67.7% for top-k and 40.3% for ALICE. The lower rate for ALICE is expected, as ALICE explicitly tries to make toxic content appear benign to classifiers.[1]

## Classifier Evaluation and Results

### Baseline Classifiers Tested

The researchers evaluated several prominent toxicity detection systems on ToxiGen:

- [Google](/wiki/google_deepmind)'s Perspective API
- HateBERT (Caselli et al., 2021)[3]
- OpenAI content filter
- AI2 Delphi
- [RoBERTa](/wiki/roberta) (ToxDectRoBERTa variant by Zhou et al., 2021)[8]

### Fine-Tuning Results

The central experimental finding of ToxiGen is that fine-tuning toxicity classifiers on ToxiGen data substantially improves their performance, not just on ToxiGen's own test set, but on external human-written datasets as well. The researchers tested this using three external benchmarks: SocialBiasFrames, the Implicit Hate Corpus, and DynaHate.[1]

Performance was measured using Area Under the ROC Curve (AUC):

| Model | Dataset | No Fine-Tuning | Fine-Tuned on ToxiGen |
|---|---|---|---|
| [HateBERT](/wiki/bert) | SocialBiasFrames | 0.60 | 0.71 |
| HateBERT | Implicit Hate Corpus | 0.60 | 0.67 |
| HateBERT | DynaHate | 0.47 | 0.66 |
| HateBERT | ToxiGen-Val | 0.57 | 0.96 |
| RoBERTa | SocialBiasFrames | 0.65 | 0.70 |
| RoBERTa | Implicit Hate Corpus | 0.57 | 0.66 |
| RoBERTa | DynaHate | 0.49 | 0.54 |
| RoBERTa | ToxiGen-Val | 0.57 | 0.93 |

The improvements ranged from 5 to 19 percentage points in AUC across different model-dataset combinations.[1] The largest gains were observed on the ToxiGen validation set itself (39 points for HateBERT, 36 points for RoBERTa), but the consistent improvements on external human-written datasets demonstrate that ToxiGen training data generalizes beyond machine-generated text.

### Released Models

The team released two pre-trained classifiers fine-tuned on ToxiGen data:

- **HateBERT_ToxiGen**: A HateBERT model fine-tuned on ToxiGen's human-annotated subset
- **RoBERTa_ToxiGen**: A ToxDectRoBERTa model fine-tuned on ToxiGen's human-annotated subset

Both models are available through the ToxiGen GitHub repository and showed significant improvements over their base versions on implicit hate speech detection tasks.

## Comparison with Existing Datasets

ToxiGen represents a significant advancement over prior hate speech and toxicity datasets in several dimensions:

| Dataset | Size | % Implicit | % Hate Class | Year |
|---|---|---|---|---|
| Davidson et al.[6] | 24,802 | 30.2% | 5.0% | 2017 |
| Founta et al.[7] | 80,000 | 26.1% | 53.9% | 2018 |
| Implicit Hate Corpus[4] | 22,584 | 96.8% | 39.8% | 2021 |
| DynaHate[5] | 41,134 | 83.3% | 53.8% | 2021 |
| **ToxiGen**[1] | **274,186** | **98.2%** | **50.1%** | **2022** |

ToxiGen is roughly 3.4 times larger than the next biggest dataset (Founta et al.) and has the highest proportion of implicit content.[1] It also has the most balanced split between toxic and benign classes at 50.1% hate, which is important for training classifiers that do not develop a bias toward predicting one class over the other.

## Usage and Applications

### Evaluating Toxicity Classifiers

The primary intended use of ToxiGen is as a [benchmark](/wiki/ai_benchmark) for evaluating and improving toxicity classifiers. Because the dataset is dominated by implicit toxic content, it tests whether classifiers can detect harmful language that lacks obvious surface-level cues. Research teams and companies building content moderation systems can use ToxiGen to stress-test their models against the kinds of subtle toxic language that are most likely to slip through existing filters.

### Evaluating Language Models

ToxiGen has also been adopted as a benchmark for evaluating [large language models](/wiki/large_language_model) themselves. By prompting a language model with ToxiGen's demonstration sets, researchers can measure how likely the model is to generate toxic continuations. This application has become increasingly important as LLMs are deployed in consumer-facing products like chatbots and writing assistants.

Deshpande et al. (2023) adapted ToxiGen for investigating the safety of LLM-based chatbots by providing seven sentences from the dataset and prompting models to respond in a similar style.[9] This methodology has been used to evaluate models including [GPT-4](/wiki/gpt-4), [LLaMA](/wiki/llama), and others for their tendency to produce harmful content.

### Training Data for Safety Systems

Because fine-tuning on ToxiGen data consistently improves classifier performance on human-written benchmarks, the dataset serves as valuable training data for building more robust content moderation tools. Organizations can augment their existing training pipelines with ToxiGen examples to improve their systems' ability to catch implicit toxicity.

### Integration with Evaluation Frameworks

ToxiGen has been integrated into several LLM evaluation frameworks, including the EleutherAI Language Model Evaluation Harness. This integration allows researchers to include ToxiGen as a standard benchmark when evaluating new models, making it part of the broader ecosystem of [AI safety](/wiki/ai_safety) evaluation tools.

## Accessing the Dataset

ToxiGen is available through multiple channels:

- **Hugging Face**: The dataset is hosted at `toxigen/toxigen-data` on the Hugging Face Hub. Access requires filling out a form and authenticating with a Hugging Face token.
- **GitHub**: The full codebase for generating ToxiGen data, including prompt templates, demonstration sets, and generation scripts, is available at the [Microsoft TOXIGEN repository](https://github.com/microsoft/TOXIGEN).
- **Python Package**: The dataset can be installed via pip with `pip install toxigen`, which provides utilities for loading and working with the data.

The GitHub repository includes:
- Human-provided demonstration sentences organized by group (in the `demonstrations/` directory)
- Pre-written prompt files for LLM generation (in the `prompts/` directory)
- Jupyter notebook tutorials for data generation and loading (in the `notebooks/` directory)
- The core generation script (`generate.py`) with support for both standard and ALICE-based generation

## Limitations and Ethical Considerations

### Subjectivity of Toxicity

The researchers acknowledged that toxicity is inherently subjective. What one person considers harmful, another might view as merely insensitive or even acceptable. The moderate inter-annotator agreement (Fleiss' kappa of 0.46) reflects this reality.[1] The labels in ToxiGen should be understood as reflecting majority annotator judgments rather than objective ground truth.

### Limited Demographic Coverage

While 13 groups represents broader coverage than many earlier datasets, it does not encompass all communities that experience online hate. The dataset also focuses exclusively on groups that are targets of oppression, without addressing the language of dominant groups (whiteness, heterosexuality, able-bodiedness) and how such language contributes to systems of harm.

### Annotator Representation

The annotator pool was predominantly White (56.9%), which raises questions about whether the toxicity judgments fully capture the perspectives of the communities most affected by the hate speech in the dataset. Members of targeted groups may perceive and evaluate hate speech differently from those who do not share those identities.

### Noise in Large-Scale Generation

Any dataset generated at the scale of 274,000 examples using automated methods will inevitably contain noise. Some statements labeled as toxic may be relatively benign, and vice versa. The human-validated subsets provide higher-quality labels, but the broader training set relies on prompt-based heuristic labels.

### Potential for Misuse

The methods described in ToxiGen, particularly ALICE, can be used to generate toxic text that is specifically designed to evade detection systems. While the authors released these tools for defensive research purposes (building better classifiers), the same techniques could theoretically be used by bad actors to produce harder-to-detect hate speech at scale. The authors flagged this dual-use concern in the paper.

### Context Dependency

Toxic language is highly context-dependent. A statement that is harmful in one context might be acceptable in another (for instance, reclaiming slurs within a community versus using them as outsiders). ToxiGen's statements are evaluated in isolation, without the broader conversational or social context that would exist in real-world deployment scenarios.

## Impact and Follow-Up Work

Since its publication, ToxiGen has become one of the most widely cited resources in the hate speech detection literature. The dataset and its methods have influenced subsequent research in several directions:

- **Improved implicit hate detection**: Multiple research groups have used ToxiGen as training data or evaluation benchmark for building classifiers better suited to detecting subtle toxicity.
- **LLM safety evaluation**: ToxiGen has been incorporated into standard evaluation suites for assessing the safety properties of new [large language models](/wiki/large_language_model).
- **Adversarial robustness**: The ALICE methodology has inspired follow-up work on adversarial approaches to both generating and detecting toxic content.
- **Annotation quality research**: Huang et al. (2023) explored whether [ChatGPT](/wiki/chatgpt) could serve as a more consistent annotator than humans for implicit hate speech, using ToxiGen as a test bed.[10] Jinadu and Ding (2024) addressed noise correction challenges in subjective annotation tasks, drawing on ToxiGen data.[11]
- **Cross-lingual and multi-domain extensions**: Researchers have built on ToxiGen's methods to create similar resources for other languages and content domains.

Community contributions have also expanded the dataset. In March 2024, contributors added demonstration sets for additional demographic categories, including immigrants and bisexual individuals, extending ToxiGen's coverage beyond the original 13 groups.

Newer benchmarks like TET (Realistic Evaluation of Toxicity) have built on ToxiGen's foundations while attempting to address some of its limitations, such as eliciting more toxicity from modern LLMs in controlled settings. These newer tools often compare their performance against ToxiGen as a baseline, underscoring ToxiGen's role as a standard reference point in the field.

## See Also

- [BERT](/wiki/bert)
- [RoBERTa](/wiki/roberta)
- [GPT-3](/wiki/gpt-3)
- [AI Safety](/wiki/ai_safety)
- [Natural Language Processing](/wiki/natural_language_processing)
- [Prompt Engineering](/wiki/prompt_engineering)
- [HarmBench](/wiki/harmbench)
- [BBQ Benchmark](/wiki/bbq_benchmark)

## References

1. Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. *Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, pages 3309-3326. Dublin, Ireland. DOI: 10.18653/v1/2022.acl-long.234.
2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language Models are Few-Shot Learners. *Advances in Neural Information Processing Systems*, 33, 1877-1901.
3. Caselli, T., Basile, V., Mitrovic, J., Karber, I., & Granitzer, M. (2021). HateBERT: Retraining a BERT Model for Abusive Language Detection in English. *Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)*, pages 17-25.
4. ElSherief, M., Ziems, C., Muchlinski, D., Anber, V., Bau, J., Dahlke, J., & Yang, D. (2021). Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 345-363.
5. Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2021). Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. *Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics*, pages 1667-1682.
6. Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. *Proceedings of the International AAAI Conference on Web and Social Media*, 11(1), 512-515.
7. Founta, A. M., Djouvas, C., Chatzakou, D., Leontiadis, I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, M., & Kourtellis, N. (2018). Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. *Proceedings of the International AAAI Conference on Web and Social Media*, 12(1).
8. Zhou, X., Sap, M., Swayamdipta, S., Choi, Y., & Smith, N. A. (2021). Challenges in Automated Debiasing for Toxic Language Detection. *Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics*, pages 3143-3155.
9. Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., & Narasimhan, K. (2023). Toxicity in ChatGPT: Analyzing Persona-Assigned Language Models. *Findings of the Association for Computational Linguistics: EMNLP 2023*.
10. Huang, F., Kwak, H., & An, J. (2023). Is ChatGPT Better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. *Companion Proceedings of the ACM Web Conference 2023*.
11. Jinadu, B. & Ding, X. (2024). Noise Correction on Subjective Datasets. *arXiv preprint*.