ToxiGen

AI Benchmarks AI Ethics AI Safety Natural Language Processing

18 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v3 · 3,621 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ToxiGen is a large-scale, machine-generated dataset designed for adversarial and implicit hate speech detection. Created by Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar, the dataset was published at the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) in Dublin, Ireland.^[1] ToxiGen contains 274,186 toxicity-annotated statements spanning 13 minority demographic groups, with a roughly equal split between toxic and benign examples.^[1] What distinguishes ToxiGen from earlier hate speech datasets is its focus on implicit toxicity: 98.2% of the statements in the dataset contain no explicit slurs, profanity, or swear words.^[1] The dataset was generated using GPT-3 through a novel demonstration-based prompting framework, combined with an adversarial classifier-in-the-loop decoding method called ALICE.^[1] ToxiGen is hosted on Hugging Face and released through a GitHub repository maintained by Microsoft.

Background and Motivation

Toxic language detection is a critical component of online content moderation. Platforms and AI systems need reliable classifiers that can identify hateful, abusive, or harmful text. However, existing toxicity detection systems suffer from two major problems that ToxiGen was specifically designed to address.

First, many classifiers exhibit a high false positive rate for text that merely mentions minority groups. Because minority groups are frequently the targets of online hate, models trained on existing corpora learn spurious correlations between group mentions and toxicity labels. This means that a perfectly benign sentence like "My Muslim neighbor is a great cook" might be flagged as toxic simply because it references a religious minority. This over-flagging has real consequences: it can silence marginalized voices and discourage legitimate discussion of identity and social issues.

Second, existing classifiers struggle with implicit toxicity. Implicit hate speech avoids explicit slurs and profanity but conveys harmful stereotypes, dog whistles, or coded language. A statement like "They just have a different way of thinking about property" directed at a racial group is implicitly toxic because it perpetuates a stereotype without using any flagged keywords. Most toxicity datasets are dominated by explicitly hateful text, leaving classifiers poorly equipped to handle these subtler forms of harm.

Prior to ToxiGen, several datasets attempted to address these challenges. The Davidson et al. (2017) dataset contained 24,802 statements but only 30.2% were implicit.^[6] The Founta et al. (2018) dataset had 80,000 statements with just 26.1% implicit content.^[7] The Implicit Hate Corpus (ElSherief et al., 2021) improved the ratio to 96.8% implicit but contained only 22,584 statements.^[4] DynaHate (Vidgen et al., 2021) reached 41,134 statements with 83.3% implicit content.^[5] ToxiGen addressed the need for a dataset that was both large in scale and overwhelmingly implicit, while also maintaining a balanced split between toxic and benign examples.

Authors and Institutions

ToxiGen was developed by a team of researchers from multiple institutions:^[1]

Author	Affiliation
Thomas Hartvigsen	Massachusetts Institute of Technology (MIT)
Saadia Gabriel	University of Washington
Hamid Palangi	Microsoft Research
Maarten Sap	Allen Institute for AI (AI2) and Carnegie Mellon University
Dipankar Ray	Microsoft
Ece Kamar	Microsoft Research

The project was released under the Microsoft organization on GitHub, reflecting the close collaboration between academic and industry researchers in its development.

Dataset Composition

Overview

ToxiGen consists of 274,186 statements in total, split approximately evenly between toxic and benign categories (roughly 137,093 each).^[1] The dataset covers 13 minority demographic groups and was generated entirely by GPT-3 using carefully designed prompting strategies.^[1]

Demographic Groups

The 13 target groups in ToxiGen were selected to cover a broad range of identity categories:^[1]

Group	Category
Black	Race/Ethnicity
Asian	Race/Ethnicity
Chinese	Race/Ethnicity/Nationality
Latino	Race/Ethnicity
Mexican	Race/Ethnicity/Nationality
Native American	Race/Ethnicity
Middle Eastern	Race/Ethnicity
Jewish	Religion
Muslim	Religion
LGBTQ+	Sexual Orientation/Gender Identity
Women	Gender
Mental Disability	Disability
Physical Disability	Disability

Each group has both toxic and benign statements, and the dataset maintains an approximate balance across groups. For example, the "Black" group contains 10,554 benign statements and 10,306 toxic statements.^[1]

Dataset Splits

The dataset is organized into several components:^[1]

Split	Size	Description
Full Training Set	~260,012	Generated via top-k decoding for model development
ALICE Adversarial Set	14,174	Generated via adversarial classifier-in-the-loop decoding
Human-Validated Test Set (ToxiGen-HumanVal)	940	Annotated by 3 human annotators per item
Human-Annotated Training Sample	8,960	Human-labeled subset for fine-tuning classifiers
Raw Annotations	27,450	Full set of Mechanical Turk annotation responses

Implicit vs. Explicit Content

One of ToxiGen's defining features is the near-total absence of explicit toxic language. Across the full dataset, 98.2% of all statements are implicit, meaning they contain no overt profanity, slurs, or swear words. The ALICE-generated subset is even more implicit at 99.7%.^[1] This makes ToxiGen uniquely challenging for toxicity classifiers that rely on keyword-based or surface-level features.

Methodology

Demonstration-Based Prompting

The core generation method in ToxiGen uses a technique called demonstration-based prompting. Rather than providing explicit instructions to GPT-3, the researchers showed the model a small set of example sentences (called demonstrations) and let it continue generating text in the same style.

For each of the 13 demographic groups, the team collected 20 to 50 human-written seed sentences.^[1] These seeds were drawn from two sources:

Implicit hate speech examples taken from existing hate speech datasets, which provided models of subtle, coded toxic language.
Neutral statements collected from news articles, opinion pieces, podcast transcripts, and other public sources that mentioned minority groups in benign contexts.

During generation, 5 random seed sentences were sampled from the relevant pool, joined together as a prompt, and fed to GPT-3. The model then generated a continuation in a style consistent with the demonstration. This process was repeated thousands of times per group to build up the dataset. In total, 26 prompt sets were created (one toxic set and one benign set for each of the 13 groups), and the process generated approximately 260,000 examples using standard top-k decoding.^[1]

The demonstration-based approach had a key advantage: by varying the seed sentences, the researchers could control the style and tone of GPT-3's outputs without relying on explicit instructions that the model might ignore or interpret inconsistently. The model produced linguistically diverse outputs while maintaining the implicit quality of the demonstrations.

ALICE: Adversarial Classifier-in-the-Loop Decoding

Beyond the standard demonstration-based generation, ToxiGen introduced an adversarial generation method called ALICE (Adversarial Classifier-In-the-Loop Constrained dEcoding).^[1] ALICE pairs the language model with a toxicity classifier during the text generation process, creating a feedback loop that produces statements specifically designed to fool existing detection systems.

The core idea behind ALICE is formalized as a modified decoding objective. At each generation step, the probability of the next token is determined by two factors:

The language model's own probability of the token given the context (weighted by a parameter λ_L)
The toxicity classifier's assessment of the generated sequence so far (weighted by a parameter λ_C)

Both weights were set to 0.5 in the experiments. The generation used a beam size of 10, a temperature of 0.9, a maximum generation length of 30 tokens, and restricted the vocabulary to the top 100 tokens at each step. The toxicity classifier used in the loop was HateBERT fine-tuned on the OffensEval dataset.^[1]

ALICE operates in two adversarial modes:

Generating false negatives: Starting with a toxic prompt, ALICE generates toxic text while maximizing the classifier's probability that the output is benign. This produces toxic statements that evade detection.
Generating false positives: Starting with a benign prompt, ALICE generates benign text while maximizing the classifier's probability that the output is toxic. This produces innocuous statements that classifiers would incorrectly flag.

Through this adversarial process, ALICE generated approximately 14,174 examples.^[1] These examples are particularly valuable for stress-testing toxicity classifiers because they represent the exact boundary cases where classifiers are most likely to fail. Human evaluation confirmed that ALICE-generated sentences were significantly more successful at deceiving HateBERT, with a 26.4% deception rate compared to 16.8% for standard top-k sampling.^[1]

Human Evaluation

Annotation Process

To validate the quality of the machine-generated data, the researchers conducted a large-scale human evaluation using Amazon Mechanical Turk. A total of 156 workers were pre-qualified for the annotation task, and 51 of them participated in the actual annotation work.^[1]

Each statement in the human-validated subset was annotated by 3 independent annotators.^[1] The annotators assessed multiple dimensions of each statement, including:

Whether the statement was toxic or benign
Whether the statement contained harmful stereotypes
Which demographic group the statement referenced
Whether the statement appeared to be human-written or machine-generated
Factuality assessments

Inter-Annotator Agreement

The annotation process achieved the following agreement metrics:^[1]

Metric	Score
Fleiss' κ (kappa)	0.46 (moderate agreement)
Krippendorff's α (alpha)	0.64
Full agreement (all 3 annotators agree)	55.17%
Majority agreement (at least 2 of 3 agree)	93.4%

The moderate Fleiss' kappa score reflects the inherent subjectivity of toxicity judgments, particularly for implicit content where reasonable people can disagree about whether a statement crosses the line from insensitive to harmful.

Annotator Demographics

The annotator pool had the following demographic breakdown: 56.9% White, 9.8% Black, 3.9% Hispanic, 3.9% Asian, and 45.1% female.^[1] The researchers acknowledged that annotator demographics can influence toxicity judgments and that the pool may not be fully representative of all affected communities.

Key Findings from Human Evaluation

The human evaluation produced several important findings:

94.5% of toxic examples were labeled as hate speech by human annotators, confirming that the machine-generated toxic content was genuinely harmful rather than merely edgy or ambiguous.^[1]
90.5% of machine-generated examples were perceived as human-written, indicating that GPT-3 produced text of sufficient quality and naturalness that annotators could not reliably distinguish it from human-authored content.^[1]
Both the standard top-k and ALICE methods reliably produced data matching the intended toxicity labels: 95.2% of benign prompts produced benign outputs with top-k decoding, and 92.1% with ALICE.^[1]
For toxic prompts, the match rate was lower: 67.7% for top-k and 40.3% for ALICE. The lower rate for ALICE is expected, as ALICE explicitly tries to make toxic content appear benign to classifiers.^[1]

Classifier Evaluation and Results

Baseline Classifiers Tested

The researchers evaluated several prominent toxicity detection systems on ToxiGen:

Google's Perspective API
HateBERT (Caselli et al., 2021)^[3]
OpenAI content filter
AI2 Delphi
RoBERTa (ToxDectRoBERTa variant by Zhou et al., 2021)^[8]

Fine-Tuning Results

The central experimental finding of ToxiGen is that fine-tuning toxicity classifiers on ToxiGen data substantially improves their performance, not just on ToxiGen's own test set, but on external human-written datasets as well. The researchers tested this using three external benchmarks: SocialBiasFrames, the Implicit Hate Corpus, and DynaHate.^[1]

Performance was measured using Area Under the ROC Curve (AUC):

Model	Dataset	No Fine-Tuning	Fine-Tuned on ToxiGen
HateBERT	SocialBiasFrames	0.60	0.71
HateBERT	Implicit Hate Corpus	0.60	0.67
HateBERT	DynaHate	0.47	0.66
HateBERT	ToxiGen-Val	0.57	0.96
RoBERTa	SocialBiasFrames	0.65	0.70
RoBERTa	Implicit Hate Corpus	0.57	0.66
RoBERTa	DynaHate	0.49	0.54
RoBERTa	ToxiGen-Val	0.57	0.93

The improvements ranged from 5 to 19 percentage points in AUC across different model-dataset combinations.^[1] The largest gains were observed on the ToxiGen validation set itself (39 points for HateBERT, 36 points for RoBERTa), but the consistent improvements on external human-written datasets demonstrate that ToxiGen training data generalizes beyond machine-generated text.

Released Models

The team released two pre-trained classifiers fine-tuned on ToxiGen data:

HateBERT_ToxiGen: A HateBERT model fine-tuned on ToxiGen's human-annotated subset
RoBERTa_ToxiGen: A ToxDectRoBERTa model fine-tuned on ToxiGen's human-annotated subset

Both models are available through the ToxiGen GitHub repository and showed significant improvements over their base versions on implicit hate speech detection tasks.

Comparison with Existing Datasets

ToxiGen represents a significant advancement over prior hate speech and toxicity datasets in several dimensions:

Dataset	Size	% Implicit	% Hate Class	Year
Davidson et al.^[6]	24,802	30.2%	5.0%	2017
Founta et al.^[7]	80,000	26.1%	53.9%	2018
Implicit Hate Corpus^[4]	22,584	96.8%	39.8%	2021
DynaHate^[5]	41,134	83.3%	53.8%	2021
ToxiGen^[1]	274,186	98.2%	50.1%	2022

ToxiGen is roughly 3.4 times larger than the next biggest dataset (Founta et al.) and has the highest proportion of implicit content.^[1] It also has the most balanced split between toxic and benign classes at 50.1% hate, which is important for training classifiers that do not develop a bias toward predicting one class over the other.

Usage and Applications

Evaluating Toxicity Classifiers

The primary intended use of ToxiGen is as a benchmark for evaluating and improving toxicity classifiers. Because the dataset is dominated by implicit toxic content, it tests whether classifiers can detect harmful language that lacks obvious surface-level cues. Research teams and companies building content moderation systems can use ToxiGen to stress-test their models against the kinds of subtle toxic language that are most likely to slip through existing filters.

Evaluating Language Models

ToxiGen has also been adopted as a benchmark for evaluating large language models themselves. By prompting a language model with ToxiGen's demonstration sets, researchers can measure how likely the model is to generate toxic continuations. This application has become increasingly important as LLMs are deployed in consumer-facing products like chatbots and writing assistants.

Deshpande et al. (2023) adapted ToxiGen for investigating the safety of LLM-based chatbots by providing seven sentences from the dataset and prompting models to respond in a similar style.^[9] This methodology has been used to evaluate models including GPT-4, LLaMA, and others for their tendency to produce harmful content.

Training Data for Safety Systems

Because fine-tuning on ToxiGen data consistently improves classifier performance on human-written benchmarks, the dataset serves as valuable training data for building more robust content moderation tools. Organizations can augment their existing training pipelines with ToxiGen examples to improve their systems' ability to catch implicit toxicity.

Integration with Evaluation Frameworks

ToxiGen has been integrated into several LLM evaluation frameworks, including the EleutherAI Language Model Evaluation Harness. This integration allows researchers to include ToxiGen as a standard benchmark when evaluating new models, making it part of the broader ecosystem of AI safety evaluation tools.

Accessing the Dataset

ToxiGen is available through multiple channels:

Hugging Face: The dataset is hosted at toxigen/toxigen-data on the Hugging Face Hub. Access requires filling out a form and authenticating with a Hugging Face token.
GitHub: The full codebase for generating ToxiGen data, including prompt templates, demonstration sets, and generation scripts, is available at the Microsoft TOXIGEN repository.
Python Package: The dataset can be installed via pip with pip install toxigen, which provides utilities for loading and working with the data.

The GitHub repository includes:

Human-provided demonstration sentences organized by group (in the demonstrations/ directory)
Pre-written prompt files for LLM generation (in the prompts/ directory)
Jupyter notebook tutorials for data generation and loading (in the notebooks/ directory)
The core generation script (generate.py) with support for both standard and ALICE-based generation

Limitations and Ethical Considerations

Subjectivity of Toxicity

The researchers acknowledged that toxicity is inherently subjective. What one person considers harmful, another might view as merely insensitive or even acceptable. The moderate inter-annotator agreement (Fleiss' kappa of 0.46) reflects this reality.^[1] The labels in ToxiGen should be understood as reflecting majority annotator judgments rather than objective ground truth.

Limited Demographic Coverage

While 13 groups represents broader coverage than many earlier datasets, it does not encompass all communities that experience online hate. The dataset also focuses exclusively on groups that are targets of oppression, without addressing the language of dominant groups (whiteness, heterosexuality, able-bodiedness) and how such language contributes to systems of harm.

Annotator Representation

The annotator pool was predominantly White (56.9%), which raises questions about whether the toxicity judgments fully capture the perspectives of the communities most affected by the hate speech in the dataset. Members of targeted groups may perceive and evaluate hate speech differently from those who do not share those identities.

Noise in Large-Scale Generation

Any dataset generated at the scale of 274,000 examples using automated methods will inevitably contain noise. Some statements labeled as toxic may be relatively benign, and vice versa. The human-validated subsets provide higher-quality labels, but the broader training set relies on prompt-based heuristic labels.

Potential for Misuse

The methods described in ToxiGen, particularly ALICE, can be used to generate toxic text that is specifically designed to evade detection systems. While the authors released these tools for defensive research purposes (building better classifiers), the same techniques could theoretically be used by bad actors to produce harder-to-detect hate speech at scale. The authors flagged this dual-use concern in the paper.

Context Dependency

Toxic language is highly context-dependent. A statement that is harmful in one context might be acceptable in another (for instance, reclaiming slurs within a community versus using them as outsiders). ToxiGen's statements are evaluated in isolation, without the broader conversational or social context that would exist in real-world deployment scenarios.

Impact and Follow-Up Work

Since its publication, ToxiGen has become one of the most widely cited resources in the hate speech detection literature. The dataset and its methods have influenced subsequent research in several directions:

Improved implicit hate detection: Multiple research groups have used ToxiGen as training data or evaluation benchmark for building classifiers better suited to detecting subtle toxicity.
LLM safety evaluation: ToxiGen has been incorporated into standard evaluation suites for assessing the safety properties of new large language models.
Adversarial robustness: The ALICE methodology has inspired follow-up work on adversarial approaches to both generating and detecting toxic content.
Annotation quality research: Huang et al. (2023) explored whether ChatGPT could serve as a more consistent annotator than humans for implicit hate speech, using ToxiGen as a test bed.^[10] Jinadu and Ding (2024) addressed noise correction challenges in subjective annotation tasks, drawing on ToxiGen data.^[11]
Cross-lingual and multi-domain extensions: Researchers have built on ToxiGen's methods to create similar resources for other languages and content domains.

Community contributions have also expanded the dataset. In March 2024, contributors added demonstration sets for additional demographic categories, including immigrants and bisexual individuals, extending ToxiGen's coverage beyond the original 13 groups.

Newer benchmarks like TET (Realistic Evaluation of Toxicity) have built on ToxiGen's foundations while attempting to address some of its limitations, such as eliciting more toxicity from modern LLMs in controlled settings. These newer tools often compare their performance against ToxiGen as a baseline, underscoring ToxiGen's role as a standard reference point in the field.

References

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. *Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, pages 3309-3326. Dublin, Ireland. DOI: 10.18653/v1/2022.acl-long.234. ↩
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language Models are Few-Shot Learners. *Advances in Neural Information Processing Systems*, 33, 1877-1901.
Caselli, T., Basile, V., Mitrovic, J., Karber, I., & Granitzer, M. (2021). HateBERT: Retraining a BERT Model for Abusive Language Detection in English. *Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)*, pages 17-25. ↩
ElSherief, M., Ziems, C., Muchlinski, D., Anber, V., Bau, J., Dahlke, J., & Yang, D. (2021). Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 345-363. ↩
Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2021). Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. *Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics*, pages 1667-1682. ↩
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. *Proceedings of the International AAAI Conference on Web and Social Media*, 11(1), 512-515. ↩
Founta, A. M., Djouvas, C., Chatzakou, D., Leontiadis, I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, M., & Kourtellis, N. (2018). Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. *Proceedings of the International AAAI Conference on Web and Social Media*, 12(1). ↩
Zhou, X., Sap, M., Swayamdipta, S., Choi, Y., & Smith, N. A. (2021). Challenges in Automated Debiasing for Toxic Language Detection. *Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics*, pages 3143-3155. ↩
Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., & Narasimhan, K. (2023). Toxicity in ChatGPT: Analyzing Persona-Assigned Language Models. *Findings of the Association for Computational Linguistics: EMNLP 2023*. ↩
Huang, F., Kwak, H., & An, J. (2023). Is ChatGPT Better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. *Companion Proceedings of the ACM Web Conference 2023*. ↩
Jinadu, B. & Ding, X. (2024). Noise Correction on Subjective Datasets. *arXiv preprint*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Yejin Choi