OpenAI Moderation API

AI Safety Developer Tools OpenAI

6 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,250 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The OpenAI Moderation API is a free classification tool, accessed through a dedicated moderation endpoint of the OpenAI API, that assesses whether text (and, for newer models, images) is potentially harmful across a set of content categories. It is provided by OpenAI to help developers detect and act on content that may violate the company's usage policies. For each category the endpoint returns a boolean flagged decision together with a category_scores confidence value between 0 and 1.^[1]^[2]

Overview

OpenAI first introduced the Moderation endpoint in August 2022 with a blog post titled "New and improved content moderation tooling." It gave OpenAI API developers free access to GPT-based classifiers that evaluate whether a piece of text is sexual, hateful, violent, or promotes self-harm, categories prohibited by OpenAI's content policy. The stated purpose was to let developers moderate content (including content generated by OpenAI's own models) so that applications could remain compliant with usage policies and the broader AI ecosystem could be made safer.^[3]^[1]

The endpoint is exposed at /v1/moderations. A request supplies an input (text, and for the multimodal model an image) and optionally a model name; the response is a JSON object describing how the content scored against each category. Because the service is free, it is commonly used as a lightweight first-pass filter on both user inputs and model outputs before further processing.^[1]^[2]

OpenAI positions the API as a screening aid rather than a complete safety system. The classifiers report likelihoods of policy-relevant content, but the documentation and surrounding guidance frame moderation as one layer among several; flagging thresholds, downstream handling, and product-specific policy decisions remain the developer's responsibility. The category definitions are tied to OpenAI's published usage policies and are conceptually related to the rules expressed in the Model Spec.^[1]

Models (text-moderation and omni-moderation)

The Moderation API has shipped two generations of classifier. The original generation comprised text-only models, exposed as text-moderation-latest and text-moderation-stable. These are GPT-based text classifiers that support text input only and cover a smaller set of categories. OpenAI describes them as legacy models, with the newer omni-moderation family recommended as the default going forward.^[1]

On September 26, 2024, OpenAI released omni-moderation-latest, a moderation model built on GPT-4o. The omni model is multimodal: it accepts both text and image inputs and can evaluate an image on its own or in combination with accompanying text. It also adds two new categories beyond the earlier generation. A pinned snapshot, omni-moderation-2024-09-26, is available alongside the rolling -latest alias.^[4]^[1]

OpenAI reported that the omni model improves accuracy substantially on non-English content. In a test spanning 40 languages, it improved 42% on accuracy for non-English material compared with the previous model, with notable gains on lower-resource languages. The model is described as the most capable moderation option and the recommended default.^[4]^[1]

Model	Inputs	Categories	Status
`omni-moderation-latest`	Text and image	13 (adds `illicit`, `illicit/violent`)	Recommended default; built on GPT-4o
`omni-moderation-2024-09-26`	Text and image	13	Pinned snapshot of the omni model
`text-moderation-latest`	Text only	11	Legacy text classifier
`text-moderation-stable`	Text only	11	Legacy text classifier

These moderation models should not be confused with OpenAI's chat or completion models. Although omni-moderation-latest is built on the same GPT-4o base used by chat products, it is a specialized classifier that only emits category scores and flags; it does not generate free-form text.^[1]^[4]

Categories and outputs

The omni-moderation model classifies content across 13 categories, including subcategories. The earlier text-only models cover the same set minus the two illicit categories. For the omni model, some categories accept image input while others remain text only, as shown below.^[1]

Category	Description	Supported inputs
`harassment`	Content that expresses, incites, or promotes harassing language towards any target.	Text only
`harassment/threatening`	Harassment content that also includes violence or serious harm towards any target.	Text only
`hate`	Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.	Text only
`hate/threatening`	Hateful content that also includes violence or serious harm towards the targeted group.	Text only
`illicit`	Content that gives advice or instruction on how to commit illicit acts.	Text only
`illicit/violent`	The same content flagged by `illicit`, but that also includes references to violence or procuring a weapon.	Text only
`self-harm`	Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders.	Text and image
`self-harm/intent`	Content where the speaker expresses that they are engaging in or intend to engage in acts of self-harm.	Text and image
`self-harm/instructions`	Content that encourages, or gives instructions or advice on how to commit, acts of self-harm.	Text and image
`sexual`	Content meant to arouse sexual excitement, or that promotes sexual services.	Text and image
`sexual/minors`	Sexual content that includes an individual who is under 18 years old.	Text only
`violence`	Content that depicts death, violence, or physical injury.	Text and image
`violence/graphic`	Content that depicts death, violence, or physical injury in graphic detail.	Text and image

A moderation response contains a top-level flagged boolean indicating whether any category was triggered, a categories object mapping each category name to a true/false flag, and a category_scores object giving the model's confidence (0 to 1) for each category. For the omni models, the response also includes category_applied_input_types, which lists, per category, whether a text or image portion of the input contributed to the flag.^[1]^[2]

Usage and pricing

The Moderation endpoint is free to use. OpenAI has described it as free since the original 2022 launch and reaffirmed that the omni model is free to use for all developers when it shipped in 2024.^[3]^[4]

A typical integration sends candidate content to /v1/moderations, inspects the flagged field for a fast accept-or-review decision, and optionally consults individual category_scores to apply custom thresholds for specific categories. Developers commonly screen end-user prompts before sending them to a model and screen model outputs before displaying them, using the moderation result to block, redact, or route content for human review. Image inputs are supported by the omni model for the image-capable categories, with image files limited to 20 MB.^[1]^[2]

Limitations

OpenAI characterizes the Moderation API as an aid for enforcing usage policies, not a standalone safety guarantee. The classifiers return probabilistic scores rather than definitive judgments, so calibrated thresholds and human oversight are recommended for high-stakes decisions. The legacy text-moderation-* models handle text only and lack the illicit and illicit/violent categories, and even the omni model restricts several categories (including all hate, harassment, and illicit categories, plus sexual/minors) to text input, meaning those harms are not evaluated from images. Accuracy also varies across languages and contexts despite the documented multilingual gains, and the fixed category taxonomy may not map cleanly onto every application's content policy.^[1]^[4]

References

OpenAI, "Moderation," OpenAI API documentation. https://developers.openai.com/api/docs/guides/moderation ↩
OpenAI, "Create moderation," OpenAI API Reference. https://developers.openai.com/api/reference/resources/moderations/methods/create ↩
OpenAI, "New and improved content moderation tooling," August 2022. https://openai.com/index/new-and-improved-content-moderation-tooling/ ↩
OpenAI, "Upgrading the Moderation API with our new multimodal moderation model," September 26, 2024. https://openai.com/index/upgrading-the-moderation-api-with-our-new-multimodal-moderation-model/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

OpenAI API

Overview

Models (text-moderation and omni-moderation)

Categories and outputs

Usage and pricing

Limitations

References

Improve this article

Related Articles

Cybersecurity ChatGPT Plugins

Ilya Sutskever

Preparedness Framework (OpenAI)

Rule-Based Rewards (RBR)

Model Spec

GPT API