OpenAI Moderation API
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,250 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,250 words
Add missing citations, update stale details, or suggest a clearer explanation.
The OpenAI Moderation API is a free classification tool, accessed through a dedicated moderation endpoint of the OpenAI API, that assesses whether text (and, for newer models, images) is potentially harmful across a set of content categories. It is provided by OpenAI to help developers detect and act on content that may violate the company's usage policies. For each category the endpoint returns a boolean flagged decision together with a category_scores confidence value between 0 and 1.[1][2]
OpenAI first introduced the Moderation endpoint in August 2022 with a blog post titled "New and improved content moderation tooling." It gave OpenAI API developers free access to GPT-based classifiers that evaluate whether a piece of text is sexual, hateful, violent, or promotes self-harm, categories prohibited by OpenAI's content policy. The stated purpose was to let developers moderate content (including content generated by OpenAI's own models) so that applications could remain compliant with usage policies and the broader AI ecosystem could be made safer.[3][1]
The endpoint is exposed at /v1/moderations. A request supplies an input (text, and for the multimodal model an image) and optionally a model name; the response is a JSON object describing how the content scored against each category. Because the service is free, it is commonly used as a lightweight first-pass filter on both user inputs and model outputs before further processing.[1][2]
OpenAI positions the API as a screening aid rather than a complete safety system. The classifiers report likelihoods of policy-relevant content, but the documentation and surrounding guidance frame moderation as one layer among several; flagging thresholds, downstream handling, and product-specific policy decisions remain the developer's responsibility. The category definitions are tied to OpenAI's published usage policies and are conceptually related to the rules expressed in the Model Spec.[1]
The Moderation API has shipped two generations of classifier. The original generation comprised text-only models, exposed as text-moderation-latest and text-moderation-stable. These are GPT-based text classifiers that support text input only and cover a smaller set of categories. OpenAI describes them as legacy models, with the newer omni-moderation family recommended as the default going forward.[1]
On September 26, 2024, OpenAI released omni-moderation-latest, a moderation model built on GPT-4o. The omni model is multimodal: it accepts both text and image inputs and can evaluate an image on its own or in combination with accompanying text. It also adds two new categories beyond the earlier generation. A pinned snapshot, omni-moderation-2024-09-26, is available alongside the rolling -latest alias.[4][1]
OpenAI reported that the omni model improves accuracy substantially on non-English content. In a test spanning 40 languages, it improved 42% on accuracy for non-English material compared with the previous model, with notable gains on lower-resource languages. The model is described as the most capable moderation option and the recommended default.[4][1]
| Model | Inputs | Categories | Status |
|---|---|---|---|
omni-moderation-latest | Text and image | 13 (adds illicit, illicit/violent) | Recommended default; built on GPT-4o |
omni-moderation-2024-09-26 | Text and image | 13 | Pinned snapshot of the omni model |
text-moderation-latest | Text only | 11 | Legacy text classifier |
text-moderation-stable | Text only | 11 | Legacy text classifier |
These moderation models should not be confused with OpenAI's chat or completion models. Although omni-moderation-latest is built on the same GPT-4o base used by chat products, it is a specialized classifier that only emits category scores and flags; it does not generate free-form text.[1][4]
The omni-moderation model classifies content across 13 categories, including subcategories. The earlier text-only models cover the same set minus the two illicit categories. For the omni model, some categories accept image input while others remain text only, as shown below.[1]
| Category | Description | Supported inputs |
|---|---|---|
harassment | Content that expresses, incites, or promotes harassing language towards any target. | Text only |
harassment/threatening | Harassment content that also includes violence or serious harm towards any target. | Text only |
hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. | Text only |
hate/threatening | Hateful content that also includes violence or serious harm towards the targeted group. | Text only |
illicit | Content that gives advice or instruction on how to commit illicit acts. | Text only |
illicit/violent | The same content flagged by illicit, but that also includes references to violence or procuring a weapon. | Text only |
self-harm | Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders. | Text and image |
self-harm/intent | Content where the speaker expresses that they are engaging in or intend to engage in acts of self-harm. | Text and image |
self-harm/instructions | Content that encourages, or gives instructions or advice on how to commit, acts of self-harm. | Text and image |
sexual | Content meant to arouse sexual excitement, or that promotes sexual services. | Text and image |
sexual/minors | Sexual content that includes an individual who is under 18 years old. | Text only |
violence | Content that depicts death, violence, or physical injury. | Text and image |
violence/graphic | Content that depicts death, violence, or physical injury in graphic detail. | Text and image |
A moderation response contains a top-level flagged boolean indicating whether any category was triggered, a categories object mapping each category name to a true/false flag, and a category_scores object giving the model's confidence (0 to 1) for each category. For the omni models, the response also includes category_applied_input_types, which lists, per category, whether a text or image portion of the input contributed to the flag.[1][2]
The Moderation endpoint is free to use. OpenAI has described it as free since the original 2022 launch and reaffirmed that the omni model is free to use for all developers when it shipped in 2024.[3][4]
A typical integration sends candidate content to /v1/moderations, inspects the flagged field for a fast accept-or-review decision, and optionally consults individual category_scores to apply custom thresholds for specific categories. Developers commonly screen end-user prompts before sending them to a model and screen model outputs before displaying them, using the moderation result to block, redact, or route content for human review. Image inputs are supported by the omni model for the image-capable categories, with image files limited to 20 MB.[1][2]
OpenAI characterizes the Moderation API as an aid for enforcing usage policies, not a standalone safety guarantee. The classifiers return probabilistic scores rather than definitive judgments, so calibrated thresholds and human oversight are recommended for high-stakes decisions. The legacy text-moderation-* models handle text only and lack the illicit and illicit/violent categories, and even the omni model restricts several categories (including all hate, harassment, and illicit categories, plus sexual/minors) to text input, meaning those harms are not evaluated from images. Accuracy also varies across languages and contexts despite the documented multilingual gains, and the fixed category taxonomy may not map cleanly onto every application's content policy.[1][4]