AI watermarking
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,634 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,634 words
Add missing citations, update stale details, or suggest a clearer explanation.
AI watermarking is a family of techniques for embedding an imperceptible, machine-detectable signal in content produced by generative artificial intelligence systems so that the content can later be identified as AI-generated and, in some schemes, traced to a specific model. Unlike post-hoc detection classifiers that infer synthetic origin from statistical artefacts, watermarks are inserted at generation time by altering the sampling step, the model decoder, the initial noise, or the final waveform or pixels in ways the producer controls. The field covers four primary modalities, images, text, audio, and video, and several deployed systems including Google DeepMind's SynthID, Meta's Stable Signature and AudioSeal, and research schemes such as the Kirchenbauer et al. green-list scheme and the Aaronson cryptographic design developed at OpenAI.[1][2][3][4] AI watermarking is distinct from, and frequently positioned as complementary to, cryptographic content provenance frameworks such as the Coalition for Content Provenance and Authenticity (C2PA), which attach signed metadata manifests to files rather than embedding signals in the content itself.[5]
Digital watermarking predates generative AI by several decades; the literature on robust media watermarking for copyright protection, broadcast monitoring, and forensic tracing dates to the early 1990s, when researchers embedded bit strings into the discrete cosine transform coefficients of JPEG images or the spectral coefficients of audio signals. The shift to specifically AI-oriented watermarking arose from a different problem: not protecting an asset from theft, but allowing platforms, regulators, and the public to recognise content that a model produced. Public debate over deepfakes, AI-generated essays in education, and synthetic political imagery in the run-up to the 2024 election cycle drove labs and standards bodies to seek mechanisms that survive incidental editing while remaining invisible to ordinary consumers.[6][7]
The first broadly cited modern proposal came from outside the industrial labs. In November 2022, Scott Aaronson, then a guest researcher at OpenAI, described in a public University of Texas lecture that he had been working with the company on a "statistical watermarking" tool that biases a language model's sampling toward outputs scoring high under a cryptographic pseudorandom function whose key OpenAI would hold.[4] Two months later, John Kirchenbauer and collaborators at the University of Maryland posted "A Watermark for Large Language Models" to arXiv, the first peer-reviewed treatment of a complete text watermarking scheme.[3] In March 2023, Pierre Fernandez and colleagues at Meta posted "The Stable Signature", which moved comparable ideas to image generators by fine-tuning the decoder of a latent diffusion model to imprint a fixed bit string in every output.[2] These three threads, the Aaronson scheme, the green-list scheme, and Stable Signature, established the design space that subsequent industrial systems would refine.
Watermarks for AI-generated media are usually classified along several axes.
A watermark is imperceptible if a human consumer of the content cannot distinguish a watermarked output from an unwatermarked one. Imperceptibility is measured for images and video by peak signal-to-noise ratio and learned perceptual similarity metrics such as LPIPS; for audio by mean opinion scores and signal-to-distortion ratios; and for text by perplexity, downstream task performance, or blind human preference comparisons.[2][3][8] Visible watermarks, by contrast, place a logo or label on the surface of an image; OpenAI used a visible corner watermark on early DALL-E 2 outputs before removing it in 2022.
A watermark is robust if it survives common transformations that ordinary users apply, including JPEG or MP3 recompression, cropping, mild rotation, brightness or contrast changes, format conversion, or, for text, light editing and synonym substitution. Robustness is normally specified together with a detection threshold and a false positive rate; the Stable Signature paper, for example, reports detection of an image cropped to 10% of its content with above 90% accuracy at a false-positive rate below one in a million.[2] A fragile watermark is the dual: it is designed to break under any modification, so that a detected mark guarantees the content is bit-identical to the model's output. Fragile schemes serve forensic integrity rather than provenance attribution.
A watermark is distortion-free (sometimes "undetectable") if the distribution over outputs induced by a watermarked model is statistically indistinguishable, to anyone without the secret key, from the distribution of the unwatermarked model. The Aaronson scheme and several follow-up cryptographic constructions achieve this property; the Kirchenbauer green-list scheme does not, since the bias toward green tokens slightly shifts the output distribution.[3][4][9]
Other properties of interest include multi-bit capacity (whether the mark carries a payload identifying a specific model, account, or session, or merely a single bit indicating "AI"); publicly detectable versus secret-key detection; and localised detection, where a detector can flag which portion of a longer file is watermarked rather than rendering a single global verdict.[10][8]
The Stable Signature was introduced by Fernandez, Couairon, Jegou, Douze, and Furon of Meta and Inria at ICCV 2023 and posted to arXiv on 27 March 2023.[2] The construction targets latent diffusion models such as Stable Diffusion, in which a denoised latent vector is decoded to an image by a learned VAE decoder. The scheme works in two steps. First, a separate encoder-decoder pair is pre-trained on natural images: the encoder maps an image plus a binary message to a watermarked image, and the decoder recovers the message. Then, for each user or model copy, the latent decoder of the diffusion model is fine-tuned for a few thousand iterations on a small image set so that, regardless of the latent input, its output contains a fixed message that the pre-trained extractor can read. Because the watermark is baked into the decoder weights, an open-source release of the fine-tuned model carries the mark; replacing the decoder is the only obvious removal route, but the original unwatermarked decoder is not distributed.[2][11]
In Meta's October 2023 announcement, the company reported a false-positive rate of one in ten billion for the Stable Signature detector and described plans to integrate the technique into its own image generators and to release the research as a building block for the broader generative ecosystem.[11] The paper itself reports above 99% bit accuracy on the embedded 48-bit message for unmodified outputs and graceful degradation under cropping, JPEG compression, and brightness or contrast changes.[2]
Google DeepMind announced SynthID on 29 August 2023 as a beta service for Vertex AI customers using Google's Imagen text-to-image model.[1] The system pairs an embedder network that adds a small perturbation to the pixels of the generated image during the final stage of synthesis with a detector network that scans an arbitrary image and returns one of three labels: "watermark detected", "watermark possibly detected", or "watermark not detected".[1][12] The two networks are trained jointly on a corpus of generated images with deliberate adversarial transformations applied during training, so that the mark survives JPEG compression, colour-space changes, mild cropping, and overlay filters typical of social media platforms.[1] DeepMind has not published the embedder or detector weights, and the embedded payload appears to be small; the public material describes a single-bit "is this from Google's model" decision rather than a multi-bit account identifier.[12]
Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein of the University of Maryland proposed Tree-Ring Watermarks at NeurIPS 2023.[13] Rather than modifying pixels or decoder weights, Tree-Ring embeds a pattern in the initial Gaussian noise vector that seeds the diffusion sampling process. The pattern is structured in the Fourier domain with concentric circular rings of fixed values, chosen so that the pattern is approximately invariant under crops, rotations, flips, and small affine transformations. To detect a watermark, the verifier inverts the diffusion process from a candidate image back to a noise vector using a deterministic DDIM inversion and tests whether the resulting noise contains the planted ring pattern. The authors report that the watermark is invisible by FID, is robust to standard perturbations, and can be added to any pretrained diffusion model without modifying the model's weights or training. The inversion-based detection is computationally expensive compared with the small detector networks of SynthID and Stable Signature.[13]
The first widely studied LLM text watermark, posted to arXiv in January 2023 and recognised at ICML 2023 as an outstanding paper, partitions the model's vocabulary into a "green list" and a "red list" at every generation step, using a hash of the previous tokens as a seed. During sampling, the logits of green-list tokens are increased by a small constant before softmax, so the model is biased toward producing green tokens. A statistical test on the proportion of green tokens in a candidate text computes a z-score under the null hypothesis of unwatermarked text and yields a p-value that the verifier can threshold.[3] The authors of the scheme were John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein, all then at the University of Maryland. Because the watermark depends only on the logit values and the vocabulary partition, it can be implemented as a thin layer over any open-source LLM without retraining and detection does not require access to the model itself, only to the hashing function and partition seed.[3]
The green-list scheme is not distortion-free: the bias toward green tokens slightly shifts the output distribution, which the authors quantify in terms of perplexity increase. The trade-off between bias strength (which determines detectability) and quality loss is a single knob that practitioners tune to taste. Subsequent work showed that the scheme is robust to light paraphrasing but degrades sharply under heavy rewriting or translation through a second model.[14][9]
In November 2022, Scott Aaronson described in a University of Texas lecture, and shortly afterwards on his blog, a watermarking scheme he had developed during a sabbatical at OpenAI.[4] The scheme uses a cryptographic pseudorandom function with a secret key to compute, for each possible next token given the recent context, a value in the unit interval; the model then samples the token whose value is highest, weighted by the model's probability distribution (the "exponential" or Gumbel sampling formulation). A verifier with the key can recompute the values for each token of a candidate text and compute a statistic that has a known distribution under unwatermarked text and a distinguishable distribution under watermarked text. The construction is distortion-free in expectation: averaged over the random key, the induced distribution on outputs equals the model's original distribution, so an attacker without the key cannot detect the watermark by inspecting outputs alone.[4][9]
Despite developing the technique in 2022 and demonstrating an internal prototype reportedly able to flag ChatGPT-generated essays with roughly 99.9% accuracy on sufficiently long passages, OpenAI did not deploy the watermark publicly. The Wall Street Journal reported in August 2024 that the company had held the tool back for over a year out of concern that paid users would defect if their outputs were identifiable, and out of recognition that the watermark could be defeated by paraphrasing or translation.[15] OpenAI's public statement at the time confirmed the existence of the system and noted that the company was researching alternatives, including cryptographically signed metadata in the C2PA family.[15]
Google DeepMind announced SynthID-Text in May 2024 and described it in a peer-reviewed Nature paper in October 2024 by Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, and colleagues.[7][16] The system is a production watermark deployed in Gemini's chat outputs. Its central technical contribution is "Tournament Sampling": instead of sampling a token from the model's softmax distribution and biasing it as in Kirchenbauer et al., SynthID-Text first samples several candidate tokens from the unbiased distribution, then runs a deterministic tournament between them using a pseudorandom g-function keyed by the recent context. The token that wins the tournament becomes the output. Because each candidate is drawn from the true model distribution, the watermark is approximately distortion-free over the average of the random seeds, and the authors report no statistically significant degradation in human pairwise preferences between watermarked and unwatermarked Gemini outputs in a large-scale evaluation.[16]
The Nature paper presents data from a feedback study of nearly 20 million Gemini production conversations indicating that users did not detect a quality difference between watermarked and unwatermarked responses. Detection is performed by a small Bayesian classifier whose false-positive and false-negative rates are tuneable for the deployer.[16] Google open-sourced a reference implementation of the watermark in Hugging Face's transformers library version 4.46.0, released on 23 October 2024, with example notebooks for applying SynthID-Text to any model loadable through transformers.generate.[17] The open implementation includes the logits processor and a Bayesian detector, and recommends a default n-gram length of 5 and 20 to 30 random keys per deployment.[17]
Text watermarks face two intrinsic difficulties that image watermarks do not. First, the information capacity of text is much lower than that of images, so a watermark must spend dozens or hundreds of tokens to accumulate enough signal for a confident detection. Both the Kirchenbauer scheme and SynthID-Text require sequences of several hundred tokens for high-confidence detection, and effectively cannot mark very short outputs such as classification labels or one-line completions.[3][16][17] Second, text can be paraphrased, translated, retokenised, or re-encoded by another model with almost no semantic loss, which is a stronger attack than the geometric and compression manipulations that image watermarks face. Both production and academic studies report that running a watermarked text through GPT-4-class paraphrasing reduces detection accuracy substantially.[14][17] Token-level watermarks also struggle on outputs with low entropy: when the next token is nearly deterministic (for example, in code or in factual lists), there is no headroom to encode a watermark without changing the substantive content.[16][17]
Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Defossez, Teddy Furon, and Tuan Tran, all at Meta and Inria, published AudioSeal at ICML 2024, with an arXiv preprint dated 30 January 2024.[8] AudioSeal trains a generator and detector pair on speech corpora. The generator takes an audio waveform and an optional 16-bit message and outputs an imperceptibly modified waveform; the detector takes a candidate waveform and outputs both a global "watermark present" probability and a per-sample localisation mask indicating which segments of the audio carry a mark. Localisation matters in voice-cloning settings, where an attacker may splice a few seconds of synthetic speech into a longer authentic recording.[8]
The authors report that AudioSeal is robust to MP3 compression down to 16 kbps, to additive noise, to time stretching and pitch shifting, and to encoding through standard codecs, while the perceptual loss based on auditory masking keeps the mark inaudible in listener studies. The detector is single-pass and runs roughly two orders of magnitude faster than prior generator-detector watermark systems, which the authors highlight as essential for screening large volumes of audio.[8]
DeepMind extended SynthID to audio in November 2023, integrating it with Lyria, the company's music generation model launched the same month in collaboration with YouTube for the Dream Track experiment.[18] SynthID-Audio converts the generated waveform into a spectrogram, applies an imperceptible modification in the spectrogram domain, then converts back to audio; the detector recovers the watermark even after MP3 compression, speed changes, and the addition of background noise.[18][12] Lyria-generated outputs distributed through YouTube Shorts and Music AI Sandbox tools carry the SynthID-Audio mark.[18]
Google DeepMind extended SynthID to video on 14 May 2024, announcing that all videos generated by the Veo text-to-video model and accessible through VideoFX would carry an embedded watermark.[7][12] SynthID-Video applies the image watermark frame by frame, with consistency constraints across frames so that the mark survives frame-rate changes, transcoding, and re-encoding at lower bitrates. As of mid-2025 it is the only major commercial video watermarking system deployed at scale.[12]
AI watermarks and the C2PA cryptographic provenance framework approach the same problem, namely "is this content AI-generated and from whom", from opposite directions. C2PA attaches a signed JSON manifest to a file's metadata, recording the producing application, the editing history, and the public-key signature of the producer. The manifest is verifiable offline by anyone with the producer's public key, and richly informative; but it lives in metadata that standard upload pipelines on Instagram, X, TikTok, Facebook, and LinkedIn routinely strip during processing.[5][19] A watermark lives in the content signal itself and survives metadata loss, but carries a much smaller payload, requires the verifier to know which detector to run, and provides probabilistic rather than cryptographic guarantees.
The two approaches are complementary in industry roadmaps. The C2PA 2.0 specification, published in 2024, introduced "soft bindings", in which an invisible watermark in the content encodes a short identifier whose lookup in a manifest store recovers the full provenance record even after metadata stripping.[5][19] DeepMind, Adobe, Microsoft, and Meta have all published positions describing the combination of cryptographic provenance and watermarking as a defence-in-depth strategy, where each layer covers the other's failure modes.[5][11][12]
Across modalities, watermark schemes face a similar attack taxonomy. Compression and format conversion is the baseline attack: JPEG, WebP, MP3, AAC, H.264, and H.265 all discard high-frequency information that early watermarks relied on. Modern schemes are explicitly trained to survive these codecs by including the codecs in the adversarial training pipeline, so that the embedder learns to place the mark in components of the signal that the codec preserves.[1][2][8] Geometric and temporal transformations, including cropping, scaling, rotation, frame-rate change, and time stretching, are routinely included in adversarial training; Tree-Ring's Fourier-domain construction is designed specifically for invariance under these operations because rotations and crops in the spatial domain translate to permutations of the Fourier coefficients on which the ring pattern is laid out.[13]
Adversarial removal uses optimisation to find a perturbation that suppresses the watermark while preserving the content. Published attacks on Stable Signature, SynthID-Image, and Tree-Ring have all succeeded under various threat models, often using a second diffusion model to "regenerate" the content from a noisy version that lies outside the original watermark's basin of attraction. Saberi and colleagues at the University of Maryland argued in a 2023 paper that, under a sufficiently strong diffusion-based regeneration attack, any imperceptible image watermark must fail in a precise information-theoretic sense, because the regenerated image is in distribution under the original generator and cannot be distinguished from unwatermarked content.[20] Liu and collaborators demonstrated an explicit attack on Stable Signature in a paper titled "Stable Signature is Unstable", showing that an adversary with access to the watermarked decoder can fine-tune it for a small number of steps to scrub the embedded message while preserving image quality.[21] These results illustrate that, while watermarks raise the cost of laundering content, they do not constitute a hard guarantee.
For text, the dominant attack is paraphrasing: rewriting the output through a second LLM or a sentence-level paraphraser reduces detection accuracy from near-perfect to slightly above chance for sufficiently aggressive rewrites.[14][17] Retokenisation and Unicode substitution attacks replace tokens with visually identical but differently encoded sequences, breaking the n-gram seeds on which both green-list and SynthID-Text rely; defences require canonicalising the input before detection.[17] Translation, a special case of paraphrasing, defeats current text watermarks reliably because no current scheme is robust across languages, since translation rewrites the token sequence almost entirely while preserving meaning.[16][17] Emoji and special-character insertion, where an attacker inserts and then strips a chosen character between every token, was the specific attack cited by OpenAI in 2024 when it described its text watermark as "less robust against globalised tampering".[15]
A separate category of risk is spoofing: producing a non-AI text or image that triggers a positive watermark detection. The Kirchenbauer paper notes that a determined attacker who has even partial access to the green-list partition function can construct misleading "watermarked" human text, and similar concerns apply when the detector's hashing function is known.[3] Cryptographic schemes such as the Aaronson construction are designed to make spoofing computationally hard, since constructing a spoofing example would require evaluating the pseudorandom function whose key the attacker does not possess. The trade-off is that cryptographic schemes need more elaborate detection machinery and need careful key-management practices to avoid leaking the key through repeated queries.[4][9] Zhang and colleagues, in a 2023 paper titled "Watermarks in the Sand", argue an impossibility result: for sufficiently expressive generative models and sufficiently small allowed distortions, no watermark can be simultaneously robust to all polynomial-time edit attacks; their proof formalises the intuition that an attacker with access to a related generative model can always smooth out a perturbation that another such model could have produced.[10]
AI watermarking cannot, in its current form, prevent a determined and informed adversary from producing AI content that passes as human. All deployed systems are, by design, probabilistic; they raise the cost and skill threshold of laundering AI content rather than eliminating the possibility. Three structural limitations recur across published evaluations.
The paraphrase and regeneration gap in text and image domains means a second generative model can be used to wash out a first model's watermark; as paraphrasing and image-to-image regeneration become commodity tools, the floor of the watermark's robustness drops.[14][20] Low-entropy outputs, especially in text, leave little statistical room for a watermark to operate; some text settings are effectively unwatermarkable.[16][17] Centralised detector dependency is intrinsic: a watermark is only useful if someone trusted holds the detector and is willing to run it, which makes the assurance third-party rather than self-evident, and concentrates verification power in the hands of model providers.[9][16]
A further open issue is interoperability. Each major producer ships its own watermark with its own detector, and there is no common API to ask "is this content watermarked by any participating system". The SynthID Detector portal, launched by Google on 20 May 2025 as a verification site for journalists and researchers, addresses this only within Google's own ecosystem; it can identify content watermarked by Gemini, Imagen, Lyria, and Veo, but not content watermarked by Meta, OpenAI, or other producers.[22][23] Standards activity within C2PA and the Content Authenticity Initiative is moving toward registries of watermark types that a single verifier can query, but as of 2025 this remains an aspiration.[5]
Finally, the business and policy incentives to deploy watermarks are uneven. OpenAI's decision to develop but withhold a text watermark from public deployment, reported in the Wall Street Journal in August 2024, illustrates the tension between provenance commitments and competitive concerns about user defection. Internal surveys cited in that report indicated that around a third of ChatGPT users in a sampled group said they would use the product less if their outputs were detectable as AI-generated, a figure the company weighed against the policy upside of releasing the tool.[15] Regulatory drivers, including the European Union AI Act's transparency obligations on providers of generative systems and the Biden administration's October 2023 executive order on AI, have begun to push providers toward watermarking deployment, but neither instrument mandates a specific technical scheme; the AI Act in particular requires that providers ensure outputs are "marked in a machine-readable format and detectable as artificially generated", language that leaves room for either watermarking or C2PA-style provenance signatures.[6][7]
The trajectory of AI watermarking deployment over 2023 to 2025 reflects a rapid move from research papers to production systems. Through the first half of 2023, the principal artefacts were the Kirchenbauer green-list paper at ICML, the Stable Signature paper at ICCV, and the Tree-Ring paper at NeurIPS. The Aaronson scheme remained known only through a blog post and a lecture. In August 2023, Google DeepMind shipped SynthID for images via Vertex AI, the first commercial AI watermarking system at scale.[1] In October 2023, Meta announced Stable Signature, framing the technique as both a research release and a candidate for integration into Meta's own image generators.[11] In November 2023, DeepMind launched SynthID-Audio in connection with Lyria.[18] In May 2024, DeepMind extended SynthID to video with Veo and announced production text watermarking for Gemini outputs.[7] The Nature paper describing SynthID-Text appeared in October 2024, and the same day Google open-sourced a reference implementation through Hugging Face transformers.[16][17]
Across the same window, Meta released AudioSeal at ICML 2024.[8] OpenAI continued to defer public deployment of its text watermark, with the WSJ reporting in August 2024 that the system had been ready for roughly a year.[15] On 20 May 2025, Google announced the SynthID Detector portal, opening a single verification site for journalists and researchers to check content against the four SynthID modalities.[22][23] By that date, DeepMind reported that more than ten billion pieces of content had been watermarked with some variant of SynthID since the technology's initial release in 2023.[12][22]