AI and the internet
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,751 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,751 words
Add missing citations, update stale details, or suggest a clearer explanation.
AI and the internet describes the ways that generative AI is reshaping the open web: how content is produced and discovered, how traffic flows between sites, who (or what) reads the resulting pages, and how the underlying business model of the web is being renegotiated. Since the public release of ChatGPT in November 2022, the marginal cost of producing plausible text and images has fallen close to zero, while large language models have begun to mediate how people find information at all. The result is a set of overlapping pressures: a surge in machine-generated pages and "AI slop," automated traffic that now rivals human visits, search products that answer questions without sending clicks onward, disputes over the scraping of web data to train models, and early attempts to re-establish provenance and trust.
Many of these effects are real and measurable; others are contested or projected. Estimates of how much of the web is AI-generated, for example, vary by an order of magnitude depending on definitions and methods. This article presents ranges with their sources rather than treating any single figure as settled. The closely related topic of the content itself (its types, detection, and regulation) is covered in detail at AI-generated content; this article focuses on the web as a system.
The most visible change is sheer volume. Producing a short article with a language model costs a fraction of a cent in API fees, against tens or hundreds of dollars for a human writer, and image generation is cheaper still. That economics favors any operation that values quantity over quality, and it has filled the web with what came to be called "AI slop": low-quality, mass-produced AI content published with little or no human oversight. The Oxford English Dictionary added the word in this sense in 2025 [1].
How much of the web is now machine-generated is genuinely uncertain, and the headline numbers differ sharply:
| Source | Date | What was measured | Finding |
|---|---|---|---|
| Stanford University, Imperial College London, and the Internet Archive | April 2026 | Newly published websites in the Wayback Machine, late 2022 to mid-2025 | 35.3% of new sites were created with AI assistance; 17.6% were fully AI-generated [2] |
| Ahrefs | April 2025 | About 900,000 newly published English-language web pages | 74.2% contained at least some AI-generated text [3] |
| Europol (earlier projection) | 2022 | Forward-looking estimate of online content | As much as 90% could be AI-generated by 2026 [1] |
These figures are not directly comparable. The Stanford-led study counts whole websites and distinguishes "AI-assisted" from "fully AI-generated," whereas the Ahrefs analysis flags any page containing AI text, a much broader category. The Europol number was a projection, not a measurement, and is frequently cited without that context. Detection methods are also imperfect, and the boundary between "AI-assisted" and "human" work is blurry, since many writers now use models to draft or edit. The reasonable summary is that a large and rising minority of new web pages involve generative AI, with the exact share depending heavily on how the question is framed.
The Stanford-led study is notable for complicating the most alarmist narrative. Testing several common assumptions, the researchers found that AI-generated pages were not markedly less factually accurate than human ones and often cited sources through external links. The documented harms were subtler: a contraction in the range of unique ideas and viewpoints, and prose that the authors described as increasingly sanitized and artificially cheerful [2].
The "dead internet theory" is the claim that most online activity and content is now automated rather than human, to the point that the web is effectively a hollow simulation. It began around 2021 on internet forums as a fringe conspiracy theory, originally framed in terms of bots and coordinated manipulation, and was widely dismissed at the time [4].
The rise of generative AI gave the idea a second life and a more sober framing. Commentators now invoke "dead internet" less as a literal conspiracy and more as shorthand for the lived experience of encountering large volumes of synthetic posts, comments, images, and articles. The Stanford-led measurement of AI-generated websites was widely reported as evidence that the theory is "partway" to reality, with outlets noting the gap between the conspiratorial original and the measured trend [2]. The full history, variants, and academic treatment of the idea are beyond the scope of this article.
For two decades the open web ran on an implicit bargain: search engines indexed sites and, in exchange, sent readers (and ad revenue) back to them. Generative AI is straining that arrangement from both ends, by making spam cheaper to produce and by answering queries directly so that fewer clicks ever leave the search page.
On the supply side, Google moved against scaled low-quality content in its March 2024 core update, which introduced three new spam policies. "Scaled content abuse" targets pages generated mainly to manipulate rankings rather than help users, and Google stated that it applies regardless of whether the content is produced by automation, by people, or by some combination, explicitly closing a loophole around AI text. The same update added policies against "site reputation abuse" (parasite SEO) and "expired domain abuse." Google said the changes were intended to reduce low-quality, unoriginal results, and later reported a reduction of roughly 45% [5].
On the demand side, Google's AI Overviews (the AI-written summaries shown above traditional results) appear to be reducing clicks to outside sites. A July 2025 Pew Research Center study of more than 68,000 searches by about 900 U.S. adults found that users clicked a traditional result in 8% of searches that showed an AI summary, against 15% when no summary appeared; clicks on links inside the summary were rarer still, at about 1% of visits. Users were also more likely to end their browsing session after a page with a summary (26%) than without one (16%). Google disputed the study's methodology and conclusions [6]. Separately, Similarweb reported that the share of "zero-click" Google searches (those ending without any click) rose from roughly 56% in May 2024 to about 69% in May 2025, climbing to around 83% when an AI Overview was present [7].
Publishers have reported steep traffic losses over the same period, which they attribute substantially to AI summaries and changing search behavior, though disentangling that from algorithm changes and broader shifts is difficult. A wave of reporting in 2025 described the situation for news sites in stark terms, including an NPR account quoting publishers who called it an "extinction-level event" [8]. The directional effect (fewer referral clicks from search) is well supported; the precise magnitude attributable to AI specifically is contested.
The "who is reading" question has its own answer. According to the 2025 Imperva Bad Bot Report, published by Thales, automated traffic surpassed human traffic for the first time in a decade, reaching about 51% of all web traffic, with malicious "bad bots" alone accounting for roughly 37%, up from 32% the year before [9].
The report attributes much of the rise to generative AI, which lowers the barrier to building and operating bots and lets less sophisticated actors launch higher volumes of attacks. Among AI-related bot activity, Imperva identified ByteSpider (operated by ByteDance) as responsible for about 54% of AI-enabled attacks, followed by AppleBot at 26%, ClaudeBot at 13%, and a ChatGPT user agent at 6% [9]. These automated agents fall into several overlapping categories:
| Bot type | Purpose | Example concern |
|---|---|---|
| Training crawlers | Collect text and media to train models | Scraping volume, server load, copyright |
| Retrieval agents | Fetch pages in real time to answer user prompts | Bypass paywalls and ad views |
| Engagement bots | Post and interact on social platforms | Inflate metrics, spread synthetic content |
| Malicious bots | Credential stuffing, scalping, scraping | Fraud and abuse, now AI-assisted |
The practical consequence for site operators is rising infrastructure cost and a harder distinction between legitimate human readers, helpful crawlers, and abuse.
Language and image models are trained on enormous quantities of web data, much of it gathered by crawling public pages. As that became widely understood, many site owners moved to limit it, turning the long-standing robots.txt convention into a front line. Robots.txt is a voluntary file that asks crawlers not to visit certain paths; it has no technical enforcement.
Blocking surged once AI crawlers were named. By late 2025, roughly 5.6 million websites listed OpenAI's GPTBot in their robots.txt disallow rules, up from about 3.3 million mid-year, part of a reported 336% year-over-year increase in sites blocking AI crawlers; around 79% of top news sites blocked AI training bots [10]. Compliance, however, is slipping. One analysis found that about 13.26% of AI-bot requests ignored robots.txt directives in the second quarter of 2025, up from 3.3% at the end of 2024, with one OpenAI retrieval agent reportedly ignoring the rules in as many as 42% of cases despite documentation claiming otherwise [10]. Whether crawlers honor robots.txt has begun to surface in litigation, including Reddit's scraping suit against Anthropic [10].
At the same time, a parallel market in licensing has emerged. Rather than block scraping outright, some publishers have signed paid deals granting AI companies access to their archives. Reported terms range from about $1 million to more than $250 million annually for large publishers; Reddit is reported to earn roughly $60 million a year from Google for data access, and companies including OpenAI and Perplexity have signed agreements with outlets such as USA Today, The Washington Post, and The Guardian [10]. To formalize this, a machine-readable standard called Really Simple Licensing (RSL) was published in December 2025, extending robots.txt so that sites can attach licensing and payment terms; it gathered backing from more than 50 partners, though as of early 2026 no major AI company had committed to honoring it [10].
The tension is structural. The open web's content was largely produced under an assumption of human readership and ad-supported or subscription business models. Training crawlers and answer engines consume that content without necessarily returning traffic, which is why the period is sometimes described as the end of "free scraping" and the beginning of a negotiated, paywalled, or licensed web [11].
If synthetic content is unavoidable, one response is to make origin verifiable rather than guessed. Two complementary approaches dominate: provenance metadata and watermarking.
Provenance is led by the Coalition for Content Provenance and Authenticity (C2PA) and its Content Credentials. C2PA defines a cryptographically signed manifest, bound to an image, video, or audio file, that records the device or model that produced it and the chain of edits applied. Version 2.1, ratified in 2025, was published as an international standard (ISO/IEC 22144), and support was built into Adobe's Creative Cloud applications and Microsoft's image tools. In May 2026, OpenAI joined the C2PA steering committee and committed to pairing Content Credentials with watermarking [12].
Watermarking embeds a signal directly in generated output. Google DeepMind's SynthID, the most widely deployed system, adds an imperceptible watermark to images, audio, video, and text from Google's models and had been applied to more than 20 billion images by 2026 [12]. Watermarking and provenance are also being pulled forward by regulation: Article 50 of the EU AI Act requires AI-generated or manipulated content to be marked in a machine-readable way, with enforcement of those transparency duties beginning in August 2026 [12].
The limits are well understood. Provenance metadata only helps where it is present and preserved, and it can be stripped by re-encoding, screenshotting, or uploading to platforms that discard it. Watermarks can survive more handling but may be weakened by cropping, compression, or paraphrasing of text, and they require broad cooperation among model providers to be useful. After-the-fact detectors that try to classify whether existing text or images are AI-made remain unreliable: accuracy falls on edited or paraphrased content, false positives disproportionately affect non-native English writers, and the best detectors trail the best generators. The detection landscape is treated at length in AI-generated content [1].
A longer-term risk for the web as a data source is "model collapse." A 2024 paper in Nature by Shumailov and colleagues showed that when generative models are trained repeatedly on the output of earlier models, successive generations degrade, first losing the rare tails of the data distribution and eventually producing narrow, repetitive nonsense. In one illustration, a model prompted about medieval architecture had, by the ninth recursive generation, drifted into producing a list of jackrabbits [13].
The connection to the open web is direct. As AI-generated pages become a larger share of what crawlers collect, future models risk training partly on their predecessors' output, which could erode quality over time. Researchers note that the effect can be mitigated, for example by keeping synthetic data a bounded fraction of training sets, filtering AI text, preserving curated archives of pre-AI human writing, or accumulating synthetic data alongside (rather than replacing) human data; the precise conditions under which collapse is avoided remain an active research question [13].
The same capabilities also have clear upside. Generative tools lower the cost of publishing for people who lack writing, design, or coding skills, and they support translation and accessibility (such as alt text and summaries) at scale. Answer engines and AI Overviews can satisfy simple informational needs quickly. Licensing markets, if they mature, could channel revenue back to the publishers whose work trains and grounds these systems, and provenance standards could eventually make the origin of media more legible than it has ever been. Many working journalists, marketers, and developers now use models routinely as drafting and research aids, with human editing on top [11].
As of mid-2026, the trajectory points toward a web in which synthetic and human content are deeply mixed, automated traffic is the majority, and the click-based economy that funded the open web is being renegotiated through paywalls, licensing standards such as RSL, and direct deals. Whether the outcome is a healthier, provenance-aware web or a degraded, slop-saturated one will depend on unresolved contests: between scraping and licensing, between answer engines and the sites they summarize, and between the cost of generating content and the cost of verifying it. The measured effects so far (more machine-made pages, fewer referral clicks, more bots, and early provenance infrastructure) are clearer than the long-run equilibrium they are heading toward.