See also: AI text, AI content and AI content detectors
Burstiness (also called text burstiness, word burstiness, or simply burstiness) is a statistical property of a sequence that measures how unevenly events, tokens, or sentence lengths are distributed across that sequence. In writing, it captures the extent to which short and long sentences alternate, how often a word reappears in clumps once introduced, and how much variation exists from one passage to the next. The concept moved from corpus linguistics and information theory into the mainstream when GPTZero, launched in January 2023 by Princeton undergraduate Edward Tian, used burstiness together with perplexity as the two headline signals of its AI detection model. Since then burstiness has become one of the most widely cited heuristics for telling apart human writing from text produced by a large language model.
The core intuition is simple. Human prose tends to be uneven. People mix a clipped four-word sentence with a winding thirty-word one, repeat a topical noun several times in one paragraph and then drop it for pages, and shift register without warning. Models trained on next-token prediction tend to regress toward the average. Their sentences cluster around a typical length, their vocabulary use is smoother, and the per-token surprisal varies less from one sentence to the next. Burstiness tries to put a number on that difference.
In the AI detection setting, burstiness is usually framed as the variance of perplexity (or sentence length) across the document, while perplexity itself is the average per-token unpredictability of the text under a reference language model. A text can have high average perplexity but still look machine-written if every sentence has roughly the same perplexity, because that flatness is itself a fingerprint of statistical generation. GPTZero summarises the relationship by saying that burstiness compares perplexity across sentences and that human text is more discontinuous than model output.
A loose ranking helps:
The second category is what detectors are looking for. A typical untouched ChatGPT reply tends to land there because the decoding process favours statistically average continuations, and average continuations produce sentences of similar length and similar surprisal.
The statistical idea is older than the AI detection use case by several decades. It first appears in corpus linguistics around the question of why simple bag-of-words models fail. If words were sprinkled independently across documents, their counts would follow a Poisson distribution. They do not. Topical content words such as boycott, pope, or kennedy arrive in clumps: once a word appears in a document, the probability of seeing it again jumps far above the baseline rate. Kenneth Church and William Gale formalised this in their 1995 paper Poisson Mixtures and in companion work on inverse document frequency, where they showed that word counts are much better captured by a negative binomial distribution, equivalent to a Poisson with a Gamma-distributed rate parameter, than by a single Poisson.
This observation drives a lot of practical natural language processing. Term-frequency weighting, inverse document frequency, and topic models all have to cope with the fact that words are bursty. In 2005 Rasmus Madsen, David Kauchak, and Charles Elkan introduced the Dirichlet compound multinomial (DCM), also called the multivariate Polya distribution, as an alternative to the multinomial in text classification and clustering. The DCM adds one degree of freedom that lets a model say "this word, once seen, is more likely to appear again in this document," which corrects much of the burstiness problem and produces measurably better perplexity on standard document collections.
A parallel line of work in physics and complex systems studies burstiness as a property of event sequences in time. Albert-Laszlo Barabasi's 2005 Nature paper The origin of bursts and heavy tails in human dynamics argued that human activity timing, things like email, library visits, and printing jobs, follows non-Poisson statistics, with short stretches of intense activity separated by long quiet periods. Three years later K.-I. Goh and Barabasi proposed a compact way to quantify that pattern, now known as the Goh-Barabasi burstiness parameter B. Although it was developed for inter-event times rather than text, the same parameter is sometimes adapted to sentence-length distributions in NLP because it shares the same intuitive scale.
There is no single canonical formula for burstiness. Different fields use different measures, and AI detection tools usually do not publish their exact equations. The most common formulations are summarised below.
| Measure | Formula | Range / interpretation | Typical use |
|---|---|---|---|
| Goh-Barabasi parameter B | (sigma - mu) / (sigma + mu) | -1 (perfectly regular) to +1 (extremely bursty); 0 corresponds to a Poisson process | Inter-event times, adapted to sentence lengths |
| Index of dispersion (Fano factor) | sigma^2 / mu | =1 for Poisson; >1 indicates bursty / overdispersed; <1 indicates regular | Counting processes, word counts |
| Coefficient of variation (CV) | sigma / mu | 0 for constant series; rises as variance grows | General variability of sentence lengths or perplexities |
| Variance of per-sentence perplexity | Var(PPL_i) over sentences i | No fixed scale; higher means more bursty | AI text detection heuristics |
| Inverse participation ratio | sum(p_i^2) over normalized sentence lengths | Low = uniform, high = concentrated | Concentration measure borrowed from physics |
In practice, most AI detection contexts collapse all of this into one rough question: how much do per-sentence statistics fluctuate compared to their mean? A document where every sentence is around twenty tokens with similar average log-probability scores will look low-burstiness regardless of which formula is plugged in. A document that swings between a five-word fragment and a forty-word complex sentence with a wide spread of per-token probabilities will look high-burstiness on every measure.
Decoder-only transformer models such as the GPT family generate text by sampling one token at a time from a probability distribution conditioned on what has been written so far. Even when temperature, top-p, and other decoding parameters are tuned to encourage diversity, the underlying objective is still next-token prediction, and it tends to regress toward statistically average behaviour at the sentence and paragraph level. Three reinforcing effects show up in the output:
Reinforcement learning from human feedback and instruction tuning amplify the smoothing. Human raters tend to prefer clear, well-structured, on-topic answers, so models that have been fine-tuned on those preferences become even more uniform in tone and rhythm. The result is text that reads as competent but monochrome, which is exactly what burstiness measures are designed to flag.
GPTZero is the tool most associated with the term in the public conversation. Edward Tian launched a prototype on 2 January 2023, weeks after ChatGPT's public release, and the system gained millions of users almost overnight as schools and universities scrambled to respond to AI-written assignments. GPTZero's documentation describes a statistical layer combining perplexity and burstiness as the first stage of detection, with subsequent layers added as the product matured. The service raised 3.5 million dollars in seed funding in May 2023 and a 10 million dollar Series A in 2024.
Other detectors use related but not identical signals. The table below summarises how a few well-known tools relate to burstiness. Exact algorithms are proprietary, so descriptions reflect public statements rather than published source code.
| Detector | Burstiness role | Other primary signals | Notes |
|---|---|---|---|
| GPTZero | Headline statistic, computed alongside perplexity | Sentence-level classifier, paragraph scoring | Most widely cited use of the term |
| Originality.AI | Used as one feature among many | Supervised classifier trained on AI vs human text | Markets to publishers and SEO professionals |
| OpenAI AI Text Classifier | Implicitly captured in classifier features | Fine-tuned GPT classifier | Released January 2023, withdrawn 20 July 2023 due to low accuracy |
| Turnitin AI detection | Internal statistical features | Sentence classification model | Targets education market |
| Winston AI, Copyleaks, ZeroGPT | Variable, often perplexity plus heuristics | Mixed classifier and statistical approaches | Quality and transparency vary |
Most commercial detectors no longer rely on burstiness alone. The dominant approach today is a supervised classifier trained on large corpora of paired AI and human text, with statistical features such as perplexity and burstiness fed in as inputs alongside richer linguistic signals.
Burstiness is a useful heuristic, not a reliable test. There are several well-documented failure modes.
False positives on simple, technical, or repetitive writing. Documents written for clarity, such as legal text, instructions, or formal reports, naturally tend toward uniform sentence length and predictable vocabulary, and they often score as low-burstiness even when no model was involved. Ars Technica and other outlets have shown that detectors flag the United States Constitution and similar canonical documents as AI-generated.
Bias against non-native English writers. The most cited critique is the 2023 study by Weixin Liang and colleagues at Stanford, GPT detectors are biased against non-native English writers, published in Patterns. The team tested several major detectors on TOEFL essays written by non-native English speakers and on academic writing samples. Detectors classified more than half of the non-native essays as AI-generated, while almost all native-speaker samples were correctly identified as human. The authors traced the effect to lower lexical and syntactic variety in non-native writing, which produces lower perplexity and flatter burstiness in the same way model output does.
Low standalone accuracy. When OpenAI shut down its own AI Text Classifier on 20 July 2023, it cited a true positive rate of only 26 percent for AI-written text and a false positive rate of 9 percent for human writing. A University of Maryland study and several follow-up evaluations have reached similar conclusions: no detector is reliable enough to be used as the sole basis for an accusation of academic dishonesty.
Easy evasion. Because burstiness measures variance, a writer can raise it artificially by mixing in short and long sentences, splicing fragments between long ones, and varying word choice. The Liang paper showed that simple prompting strategies, such as asking the model to rewrite the text "with literary flourish," pushed AI output past most detectors. Paraphrasing tools and so-called "humanisers" exploit the same weakness.
Sensitivity to length and editing. Burstiness scores are noisier on short passages, and they shift sharply when a human edits a model-generated draft or when a model polishes a human draft. The longer the document and the cleaner the source, the more reliable the signal, and conversely.
These limitations explain why teachers, journals, and platforms have generally moved away from treating any detector score as proof of authorship. The signal is real, but it is statistical and it can be wrong on either side.
The AI detection use case is the most public, but burstiness shows up in several other corners of natural language processing.
Information retrieval and term weighting. Standard TF-IDF schemes implicitly assume that document term counts follow a Poisson-like distribution. The DCM and similar bursty models give better scores when they replace the multinomial assumption inside language models for retrieval.
Topic modelling. Latent Dirichlet allocation and similar models inherit the multinomial assumption and therefore underestimate how often a topical word repeats. Researchers including Gabriel Doyle and Charles Elkan have proposed accounting for burstiness directly inside topic models to improve held-out perplexity.
Named-entity bursts in news streams. In streaming-text settings, the sudden burst of mentions of a name or place is itself a feature used to detect breaking events, trending topics, and emerging entities.
Speech and disfluency analysis. Bursty patterns in pause length, filler words, and repetition appear in spoken-language analysis and have been used in clinical NLP for studies of speech disorders.
Writers who want their drafts to score as more human, whether they are revising AI output or simply trying to liven up flat prose, generally aim for the same handful of moves.
None of this guarantees that a detector will classify the result as human, but it does push the statistical fingerprint closer to typical human writing.
Imagine you are listening to two friends tell you about their day. One friend talks in bursts. They say a lot all at once, then a quick "yeah" or "anyway," then a long story, then a very short joke. The other friend talks in steady, even sentences that all sound about the same length, like they are reading from a list. The bursty friend feels more like a person. The steady one feels a bit like a robot. Burstiness is just a way of measuring that difference in writing instead of speaking. Computers like GPTZero look at a piece of text, count how much the sentences bounce around, and use that to guess whether a person or an AI wrote it. The trick is that the test is not perfect: some people, especially when they are writing in a language that is not their first, also write in even sentences and get mistaken for robots.