# Burstiness

> Source: https://aiwiki.ai/wiki/burstiness
> Updated: 2026-06-28
> Categories: AI Tools & Products, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: AI text, AI content and [AI content detectors](/wiki/ai_content_detectors)*

**Burstiness** is a statistical property of a text that measures how unevenly its sentence lengths, vocabulary, and per-sentence unpredictability are distributed across the document. In [AI detection](/wiki/ai_content_detectors), high burstiness (a wide mix of short and long sentences with varying [perplexity](/wiki/perplexity)) is treated as a signal of human authorship, while uniform, low-burstiness text is treated as a signal that a [large language model](/wiki/large_language_model) wrote it. The concept moved from corpus linguistics and information theory into the mainstream when [GPTZero](/wiki/gptzero), launched in January 2023 by Princeton undergraduate Edward Tian, used burstiness together with [perplexity](/wiki/perplexity) as the two headline signals of its AI detection model.[1][3] Since then burstiness has become one of the most widely cited heuristics for telling apart human writing from text produced by a [large language model](/wiki/large_language_model), though research has repeatedly shown that on its own it is an unreliable test (see Limitations below).[24][26]

The core intuition is simple. Human prose tends to be uneven. People mix a clipped four-word sentence with a winding thirty-word one, repeat a topical noun several times in one paragraph and then drop it for pages, and shift register without warning. Models trained on next-token prediction tend to regress toward the average. Their sentences cluster around a typical length, their vocabulary use is smoother, and the per-token surprisal varies less from one sentence to the next. Burstiness tries to put a number on that difference. GPTZero's founder Edward Tian frames it as a quality of variation: burstiness is "a measure of how much writing patterns and text perplexities vary over the entire document," and where models "formulaically use the same rule to choose the next word," people "tend to vary their sentence construction and diction throughout a document."[1]

## What is burstiness in AI detection?

In the AI detection setting, burstiness is usually framed as the variance of [perplexity](/wiki/perplexity) (or sentence length) across the document, while perplexity itself is the average per-token unpredictability of the text under a reference language model. A text can have high average perplexity but still look machine-written if every sentence has roughly the same perplexity, because that flatness is itself a fingerprint of statistical generation. GPTZero summarises the relationship by saying that burstiness compares perplexity across sentences and that human text is more discontinuous than model output.[1] In its own support material the company stresses that neither signal is read in isolation: detectors look for a balance of perplexity and burstiness that resembles natural human writing, and GPTZero now layers these statistics inside a larger model rather than treating them as a verdict on their own.[18]

A loose ranking helps:

- High burstiness, high perplexity: most likely human.
- Low burstiness, low perplexity: most likely model-generated.
- Mixed scores: ambiguous, often the result of editing, paraphrasing, or non-native English writing.

The second category is what detectors are looking for. A typical untouched [ChatGPT](/wiki/chatgpt) reply tends to land there because the decoding process favours statistically average continuations, and average continuations produce sentences of similar length and similar surprisal.

## Where does burstiness come from in linguistics and information theory?

The statistical idea is older than the AI detection use case by several decades. It first appears in corpus linguistics around the question of why simple [bag-of-words](/wiki/bag_of_words) models fail. If words were sprinkled independently across documents, their counts would follow a Poisson distribution. They do not. Topical content words such as *boycott*, *pope*, or *kennedy* arrive in clumps: once a word appears in a document, the probability of seeing it again jumps far above the baseline rate. Kenneth Church and William Gale formalised this in their 1995 paper *Poisson Mixtures* and in companion work on inverse document frequency, where they showed that word counts are much better captured by a negative binomial distribution, equivalent to a Poisson with a Gamma-distributed rate parameter, than by a single Poisson.[8][9] The standard summary of their result is that a word is "bursty or contagious" if, after its first mention, it is likely to be observed again in the same document, which is exactly the behaviour a single Poisson cannot produce.[8][9] The negative binomial arises as the special case in which the per-document rate is drawn from a Gamma density, and it fits real word-frequency data far more closely than a single Poisson, giving more accurate estimates of variance, entropy, and IDF.[8]

This observation drives a lot of practical [natural language processing](/wiki/natural_language_processing). Term-frequency weighting, inverse document frequency, and topic models all have to cope with the fact that words are bursty. In 2005 Rasmus Madsen, David Kauchak, and Charles Elkan introduced the Dirichlet compound multinomial (DCM), also called the multivariate Polya distribution, as an alternative to the multinomial in text classification and clustering. The DCM adds one degree of freedom that lets a model say "this word, once seen, is more likely to appear again in this document," which corrects much of the burstiness problem and produces measurably better perplexity on standard document collections.[10]

The phenomenon is not limited to a word's overall frequency. In a 2009 *PLoS ONE* study, *Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words*, Eduardo Altmann, Janet Pierrehumbert, and Adilson Motter showed that the gaps between successive occurrences of a word are well described by a stretched-exponential (Weibull) distribution rather than the exponential gaps a Poisson process would produce.[20] They found that how bursty a word is depends mainly on its semantic type, a measure related to how "topical" or content-bearing the word is, with raw frequency playing only a secondary role. Pierrehumbert later extended this line of work to morphology, reporting that derived nouns inherit their burstiness from the deverbal suffix that heads them rather than from the verb stem, which ties discourse statistics to linguistic structure.

A parallel line of work in physics and complex systems studies burstiness as a property of event sequences in time. Albert-Laszlo Barabasi's 2005 *Nature* paper *The origin of bursts and heavy tails in human dynamics* argued that human activity timing, things like email, library visits, and printing jobs, follows non-Poisson statistics, with short stretches of intense activity separated by long quiet periods.[11] Three years later K.-I. Goh and Barabasi proposed a compact way to quantify that pattern, now known as the Goh-Barabasi burstiness parameter B.[12] Although it was developed for inter-event times rather than text, the same parameter is sometimes adapted to sentence-length distributions in NLP because it shares the same intuitive scale. The original parameter has a known weakness on short sequences: Eun-Kyeong Kim and Hang-Hyun Jo showed in 2016 that B is strongly distorted by the finite number of events in a series and proposed a corrected definition that removes this bias, a caveat that matters whenever burstiness is estimated from a short passage of text.[13]

## How is burstiness calculated?

There is no single canonical formula for burstiness. Different fields use different measures, and AI detection tools usually do not publish their exact equations.[14][15] The most common formulations are summarised below.

| Measure | Formula | Range / interpretation | Typical use |
|---|---|---|---|
| Goh-Barabasi parameter B | (sigma - mu) / (sigma + mu) | -1 (perfectly regular) to +1 (extremely bursty); 0 corresponds to a Poisson process | Inter-event times, adapted to sentence lengths |
| Index of dispersion (Fano factor) | sigma^2 / mu | =1 for Poisson; >1 indicates bursty / overdispersed; <1 indicates regular | Counting processes, word counts |
| Coefficient of variation (CV) | sigma / mu | 0 for constant series; rises as variance grows | General variability of sentence lengths or perplexities |
| Variance of per-sentence perplexity | Var(PPL_i) over sentences i | No fixed scale; higher means more bursty | AI text detection heuristics |
| Inverse participation ratio | sum(p_i^2) over normalized sentence lengths | Low = uniform, high = concentrated | Concentration measure borrowed from physics |

In practice, most AI detection contexts collapse all of this into one rough question: how much do per-sentence statistics fluctuate compared to their mean? A document where every sentence is around twenty tokens with similar average log-probability scores will look low-burstiness regardless of which formula is plugged in. A document that swings between a five-word fragment and a forty-word complex sentence with a wide spread of per-token probabilities will look high-burstiness on every measure.

## Why do language models produce low-burstiness text?

Decoder-only [transformer](/wiki/transformer) models such as the [GPT](/wiki/gpt_generative_pre-trained_transformer) family generate text by sampling one token at a time from a probability distribution conditioned on what has been written so far. Even when temperature, top-p, and other decoding parameters are tuned to encourage diversity, the underlying objective is still next-token prediction, and it tends to regress toward statistically average behaviour at the sentence and paragraph level. Three reinforcing effects show up in the output:

1. Sentence lengths cluster around the model's training-distribution mean. Models rarely produce two-word sentences or runaway sixty-word ones unless the prompt specifically pushes them there.
2. Lexical variety is smoothed. The same general vocabulary is reused across paragraphs, with fewer of the topical bursts that human writers create when they discover a useful word and then rely on it.
3. Per-sentence surprisal is flat. Because each sentence is composed of the locally most probable tokens, the perplexity from one sentence to the next stays in a narrow band.

[Reinforcement learning from human feedback](/wiki/rlhf) and instruction tuning amplify the smoothing. Human raters tend to prefer clear, well-structured, on-topic answers, so models that have been fine-tuned on those preferences become even more uniform in tone and rhythm. The result is text that reads as competent but monochrome, which is exactly what burstiness measures are designed to flag.

## How does GPTZero use burstiness?

GPTZero is the tool most associated with the term in the public conversation. Edward Tian launched a prototype on 2 January 2023, weeks after [ChatGPT](/wiki/chatgpt)'s public release, and the system gained millions of users almost overnight as schools and universities scrambled to respond to AI-written assignments.[2][3] In its founding explainer, GPTZero defines the two signals plainly: perplexity is "a measure of how likely an AI model would have chosen the exact same set of words as found in the document," and burstiness is "a measure of how much writing patterns and text perplexities vary over the entire document."[1] The documentation describes that statistical layer combining perplexity and burstiness as the first stage of detection, with subsequent layers added as the product matured; the company now describes its system as a multilayered model with seven components rather than the original two statistics.[1][18] The service raised 3.5 million dollars in seed funding in May 2023, co-led by Uncork Capital and Neo, and a 10 million dollar Series A in June 2024 led by Footwork VC, bringing total funding to 13.5 million dollars; by that point its registered user base had grown from roughly one million to four million in twelve months and annual recurring revenue had grown 500 percent in the prior six months.[19]

Other detectors use related but not identical signals. The table below summarises how a few well-known tools relate to burstiness. Exact algorithms are proprietary, so descriptions reflect public statements rather than published source code.

| Detector | Burstiness role | Other primary signals | Notes |
|---|---|---|---|
| [GPTZero](/wiki/gptzero) | Headline statistic, computed alongside perplexity | Sentence-level classifier, paragraph scoring | Most widely cited use of the term |
| [Originality.AI](/wiki/originality_ai) | Used as one feature among many | Supervised classifier trained on AI vs human text | Markets to publishers and SEO professionals |
| [OpenAI](/wiki/openai) AI Text Classifier | Implicitly captured in classifier features | Fine-tuned GPT classifier | Released January 2023, withdrawn 20 July 2023 due to low accuracy[6][7] |
| Turnitin AI detection | Internal statistical features | Sentence classification model | Targets education market |
| Winston AI, Copyleaks, ZeroGPT | Variable, often perplexity plus heuristics | Mixed classifier and statistical approaches | Quality and transparency vary |

Most commercial detectors no longer rely on burstiness alone. The dominant approach today is a supervised classifier trained on large corpora of paired AI and human text, with statistical features such as perplexity and burstiness fed in as inputs alongside richer linguistic signals. The shared task SemEval-2024 Task 8, a multi-domain and multilingual machine-generated text detection benchmark, reported that across all of its subtasks the best-performing systems were built on fine-tuned language models, and the winning English system, Genaios, relied on probabilistic features extracted from a language model rather than on a single hand-crafted statistic like burstiness.[25]

## How do academic zero-shot detectors relate to burstiness?

Alongside commercial products, a research literature has grown up around detecting machine text without training a dedicated classifier. These zero-shot methods are the academic cousins of the perplexity-and-burstiness idea, and they make the link between burstiness and per-token probability explicit.

[DetectGPT](/wiki/detectgpt), introduced by Eric Mitchell and colleagues at Stanford and presented at ICML 2023, is the best known. Its insight is that text sampled from a language model tends to sit in regions of negative curvature of that model's log-probability surface. In plain terms, if you take a passage and make many small paraphrastic perturbations, machine-written text usually drops in log-probability more than human-written text does, because the original already sat near a local maximum. DetectGPT turns that gap into a score using only the log probabilities of the model in question plus perturbations from a generic model such as T5, and it raised detection of fake news from a 20-billion-parameter GPT-NeoX from 0.81 AUROC for the strongest prior zero-shot baseline to 0.95 AUROC.[21]

A later method, Binoculars, presented by Abhimanyu Hans and colleagues at ICML 2024, sharpened the perplexity idea by contrasting two closely related models. It computes the ratio between the perplexity of a passage under an "observer" model and its cross-perplexity under a "performer" model. Machine text tends to look unsurprising to both, so the ratio sits in a narrow range, whereas human text produces a wider spread. Using a fixed pair of open models and no training data, Binoculars reported detection of over 90 percent of ChatGPT samples at a false positive rate of 0.01 percent across many document types, despite never having seen ChatGPT output during development.[22] Methods of this kind share burstiness's core assumption, that machine text is statistically flatter than human text, but they operate at the token level rather than counting sentence lengths.

## How reliable is burstiness, and is it biased?

Burstiness is a useful heuristic, not a reliable test. There are several well-documented failure modes.

False positives on simple, technical, or repetitive writing. Documents written for clarity, such as legal text, instructions, or formal reports, naturally tend toward uniform sentence length and predictable vocabulary, and they often score as low-burstiness even when no model was involved. *Ars Technica* and other outlets have shown that detectors flag the United States Constitution and similar canonical documents as AI-generated.[16] This failure is compounded for perplexity-based detection by a second effect: canonical texts such as the Declaration of Independence appear so often in model training data that the reference model assigns them very low perplexity, which a naive detector reads as a sign of machine authorship.[26]

Bias against non-native English writers. The most cited critique is the 2023 study by Weixin Liang and colleagues at Stanford, *GPT detectors are biased against non-native English writers*, published in *Patterns*. The team ran TOEFL essays written by non-native English speakers through seven widely used commercial detectors. On average 61.3 percent of the non-native essays were flagged as AI-generated, against roughly 5.1 percent of essays written by native speakers, and more than half of the non-native essays were misclassified by the detectors as a group.[4][5] The authors traced the effect to lower lexical and syntactic variety in non-native writing, which produces lower perplexity and flatter burstiness in the same way model output does. Tellingly, when the team used ChatGPT to rewrite the same TOEFL essays "with more sophisticated language," the average false positive rate dropped by about 49 percentage points, confirming that the detectors were keying on writing style rather than true authorship.[4] The pattern persists in later evaluations. A 2025 study by Ahmad Pratama in *PeerJ Computer Science* examined the accuracy-bias trade-off across several tools and found that the most accurate detector it tested, GPTZero, also showed the clearest disparity, with non-native authors of AI-assisted text receiving a median AI-likelihood score of 22.50 percent against 9.50 percent for native authors, exposing roughly one in four non-native writers to over-detection.[24]

Low standalone accuracy. When OpenAI shut down its own AI Text Classifier on 20 July 2023, it cited a true positive rate of only 26 percent for AI-written text and a false positive rate of 9 percent for human writing.[6][7] Follow-up evaluations have reached similar conclusions: no detector is reliable enough to be used as the sole basis for an accusation of academic dishonesty. The same 2025 *PeerJ* study found wide spread between tools, with GPTZero reaching 97.22 percent accuracy at a zero percent false positive rate on its sample while the zero-shot method DetectGPT scored 54.63 percent, which the author described as virtually no better than random guessing.[24]

Easy evasion. Because burstiness measures variance, a writer can raise it artificially by mixing in short and long sentences, splicing fragments between long ones, and varying word choice. The Liang paper showed that simple prompting strategies, such as asking the model to rewrite the text "with literary flourish," pushed AI output past most detectors.[4] Paraphrasing tools and so-called "humanisers" exploit the same weakness. A 2023 paper by Vinu Sadasivan and colleagues, *Can AI-Generated Text be Reliably Detected?*, made the case both empirically and theoretically. The authors built a recursive paraphrasing attack that broke watermarking schemes, neural classifiers, zero-shot methods, and retrieval-based detectors alike, and they proved a general limit: as machine output grows closer to the distribution of human writing, any detector is forced toward either a high false positive rate or a high miss rate.[23] One widely cited illustration is that DetectGPT, which caught about 70 percent of GPT-2 output unmodified, fell to single-digit detection once the same text was paraphrased.[23]

Sensitivity to length and editing. Burstiness scores are noisier on short passages, and they shift sharply when a human edits a model-generated draft or when a model polishes a human draft. The longer the document and the cleaner the source, the more reliable the signal, and conversely. The underlying statistic inherits this fragility from its origins: the Goh-Barabasi burstiness parameter is itself biased on short event sequences, which is part of why a few sentences rarely give a stable reading.[13]

These limitations explain why teachers, journals, and platforms have generally moved away from treating any detector score as proof of authorship. The signal is real, but it is statistical and it can be wrong on either side.

## What else is burstiness used for in NLP?

The AI detection use case is the most public, but burstiness shows up in several other corners of [natural language processing](/wiki/natural_language_processing).

Information retrieval and term weighting. Standard TF-IDF schemes implicitly assume that document term counts follow a Poisson-like distribution. The DCM and similar bursty models give better scores when they replace the multinomial assumption inside language models for retrieval.[10]

Topic modelling. Latent Dirichlet allocation and similar models inherit the multinomial assumption and therefore underestimate how often a topical word repeats. Researchers including Gabriel Doyle and Charles Elkan have proposed accounting for burstiness directly inside topic models to improve held-out perplexity.[17]

Named-entity bursts in news streams. In streaming-text settings, the sudden burst of mentions of a name or place is itself a feature used to detect breaking events, trending topics, and emerging entities.

Speech and disfluency analysis. Bursty patterns in pause length, filler words, and repetition appear in spoken-language analysis and have been used in clinical NLP for studies of speech disorders.

## How can you write more bursty text?

Writers who want their drafts to score as more human, whether they are revising AI output or simply trying to liven up flat prose, generally aim for the same handful of moves.

- Vary sentence length deliberately. Drop in a four-word sentence after a long one. Let a fragment land. Then write a longer, more involved sentence that takes its time getting where it is going.
- Use paragraph rhythm. A short paragraph between two long ones changes pace.
- Repeat topical words in clusters when it suits the topic, then drop them. This recreates the natural burstiness of human vocabulary use.
- Allow asides, parentheses, and minor digressions. Models tend to keep every sentence on-topic; people drift.
- Edit at the sentence level rather than at the word level. Substitution-based paraphrasing often preserves the underlying rhythm of the source.

None of this guarantees that a detector will classify the result as human, but it does push the statistical fingerprint closer to typical human writing.

## Explain like I'm 5 (ELI5)

Imagine you are listening to two friends tell you about their day. One friend talks in bursts. They say a lot all at once, then a quick "yeah" or "anyway," then a long story, then a very short joke. The other friend talks in steady, even sentences that all sound about the same length, like they are reading from a list. The bursty friend feels more like a person. The steady one feels a bit like a robot. Burstiness is just a way of measuring that difference in writing instead of speaking. Computers like [GPTZero](/wiki/gptzero) look at a piece of text, count how much the sentences bounce around, and use that to guess whether a person or an AI wrote it. The trick is that the test is not perfect: some people, especially when they are writing in a language that is not their first, also write in even sentences and get mistaken for robots.

## See also

- [Perplexity](/wiki/perplexity)
- [AI content detectors](/wiki/ai_content_detectors)
- [GPTZero](/wiki/gptzero)
- [Originality.AI](/wiki/originality_ai)
- [DetectGPT](/wiki/detectgpt)
- [Machine-generated text detection](/wiki/machine_generated_text_detection)
- [Large language model](/wiki/large_language_model)
- [Natural language processing](/wiki/natural_language_processing)
- [ChatGPT](/wiki/chatgpt)
- [Hallucination](/wiki/hallucination)

## References

1. Tian, E. "GPTZero: Perplexity and Burstiness, what is it?" GPTZero, 1 March 2023. https://gptzero.me/news/perplexity-and-burstiness-what-is-it/
2. Wikipedia contributors. "GPTZero." Wikipedia. https://en.wikipedia.org/wiki/GPTZero
3. Bowman, E. "A college student created an app that can tell whether AI wrote an essay." NPR, 9 January 2023. https://www.npr.org/2023/01/09/1147549845/gptzero-ai-chatgpt-edward-tian-plagiarism
4. Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., and Zou, J. "GPT detectors are biased against non-native English writers." *Patterns*, vol. 4, no. 7, 2023, article 100779. https://www.cell.com/patterns/fulltext/S2666-3899(23)00130-7
5. Liang, W., et al. Preprint. arXiv:2304.02819, 2023. https://arxiv.org/abs/2304.02819
6. OpenAI. "New AI classifier for indicating AI-written text." 31 January 2023, updated 20 July 2023. https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/
7. Wiggers, K. "OpenAI scuttles AI-written text detector over 'low rate of accuracy.'" TechCrunch, 25 July 2023. https://techcrunch.com/2023/07/25/openai-scuttles-ai-written-text-detector-over-low-rate-of-accuracy/
8. Church, K. W., and Gale, W. A. "Poisson mixtures." *Natural Language Engineering*, vol. 1, no. 2, 1995, pp. 163-190.
9. Church, K. W., and Gale, W. A. "Inverse Document Frequency (IDF): A Measure of Deviations from Poisson." In *Natural Language Processing Using Very Large Corpora*, 1995. https://aclanthology.org/W95-0110.pdf
10. Madsen, R. E., Kauchak, D., and Elkan, C. "Modeling Word Burstiness Using the Dirichlet Distribution." *Proceedings of the 22nd International Conference on Machine Learning (ICML)*, 2005. https://icml.cc/Conferences/2005/proceedings/papers/069_WordBursting_MadsenEtAl.pdf
11. Barabasi, A.-L. "The origin of bursts and heavy tails in human dynamics." *Nature*, vol. 435, 2005, pp. 207-211. https://www.nature.com/articles/nature03459
12. Goh, K.-I., and Barabasi, A.-L. "Burstiness and memory in complex systems." *Europhysics Letters*, vol. 81, no. 4, 2008, article 48002. https://arxiv.org/abs/physics/0610233
13. Kim, E.-K., and Jo, H.-H. "Measuring burstiness for finite event sequences." *Physical Review E*, vol. 94, 2016, article 032311. https://link.aps.org/doi/10.1103/PhysRevE.94.032311
14. Wikipedia contributors. "Burstiness." Wikipedia. https://en.wikipedia.org/wiki/Burstiness
15. Wikipedia contributors. "Index of dispersion." Wikipedia. https://en.wikipedia.org/wiki/Index_of_dispersion
16. Edwards, B. "Why AI writing detectors don't work." Ars Technica, 14 July 2023. https://arstechnica.com/information-technology/2023/07/why-ai-detectors-think-the-us-constitution-was-written-by-ai/
17. Doyle, G., and Elkan, C. "Accounting for Burstiness in Topic Models." *Proceedings of the 26th International Conference on Machine Learning*, 2009. https://cseweb.ucsd.edu/~elkan/TopicBurstiness.pdf
18. GPTZero. "How Do AI Detectors Work? Techniques, Limitations & More." GPTZero. https://gptzero.me/news/how-ai-detectors-work/
19. Wiggers, K. "GPTZero's founders, still in their 20s, have a profitable AI detection startup, millions in the bank and a new $10M Series A." TechCrunch, 13 June 2024. https://techcrunch.com/2024/06/13/gptzero-profitable-ai-detection-startup-10m-series-a/
20. Altmann, E. G., Pierrehumbert, J. B., and Motter, A. E. "Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words." *PLoS ONE*, vol. 4, no. 11, 2009, article e7678. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007678
21. Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., and Finn, C. "DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature." *Proceedings of the 40th International Conference on Machine Learning (ICML)*, 2023. arXiv:2301.11305. https://arxiv.org/abs/2301.11305
22. Hans, A., Schwarzschild, A., Cherepanova, V., Kazemi, H., Saha, A., Goldblum, M., Geiping, J., and Goldstein, T. "Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text." *Proceedings of the 41st International Conference on Machine Learning (ICML)*, 2024. arXiv:2401.12070. https://arxiv.org/abs/2401.12070
23. Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., and Feizi, S. "Can AI-Generated Text be Reliably Detected?" arXiv:2303.11156, 2023. https://arxiv.org/abs/2303.11156
24. Pratama, A. R. "The accuracy-bias trade-offs in AI text detection tools and their impact on fairness in scholarly publication." *PeerJ Computer Science*, 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12453642/
25. Wang, Y., et al. "SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection." *Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)*, 2024. arXiv:2404.14183. https://arxiv.org/abs/2404.14183
26. Emi, B. "Why Perplexity and Burstiness Fail to Detect AI." Pangram Labs. https://www.pangram.com/blog/why-perplexity-and-burstiness-fail-to-detect-ai