Skip to main content
← Back to blog

How AI Detection Actually Works (Technical Explainer)

9 min read

AI detectors have become gatekeepers for academic submissions, content platforms, and publishing workflows. But most people who use them — or worry about them — have no idea how they actually work. This is a plain-English breakdown of the statistical methods and machine learning models behind AI text detection, their strengths, and their significant limitations.

Perplexity: How Surprised Is the Model?

Perplexity is the foundational metric in AI detection. It measures how "surprised" a language model is by a piece of text. Technically, it is the exponential of the average negative log-likelihood of each token given the preceding context. In simpler terms: if a model can easily predict every next word in a passage, that passage has low perplexity.

AI-generated text tends to have low perplexity because it was produced by a model that, by definition, chose the most probable words at each step. Human writing has higher perplexity because humans make surprising word choices — unusual metaphors, sentence fragments, domain-specific jargon, colloquialisms, and deliberate rule-breaking.

Early detectors like GPTZero were built primarily on perplexity scoring. The logic is straightforward: run the text through a reference model, measure the average perplexity, and flag anything below a threshold as likely AI-generated. This works reasonably well on long passages of unedited AI output. It breaks down quickly on shorter texts, edited texts, and text from certain domains.

Burstiness: The Rhythm Test

Burstiness measures the variation in perplexity across a text. Human writing is "bursty" — some sentences are highly predictable (simple factual statements, common phrases) while others are surprising (creative descriptions, unexpected transitions, personal anecdotes). The perplexity jumps around.

AI text has low burstiness. The perplexity stays relatively constant from sentence to sentence because the model maintains a consistent level of "safe" word selection throughout. It does not have boring sentences and interesting sentences. It has uniformly adequate sentences.

Combining perplexity and burstiness gives detectors a two-dimensional signal. Low perplexity plus low burstiness is a strong indicator of AI origin. High perplexity plus high burstiness suggests human authorship. Mixed signals — low perplexity but high burstiness, or vice versa — fall into a gray zone where detectors are unreliable.

Statistical Watermarking

Some AI providers embed invisible statistical watermarks in their output. The idea is simple: when generating text, the model slightly biases its token selection toward a specific pattern that is imperceptible to readers but detectable by the provider's verification tool.

For example, a watermarking scheme might partition the vocabulary into "green" and "red" lists at each position and nudge the model to prefer green-list tokens. A detector checks whether the text contains a statistically improbable number of green-list tokens. If so, it was likely generated by that specific model.

Watermarking is the most reliable detection method when it works, but it has significant limitations. It only detects text from models that implement it. It breaks when text is paraphrased, translated, or even moderately edited. And it requires cooperation from AI providers, which not all are willing to provide.

Classifier Models

The most widely used commercial detectors — GPTZero, Originality.ai, Turnitin's AI detection, Copyleaks — use trained classifier models. These are neural networks (typically fine-tuned transformers) trained on large datasets of human-written and AI-generated text to learn the difference.

The classifier approach has an advantage over pure statistical methods: it can learn subtle patterns that are hard to capture in a single metric. It can pick up on things like the distribution of rare words, the ratio of content words to function words, paragraph-level structure, and stylistic consistency.

But classifiers also inherit all the biases of their training data. They tend to flag certain writing styles as AI even when a human wrote them — particularly formal academic writing, non-native English speakers, and technical documentation. This leads to the false positive problem.

The False Positive Problem

False positives are the Achilles' heel of AI detection. Every major detector has been documented flagging human-written text as AI-generated. The cases are not edge cases:

  • The US Constitution has been flagged as AI-generated by multiple detectors.
  • Non-native English speakers are disproportionately flagged because their writing patterns (simpler vocabulary, more regular grammar) resemble AI output.
  • Formal academic writing, with its structured argumentation and careful hedging, triggers the same signals as RLHF-trained model output.
  • Technical writing about well-documented topics tends to converge on standard phrasing regardless of whether a human or AI wrote it.

The false positive rate varies by detector and by text type, but independent studies have found rates ranging from 1% to over 15% depending on the context. For any individual piece of text, a detector's confidence score should be treated as a probability estimate, not a verdict.

Why Detectors Keep Losing the Arms Race

AI detection is fundamentally an adversarial problem. As detectors improve, the tools used to evade them improve too. But there is a deeper issue: as AI models themselves improve, their output becomes harder to distinguish from human text. Better models produce more varied, more natural-sounding prose with higher perplexity and burstiness.

This means detection accuracy is moving in the wrong direction over time. The models that were easy to detect in 2023 (GPT-3.5, early Claude) produced much more uniform text than current models. Each generation of models narrows the statistical gap between AI and human writing.

Why Better Writing Beats Gaming Detectors

There are two approaches to reducing AI detection scores. The first is to game specific detectors — introducing targeted noise, misspellings, or Unicode tricks that exploit weaknesses in particular detection algorithms. This works temporarily and breaks when the detector updates.

The second approach is to actually improve the writing. Text with genuine variation in sentence length, vocabulary, and structure; text with a clear voice and perspective; text with specific examples and natural imperfections — this text scores lower on detectors not because it exploits a bug, but because it genuinely resembles human writing. It has high perplexity and high burstiness because it was written (or rewritten) with the same unpredictability that characterizes human prose.

This is the approach Metric37 takes. Rather than adding noise to fool a specific detector, the multi-LLM pipeline produces text that is statistically closer to human writing across all the metrics detectors measure. The eval gate verifies this before returning the result. The output does not just evade detectors — it reads better, which is the whole point.

What This Means for You

If you are worried about AI detection, the most durable strategy is not to find a tool that games today's detectors. It is to produce text that is genuinely good enough that the question of provenance becomes irrelevant. Detectors will keep changing. Good writing is good writing regardless of how it was produced.

Ready to humanize your AI content?

Paste your AI draft and get prose that sounds like you wrote it. 5,000 words free.

Start free
Feedback