Detection and generation are locked in a cat-and-mouse race. Each new model release closes the statistical gap that detectors rely on — and each detection improvement is answered by a new humaniser tool. Here's what's actually going on under the hood.
Every AI text detector is ultimately a statistical discriminator — it looks at features of text (token probabilities, perplexity, burstiness, syntactic regularity) and tries to find signals that distinguish machine-generated from human-written content. The Binoculars method (ICML 2024) uses a ratio of cross-perplexity between two language models as its signal. The ModernBERT supervised approach learns the signal directly from labeled examples.
Both approaches share a fundamental vulnerability: the signals they rely on are side-effects of how models generate text, not fundamental features of machine-written-ness. As generators improve, those side-effects shrink. A model trained to write more like a human will — by definition — be harder to detect.
This is not a research failure. It's a structural fact about the problem. Detection operates on a moving target: every major LLM release narrows the gap, every humaniser tool explicitly trains against detector outputs. The question is not ‘can we achieve 100% detection forever’ — it cannot be done — but ‘can we stay ahead of the current generation long enough to be useful in practice.’
Three generation trends make detection harder. Size: larger models produce statistically more diverse text because they have richer internal distributions. A 70-billion-parameter model has a wider range of human-like output than a 7-billion-parameter one. Instruction-tuning: RLHF and constitutional methods teach models to avoid the repetitive, hedging, bland patterns that made GPT-3 easy to spot. Temperature and sampling: chat interfaces have shifted toward nucleus sampling and randomness, which break some of the low-variance patterns classical detectors used as anchors.
GPT-5, Claude 4.5, and Gemini 2.5 are all noticeably harder to detect than their predecessors. Our internal validation confirms this: each model generation drops our AUC on that family by 5–10 percentage points compared to the previous generation. See our accuracy benchmark for per-model numbers.
Humaniser tools — Undetectable AI, StealthWriter, Humanbeing, and a growing list — are the explicit adversaries. They take AI output and paraphrase, rewrite, or style-transfer it specifically to defeat detectors. They are trained against public detectors (including ours, though we never share our model weights) and they get measurably better with each update.
Detectors have three responses to the generation arms race. Ensembling: combining multiple detection signals so that any single evasion tactic is insufficient. Our ensemble of zero-shot Binoculars with supervised ModernBERT exploits this: a humaniser that defeats one component often fails against the other, and the ensemble score captures both.
Continuous retraining: we add samples from every major new generator release within 4 weeks of launch. If GPT-6 drops tomorrow, our training corpus will include it by mid-next-month. This is expensive — compute, annotation, re-validation — but it is the only way to keep detection current. Detectors that retrain annually or less are effectively museum pieces within a year.
Adversarial training: we deliberately train on humanised AI samples and paraphrased outputs, teaching the model to see past surface-level style transfer. This raises the floor of what a humaniser must do to evade us, which in turn slows the arms race.
How do humaniser tools actually work? Three broad categories. Paraphrasing: rewrite the text word-by-word or sentence-by-sentence using a secondary LLM. Effective against naïve detectors that rely on exact token sequences; modestly effective against statistical methods. Style transfer: transform the text to mimic a specific author or register. More effective — our detector's AUC drops by ~8 points on style-transferred AI text.
Hybrid human-AI editing: the author writes a draft, runs it through an LLM for polish, then manually edits the polished version. This is the hardest case — legitimately collaborative work that blends human and machine signals at the sentence level. No detector, including ours, can reliably resolve these without editing-history metadata the detector cannot see.
A useful mental model: a humaniser is not a detector-breaker, it's a cost multiplier for the evader. It takes time, sometimes money, and always adds risk of introducing errors. Most academic cheating attempts do not use humanisers because the friction outweighs the benefit. Where humanisers dominate is professional content farming and AI-generated SEO spam — use cases where throughput matters and quality control is weak.
Paste any document and watch the per-sentence verdict in real time. The ensemble logic described above runs on your text in under 30 seconds.
A single-signal detector has a single failure mode. If you rely only on perplexity, a paraphrased output with altered token probabilities defeats you. If you rely only on a supervised classifier, out-of-distribution text (a new model family, a new writing domain) defeats you. An ensemble averages the weaknesses: the paraphrase that defeats perplexity probably still trips the supervised head, and vice versa.
Our production detector is explicitly ensembled: 35% Binoculars (zero-shot, model-agnostic, robust to out-of-distribution) + 65% ModernBERT (supervised, domain-specific, high precision on in-distribution text). The weights were chosen empirically — ensemble AUC was maximised when ModernBERT dominated but Binoculars retained veto power on edge cases.
The consequence: a humaniser tool now has to defeat two substantially different detection architectures simultaneously to evade our verdict. Public humanisers are typically trained against a single target detector, which means they often succeed against that specific detector but fail against an ensemble. This is detection's primary structural advantage in the current arms race.
What should we expect through 2026–2027? GPT-6 and Claude 5 are likely mid-year releases; both will further narrow the gap. Open-weights models — Llama 4, Qwen 4 — will continue to commoditise high-quality generation and make humanisers cheaper to run at scale. Detection AUC on frontier models will probably drop into the 0.80–0.90 band for the first year after release before retraining corrects it.
On the defence side: multi-modal signals (typing dynamics, edit history, authorship verification against a known corpus) are likely to matter more than pure text-based detection within 24 months. Our text-only detector will remain the first filter but will increasingly be a voting member in a richer evidence stack.
The honest bottom line: pure text-based detection will never reach 100%. It will plateau somewhere around 90–95% AUC on in-distribution text and 75–85% on frontier models. If your workflow requires certainty, you need evidence beyond the score. If your workflow requires a strong signal to prioritise human review, text-based detection remains useful and measurably better than doing nothing.
This article describes structural properties of AI text detection. Specific numbers refer to our internal validation and may not generalise. We update this page as new research and generator releases warrant.