Home › Why AI Text Detection Is Hard: The Attack-Defense Arms Race | Plagiarism Detector

Why AI Text Detection Is Hard: Inside the Arms Race

Detection and generation are locked in a cat-and-mouse race. Each new model release closes the statistical gap that detectors rely on — and each detection improvement is answered by a new humaniser tool. Here's what's actually going on under the hood.

2026-04-17 · Plagiarism Detector Team

The Statistical Basis of Detection

Every AI text detector is ultimately a statistical discriminator — it looks at features of text (token probabilities, perplexity, burstiness, syntactic regularity) and tries to find signals that distinguish machine-generated from human-written content. The Binoculars method (ICML 2024) uses a ratio of cross-perplexity between two language models as its signal. The ModernBERT supervised approach learns the signal directly from labeled examples.

Both approaches share a fundamental vulnerability: the signals they rely on are side-effects of how models generate text, not fundamental features of machine-written-ness. As generators improve, those side-effects shrink. A model trained to write more like a human will — by definition — be harder to detect.

This is not a research failure. It's a structural fact about the problem. Detection operates on a moving target: every major LLM release narrows the gap, every humaniser tool explicitly trains against detector outputs. The question is not ‘can we achieve 100% detection forever’ — it cannot be done — but ‘can we stay ahead of the current generation long enough to be useful in practice.’

What the Sword Does — Generation Improves

Three generation trends make detection harder. Size: larger models produce statistically more diverse text because they have richer internal distributions. A 70-billion-parameter model has a wider range of human-like output than a 7-billion-parameter one. Instruction-tuning: RLHF and constitutional methods teach models to avoid the repetitive, hedging, bland patterns that made GPT-3 easy to spot. Temperature and sampling: chat interfaces have shifted toward nucleus sampling and randomness, which break some of the low-variance patterns classical detectors used as anchors.

GPT-5, Claude 4.5, and Gemini 2.5 are all noticeably harder to detect than their predecessors. Our internal validation confirms this: each model generation drops our AUC on that family by 5–10 percentage points compared to the previous generation. See our accuracy benchmark for per-model numbers.

Humaniser tools — Undetectable AI, StealthWriter, Humanbeing, and a growing list — are the explicit adversaries. They take AI output and paraphrase, rewrite, or style-transfer it specifically to defeat detectors. They are trained against public detectors (including ours, though we never share our model weights) and they get measurably better with each update.

What the Shield Does — Detection Responds

Detectors have three responses to the generation arms race. Ensembling: combining multiple detection signals so that any single evasion tactic is insufficient. Our ensemble of zero-shot Binoculars with supervised ModernBERT exploits this: a humaniser that defeats one component often fails against the other, and the ensemble score captures both.

Continuous retraining: we add samples from every major new generator release within 4 weeks of launch. If GPT-6 drops tomorrow, our training corpus will include it by mid-next-month. This is expensive — compute, annotation, re-validation — but it is the only way to keep detection current. Detectors that retrain annually or less are effectively museum pieces within a year.

Adversarial training: we deliberately train on humanised AI samples and paraphrased outputs, teaching the model to see past surface-level style transfer. This raises the floor of what a humaniser must do to evade us, which in turn slows the arms race.

Inside the Evasion Landscape

How do humaniser tools actually work? Three broad categories. Paraphrasing: rewrite the text word-by-word or sentence-by-sentence using a secondary LLM. Effective against naïve detectors that rely on exact token sequences; modestly effective against statistical methods. Style transfer: transform the text to mimic a specific author or register. More effective — our detector's AUC drops by ~8 points on style-transferred AI text.

Hybrid human-AI editing: the author writes a draft, runs it through an LLM for polish, then manually edits the polished version. This is the hardest case — legitimately collaborative work that blends human and machine signals at the sentence level. No detector, including ours, can reliably resolve these without editing-history metadata the detector cannot see.

A useful mental model: a humaniser is not a detector-breaker, it's a cost multiplier for the evader. It takes time, sometimes money, and always adds risk of introducing errors. Most academic cheating attempts do not use humanisers because the friction outweighs the benefit. Where humanisers dominate is professional content farming and AI-generated SEO spam — use cases where throughput matters and quality control is weak.

See how our detector scores right now

Paste any document and watch the per-sentence verdict in real time. The ensemble logic described above runs on your text in under 30 seconds.

Why Ensembling Matters More Than Any Single Metric

A single-signal detector has a single failure mode. If you rely only on perplexity, a paraphrased output with altered token probabilities defeats you. If you rely only on a supervised classifier, out-of-distribution text (a new model family, a new writing domain) defeats you. An ensemble averages the weaknesses: the paraphrase that defeats perplexity probably still trips the supervised head, and vice versa.

Our production detector is explicitly ensembled: 35% Binoculars (zero-shot, model-agnostic, robust to out-of-distribution) + 65% ModernBERT (supervised, domain-specific, high precision on in-distribution text). The weights were chosen empirically — ensemble AUC was maximised when ModernBERT dominated but Binoculars retained veto power on edge cases.

The consequence: a humaniser tool now has to defeat two substantially different detection architectures simultaneously to evade our verdict. Public humanisers are typically trained against a single target detector, which means they often succeed against that specific detector but fail against an ensemble. This is detection's primary structural advantage in the current arms race.

Realistic Expectations for the Next 12 Months

What should we expect through 2026–2027? GPT-6 and Claude 5 are likely mid-year releases; both will further narrow the gap. Open-weights models — Llama 4, Qwen 4 — will continue to commoditise high-quality generation and make humanisers cheaper to run at scale. Detection AUC on frontier models will probably drop into the 0.80–0.90 band for the first year after release before retraining corrects it.

On the defence side: multi-modal signals (typing dynamics, edit history, authorship verification against a known corpus) are likely to matter more than pure text-based detection within 24 months. Our text-only detector will remain the first filter but will increasingly be a voting member in a richer evidence stack.

The honest bottom line: pure text-based detection will never reach 100%. It will plateau somewhere around 90–95% AUC on in-distribution text and 75–85% on frontier models. If your workflow requires certainty, you need evidence beyond the score. If your workflow requires a strong signal to prioritise human review, text-based detection remains useful and measurably better than doing nothing.

Frequently Asked Questions

If AI detection will never be perfect, is it worth using at all?

Yes — the question is not ‘is it perfect’ but ‘is it better than not screening at all.’ A 90% AUC detector on your workload is a massive signal-to-noise improvement. The people most vocal about detector limitations are often those trying to defeat them; that's not an argument for abandoning the tool.

Can watermarking replace statistical detection?

Watermarking embeds a hidden statistical signature in generated text that a detector can later retrieve. It works when generators cooperate (OpenAI has deployed it experimentally) but fails entirely on open-weights models, which generate without watermarks. Statistical detection will remain necessary for the foreseeable future because it works even when the generator refuses to cooperate.

What's the single hardest thing to detect today?

Hybrid human-AI editing — an AI-drafted, human-polished text fragment at the sentence level. No current detector reliably resolves these without access to edit-history metadata. If that's your use case, text-based detection is the wrong tool — you need workflow instrumentation.

How often does a new generator actually reduce your AUC?

Every major release, roughly every 3–6 months, reduces AUC on that family by 5–10 percentage points until we retrain. Retraining takes about 4 weeks after we have sufficient samples. The practical result: there is always a 2–8 week window after a new launch where our AUC on that family is lower than average. We disclose these gaps on the benchmark page.

Does ensembling help against humanisers?

Substantially — it's the primary structural defence we have. Humanisers train against a target detector. When that target is an ensemble of two architecturally different detectors, the humaniser has to defeat both simultaneously, which is meaningfully harder than defeating either alone. This is why we use an ensemble in production even when a single component would be cheaper to run.

This article describes structural properties of AI text detection. Specific numbers refer to our internal validation and may not generalise. We update this page as new research and generator releases warrant.