Not all AI text is equally detectable. Here are the results of our per-generator benchmark — which model families our detector catches with near-perfect accuracy, which ones it struggles with, and what that tells you about choosing a detection workflow.
[LEADERBOARD TABLE — fill with real per-model AUC numbers from benchmark before publishing]
Ordered from easiest to hardest to detect on our validation set. The spread is wide — AUC on some model families exceeds 0.99 while others drop into the 0.80s. Detection difficulty correlates with model size, instruction-tuning sophistication, and output variance.
For the full per-generator breakdown methodology, see our accuracy benchmark page. This article summarises the practical implications of that data for users choosing which detector to trust and which model to use.
GPT-3.5 is the easiest modern model to detect — AUC [AUC: ?] on our set. Legacy generation artefacts (repetition, hedging, bland register) remain clearly present. GPT-4 drops to AUC [AUC: ?], GPT-4o to [AUC: ?], reflecting progressively better calibration. GPT-5.x is the hardest of the family — AUC [AUC: ?] — because the instruction-tuning team explicitly targeted detection-artefact removal.
Practical implication: academic workflows concerned about GPT-3.5-era cheating can rely heavily on detection alone. Workflows concerned about GPT-5 need to pair detection with contextual evidence, as described in our teacher workflow guide.
Temperature settings matter. Low-temperature outputs (t≤0.5) are easier to detect because they concentrate probability mass on a narrower vocabulary. Most chat interfaces default to t≈0.7, putting text in a moderately detectable zone. Adversarial users explicitly crank temperature or use diverse decoding to widen the range and evade detection — our ensemble partially corrects for this but not completely.
Claude 3 Opus: AUC [AUC: ?]. Claude 3.5 Sonnet: [AUC: ?]. Claude 4 Opus: [AUC: ?]. Claude 4.5 Sonnet: [AUC: ?]. The Claude family consistently produces less repetitive, more stylistically varied text than same-generation GPT models, which makes it harder to detect via statistical methods.
Claude's constitutional-AI training specifically targets the “machine tells” that our supervised classifier learns from — hedging patterns, overuse of specific connectives, predictable paragraph structure. This is a direct adversarial relationship: the generator is trained against features the detector relies on.
Claude 4.5 Sonnet and GPT-5.x are close in difficulty. Their score distributions overlap the human baseline the most in our validation data. If your workflow targets either of these models, expect reduced recall at the default threshold and consider lowering to F1-optimal for high-sensitivity screening.
Gemini 1.5 Pro: AUC [AUC: ?]. Gemini 2.0: [AUC: ?]. Gemini 2.5: [AUC: ?]. Gemini has shown the most variable detection performance across versions — some intermediate releases regressed temporarily before improvements landed.
Gemini's multi-modal training means text-only outputs sometimes carry vestigial patterns from image-caption or code-explanation domains. Our detector picks up on these, which explains Gemini's slightly higher detectability on mixed-domain prompts than on pure prose.
For Google Workspace users whose students or employees use Gemini through Docs, the detection signal is similar to the raw API output. We have not observed workspace-integration-specific evasion patterns distinct from direct Gemini API use.
Paste output from any LLM and see the per-sentence verdict. Our detector treats all 22 model families as a single ensemble check.
Llama 3.1: AUC [AUC: ?]. Llama 3.3: [AUC: ?]. Qwen 2.5: [AUC: ?]. Qwen 3: [AUC: ?]. DeepSeek R1: [AUC: ?]. Mistral Large: [AUC: ?]. Open-weights models span a wider range than closed ones — fine-tuning variants, quantised deployments, and community-modified checkpoints all produce subtly different outputs.
Detection on open-weights is strategically important because humaniser tools are usually built on open-weights models — Llama and Mistral derivatives run locally at low cost, which is why paraphrasing and style-transfer services price them out. If your concern is humanised AI, you are ultimately defending against Llama-family generation.
DeepSeek R1 and o3-mini (OpenAI reasoning model) deserve separate mention. Both produce text with reasoning-chain artefacts — explicit step-by-step logic visible in the output — which our detector has learned to recognise. Reasoning models are currently easier to detect than their base-chat counterparts for this reason.
If you're picking a model to write with and detection isn't your concern, Claude 4.5 Sonnet and GPT-5 are the hardest-to-detect. If you're building a detection workflow, prioritise for the models you actually see: most academic misuse still runs on GPT-4/5 through free interfaces; most content-farming runs on Llama-derivative humanisers.
A single detector trained on a single model family will perform worst on the others. Our ensemble approach trains on samples from all 22 generators, which is why per-model AUC on hard cases (Claude 4.5, GPT-5) is still above 0.90 while any single-model-trained detector would drop below 0.80.
The underlying trend: detection difficulty is rising faster than generator release cadence. Each new flagship is harder to detect than the previous one, retraining closes the gap but not fully. Expect the 2026–2027 baseline to be lower AUC on the frontier models and roughly constant on legacy models.
Per-model AUC numbers are derived from our internal validation and may not generalise. Each model's difficulty changes over time as both the generator and our training corpus evolve. Current data reflects the 2026-04 benchmark run.