Home › Which AI Is Hardest to Detect? GPT vs Claude vs Gemini | Plagiarism Detector

Which AI Is Hardest to Detect? GPT vs Claude vs Gemini vs Llama

Not all AI text is equally detectable. Here are the results of our per-generator benchmark — which model families our detector catches with near-perfect accuracy, which ones it struggles with, and what that tells you about choosing a detection workflow.

2026-04-17 · Plagiarism Detector Team

The Short Answer — Leaderboard

[LEADERBOARD TABLE — fill with real per-model AUC numbers from benchmark before publishing]

Ordered from easiest to hardest to detect on our validation set. The spread is wide — AUC on some model families exceeds 0.99 while others drop into the 0.80s. Detection difficulty correlates with model size, instruction-tuning sophistication, and output variance.

For the full per-generator breakdown methodology, see our accuracy benchmark page. This article summarises the practical implications of that data for users choosing which detector to trust and which model to use.

OpenAI Family — GPT

GPT-3.5 is the easiest modern model to detect — AUC [AUC: ?] on our set. Legacy generation artefacts (repetition, hedging, bland register) remain clearly present. GPT-4 drops to AUC [AUC: ?], GPT-4o to [AUC: ?], reflecting progressively better calibration. GPT-5.x is the hardest of the family — AUC [AUC: ?] — because the instruction-tuning team explicitly targeted detection-artefact removal.

Practical implication: academic workflows concerned about GPT-3.5-era cheating can rely heavily on detection alone. Workflows concerned about GPT-5 need to pair detection with contextual evidence, as described in our teacher workflow guide.

Temperature settings matter. Low-temperature outputs (t≤0.5) are easier to detect because they concentrate probability mass on a narrower vocabulary. Most chat interfaces default to t≈0.7, putting text in a moderately detectable zone. Adversarial users explicitly crank temperature or use diverse decoding to widen the range and evade detection — our ensemble partially corrects for this but not completely.

Anthropic — Claude

Claude 3 Opus: AUC [AUC: ?]. Claude 3.5 Sonnet: [AUC: ?]. Claude 4 Opus: [AUC: ?]. Claude 4.5 Sonnet: [AUC: ?]. The Claude family consistently produces less repetitive, more stylistically varied text than same-generation GPT models, which makes it harder to detect via statistical methods.

Claude's constitutional-AI training specifically targets the “machine tells” that our supervised classifier learns from — hedging patterns, overuse of specific connectives, predictable paragraph structure. This is a direct adversarial relationship: the generator is trained against features the detector relies on.

Claude 4.5 Sonnet and GPT-5.x are close in difficulty. Their score distributions overlap the human baseline the most in our validation data. If your workflow targets either of these models, expect reduced recall at the default threshold and consider lowering to F1-optimal for high-sensitivity screening.

Google — Gemini

Gemini 1.5 Pro: AUC [AUC: ?]. Gemini 2.0: [AUC: ?]. Gemini 2.5: [AUC: ?]. Gemini has shown the most variable detection performance across versions — some intermediate releases regressed temporarily before improvements landed.

Gemini's multi-modal training means text-only outputs sometimes carry vestigial patterns from image-caption or code-explanation domains. Our detector picks up on these, which explains Gemini's slightly higher detectability on mixed-domain prompts than on pure prose.

For Google Workspace users whose students or employees use Gemini through Docs, the detection signal is similar to the raw API output. We have not observed workspace-integration-specific evasion patterns distinct from direct Gemini API use.

Check a sample from any model

Paste output from any LLM and see the per-sentence verdict. Our detector treats all 22 model families as a single ensemble check.

Meta and Open-Weights Models

Llama 3.1: AUC [AUC: ?]. Llama 3.3: [AUC: ?]. Qwen 2.5: [AUC: ?]. Qwen 3: [AUC: ?]. DeepSeek R1: [AUC: ?]. Mistral Large: [AUC: ?]. Open-weights models span a wider range than closed ones — fine-tuning variants, quantised deployments, and community-modified checkpoints all produce subtly different outputs.

Detection on open-weights is strategically important because humaniser tools are usually built on open-weights models — Llama and Mistral derivatives run locally at low cost, which is why paraphrasing and style-transfer services price them out. If your concern is humanised AI, you are ultimately defending against Llama-family generation.

DeepSeek R1 and o3-mini (OpenAI reasoning model) deserve separate mention. Both produce text with reasoning-chain artefacts — explicit step-by-step logic visible in the output — which our detector has learned to recognise. Reasoning models are currently easier to detect than their base-chat counterparts for this reason.

What These Differences Mean for You

If you're picking a model to write with and detection isn't your concern, Claude 4.5 Sonnet and GPT-5 are the hardest-to-detect. If you're building a detection workflow, prioritise for the models you actually see: most academic misuse still runs on GPT-4/5 through free interfaces; most content-farming runs on Llama-derivative humanisers.

A single detector trained on a single model family will perform worst on the others. Our ensemble approach trains on samples from all 22 generators, which is why per-model AUC on hard cases (Claude 4.5, GPT-5) is still above 0.90 while any single-model-trained detector would drop below 0.80.

The underlying trend: detection difficulty is rising faster than generator release cadence. Each new flagship is harder to detect than the previous one, retraining closes the gap but not fully. Expect the 2026–2027 baseline to be lower AUC on the frontier models and roughly constant on legacy models.

Frequently Asked Questions

If some models are harder to detect, should I avoid using detectors at all?

No — even on the hardest model families our AUC is above 0.85, which is a strong signal. The question is how you use the signal. For hard-to-detect models, pair the score with corroborating evidence (edit history, in-class work, student conversation). For easier models, the score alone is often sufficient.

Which model should I use if I want to avoid detection?

We don't answer this question directly — we run a detection tool, not an evasion guide. What we will say: detectable-vs-undetectable is not the right axis for picking a model. Quality, cost, and fit-for-purpose matter far more than detection difficulty. If you are writing legitimately with AI assistance, disclosure and transparent workflow matter more than hiding the tool.

Do open-weights model variants have different detection profiles?

Yes, and meaningfully so. A community-fine-tuned Llama 3.3 variant trained for a specific writing style can produce text that scores differently from vanilla Llama 3.3. Our benchmark covers the standard checkpoint; custom fine-tunes may be easier (if they narrow output distributions) or harder (if they explicitly adversarial-train against detection).

How does temperature and sampling affect detectability?

Higher temperature and more diverse sampling generally reduce detectability because they widen the output distribution. Low-temperature greedy decoding is easiest to detect. Most production chat interfaces run t≈0.7–1.0 with nucleus sampling, which puts them in a moderately detectable regime — our ensemble performs similarly across the default range.

When will GPT-6 or Claude 5 arrive and what should I expect?

Mid-2026 is the consensus forecast for both. Expect detection AUC on the new families to drop into the 0.80–0.85 range for the first 4–8 weeks after launch while we gather samples and retrain. Historical versions suggest full recovery within 8–12 weeks if the model is widely available; longer for rare or limited-access models.

Per-model AUC numbers are derived from our internal validation and may not generalise. Each model's difficulty changes over time as both the generator and our training corpus evolve. Current data reflects the 2026-04 benchmark run.