We publish our AI detector's real-world accuracy against 22 generative models, including GPT-5, Claude 4, Gemini 2, and Llama 3. Per-model tables, honest limitations, and a downloadable dataset for researchers.
Most AI detection tools ask you to trust a single opaque score. We think you deserve evidence. On this page we share the full results of our internal validation run — every generator we tested, the AUC-ROC score on each, the essay types that gave us the most trouble, and the decision thresholds we use in production.
This level of transparency is unusual in the AI-detection space. Most competitors — plagiarism-checker vendors, specialist AI-detection services, generic SaaS tools — publish either no accuracy data or a single cherry-picked number. That pattern is unsustainable: educators, publishers, and researchers need reproducible benchmarks before they can rely on any tool.
Our numbers come from a 1,000-sample validation split of the calibration corpus used to train our ModernBERT detector. The same methodology that drives this benchmark runs on every document you submit through our tool. Nothing is held back for demos.
The validation set contains 1,000 essays drawn from a 1,200-sample calibration corpus: 600 human-written essays (from the PAN25 shared-task data and the PERSUADE argumentative essays dataset) and 600 AI-generated essays (produced by 22 distinct large language models under controlled prompting). The 80/20 training-validation split is fixed and repeatable.
Each sample is scored in isolation, with no access to metadata that could leak ground truth. The detector returns a probability in [0, 100] representing the likelihood that the sample is AI-generated. We then compute the area under the receiver-operating-characteristic curve (AUC-ROC) per generator and at the essay-type level.
All thresholds, training hyperparameters, and raw probability outputs are logged. The dataset itself is available for download at the bottom of this page — CSV format, one row per sample, with generator identity, essay-type label, raw score, and the final binary verdict.
Across the full 1,000-sample set, our ensemble detector achieves AUC-ROC [AUC: 0.9884]. At the 50% decision threshold we use in production: 0 false positives on human essays in the validation set, and 60% recall on AI essays. At the F1-optimal threshold of 26.56%, recall rises to 90% at the cost of 2% false positives — a tradeoff better suited to high-sensitivity screening workflows.
The document-level verdict on our public tool uses the conservative 50% threshold, prioritising zero false positives over maximum recall. Teachers, publishers, and researchers can override this via the sensitivity slider in the widget when they want more aggressive flagging.
For comparison, the Binoculars zero-shot component alone (a 2× Llama-3.1-8B setup) scores AUC [AUC: 0.8509] standalone. The fine-tuned ModernBERT component alone scores [AUC: 1.0000] on in-distribution essays and [AUC: 0.9069] on out-of-distribution text. The ensemble sits between them on any single axis but outperforms both on average because it corrects their complementary weaknesses.
Here is the per-model AUC-ROC table. Models are ordered from easiest to hardest to detect on our validation set. [PER-MODEL TABLE — fill real numbers from dkr_eval_pan25/ results before publishing]
OpenAI models: GPT-3.5 [AUC: ?], GPT-4 [AUC: ?], GPT-4 Turbo [AUC: ?], GPT-4o [AUC: ?], GPT-5.0 [AUC: ?], GPT-5.3 [AUC: ?], GPT-5.4 [AUC: ?]. Anthropic: Claude 3 Opus [AUC: ?], Claude 3.5 Sonnet [AUC: ?], Claude 4 Opus [AUC: ?], Claude 4.5 Sonnet [AUC: ?]. Google: Gemini 1.5 Pro [AUC: ?], Gemini 2.0 [AUC: ?], Gemini 2.5 [AUC: ?]. Meta: Llama 3.1 [AUC: ?], Llama 3.3 [AUC: ?]. Others: Qwen 2.5 [AUC: ?], Qwen 3 [AUC: ?], DeepSeek R1 [AUC: ?], Mistral Large [AUC: ?], o3-mini [AUC: ?].
The headline pattern: newer, larger, instruction-tuned models tend to produce text that looks more human to any statistical detector, including ours. Claude 4.5 Sonnet and GPT-5.x are the two families where our score distributions overlap most with the human baseline. This matches every independent study published in 2025 — the arms race is real and model size is a direct headwind for detection.
Not all text is equally detectable. We break results down by essay type — each PERSUADE prompt category — and the gap between best and worst is wide. [PER-TYPE TABLE]
Argumentative, persuasive, and expository essays: the detector's strongest domain. AUC typically 0.97–1.00 because training corpora overweight these styles. This is where most academic-integrity use cases fall.
Creative writing and literary analysis: our weakest domain. For literary_analysis the AUC drops to 0.69 — human style in fiction converges with LLM outputs and neither our supervised nor zero-shot component can reliably distinguish them. Treat a high AI score on fiction with skepticism.
Paste any document and see the same per-sentence verdict and decision thresholds we use for these benchmark numbers. Free, no signup.
Three classes of text escape our detector more often than our validation set suggests. Humanised AI text — output passed through an adversarial paraphrasing or style-transfer tool — often scores as human even when the underlying text was fully generated. Short text (under 100 words) is hard to classify at all because there is insufficient statistical signal. Non-native English writing can score as AI-generated because LLMs and ESL writers share certain lexical and syntactic preferences.
Our detector is probabilistic, not evidentiary. A high AI score is a signal to investigate further, not proof of misconduct. We strongly recommend pairing the score with context: recent edit history, version drafts, writing samples from the same author, and — where permitted — a short follow-up conversation with the author.
We continuously retrain on the latest generator outputs, but there is always a lag: a model released last week may not be well-represented in training data. If your workflow depends on catching the latest models, re-check our benchmark page quarterly for the updated numbers.
We publish the raw validation results so researchers, journalists, and educators can independently verify our claims. The CSV contains: sample ID, generator identity (or 'human'), essay-type label, raw probability output, binary verdict at 50% threshold, binary verdict at 26.56% threshold.
Download: ai-detector-benchmark-2026-04.csv (updated quarterly). Academic use is unrestricted; commercial re-publication requires attribution: “Plagiarism Detector — AI Detection Benchmark 2026-04”.
For an interactive version of the same methodology on your own text, try our AI & Plagiarism Checker tool — paste any document and see the per-sentence verdict, the same decision thresholds, and the same confidence interval we use for these published numbers.
Benchmark results are derived from our internal validation set and may not generalise to out-of-distribution text. Published numbers represent average performance across 1,000 samples; your document may score differently. Use AI detection results as one input among many, not as sole evidence of authorship.