2025 PNAS Study: AI Judges Prefer AI-Written Content 70-89%

Large language models (LLMs) preferred AI-written content 70-89% of the time in forced-choice tests across product descriptions, scientific abstracts, and movie summaries, according to a peer-reviewed PNAS study. As AI systems increasingly rank, score, and summarize content, this pattern suggests AI-assisted text may gain an advantage when evaluated by other AI systems.

AI systems prefer AI-written content

A controlled study in Proceedings of the National Academy of Sciences tested LLMs as evaluators. In pairwise comparisons of human-written vs GPT-4-generated versions of the same items, LLMs more often selected the AI-written option, while human raters showed lower preference for AI alternatives. The authors caution that as AI mediates discovery and ranking, AI-assisted text may gain a systematic advantage - a potential "gate tax" on those who do not use AI assistance.

Executive snapshot

LLM judges preferred AI-written versions 89% for product descriptions, 78% for paper abstracts, and 70% for movie summaries when the AI versions were produced by GPT-4. Comparable human raters chose the AI versions 36%, 61%, and 58%, respectively.
Evaluator models included GPT-3.5, GPT-4-1106, Llama-3.1-70B, Mixtral-8x22B, and Qwen2.5-72B in pairwise, forced-choice prompts.
Order effects were present; some models favored the first option. The study mitigated this by swapping order and averaging outcomes.
The human rater sample was small (n=13), and the task measured preference, not downstream sales or engagement impact.

Implication for marketers: If AI systems score or summarize listings, AI-assisted copy may be more likely to pass AI gates. Validate conversion effects separately.

Study design and source notes (PNAS 2025)

What was measured

Relative preference for AI-written vs human-written versions of the same item using pairwise, forced-choice prompts. Items spanned marketplace product descriptions, scientific abstracts, and movie plot summaries. AI variants were generated with GPT-4 for the reported comparisons. Presentation order was counterbalanced to reduce position bias and results were averaged across permutations. A small human baseline (n=13) completed the same comparisons.

By whom and when

Walter Laurito, Jan Kulveit, and colleagues; peer-reviewed and published in PNAS (2025), DOI: 10.1073/pnas.2415697122. Full details are available in the PNAS study.

Key limitations

Small human rater sample (n=13).
Pairwise preference may not correlate with real-world outcomes such as click-through or sales.
Results may vary by prompt design, model version, domain, and text length.
Mechanism behind LLM preference (stylistic signals, lexical patterns, or formatting regularity) was not identified; authors call for follow-up work on stylometry and mitigations.

Measured LLM preference across product, abstract, and movie text

Across tasks with GPT-4 as the generator, LLM evaluators selected AI text at materially higher rates than human raters. The divergence was largest for product descriptions (LLMs 89% vs humans 36%) and smallest for movie summaries (LLMs 70% vs humans 58%), with scientific abstracts in between. Evaluator models exhibited order effects; the study mitigated this by alternating presentation and averaging, though residual bias cannot be fully ruled out. The evaluator set included both proprietary and open models, indicating the preference is not confined to a single provider family under the tested conditions.

The authors note the risk of implicit discrimination against human-written text when AI systems act as selection gates for ranking, summarization, or recommendations - a potential operational cost for organizations that do not adapt content for AI-mediated selection.

Marketing implications for AI-mediated ranking and scoring

Interpretation - likely

AI-mediated surfaces that rely on LLM scoring, summarization, or arbitration could favor AI-style copy, especially in product-like descriptions where the measured AI preference was highest. Using AI assistance for such copy may yield selection advantages within AI-driven workflows, separate from human conversion considerations.

Interpretation - tentative

Style and formatting features common in AI-generated text (consistent structure, declarative clarity, attribute completeness) may align with LLM evaluators' heuristics, contributing to higher selection rates. The study did not isolate which features drive preference.
A "gate tax" is plausible where AI assistance becomes necessary to meet AI gatekeepers' expectations, but its magnitude will vary by platform and by how much LLM judgments influence ranking versus behavioral signals.

Interpretation - speculative

Hybrid authoring - human claims and brand tone combined with an AI pass for structure, attribute coverage, and consistency - may balance AI-gate friendliness with human persuasiveness. Allocate effort by risk exposure: AI-scored formats may warrant more AI-assisted structure, while brand storytelling may remain human-led.
Platform-side mitigations (ensemble evaluators, randomized presentation, human review, or non-LLM scoring features) could reduce self-preference, but adoption depends on platform constraints and was outside the study scope.

Operational guidance - evidence-linked

Where visibility is mediated by LLMs, expect AI-assisted versions to test into higher pass rates with AI evaluators. Validate separately with human metrics (CTR, add-to-cart, conversion, returns) because evaluator preference did not equate to sales impact in this study.

Limitations, conflicting evidence, and open questions

Small human baseline (n=13); generalization to broader audiences or languages is uncertain.
Preference in a forced-choice task does not measure commercial outcomes; an LLM-preferred description may not convert better.
The AI-written variants were generated by GPT-4; it is unknown whether similar preferences hold for other generators or after extensive human editing.
Position and prompt framing effects exist; counterbalancing helps but may not eliminate bias.
Domain and length effects are plausible; product descriptions showed the largest gap, movies the smallest in this study. Other content types were not evaluated.
Mechanism remains unclear; stylometric signals, formatting regularity, verbosity, or attribute density could contribute.
Practical influence on major platforms varies. Many ranking systems still weight behavioral and business signals, and the share of decisions delegated to LLM judges is platform-specific.

Source: PNAS study.