Google says TTD-DR beats OpenAI Deep Research 74.5% on reports

TTD-DR vs OpenAI Deep Research: Google introduces a retrieval-driven, draft-first research agent that iteratively denoises a report using search and self-evolution. In head-to-head tests, TTD-DR posted a 74.5% win rate on long-form reports and +7.7 percentage points on HLE-Search, while delivering a stronger latency-quality trade-off than OpenAI’s Deep Research. Full method and results are in the paper.

Test-Time Diffusion Deep Researcher (TTD-DR)

TTD-DR frames research as an iterative draft -> search -> revise cycle. A preliminary draft is repeatedly refined with new retrieved facts, a process akin to denoising with retrieval, while a self-evolution routine improves intermediate steps and merges stronger variants before assembling the final report. The backbone is retrieval-augmented generation with Gemini-2.5-pro as the base model. Evaluations cover long-form report writing and multi-hop reasoning, with comparisons to OpenAI Deep Research.

Executive snapshot

Win rate: 74.5% vs OpenAI Deep Research on long-form report generation (pairwise evaluation).
Correctness: +7.7 percentage points on a 200-query HLE-Search subset and +1.7 points on GAIA.
Ablations: the backbone alone underperforms OpenAI DR; adding self-evolution yields a 59.8% win rate on DeepConsult and +4.4/+1.2 points correctness on HLE-Search/GAIA; adding diffusion-with-retrieval drives further gains.
Efficiency: at the same latency, TTD-DR achieves higher win-rate quality than OpenAI DR per a Pareto chart.
Base model: Gemini-2.5-pro for TTD-DR; baselines use their default LLMs.

Why it matters for marketers: Iterative, draft-first agents that keep searching and revising with citations can raise long-form factual quality at similar or lower time-to-answer.

Method and source notes for deep research agents evaluation

The team evaluates a diffusion-style research agent that plans, iteratively generates search questions and synthesizes answers via a RAG-like process, and drafts a final report. Component-wise self-evolution and report-level denoising are applied.

Who and when: Google Cloud Research (Han, Lee et al.). Blog dated Sep 19, 2025; paper.
What was measured: pairwise win rate vs OpenAI DR for long-form reports; correctness as exact match for short answers; latency-quality trade-off via a Pareto-frontier diagram.
Models: TTD-DR uses Gemini-2.5-pro; other agents use their default LLMs.
Datasets: DeepConsult for long-form reports, Humanity's Last Exam with a 200-query HLE-Search subset, and GAIA for multi-hop Q&A.
Sample sizes: HLE-Search = 200 queries; other set sizes not stated in the blog.
Procedure notes: self-evolution uses LLM-as-judge for feedback; the blog does not specify human vs automated judges for final win rates.
Availability: productized as a Research Assistant in Google Agentspace and built with the Agent Development Kit.
Limitations: vendor-run study, incomplete judging protocol and cost details, baseline model differences, limited disclosure on dataset splits and statistical significance.

Findings

TTD-DR design and workflow

Three stages: plan -> iterative search (question generation plus answer synthesis with retrieved documents) -> final report.
Self-evolution explores multiple answer variants, scores them with LLM-as-judge, revises with feedback, and merges into a single stronger output. This follows common patterns like generating multiple answers then selecting.
Report-level denoising: the current draft feeds the next search; new facts revise the draft, repeating to convergence.
Backbone: retrieval-augmented and draft-centric, implemented on Gemini-2.5-pro.

Head-to-head outcomes vs OpenAI Deep Research

Long-form reports: 74.5% win rate for TTD-DR. The blog notes win rates are computed relative to the OpenAI DR baseline.
Multi-hop short answers: correctness +7.7 points on HLE-Search (200 queries) and +1.7 points on GAIA.

Ablation results

Backbone only: underperforms OpenAI DR.
+ Self-evolution: DeepConsult win rate 59.8%; correctness +4.4 points on HLE-Search and +1.2 on GAIA.
+ Diffusion with retrieval: substantial additional gains across all evaluations.

Latency-quality trade-off

Pareto analysis shows TTD-DR achieves higher quality at the same latency compared with OpenAI DR and other agents. The blog provides a chart but no raw numbers.

Interpretation and implications for marketing and content operations

Likely

Iterative draft-first workflows with targeted retrieval and continuous revision improve long-form factual quality relative to single-pass generation, as indicated by the 74.5% win rate. This fits research-heavy briefs, compliance summaries, and executive reports where verifiable claims matter.

Tentative

Multi-hop gains (+7.7/+1.7 points) suggest fewer factual gaps for Q&A-style tasks such as briefs, FAQs, and product comparisons, but transfer to marketing domains should be validated against in-domain content and house style.
Efficiency findings imply quality improvements without longer wait times, supporting SLAs while raising editorial standards. Throughput and cost will depend on retrieval setup, token limits, and model pricing.

Speculative

Draft-driven search and revision may reduce hallucinations and increase citation density in public-facing content and thought leadership. Search ranking impact remains uncertain and should be tested in production.

Operational guidance

Favor agents that plan coverage, generate search questions from the evolving draft, ground claims in retrieved sources, then run final synthesis after revisions.
Use human QA or calibrated LLM raters to assess helpfulness and trace claims to sources before publication. The study itself relies on LLM raters during self-evolution.

Contradictions and gaps in the evidence

Judge transparency: the blog references LLM-as-judge for self-evolution but does not clarify judging for reported win rates.
Baselines: TTD-DR uses Gemini-2.5-pro while other agents use their default models, confounding method vs model effects.
Costs: compute usage, API calls, and dollar costs per task are not reported.
Statistical rigor: no confidence intervals or significance tests are provided.
Dataset coverage: only a 200-query subset is specified for HLE-Search; sizes and topic mix for other sets are not detailed.
Generalization: results may not extrapolate to non-English content or domain-specific corpora without further tests.

Data appendix: key numbers

74.5% win rate vs OpenAI DR on long-form report generation.
HLE-Search: subset of 200 harder queries; correctness +7.7 points vs OpenAI DR.
GAIA correctness: +1.7 points vs OpenAI DR.
Ablation with self-evolution: DeepConsult win rate 59.8%; correctness +4.4 (HLE-Search) and +1.2 (GAIA).