Etavrian
keyboard_arrow_right Created with Sketch.
News
keyboard_arrow_right Created with Sketch.

Google's MAD scales differentially private partition selection to 800B entries - what changes for keywords

Reviewed:
Andrii Daniv
6
min read
Aug 21, 2025
Minimalist tech illustration shield funnel histogram differential privacy toggle partitions keyword insights report

This report distills new findings on differentially private partition selection from Google Research’s ICML 2025 work and translates them into defensible implications for privacy-safe keyword and content analytics pipelines.

Differentially private partition selection: executive snapshot

Google Research introduced MaxAdaptiveDegree (MAD), a parallel algorithm for differentially private (DP) partition selection that reallocates excess weight from very frequent items to near-threshold items, increasing the number of items released under the same privacy budget. For technical details, see Scalable Private Partition Selection via Adaptive Weighting and background on DP partition selection.

  • Scale: MAD scales to datasets with hundreds of billions of items - up to 10^3× larger than prior sequential algorithms analyzed in the literature.
  • Utility at scale: On Common Crawl (~800 billion entries), two-round MAD (MAD2R) produced an item set such that 99.9% of entries had at least one item in the output and 97% of database entries corresponded to an item in the output, while satisfying DP.
  • Comparative performance: Across nine public datasets, MAD2R returned more items than baselines (uniform “Basic” and DP-SIPS) under matched privacy parameters, achieving the state of the art reported by the authors.
  • Engineering fit: Designed for MapReduce-like parallelism; low-sensitivity bounds are preserved relative to Gaussian weighting baselines that add Gaussian noise.
  • Availability: An open-source implementation is available: DP partition selection on GitHub.

One-line implication for marketers: Expect richer privacy-safe aggregate lists (e.g., top queries or n-grams) from large logs - greater coverage of frequent terms without exposing individuals.

Method and source notes on Google Research’s MAD algorithm

  • Object measured: Utility and scalability of DP partition selection - how many unique items can be safely released at user or record level DP - comparing MAD (single- and multi-round) to scalable baselines under fixed privacy budgets.
  • Who/when: Google Research; paper accepted at ICML2025.
  • Datasets and sample size: Nine publicly available datasets; a large-scale run on Common Crawl treating each entry as a “user” (~800 billion entries).
  • Methods: Adaptive weighting within the standard weight-noise-filter DP paradigm; optional multi-round approach releasing intermediate noisy weights; parallelizable in MapReduce-style systems; low sensitivity maintained relative to Gaussian weighting.
  • Outcomes: Number of items returned and entry coverage on Common Crawl; comparisons to Basic and DP-SIPS.
  • Caveats: The blog summary omits specific ε/δ values and per-dataset numeric deltas vs. baselines; consult the paper for full parameters. Record-level DP on Common Crawl (entries as users) may differ from user-level DP needed in some applications.

Findings: algorithm performance and scalability

MAD reframes the weighting step to reduce wastage - when very frequent items receive more weight than required to pass the DP threshold - and reallocates that surplus to near-threshold items, raising total items released while preserving privacy sensitivity bounds. The design fits distributed pipelines, addressing the need to run DP selection at Internet scale.

  • Utility gains from adaptivity: MAD identifies items with excess weight and shifts some allocation to under-allocated items without breaching per-user sensitivity, increasing item counts compared with Gaussian weighting in single-round settings.
  • Multi-round refinement: Safely releasing intermediate noisy weights enables further reallocation in later rounds; two-round MAD (MAD2R) achieved the strongest results among tested scalable methods across nine public datasets under the same privacy budgets.
  • Large-scale run: On Common Crawl (~800B entries), MAD2R produced outputs where 99.9% of entries had at least one selected item and 97% of database entries mapped to an item in the released set, while satisfying a DP guarantee defined at the entry (record) level.
  • Comparative baselines: Reported comparisons include uniform “Basic” weighting and DP-SIPS; MAD2R returned more items than both under equal privacy settings.
  • Applicability: DP partition selection underpins vocabulary and n-gram extraction, privacy-preserving data streams, DP histograms, and private model fine-tuning with increasing efficiency; improving this step can lift downstream utility in those tasks.

Interpretation and implications for marketers

Likely

  • Privacy-safe keyword and query lists can retain broader coverage of frequent terms under the same DP budgets, improving the fidelity of aggregate reporting while keeping rare, potentially identifying queries out of release.
  • Distributed data teams can integrate DP selection into existing big-data stacks (e.g., MapReduce-style), reducing operational friction for privacy-by-design analytics.

Tentative

  • Better DP vocabulary extraction may support stronger language and search models trained on privacy-protected corpora, improving features such as on-site search suggestions, content clustering, and topic discovery without access to raw user identifiers.
  • Multi-round adaptivity might let analytics teams tune toward business-relevant frequency thresholds while maintaining similar privacy budgets.

Speculative

  • Ad and analytics platforms could publish richer DP aggregate reports (e.g., top queries by vertical) with improved utility, affecting how marketers source privacy-compliant keyword expansions and competitive intelligence.
  • If adopted widely, improved DP partition selection may shift some research budgets from raw log analysis toward DP-protected pipelines, altering data-sharing agreements and procurement norms.

Contradictions and gaps

  • Parameter transparency: The public summary does not specify ε/δ or item-level error rates for the showcased runs, limiting independent assessment of the exact privacy-utility slope; the paper should be consulted for full details.
  • Benchmark breadth: The write-up references nine datasets and a comparative table image but does not enumerate per-dataset numeric deltas in text, constraining meta-analysis.
  • Privacy unit variance: On Common Crawl, entries are treated as users (record-level DP); many commercial contexts require user-level DP across sessions, which can be stricter and may change utility curves.
  • External replication: While code is open-sourced, independent third-party replications and audits were not cited; broader validation would strengthen confidence.

Data appendix

Key quantitative points

  • Scale: “Hundreds of billions” of items; up to 10^3× larger than prior sequential algorithms analyzed.
  • Common Crawl run: ~800 billion entries; MAD2R coverage - 99.9% of entries had at least one selected item; 97% of database entries corresponded to a selected item; DP guarantee at entry level.
  • Benchmarks: Nine publicly available datasets; MAD2R outperformed Basic and DP-SIPS under matched privacy settings (item counts increased).

Additional context on DP partition selection use cases

Quickly summarize and get insighs with: 
Author
Etavrian AI
Etavrian AI is developed by Andrii Daniv to produce and optimize content for etavrian.com website.
Reviewed
Andrew Daniv, Andrii Daniv
Andrii Daniv
Andrii Daniv is the founder and owner of Etavrian, a performance-driven agency specializing in PPC and SEO services for B2B and e‑commerce businesses.
Quickly summarize and get insighs with: 
Table of contents