Operator note

DeepPolisher Halves Genome Assembly Errors - See How Google’s Tool Does It

Google Research's Transformer tool DeepPolisher halves human genome assembly errors, cuts indels 70 percent. See how the source model boosts Q-scores.

Minimalist tech scene with AI cleaning DNA helix and error marks fading to green checks

On 6 August 2025, Google Research introduced DeepPolisher, an open-source deep learning tool that reduces base-level errors in human genome assemblies by about 50 percent. Technical lead Kishwar Shafin and product lead Andrew Carroll outlined the release in a Google Research blog post from Mountain View, California.

What DeepPolisher Does

The software applies a Transformer model to correct draft assemblies generated from Pacific Biosciences long reads. The model ingests base calls, quality scores, mapping confidence and mismatch flags as separate channels, then predicts the most probable true sequence.

The project is a joint effort between Google Research and the UC Santa Cruz Genomics Institute and is released under the Apache 2.0 license.

Key Performance Results

  • Total assembly errors cut in half in initial benchmarks.
  • Insertion-deletion (indel) mistakes fall by 70 percent, a critical improvement because indels can shift the reading frame of protein-coding genes.
  • Average assembly quality rises from Q66.7 to Q70.1 - a Q-score increase that translates to roughly 1 error in every 10 million bases.
  • Training used chromosomes 1–19 of the Personal Genomes Project HG002 sample, while chromosomes 20–22 were held out for validation.
  • The May 2025 production model polished 232 assemblies for the Human Pangenome Reference Consortium.

Why Genome Polishing Matters

Short-read sequencers, often developed by Illumina, achieve high accuracy but read only a few hundred bases at a time. Long-read platforms span tens of thousands of bases yet leave residual errors that complicate gene annotation and structural-variant discovery.

Earlier pipelines pushed error rates below 0.1 percent, still leaving thousands of mistakes in a single human genome. Google and PacBio previously reduced single-pass errors with DeepConsensus - the first demonstration of this on a human genome. DeepPolisher builds on that work by training on the well-characterized HG002 reference, validated by NIST and the NHGRI.

Using phased data allowed the model to learn maternal and paternal haplotype differences, improving correction accuracy across the genome.

Availability

Source code, pre-trained models and documentation are available in the public Code Repo. A peer-reviewed Genome Research paper details the architecture and benchmarking methodology.

The training sample comes from the Personal Genomes Project, providing transparent access to raw data for further study.

Further Reading

Keep reading

Related articles

AI powered shopping cart protocol illustration with funnel price tag alert loyalty user tapping toggleInside Google's Universal Commerce Protocol that lets AI agents tap carts, catalogs and loyalty pricing2 min readMinimalist illustration of AI checkout hub with Cart Catalog Identity cards and user tapping settingsGoogle quietly upgrades AI shopping protocol: what Cart, Catalog and Identity Linking change next2 min readMinimalist tablet health UI privacy risk toggle character adjusting shield and prescription funnelGoogle and DocMorris Launch AI Health Companion for Europe - What Changes Next2 min readMinimalist site health dashboard illustration with 404 410 toggle funnel filtering errors into green checksWorried About Endless 404 Reports In Search Console? John Mueller Reveals What They Really Mean3 min read