Etavrian
keyboard_arrow_right Created with Sketch.
News
keyboard_arrow_right Created with Sketch.

DeepPolisher halves genome errors - see how Google's Transformer rewrites DNA accuracy

Reviewed:
Andrii Daniv
2
min read
Aug 7, 2025
Minimalist tech illustration of an AI chip polishing a DNA helix with a researcher observing and check marks on the clean side

Google Research and the UC Santa Cruz Genomics Institute released DeepPolisher on 6 August 2025. The open-source deep-learning tool halves residual errors in long-read genome assemblies and already supports the Human Pangenome Reference Consortium (HPRC) second data release.

DeepPolisher at a glance

The software uses an encoder-only Transformer to identify and correct miscalls in draft assemblies. In tests across 180 human genomes, it delivered substantial accuracy gains.

  • Publication: Paper in Genome Research, 6 Aug 2025.
  • Code: Apache 2.0 Code Repo.
  • Error reduction: total errors drop by roughly 50 percent; indels fall 70 percent.
  • Quality: average Q-score rises from Q66.7 to Q70.1 in benchmark regions.
  • Training data: HG002 cell line from the Personal Genomes Project, certified by NIST and the NHGRI.
  • Inputs: base calls, Phred quality, mapping confidence, phase labels, mismatch flags.
  • Deployment: applied to 232 assemblies in the HPRC second release.
  • Performance: runs on standard GPUs and polishes a human genome in under four hours.

The tool integrates with existing long-read assembly workflows and outputs polished FASTA files ready for public release.

Why polishing matters

Short-read sequencers – largely developed by Illumina – provide high per-base accuracy but read only a few hundred bases at a time. Long-read platforms from Pacific Biosciences stretch reads to tens of thousands of bases, yet early versions carried error rates near 10 percent.

Pacific Biosciences lowered raw errors to about 1 percent with circular consensus sequencing. Google’s DeepConsensus later pushed residual errors below 0.1 percent, but draft assemblies still contained millions of mismatches and indels that disrupt the reading frame.

DeepPolisher learns from multiple overlapping reads at each genomic position, correcting those final discrepancies. Assemblies polished with the model now approach reference-grade accuracy, supporting variant discovery, population studies and clinical genetics.

How it works

The Transformer ingests base calls, quality scores, mapping confidence and phase information, then outputs a corrected consensus sequence. By considering every k-mer context, the model recognises systematic errors that evade traditional tools.

Unlike earlier neural methods that operate at the read level, DeepPolisher focuses on the assembly itself, making it agnostic to specific aligners or variant callers. The authors report that phased assemblies maintain accurate haplotype structure after polishing.

Road ahead

DeepPolisher is already part of the HPRC pipeline and is expected to feature in plant reference genomes, biodiversity initiatives and other large projects. Google Research says the approach builds on its first demonstration of this on a human genome and will continue to evolve alongside sequencing technology.

Researchers can download the tool from the Code Repo and review the methods in the Paper. Additional performance data are available on the Google Research blog.

Quickly summarize and get insighs with: 
Author
Andrew Daniv, Andrii Daniv
Andrii Daniv
Andrii Daniv is the founder and owner of Etavrian, a performance-driven agency specializing in PPC and SEO services for B2B and e‑commerce businesses.
Reviewed
Andrew Daniv, Andrii Daniv
Andrii Daniv
Andrii Daniv is the founder and owner of Etavrian, a performance-driven agency specializing in PPC and SEO services for B2B and e‑commerce businesses.
Quickly summarize and get insighs with: 
Table of contents