Google Research and the UC Santa Cruz Genomics Institute released DeepPolisher on 6 August 2025. The open-source deep-learning tool halves residual errors in long-read genome assemblies and already supports the Human Pangenome Reference Consortium (HPRC) second data release.
DeepPolisher at a glance
The software uses an encoder-only Transformer to identify and correct miscalls in draft assemblies. In tests across 180 human genomes, it delivered substantial accuracy gains.
- Publication: Paper in Genome Research, 6 Aug 2025.
- Code: Apache 2.0 Code Repo.
- Error reduction: total errors drop by roughly 50 percent; indels fall 70 percent.
- Quality: average Q-score rises from Q66.7 to Q70.1 in benchmark regions.
- Training data: HG002 cell line from the Personal Genomes Project, certified by NIST and the NHGRI.
- Inputs: base calls, Phred quality, mapping confidence, phase labels, mismatch flags.
- Deployment: applied to 232 assemblies in the HPRC second release.
- Performance: runs on standard GPUs and polishes a human genome in under four hours.
The tool integrates with existing long-read assembly workflows and outputs polished FASTA files ready for public release.
Why polishing matters
Short-read sequencers – largely developed by Illumina – provide high per-base accuracy but read only a few hundred bases at a time. Long-read platforms from Pacific Biosciences stretch reads to tens of thousands of bases, yet early versions carried error rates near 10 percent.
Pacific Biosciences lowered raw errors to about 1 percent with circular consensus sequencing. Google’s DeepConsensus later pushed residual errors below 0.1 percent, but draft assemblies still contained millions of mismatches and indels that disrupt the reading frame.
DeepPolisher learns from multiple overlapping reads at each genomic position, correcting those final discrepancies. Assemblies polished with the model now approach reference-grade accuracy, supporting variant discovery, population studies and clinical genetics.
How it works
The Transformer ingests base calls, quality scores, mapping confidence and phase information, then outputs a corrected consensus sequence. By considering every k-mer context, the model recognises systematic errors that evade traditional tools.
Unlike earlier neural methods that operate at the read level, DeepPolisher focuses on the assembly itself, making it agnostic to specific aligners or variant callers. The authors report that phased assemblies maintain accurate haplotype structure after polishing.
Road ahead
DeepPolisher is already part of the HPRC pipeline and is expected to feature in plant reference genomes, biodiversity initiatives and other large projects. Google Research says the approach builds on its first demonstration of this on a human genome and will continue to evolve alongside sequencing technology.
Researchers can download the tool from the Code Repo and review the methods in the Paper. Additional performance data are available on the Google Research blog.