We are please to announce the addition of Guppy 5 basecalling results to our short-read eliminator (SRE) and ultra-long (ULK) GM24385 dataset.
As with previous releases the new dataset is available for anonymous download from and Amazon Web Services S3 bucket. The bucket is part of the Open Data on AWS project enabling sharing and analysis of a wide range of data.
The data is located in the bucket at:
s3://ont-open-data/gm24385_2020.11/
See the tutorials page for information on downloading the dataset.
The Guppy 5 basecalling for the GM24385 dataset was performed using version
5.0.6, driven by the same katuali analysis pipeline
as for the initial dataset release. The new basecaller version was provided as
input the per-chromosome .fast5
files created in the initial pipeline via
alignment of the Guppy 4.0.11 basecalls. This allows for easy comparison of
results on subsets of the data (but may lead to subtle side-effects). For
example the analysis data structure contains now entries of the form:
gm24385_2020.11/analysis/r9.4.1/{flowcell}/guppy_v4.0.11_r9.4.1_hac_prom/align_unfiltered/{chromosome}/guppy_v5.0.6_r9.4.1_sup_prom/├── align_unfiltered│ ├── align_to_ref.log│ ├── basecall_stats.log│ ├── calls2ref.bam│ ├── calls2ref.bam.bai│ └── calls2ref_stats.txt├── pass├── fail├── basecalls.fastq.gz└── sequencing_summary.txt
The file basecalls.fastq.gz
contains the reads passing Qscore filtering from
the new Guppy version (the pass and fail subfolders contain the reads split by
this criteria). Similar to the main folder structure the align_unfiltered
directory contains unfiltered alignments of the basecalls to the reference
sequence (calls2ref.bam
) along with text files summarizing the properties of
the alignments.
Guppy 5.0.6 implements the CRF-CTC models developed in the research-grade bonito basecaller. The previous Guppy 4.0.11 basecalls leveraged the “flip-flop” algorithm developed in the flappie research caller. For this dataset we have chosen to the use the highest accuracy “super” basecalling model. This is an optional model now available in Guppy, replacing the “high” accuracy line as the models providing the state-of-the-art accuracy. The “high” accuracy line remains the default, standard choice providing a good balance of accuracy and compute performance.
In order to compare the single-read accuracy of the two basecallers, we used the alignment summaries produced by the katuali pipeline. The two alignment summary tables were joined on read ID to allow simple before and after comparison of all reads:
The plot shows that a general improvement in read accuracy of around 1.8 percentage points, or a 32% reduction in the error rate. To see the change in the average behaviour it is perhaps clearer to plot a density estimate of the read accuracy distribution:
One consideration with the newer Guppy basecaller is that the default Qscore threshold for partitioning “pass” and “fail” reads was amended in Guppy v4.5.2 (released 07/04/2021). The rebasecalling here started from reads classed as “pass” by Guppy v4.0.11. It is expected and observed that after rebasecalling a proportion of the data is now classified as failed. With improvements to the basecaller the effect is to remove between 10-15% of data; it should be noted that the data which is removed is of course the lower accuracy data and that end users are free to use both the pass- and fail-reads.
The 5.0.6 version of Guppy used here is a development release candidate version of the software. The final release version will provide very similar results to the version used here. Further details of the final release are available in the Nanopore Community pages.
We hope that these data and analyses provide a useful resource to the community.
Related Links