As with previous releases the new dataset is available for anonymous download from and Amazon Web Services S3 bucket. The bucket is part of the Open Data on AWS project enabling sharing and analysis of a wide range of data.

The data is located in the bucket at:

s3://ont-open-data/gm24385_2020.11/

See the tutorials page for information on downloading the dataset.

Rebasecalling

The Guppy 5 basecalling for the GM24385 dataset was performed using version 5.0.6, driven by the same katuali analysis pipeline as for the initial dataset release. The new basecaller version was provided as input the per-chromosome .fast5 files created in the initial pipeline via alignment of the Guppy 4.0.11 basecalls. This allows for easy comparison of results on subsets of the data (but may lead to subtle side-effects). For example the analysis data structure contains now entries of the form:

gm24385_2020.11/analysis/r9.4.1/{flowcell}/guppy_v4.0.11_r9.4.1_hac_prom/align_unfiltered/{chromosome}/guppy_v5.0.6_r9.4.1_sup_prom/
├── align_unfiltered
│   ├── align_to_ref.log
│   ├── basecall_stats.log
│   ├── calls2ref.bam
│   ├── calls2ref.bam.bai
│   └── calls2ref_stats.txt
├── pass
├── fail
├── basecalls.fastq.gz
└── sequencing_summary.txt

The file basecalls.fastq.gz contains the reads passing Qscore filtering from the new Guppy version (the pass and fail subfolders contain the reads split by this criteria). Similar to the main folder structure the align_unfiltered directory contains unfiltered alignments of the basecalls to the reference sequence (calls2ref.bam) along with text files summarizing the properties of the alignments.

Accuracy improvement from the new basecaller

Guppy 5.0.6 implements the CRF-CTC models developed in the research-grade bonito basecaller. The previous Guppy 4.0.11 basecalls leveraged the “flip-flop” algorithm developed in the flappie research caller. For this dataset we have chosen to the use the highest accuracy “super” basecalling model. This is an optional model now available in Guppy, replacing the “high” accuracy line as the models providing the state-of-the-art accuracy. The “high” accuracy line remains the default, standard choice providing a good balance of accuracy and compute performance.

In order to compare the single-read accuracy of the two basecallers, we used the alignment summaries produced by the katuali pipeline. The two alignment summary tables were joined on read ID to allow simple before and after comparison of all reads:

Pairwise comparison of single-read accuracies for Guppy 5.0.6 and previous Guppy 4.0.11 basecalling. Majority of points lie above the black diagonal line indicating an improvement in call quality.

The plot shows that a general improvement in read accuracy of around 1.8 percentage points, or a 32% reduction in the error rate. To see the change in the average behaviour it is perhaps clearer to plot a density estimate of the read accuracy distribution:

Guppy 5.0.6 reduces read error rate by 32% compared with previous Guppy 4.0.11 basecalling.

Changes to Qscore filtering

One consideration with the newer Guppy basecaller is that the default Qscore threshold for partitioning “pass” and “fail” reads was amended in Guppy v4.5.2 (released 07/04/2021). The rebasecalling here started from reads classed as “pass” by Guppy v4.0.11. It is expected and observed that after rebasecalling a proportion of the data is now classified as failed. With improvements to the basecaller the effect is to remove between 10-15% of data; it should be noted that the data which is removed is of course the lower accuracy data and that end users are free to use both the pass- and fail-reads.

Guppy v4.5.2 brought an increased Qscore threshold for determining pass and fail reads. This is carried into Guppy v5.0.6.

Further information

The 5.0.6 version of Guppy used here is a development release candidate version of the software. The final release version will provide very similar results to the version used here. Further details of the final release are available in the Nanopore Community pages.

We hope that these data and analyses provide a useful resource to the community.