Basic QC Tutorial

Expected duration: 15 minutes

The Summary statistics and QC tutorial is intended as a functional guide to help assess the quality characteristics of a single Nanopore sequencing run. This tutorial aims to enable an objective assessment of the performance of a Nanopore flowcell run and to assess the sequence characteristics to benchmark quality.

Sufficient information is provided in the tutorial that the workflow can be tested, validated and replicated. The tutorial is provided with an example dataset from a barcoded sequence library. The tutorial is intended to address important questions:

  • How many reads (and how many gigabases) were sequenced?
  • What fraction of my sequence collection is good quality?
  • How are longer sequence reads represented in my sample?

Methods used in this tutorial include:

  • Python for statistical analysis and reporting, including use of pandas, numpy, and bokeh,
  • the sequencing_summary.txt file from MinKNOW or Guppy as data source for parsing.

Computational requirements for this tutorial include:

  • Computer running the EPI2ME Labs notebook Server
  • At least 8 Gb RAM
  • Runtime with provided example data - approximately 10 minutes

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction¶

This tutorial aims to summarise the data characteristics from an Oxford Nanopore Technologies sequencing run. Observations from basecalled reads and their quality characteristics, temporal performance, and barcoded content are presented. The information presented is derived solely from the sequence_summary.txt file produced during basecalling with the Guppy software.

The goals from this tutorial include:

  • To introduce a literate framework for analysing base-calling summary

statistics to evaluate the relative performance of runs.

  • To provide basic QC metrics so that a review and consideration of experimental data can be undertaken.
  • To provide training as to which QC metrics are of most interest and to encourage an understanding of how different aspects of sequence data quality can be attributed to sample characteristics from DNA isolation and library preparation.

The steps of this tutorial will be:

  • Basic QC analysis - including number of reads, length of reads, yield and quaility scores.
  • Performance of a sequencing run through time - including sequencing throughput, speed and fragment length.
  • Sequencing channel activity - a way to visualise nanopore channel productivity.

The sequencing_summary.txt file¶

The sequencing_summary.txt file is automatically produced during base-calling with the Guppy software. This summary file contains rich metadata for each sequence read produced during a run. These data include timestamp, pore duration, read quality, and channel information, in addition to the characteristics of the resulting DNA sequence. This workflow presented here uses this summary file rather than the raw FAST5 format data for performance reasons.

Tools such as pomoxis utilise the fastq files for quality metrics, and other tools make extensive use of the fast5 files. Parsing the fast5 files provides additional analytical context but is much more demanding in terms of compute resource and time. This tutorial is lightweight and is intended to run within a few minutes on a desktop computer.

Basic QC analysis and results¶

To start analysing our experiment we must first collate our data. The workflow below expects to be given a single sequencing summary file as output by the Oxford Nanopore Technologies' sequencing device.

Before anything else we will create and set a working directory:

In [9]:
from epi2melabs import ping
pinger = ping.Pingu()
pinger.send_notebook_ping('start', 'basic_qc_tutorial')

# create a work directory and move into it
tutorial_name = "basicqc_tutorial"
working_dir = '/epi2melabs/{}/'.format(tutorial_name)
!mkdir -p "$working_dir"
%cd "$working_dir"
/epi2melabs/basicqc_tutorial

Sample Data¶

To start analysing our sequencing_summary.txt we must first "read in" the data. For this tutorial we will download a sample file of a Lambda phage sequencing run.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

In [ ]:
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)

!wget -O lambda_sequencing_summary.txt.bz2 \
    "$site/basicqc_tutorial/lambda_sequencing_summary.txt.bz2"

Using your own data¶

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook.

The form below reads the sequencing summary file using the pandas library. The form can be used to enter the filename of your sequencing summary file.

If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select Run after from the Runtime menu above after entering your filepath.

In [1]:
import os
import pandas as pd

import aplanat
import aplanat.report
import aplanat.graphics
from epi2melabs.notebook import InputForm, InputSpec

import ipywidgets as widgets

seq_summary = None
exec_summary = None
report = None


def process_form(inputs):
    summary_file = os.path.abspath(inputs.summary_file)
    global seq_summary, exec_summary, report
    # Read the file, use a tab ('\t') as the separator between columns
    print("Reading input file...")
    seq_summary = pd.read_csv(
        inputs.summary_file, delimiter='\t',
        usecols=[
            'channel', 'start_time', 'duration', 'passes_filtering',
            'sequence_length_template', 'mean_qscore_template'])
    # ensure the file is sorted by time and reset its
    # indexing after doing so
    seq_summary.sort_values('start_time', inplace=True)
    seq_summary.reset_index(drop=True, inplace=True)
    seq_summary = seq_summary
    print("Done.")

    exec_summary = aplanat.graphics.InfoGraphItems()
    report = aplanat.report.HTMLReport(
        "Basic QC Tutorial", "EPI2ME Labs Summary")
    report.markdown("## Excutive summary", key='exec-head')
    report.plot(None, 'exec-plot')  # placeholder


inputs = InputForm(
    InputSpec('summary_file', 'Sequencing summary', 'lambda_sequencing_summary.txt.bz2'),)
inputs.add_process_button(process_form)
inputs.display()
VBox(children=(HBox(children=(Label(value='Sequencing summary', layout=Layout(width='150px')), interactive(chi…

The variable seq_summary is now a pandas dataframe object. Having read the file we can take a peek at the start of the file using the .head() method of the dataframe objects:

In [11]:
# Use the .head() method to check the contents of the dataframe 
seq_summary.head()
Out[11]:
channel start_time duration passes_filtering sequence_length_template mean_qscore_template
0 154 8.71150 10.82900 True 3704 11.429653
1 323 9.33300 10.20775 True 3437 10.364430
2 216 9.34650 10.19425 True 3635 11.148887
3 403 9.41975 10.12100 True 3479 10.644968
4 412 9.54400 9.99675 True 3613 10.794808

Numbers of reads¶

In order to count the number of reads sequenced in the experiment, we can ask the dataframe how many rows it contains:

In [12]:
# Show the total number of reads
print("The total number of reads is:", len(seq_summary))

# Group the rows of the dataframe by the value of the `passes_filtering` column
# then ask the size of each group.
pass_fail = seq_summary.groupby('passes_filtering').size()

print("The number of pass reads is:", pass_fail[True])
print("The number of fail reads is:", pass_fail[False])

exec_summary.append('Total reads', pass_fail[True].sum(), 'angle-up', '')
The total number of reads is: 975715
The number of pass reads is: 814554
The number of fail reads is: 161161

As a more interesting way of summarising these values we can use the aplanat library to create a stacked bar chart. The aplanat library provides templates to create common plots using the bokeh library:

In [13]:
# Plotting pass and fail reads (click play)
import aplanat
from aplanat import bars

pass_percentage = 100 * pass_fail[True] / len(seq_summary)
values = [pass_percentage, 100 - pass_percentage]
classes = ['Pass', 'Fail']
colors = ['#54B8B1', '#EF4135']

plot = bars.single_hbar(
    values, classes, colors,
    title="Pass / Fail reads")
plot.xaxis.axis_label = '%age Reads'
aplanat.show(plot, background="#f4f4f4")

# add item to report
report.markdown("""
## Pass/fail classification

The run contained {} pass and {} fail reads for a total of {} reads.
""".format(pass_fail[True], pass_fail[False], len(seq_summary)),
    key="pass-fail-head")
report.plot(plot)

Read Lengths and Yield¶

Moving on from counting reads, let us examine the length of reads in this experiment.

We will first calculate the mean read length and the N50 read length:

In [14]:
# Calculating mean and N50 read length

# take the `sequence_length_template` column (the read lengths)
# and sort them in ascending order
sorted_lengths = seq_summary.sequence_length_template.sort_values(ascending=False).reset_index(drop=True)
# calculate the cumulative sum of lengths
cumulative_length = sorted_lengths.cumsum()
# find the total number of bases and the mean length
total_bases = cumulative_length.iloc[-1]
mean_length = total_bases / len(seq_summary)
# find the N50 length: the length for which half the reads are longer
n50_index = cumulative_length.searchsorted(total_bases / 2)
n50_length = sorted_lengths.iloc[n50_index]

exec_summary.append('Total yield', total_bases, 'signal', 'b')
exec_summary.append('Mean read length', mean_length, 'align-center', 'b')
exec_summary.append('N50 read length', n50_length, 'align-left', 'b')

report.markdown("""
## Read Lengths and yield

The following graphics depict the read-length distribution in a variety
of ways, see the tutorial for additional details.
""",
    key='length-yield-head')

These values be will used below to annotate the plots. We can depict the read length distribution in several ways. The simplest is as a simple histogram:

In [15]:
# Plotting the read length distribution (click play)
# calculate counts for a histogram using numpy
from aplanat import hist, annot

datas = [seq_summary.sequence_length_template]
plot = hist.histogram(datas, bins=400, title="Read length distribution", colors=['#0084A9'])
# add vertical lines for mean and N50 read length
annot.marker_vline(plot, mean_length, 'Mean: {:.0f}'.format(mean_length))
annot.marker_vline(plot, n50_length, 'N50: {}'.format(n50_length), text_baseline='top')
# add axis labels
plot.xaxis.axis_label = 'Read Length / bases'
plot.yaxis.axis_label = 'Number of reads'
# show the plot
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, key='length-plain')

The distribution of sequence lengths will be dependent on the protocols that have been used to extract and/or prepare the starting DNA. Sequences from amplicon DNA will have a tight distribution of read lengths, while sequences from genomic DNA will have a broader distribution. The distribution will be further influenced if a size-selection step has been used, and will also be dependent on the choice of sequencing library preparation kits. The read-length distribution should be assessed to see if the distribution is concordant with that expected. For the sample dataset, the graph depicts a spread of read lengths with a strong peak in the distribution at around 48 kbases, the length of the genome under examination.

A second way to visualise the read length distribution is as a mass distribution, or a weighted histogram. This method most similarly portrays the intensities that would be seen in an electrophoresis experiment.

In [16]:
# Plotting read-length mass distribution (click play)
from aplanat import hist, annot

datas = [seq_summary.sequence_length_template]
weights = [seq_summary.sequence_length_template.astype(float)]
plot = hist.histogram(
    datas, weights=weights,
    normalize=True,
    bins=400, title="Read length distribution",
    colors = ["#0084A9"])
# add vertical lines for mean and N50 read length
annot.marker_vline(plot, mean_length, 'Mean: {:.0f}'.format(mean_length))
annot.marker_vline(plot, n50_length, 'N50: {}'.format(n50_length), text_baseline='top')
# add axis labels
plot.xaxis.axis_label = 'Read Length / bases'
plot.yaxis.axis_label = 'Base density'
# show the plot
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, key='length-mass')

*Note how a normalised density has been plotted, to avoid misinterpretation when plotting simply the total bases per bin.*

The weighted read length histogram above shows the binned distribution of sequence length against number of sequence nucleotides contained within the bin. This plot will show clear peaks if for example, amplicons are sequenced or if size selection has been performed.

A final way of viewing this data is to take the cumulative sum of the number of bases and plot it against the read lengths, to obtain a curve expressing the number of bases contained within reads above a given length:

In [17]:
# Plotting cumulative yield by read length (click play)
from aplanat import lines

x = sorted_lengths[0:-1:10000]
y = cumulative_length[0:-1:10000] / 1e9
plot = lines.line([x], [y], xlim=(0, None), ylim=(0, None),
    title='Base yield above given read length',
    colors = ["#0084A9"])
plot.xaxis.axis_label = 'Read Length / bases'
plot.yaxis.axis_label = 'Yield above length / Gbases'
# show the plot
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, 'yield-length')

This graph indicates how many small fragments contribute equally to the total yield as fewer longer fragments.

From the graph we can read off that the total yield of the experiment from the left-most point of the blue line.

Quality Scores¶

In this section we will review the predicted quality of the sequencing data using the mean per-base quality score of reads written in the sequencing_summary.txt file.

Let's start by simply plotting a histogram of the quality scores:

In [18]:
# Histogram of read quality (click play)

passes = seq_summary['passes_filtering']
datas = [
    seq_summary.loc[passes == status]['mean_qscore_template']
    for status in [True, False]]
plot = hist.histogram(
    datas, colors=['#54B8B1', '#EF4135'],
    names=['Pass', 'Fail'], bins=200, xlim=(1, None),
    title="Read quality score distribution")
plot.xaxis.axis_label = 'Read Quality Score'
plot.yaxis.axis_label = 'Count reads / M'
aplanat.show(plot, background="#f4f4f4")

pass_mean_qscore = seq_summary[seq_summary['passes_filtering'] == True]["mean_qscore_template"].mean()
exec_summary.append('Mean qscore (pass)', pass_mean_qscore, 'thumbs-up', '')

report.markdown("""
## Read quality

Distribution of read quality for pass and fail reads.
""", key='qual-head')
report.plot(plot, key='qual-plot')

The read quality score is calculated as the mean of the per-base quality scores of a read. Each per base quality gives an estimation of the error of that base in expressed on a logarithmic scale: $QV = -10\log_{10} (P_{error})$).

Evident from the histogram above, a quality score of 7 is the threshold value for determining the Pass of Fail status of reads. The boundary is chosen to be in the tail of the distribution. This QV filter is applied during the base-calling process as a modifiable parameter. For downstream analyses we recommend processing only the higher-quality sequence reads.

Performance through time¶

A key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel will address a selection of different pores and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease. It is useful to consider whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.

Throughput¶

Plotting the number of bases generated per unit time over the course of a sequencing run can reveal unexpected behaviours. In an ideal experiment there should not be any sudden decreases in performance.

In [2]:
def process_form(inputs):
    seq_summary['time'] = (seq_summary['start_time'] / 60 / inputs.discretize).astype(int)
    # group the data by quarter hour and pass/fail
    groups = seq_summary.groupby(['passes_filtering', 'time'], as_index=False)
    # sum the bases per group
    throughput = groups['sequence_length_template'].agg('sum')

    data = [
        throughput[throughput.passes_filtering == status]
        for status in [True, False]]
    plot = lines.line(
        [d.time / (60 / inputs.discretize) for d in data],
        [d.sequence_length_template / inputs.discretize / 1e6 for d in data],
        names=['Pass', 'Fail'],
        colors=['#54B8B1', '#EF4135'],
        xlim=(0,None), ylim=(0, None),
        title='Throughput',
        x_axis_label = 'time / hr',
        y_axis_label = 'yield / Gbases')
    aplanat.show(plot, background="#f4f4f4")

    report.markdown("""
    ## Performance through time

    The following charts illustrate read, yield, and flowcell performance through
    the course of the sequencing experiment.
    """, key='time-head')
    report.plot(plot, key='yield-time-plot')


tp_inputs = InputForm(
    InputSpec('discretize', 'Time discretization (min)', (1, 60, 1)), description_width="180px")
tp_inputs.add_process_button(process_form)
tp_inputs.display()
VBox(children=(HBox(children=(Label(value='Time discretization (min)', layout=Layout(width='180px')), interact…

The above plot shows a steady decline in the sequencing rate for Pass reads, with a constant background of Fail reads. The dips in throughput every hour are normal as the sequencing control software MinKNOW reassessed from which pores to record data.

A second way of viewing the sequencing throughput is to plot the cumulative yield over time.

In [19]:
# Plotting base yield through time (click play)

# group the data by pass/fail
groups = seq_summary.set_index(['passes_filtering']).groupby(level=0, as_index=False)
# calculate a cumulative sum of the number of bases sequenced (per group)
sum_tp = groups['sequence_length_template'].cumsum().reset_index()
# attach the start times from the original table
sum_tp['start_time'] = seq_summary['start_time']

data = [
    sum_tp[::1000][sum_tp[::1000].passes_filtering == status]
    for status in [True, False]]
plot = lines.line(
    [d.start_time / 3600 for d in data],
    [d.sequence_length_template / 1e9 for d in data],
    names=['Pass', 'Fail'],
    colors=['#54B8B1', '#EF4135'],
    xlim=(0,None), ylim=(0, None),
    title='Cumulative yield',
    x_axis_label = 'time / hr',
    y_axis_label = 'yield / Gbases')
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, key='yield-time-plot')

Sequencing speed and fragment length through time¶

As well as changes in throughput over time another diagnostic metric to evaluate is the sequencing speed: the number of bases processed per second by the enzymes controlling the movement of DNA through the nanopore channels.

In [20]:
# Plotting Sequencing speed plot (click play)
from bokeh import layouts
hour = (seq_summary.start_time / 3600).astype(int)
speed = seq_summary.sequence_length_template / seq_summary.duration
spd_plot = bars.boxplot_series(
    hour, speed, title='Sequencing speed',
    x_axis_label = 'time / hr',
    y_axis_label = 'speed / (bases / s)')

length = seq_summary.sequence_length_template 
len_plot = bars.boxplot_series(
    hour, length, title='Sequencing Lengths',
    x_axis_label='time / hr',
    y_axis_label='read length / bases')
len_plot.x_range=spd_plot.x_range

p = layouts.column(spd_plot, len_plot)
aplanat.show(p, background="#f4f4f4")
report.plot(p, key='speed-length-boxplots')

Active Channels¶

The graphs above all depict the changing rate of sequencing from the perspective of the volume of data produced. A contribution to the falling sequencing rate is the number of functional channels falls throughout the experiment.

A channel is defined as being active if one or more sequence reads are observed per time window. It is expected that over the course of the run pores will block and the number of channels producing data will decrease. Changing the pore used by the channel and strategies to unblock pores mean that the number of functional channels may increase or decrease at a given timepoint but generally the number of channels producing data will decrease over time.

In [3]:
def process_form(inputs):
    # calculate a start time for each read rounded to discretize minutes
    seq_summary['time'] = (seq_summary['start_time'] / 60 / inputs.discretize).astype(int)
    # group the data by quarter hour
    groups = seq_summary.groupby('time', as_index=False)
    # count the unique channels in each 5 minutes
    chan_counts = groups['channel'].apply(lambda x: len(x.unique()))
    hours = chan_counts.index / (60 / inputs.discretize)

    plot = lines.line(
        [hours],
        [chan_counts['channel']],
        xlim=(0,None), ylim=(0, None),
        title='Functional channels ({}min period)'.format(inputs.discretize),
        x_axis_label = 'time / hr',
        y_axis_label = 'active channels',
        colors = ["#0084A9"])
    aplanat.show(plot, background="#f4f4f4")
    report.plot(plot, 'active-channels-plot')

ac_inputs = InputForm(
    InputSpec('discretize', 'Time discretization (min)', (1, 60, 1)), description_width="180px")
ac_inputs.add_process_button(process_form)
ac_inputs.display()
VBox(children=(HBox(children=(Label(value='Time discretization (min)', layout=Layout(width='180px')), interact…

Sequencing channel activity plot¶

The nanopores through which DNA is passed, and signal collected, are arrayed as a 2-dimensional matrix. A heatmap can be plotted showing channel productivity against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations. On the MinION, GridION, and PromethION platforms (but not Flongle) each channel can address four different pores ("mux"), the activity plot below therefore shows the number of sequences produced per channel, not strictly per pore.

Counting reads and channel map

We will break apart displaying the channel activity into two steps since the code is quite long. First we count form our dataset; for this we need to calculate the positions of channels in the flowcell:

In [21]:
# Calculating spatial position of channels on flowcells (click play)
import numpy as np
def minion_array():
    # The array is composed of blocks of 64 channels, the low
    # halve is to the right (high to left), working inwards
    # in a 4 x 8 block
    def channel_block(i):
        m = np.zeros((4, 16), dtype=int)
        m[4::-1, :7:-1] = np.arange(i, i + 32).reshape(4, 8)
        m[4::-1, 0:8] = np.arange(i + 32, i + 64).reshape(4, 8)
        return m
    # the blocks ascend from high to low, except the
    # lowest block is at the top
    first_channel = list(range(65, 450, 64)) + [1]
    channel_array = np.vstack([channel_block(x) for x in first_channel])
    ca = pd.DataFrame(channel_array).reset_index().melt('index')
    ca.columns = ['row', 'column', 'channel']
    return ca

def flongle_array():
    # channels are in a simple grid except two upper- and
    # lower-most channels on right-hand column are missing
    channel_array = np.concatenate([
        np.arange(1, 13), np.array([0]),
        np.arange(13, 25), np.array([0]),
        np.arange(25, 115), np.array([0]),
        np.arange(115, 127), np.array([0])]).reshape(10, 13)
    ca = pd.DataFrame(channel_array).reset_index().melt('index')
    ca.columns = ['row', 'column', 'channel']
    return ca

def promethion_array():
    # Array is simple blocks of 25*10 channels
    channel_array = np.hstack([
        np.arange(x, x + 250).reshape(25, 10)
        for x in range(1, 2752, 250)])
    ca = pd.DataFrame(channel_array).reset_index().melt('index')
    ca.columns = ['row', 'column', 'channel']
    return ca

def guess_device(max_channel):
    device = "minion"
    if max_channel > 512:
        device = "promethion"
    if max_channel <= 130:
        device = "flongle"
    return device

channel_maps = {
    'minion': minion_array(),
    'flongle': flongle_array(),
    'promethion': promethion_array()}

Using the channel map from the above code cell we can create a table with columns row, column, and reads:

In [22]:
device = guess_device(seq_summary['channel'].max())
channel_map = channel_maps[device]

# count pass reads per channel...
counts = seq_summary[seq_summary['passes_filtering']] \
    .groupby('channel') \
    .size().to_frame('reads') \
    .reset_index()
# ...and merge with the channel map
counts = counts.merge(channel_map, on='channel', how='outer').fillna(0)

Now that we have out counts, we can plot a heat map:

In [23]:
# Plotting activity heat-map plot (click play)
from aplanat import spatial

plot = spatial.heatmap(
    counts['column'], counts['row'], counts['reads'],
    name='# reads')
aplanat.show(plot, background="#f4f4f4")
report.markdown("### Active channels", key='active-channels-head')
report.plot(plot, key='active-channels-heat')

Headline numbers¶

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis:

In [24]:
# Executive summary (click play)
pinger.send_notebook_ping('stop', 'basic_qc_tutorial')

exec_plot = aplanat.graphics.infographic(exec_summary.values(), ncols=3)
aplanat.show(exec_plot, background="#f4f4f4")
report.plot(exec_plot, key='exec-plot')

fname = os.path.join(os.path.dirname(inputs.summary_file), 'report.html')
report.write(fname)
print("Report written to: {}.".format(fname))
Report written to: report.html.

Summary¶

This tutorial has stepped through basic QC analysis of a Nanopore sequencing run. The sequencing throughout and read lengths have been explored, as has the progression of the experiment through the course of time.

The analysis presented can be run on any sequencing_summary.txt from MinKNOW or Guppy. The code will run within the EPI2ME Labs notebook server environment.