Introduction to Fastq files

The fastq format is (usually) a 4 line string (text) data format denoting a sequence and it's corresponding quality score values. There different ways of encoding quality in a .fastq file however, files from ONT sequencing devices use sanger phred scores. A sequence record is made up of 4 lines:

line 1: Sequence ID and Sequence description
line 2: Sequence line e.g. ATCGs
line 3: plus symbol (can additionally have description here)
line 4: Sequence line qualities

IMPORTANT: Lines 2 and line 4 must have the same length or the sequence record is not valid.

For example a sample record looks like:

@sequence_id sequence_description
ATCG
+
!^%%

The sequence ID must not contain any spaces. Anything after the first space in the sequence ID line will be considered the "description".

A .fastq file may contain multiple records. The default number of records in a fastq file generated during a nanopore run is 4000 reads (16000 lines).

Useful snippets¶

The following snippets demonstrate common tasks you might want to perform on a single .fastq file or a set of such files. For many tasks we recommend the excellent seqkit program.

Before anything else we will create and set a working directory:

In [ ]:
# create a work directory and move into it
directory = "fastq_tutorial"
working_dir='/epi2melabs/{}/'.format(directory)
!mkdir -p "$working_dir"
%cd "$working_dir"

from epi2melabs import ping
pinger = ping.Pingu()
pinger.send_notebook_ping('stop', 'fastq_introduction')

# Download sample data
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)

!wget "$site/fast_introduction/archive.tar.gz"
!tar -xzvf archive.tar.gz
%cd test0

The snippets all have their code to the left-hand side and a form to the right which can be used to change their inputs (as an alternative to directly editing the code).

How many records in my .fastq file?¶

To count the number of records in a .fastq file we can use the linux word count command to count the number of lines in a file, with a division by four accounting for four lines per record:

In [ ]:
filename = "example3.fastq"
!echo $(( $(wc -l < $filename) / 4 )) reads

List all the fastqs in a directory¶

As Oxford Nanopore Technologies' sequencing devices output multiple .fastq files during the course of an experiment, it can be useful to find and list all such files. We can do this with the linux find command:

In [ ]:
directory = "."

!find $directory -name "*.fastq"

The default directory value here (.) means "the current working directory."

Concatenate all fastqs in a directory into a single file¶

Many bioinformatics programs require all sequence data to be present in a single .fastq file. In order to process sequences across multiple files we must concatenate (or "cat") all the .fastq files into a single consolidated file. To perform this task we can use a combination of the linux find, xargs, and cat commands:

In [ ]:
directory = "."
output_fastq = "all_records.fastq"

!find . -type f \( -iname "*.fastq" ! -iname $output_fastq \) | \
    xargs cat > $output_fastq
!echo $(( $(wc -l < $output_fastq) / 4 )) reads

Again the default directory value here (.) means "the current working directory."

You may often see a simple form of the above:

cat *.fastq > output.fastq

however, this command will fail if the number of .fastq files found is very large.

Remove all duplicates in a fastq¶

In can sometimes be the case that for some reason a .fastq file contains duplicates of the same read. To remove these we can use the rmdup command of the seqkit program:

In [ ]:
input_fastq = "all_records.fastq"
output_fastq = "deduplicated.fastq"

!seqkit rmdup "$input_fastq" -o "$output_fastq"

For the example data, 200 duplicate records are identified because the three files (containing 100 records each) are in fact copies of the same file.

Compress or extract a fastq file¶

We can save hard disk space on our computer by compressing .fastq files. To do this we recommend using bgzip which allows for indexing and fast retrieval of sequences by bioinformatics programs:

In [ ]:
input_fastq = "example3.fastq"
compressed_fastq = "example3.fastq.gz"

!ls -lh "$input_fastq"
!bgzip "$input_fastq"
!ls -lh "$compressed_fastq"

The size of the compressed file is roughly half of the original. To decompress the compress file, we again use bgzip:

In [ ]:
compressed_fastq = "example3.fastq.gz"

!bgzip -d "$compressed_fastq"

Compress a directory structure¶

In order to compress a directory structure we can use the linux tar command with the compression option:

In [ ]:
directory = "pass"
archive = "archive.tar.gz"

# the options here mean: create, gzip compress, verbose, output file
!tar -czvf "$archive" "$directory"

When compressing directories and their contents in this way it is good practice to compress a single top-level directory, so that when the archive is decompressed a single top-level directory is retrieved (and the users working directory isn't polluted).

To decompress the archive we use a similar command:

In [ ]:
archive = "archive.tar.gz"

# A temporary folder (tmp) is created here simply to avoid confusion with the
# original directory compressed in the previous example. This is not necessary
# in practice.

# the options here mean: extract, gzip compressed, verbose, input file
!rm -rf tmp && mkdir tmp && cd tmp && \
    tar -xzvf ../"$archive"

Visualizing fastq¶

The snippets below demonstrate basic parsing of fastq data in python. We do not recommend using this code in practice as much of the information is more readily available in the sequencing_summary.txt file produced by Oxford Nanopore Technologies' sequencing devices. See our Basic QC Tutorial for more examples.

In [3]:
# plotting basic summary graphs
pinger.send_notebook_ping('stop', 'fastq_introduction')

import numpy as np
from pysam import FastxFile
from bokeh.layouts import gridplot

qualities = list()
mean_qualities = list()
lengths = list()

# open the file and iterate through its records
with FastxFile("all_records.fastq") as fq:
    for rec in fq:
        # ONT calculation for "mean Q score"
        quals = np.fromiter(
            (ord(x) - 33 for x in rec.quality),
            dtype=int, count=len(rec.quality))
        mean_p = np.mean(np.power(10, quals/-10))
        mean_qualities.append(-10*np.log10(mean_p))
        # all qualities
        qualities.extend(quals)
        lengths.append(len(quals))

# use the aplanat library to plot some graphs of the
# collected statistics
import aplanat
from aplanat.hist import histogram

p1 = histogram(
    [np.array(mean_qualities)], title="Read quality scores",
    x_axis_label="quality", y_axis_label="count",
    height=250, width=300)
p2 = histogram(
    [qualities], title="Base quality scores",
    x_axis_label="quality", y_axis_label="count",
    height=250, width=300)
p3 = histogram(
    [lengths], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count",
    height=250, width=300)
aplanat.show(gridplot((p1, p2, p3), ncols=3), background="#f4f4f4")