Telomere sequencing workflow

Nextflow pipeline for the analysis of telomeric sequence data generated with the Oxford Nanopore Telo-Seq protocol.

Introduction

wf-teloseq is a Nextflow pipeline to provide telomere length estimates from data generated with the Telo-Seq protocol.

Telo-Seq aims to measure telomere length accurately and assign each telomere to a chromosome arm. wf-teloseq currently defaults to an alignment based analysis. By default the workflow will use a suitable reference which has been created from the telomeric regions of the phased telomere-to-telomere (T2T) HG002 human reference genome, but users may optionally provide their own reference genome. Please note that a T2T reference genome matching the sample that has been sequenced is required for correct assignment of the reads to chromosome arms. If no reference genome is available, a reference-less analysis can be performed, which simply provides a global telomere length estimate based on the sequenced reads without chromosome arm assignment.

NOTE In order to run the analysis, data must first be base-called and demultiplexed with Dorado using Telo-Seq specific configuration files. A bash script to perform this base-calling is available in the wf-teloseq repository under bin/basecalling.sh.

Compute requirements

Recommended requirements:

CPUs = 16
Memory = 12GB

Minimum requirements:

CPUs = 4
Memory = 8GB

Approximate run time: Using recommended compute: approximately 5 minutes per sample, or 1 minute if alignment is not performed.

ARM processor support: False

Install and run

These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.

The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed. This is controlled by the -profile parameter as exemplified below.

It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.

The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:

nextflow run epi2me-labs/wf-teloseq --help

To update a workflow to the latest version on the command line use the following command:

nextflow pull epi2me-labs/wf-teloseq

A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-teloseq/wf-teloseq-demo.tar.gz
tar -xzvf wf-teloseq-demo.tar.gz

The workflow can then be run with the downloaded demo data using:

nextflow run epi2me-labs/wf-teloseq \
    --bam 'wf-teloseq-demo/data' \
    --reference 'wf-teloseq-demo/HG002qpMP_reference.fasta.gz' \
    -profile standard

For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/

This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.

Find related protocols in the Nanopore community.

Input example

This workflow accepts either FASTQ or BAM files as input.

The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet. The sample sheet can include a "reference" column to assign each sample a specific reference to map to.

(i)                     (ii)                 (iii)    
input_reads.fastq   ─── input_directory  ─── input_directory
                        ├── reads0.fastq     ├── barcode01
                        └── reads1.fastq     │   ├── reads0.fastq
                                             │   └── reads1.fastq
                                             ├── barcode02
                                             │   ├── reads0.fastq
                                             │   ├── reads1.fastq
                                             │   └── reads2.fastq
                                             └── barcode03
                                              └── reads0.fastq

Input parameters

Input Options

Nextflow parameter name	Type	Description	Help	Default
fastq	string	FASTQ files to use in the analysis.	This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.
bam	string	BAM or unaligned BAM (uBAM) files to use in the analysis.	This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.
analyse_unclassified	boolean	Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory.	If selected and if the input is a multiplex directory the workflow will also process the unclassified directory.	False
reference	string	Reference genome of the sequenced sample, if not specified a human derived telomere reference set will be used.

Sample Options

Nextflow parameter name	Type	Description	Help	Default
sample_sheet	string	A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files.	The sample sheet is a CSV file with, minimally, columns named `barcode` and `alias`. Extra columns are allowed. A `type` column is required for certain workflows and should have the following values; `test_sample`, `positive_control`, `negative_control`, `no_template_control`.
sample	string	A single sample name for non-multiplexed data. Permissible if passing a single .FASTQ(.gz) file or directory of .FASTQ(.gz) files.

TeloSeq Options

Nextflow parameter name	Type	Description	Help	Default
skip_mapping	boolean	Perform alignment to assign haplotypes to telomeric reads.	Use `--skip_mapping` if there is no suitable reference available. Only a bulk estimate of telomere lengths per sample will be calculated.	False
alignment_threads	integer	Set max number of threads to use for alignment.		6

Output Options

Nextflow parameter name	Type	Description	Help	Default
out_dir	string	Directory for output of all workflow results.		output

Advanced Options

Nextflow parameter name	Type	Description	Help	Default
min_length	integer	Minimum read length for filtering.	Used in the initial filtering of reads into the workflow. Reads with a sequence length less than this will be removed prior to analysis, and will not be present in the output files. This removes all reads which are either noise, artifacts or partial, and which cannot be full telomeric sequences. Setting this number too high runs the risk of removing short telomeres.	100
read_quality	integer	Minimum read Q score for filtering.	Reads with a mean quality score lower than this value will be filtered out prior to analysis, and will not be present in the output files.	9

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

Title	File path	Description	Per sample or aggregated
workflow report	wf-teloseq-report.html	Report for all samples.	aggregated
Tool versions	versions.txt	A CSV with per row tool and version.	aggregated
Parameters from workflow	params.json	A json of all parameters selected in workflow.	aggregated
Unaligned, filtered and tagged sequences.	{{ alias }}/unaligned_data/{{ alias }}_filtered_telomeric.fastq	These sequences have been tagged (valid SAM format tags) on whether or not they passed filtering (qc:Z), and if detected, also tagged with the Telomere repeat boundary coordinates (tl:I).	per-sample
Summary metrics about detected telomere lengths within the sample.	{{ alias }}/stats/{{ alias }}_telomere_unaligned_metrics.tsv	Aggregated summary metrics about reads which have passed all filtering, with a detected telomere repeat boundary.	per-sample
Aligned, filtered and tagged sequences.	{{ alias }}/aligned_data/{{ alias }}_aligned_filtered_teloseqs.bam	Contains the sequences from the processed_fastq aligned to the provided reference. All sequences which passed initial read length (`--min_length`) and quality (`--read_quality`) filtering will be present, including unmapped. Only produced if alignment is performed.	per-sample
Accompanying index for the aligned BAM.	{{ alias }}/aligned_data/{{ alias }}_aligned_filtered_teloseqs.bam.csi	Coordinate-sorted index file for the aligned data BAM.	per-sample
Summary metrics about aligned telomere lengths within the sample.	{{ alias }}/stats/{{ alias }}_telomere_aligned_metrics.tsv	Aggregated summary metrics about detected telomere lengths within the sample, after alignment. Only reads which have primary alignments to the reference are considered.	per-sample
Summary metrics about aligned telomere lengths grouped by target contig.	{{ alias }}/stats/{{ alias }}_contig_telomere_aligned_metrics.tsv	Aggregated summary metrics about detected telomere lengths within the sample, after alignment. The reads are grouped by target contig, and only reads which have primary alignments to the reference are considered.	per-sample
Summary metrics about the filtering status of each read.	{{ alias }}/stats/{{ alias }}_qc_modes_metrics.tsv	Aggregated summary metrics of the filtering status of each read. Metrics displayed for each filtering status include count, median Q score, length, and alignment identity (where applicable).	per-sample
Telomere length KDE data points.	{{ alias }}/stats/{{ alias }}_kde_data.tsv	Equally spaced data points drawn from a KDE. The KDE is drawn from the calculated telomere lengths.	per-sample
Data points for drawing a boxplot from calculated telomere lengths.	{{ alias }}/stats/{{ alias }}_boxplot_values.tsv	Data points for drawing a boxplot from calculated telomere lengths for a single sample. Only provided if alignment is performed. Each row represents a contig, containing the minimum (greater than median - 1.5 * IQR), Q1, Median, Q3, and Max (smaller than median + 1.5 * IQR).	per-sample

Pipeline overview

The workflow is composed of two steps. Firstly, reads undergo initial basic length and quality control filtering, followed by analysis to determine the telomeric boundary, and filtering to remove reads where the boundary is likely incorrect. If alignment is enabled (default), all reads (irrespective of boundary determination) are then aligned to a reference. Reads with primary alignments with good telomere boundaries are then aggregated, before statistics about the estimated lengths of each contigs telomeres are generated.

1. Input and Sequence preparation.

wf-teloseq requires basecalled nanopore reads in FASTQ or BAM format as input. We recommend basecalling with super accuracy (SUP) models. High accuracy (HAC) basecalling will also work but will result in a slightly reduced number of telomeric reads of sufficient quality for analysis. For best results, we recommend a minimum of 1000 telomere reads per sample for alignment based analyses. If a reference is not provided via the --reference or --sample_sheet parameter, then a telomere reference constructed from HG002 (data/HG002qpMP_reference.fasta.gz) is used by default for alignment.

A base-calling script is included in the wf-teloseq repository under bin/basecalling.sh. This script requires Dorado basecaller to be installed and available on your PATH. Instructions for Dorado installation can be found here. The script uses specific barcode sequences for the TeloSeq protocol to correctly demultiplex data, and arrange the created BAM files in the directory arrangement that wf-teloseq expects.

Example usage:

Assuming pod5 files are in the directory /data/example_experiment/example_run/,

cd wf-teloseq
./bin/basecalling.sh -m hac -i /data/example_experiment/example_run/ -o TeloSeq_data

This will create the TeloSeq_data folder and lay out all barcodes in their own sub-folders.

2. Read filtering and tagging

Filters are listed in the order that they are applied. The first two filters are tunable with the params --min_length and --read_quality. Other filters are currently not tunable via parameters.

Reads which fail a check are tagged with a qc:Z:<Tag> tag, in either the output FASTQ if not performing alignment or the output BAM. Tags for corresponding failure modes are listed below. Reads which pass all filtering checks are tagged with qc:Z:Good.

Filter	Description	Tag
Minimum read length	Reads under 100 bases long are discarded. Removed entirely.	N/A
Minimum Q score	Reads with a Mean Q score <9 are discarded. Removed entirely.	N/A
Too Short	The read was too short (less than 160 bases). Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM.	TooShort
Too Few Repeats	There were fewer than 20 telomeric repeat motifs across the entire read. Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM.	TooFewRepeats
Too Close Start	The telomeric boundary is too close (within 60 bases) of the Start of the read. Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM.	TooCloseStart
Too Close End	The telomeric boundary is too close (within 60 bases) of the end of the read. Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM.	TooCloseEnd
Start Not Repeats	The first 30% of the read is not 80% repeats. Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM.	StartNotRepeats
Low Sub-Telo Qual	The mean basecall Q score of the region after the boundary is below a default value of 9. Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM.	LowSubTeloQual
Too Errorful	At least 5 known basecall error motifs have been observed within a 500 base pair frame in the subtelomere. Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM.	TooErrorful
Telomere only	Sequence after telomere boundary is CCC rich, and so is most likely still actually telomeric repeat sequence, meaning the detected boundary is unreliable.	TelomereOnly

The following filter is only applied if alignment is performed: | Filter | Description | Tag | |----------------------|---------------------------------------------------------------------------|------------------| | Bad Alignment | Query read has a low gap-compressed identity to the reference (\< 0.8), or a low (\< 20) mapping quality. Read is excluded from further analysis, is tagged as failing QC, remaining in output BAM. | BadAlign |

All reads have a tl:i tag set, which represents the estimated telomere length. For reads which a boundary CANNOT be determined, the value of this tag is set to -1.

It is after this stage that a tagged FASTQ file is available per sample, with the tags present in the header.

3. Alignment.

By default, ALL reads are aligned by minimap2 (excepting reads which fail the initial minimum read length and Q score filters) after initial processing. Alignment can be skipped by setting --mapping false when running the pipeline.

Default reference genome

The telomere reference provided with the pipeline is based on the HG002 telomere-to-telomere reference genome. Note that in this reference, the sub-telomeric sequence up to the EcoRV cut site in chromosome 13 (paternal) P arm is identical to chromosome 22 (paternal) P arm, so there is only one representative sequence present in the file, with a sequence name of chr13_22_PATERNAL_P. The pipeline does not separate out these two arms based upon telomere length from the single contig provided, which means the estimated telomere length assigned to this contig is likely inaccurate.

Processing of alignments

After alignment, the alignments are used to aggregate stats based on the reference contigs. One final filtering step is performed here, which is based on the gap-compressed identity between the query sequence and the target reference, as calculated by minimap2. Any read with an identity of less than 0.8 and a mapping quality of less than 20 is excluded from the estimated telomere length statistics. Only primary alignments are used in this aggregation, and only those from reads which have the qc:Z tag set with a value of Good, indicating they passed all initial filtering steps.

After this stage a tagged BAM file containing both mapped and unmapped reads is produced and output per sample.

Troubleshooting

If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
See how to interpret some common nextflow exit codes here.

FAQ's

Can I use TeloSeq and wf-teloseq on non human samples? Yes, although Telo-Seq has been tested on human data, it should work well on species with similar telomere repeat sequences.

If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.

Additional

Importing third-party workflows into EPI2ME Labs

See the EPI2ME website for lots of other resources and blog posts.