Transcriptomes

Long-read transcript discovery, quantification, differential expression, QC and mod counting.

Introduction

This workflow analyses Oxford Nanopore long-read transcript sequencing data. It uses bambu to build and quantify transcript models, SQANTI3 for transcript classification and QC, can optionally run DESeq2 and DEXSeq for differential analysis, and will run modkit for base modification pileups on aligned reads if relevant tags are present.

The workflow supports:

Transcript identification from either cDNA or direct RNA reads
Transcript discovery guided by a supplied genome and annotation
Quantification against a supplied reference annotation
Transcript classification and QC with SQANTI3
Differential gene expression with DESeq2
Differential transcript usage with DEXSeq
Base modification pileups with modkit

The main transcriptome result is a shared bambu model built from all samples together. The workflow also produces separate per-sample transcriptomes, so each sample has its own GTF, FASTA, count tables, and SQANTI3 summary alongside the shared results.

wf-transcriptomes overview schematic. — Schematic depicting wf-transcriptomes workflow.

For users familiar with earlier transcriptome workflows, the main change is that transcript discovery, quantification, and optional differential analysis now use the shared bambu outputs rather than the older StringTie/GffCompare/Salmon-based approach. The rest of this README explains the current workflow in plain terms, while the FAQ and Troubleshooting sections call out the main differences from the previous workflow version. For bambu implementation details, see the bambu GitHub repository. For SQANTI3 classification categories, see the SQANTI3 isoform classification documentation.

Compute requirements

Recommended requirements:

CPUs = 32
Memory = 96GB

Minimum requirements:

CPUs = 12
Memory = 64GB

Approximate run time: Varies with read depth and sample count; a small (1-5 million reads) single-sample run may finish in under 60 minutes with the recommended resources. Compute requirements are influenced most by total read count, reference complexity, number of samples, and whether optional steps such as DE/DTU, and modified-base summarisation are enabled.

ARM processor support: False

Install and run

These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.

The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed. This is controlled by the -profile parameter as exemplified below.

It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found in the documentation.

The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:

nextflow run epi2me-labs/wf-transcriptomes --help

To update a workflow to the latest version on the command line use the following command:

nextflow pull epi2me-labs/wf-transcriptomes

A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-transcriptomes/v2/wf-transcriptomes-demo.tar.gz
tar -xzvf wf-transcriptomes-demo.tar.gz

The workflow can then be run with the downloaded demo data using:

nextflow run epi2me-labs/wf-transcriptomes \
    --bam 'wf-transcriptomes-demo/samples' \
    --de_analysis \
    --direct_rna \
    --ref_annotation 'wf-transcriptomes-demo/gencode.v22.annotation.chr20.gtf' \
    --ref_genome 'wf-transcriptomes-demo/hg38_chr20.fa' \
    --sample_sheet 'wf-transcriptomes-demo/sample_sheet.csv' \
    -profile standard

Input example

This workflow accepts either FASTQ or BAM files as input.

The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.

(i)                     (ii)                 (iii)    
input_reads.fastq   ─── input_directory  ─── input_directory
                        ├── reads0.fastq     ├── barcode01
                        └── reads1.fastq     │   ├── reads0.fastq
                                             │   └── reads1.fastq
                                             ├── barcode02
                                             │   ├── reads0.fastq
                                             │   ├── reads1.fastq
                                             │   └── reads2.fastq
                                             └── barcode03
                                                 └── reads0.fastq

Pipeline overview

Background

The methodology implemented within the wf-transcriptomes workflow follows from the largest independent long-read RNA benchmark to date. The Systematic assessment of long-read RNA-seq methods for transcript identification and quantification concluded that, in well-annotated genomes, reference-based methods perform best. Our own previous research, benchmarking, and support of community members has shown that an automated, hands-off de-novo discovery pipeline to be bothersome for many use cases. The wf-transcriptomes workflow therefore focuses on a reference-guided approach rather than a novelty-first one.

The benchmark paper above explicitly recommends bambu for identifying sample-specific transcriptomes in well-annotated organisms when only limited novelty is expected. The paper also names bambu as one of the best options when quantification is important, which supports using it as the core engine for downstream DGE and DTU analyses. For method details and current implementation notes, see the bambu GitHub repository.

In spike-in evaluations, bambu generally showed high precision and was among the better F1 performers. This is an acceptable tradeoff for a production workflow where false transcript calls might be confound downstream analysis. Users interested more in novel discovery may wish to amend the parameters of the workflow away from their defaults. bambu also performed especially well on long non-spliced SIRVs, which supports its use on long-read datasets where transcript-end definition matters.

The workflow's choice of SQANTI3 as a companion QC and annotation layer matches the benchmark paper, which used SQANTI3 categories and metrics as its transcript assessment framework; so our outputs align with the field’s standard reporting language. This step is useful for structural isoform classification and provides standard QC summaries for reporting, method comparison, or deeper transcript model review. See the SQANTI3 repository and the SQANTI3 isoform classification documentation for category definitions and usage details.

1. Getting your files into the workflow

The shared EPI2ME input handling collects FASTQ or BAM inputs, works out whether you have a single sample or a multiplexed run, and produces per-sample FASTQ files plus read statistics. These files are used in the downstream report.

2. Sample sheet formulation

The sample sheet is optional for simple single-sample runs, but it allows sample aliases to be mapped to barcodes in multiplexed runs and is required for --de_analysis.

Every row must contain barcode and alias.
barcode must use the usual ONT-style naming such as barcode01, barcode02, and the values must be unique.
alias is the user-facing sample name, must be unique, must not begin with the word barcode and may contain only letters, numbers, ., _ or -.
If a type column is present, it must use one of: test_sample, positive_control, negative_control, or no_template_control.
If an analysis_group column is present, every row must have a value.
For --de_analysis, the sheet must also contain the primary condition column (condition by default, overridable with --condition_column), plus any columns named in --covariates.

When multiplexed input folders are named by barcode, the workflow matches those folder names against the barcode column. If the folders are named by alias, the workflow can match them against alias, but the sample sheet still needs a barcode column because the shared validator expects it.

Example sample sheets:

Example sample sheet for a simple multiplexed analysis

This example contains the minimal barcode and alias columns:

barcode,alias
barcode01,rep1
barcode02,rep2
barcode03,rep3
barcode04,rep4

Example sample sheet for a differential expression analysis

This example is suitable for a multiplexed run and also satisfies the minimum requirements for a two-group DE/DTU comparison; containing the barcode, alias and condition columns.

barcode,alias,condition
barcode01,control_rep1,control
barcode02,control_rep2,control
barcode03,control_rep3,control
barcode04,treated_rep1,treated
barcode05,treated_rep2,treated
barcode06,treated_rep3,treated

Additional columns to use as contrast facets may be named with the --covariates parameter.

3. Genome alignment

Each sample is aligned to the supplied reference genome with minimap2 in a splice aware mode, then sorted and indexed with samtools. The aligned BAMs under samples/<alias>/alignment/ are the main alignment files used for transcriptome analysis, SQANTI3 QC, and optional IGV viewing.

4. Optional modified base summarisation

When aligned BAMs contain modified base tags (MM and ML), the workflow also runs modkit on each sample alignment. It first checks which modified base codes are present in the BAM, then runs modkit pileup to produce a per-sample bedMethyl file, a simple per-sample summary table, and one bigWig track per requested or inferred modification under samples/<alias>/mods/.

If --mod_codes is set, those codes are passed directly to modkit pileup. If it is omitted, the workflow infers the available primary_base:mod_code pairs from the aligned BAM with modkit modbam check-tags. These outputs are also included in the optional IGV configuration when --igv is enabled.

5. Cohort transcriptome construction

All aligned samples are analysed together with bambu to produce the primary cohort transcriptome, transcript counts, gene counts, and the RDS objects used for downstream differential analysis. This shared model is the main cohort-level result and is published under cohort/. Before writing outputs, transcript filtering removes only transcripts with zero total transcript counts across samples; it does not use fullLengthCounts for this quantification filter.

6. Independent per-sample transcriptomes

Each sample is also processed separately with bambu so the workflow produces sample-specific GTF, FASTA, count tables, and metadata under samples/<alias>/. These per-sample outputs are useful for inspecting sample specific transcript models without changing the shared cohort transcriptome used for DE/DTU.

7. Transcript sequence generation and QC

Transcript FASTA files are derived from GTF plus genome using gffread. SQANTI3 classifies the cohort and per-sample transcriptomes and produces structural QC summaries. The cohort SQANTI3 results live under cohort/sqanti/, while per-sample SQANTI3 directories are published under samples/<alias>/sqanti/.

8. Optional DE and DTU analysis

When --de_analysis is enabled, the workflow checks the experimental design, runs DESeq2 for differential gene expression, and runs DEXSeq for differential transcript usage. These analyses use the shared bambu outputs and the design columns in the sample sheet, and each comparison is written to its own subdirectory under de_analysis/<contrast>/.

9. What you need to provide

The workflow's analysis is controlled by a user provided genome, annotation, and bambu mode.

use --transcriptome_mode to choose between discover and fixed_annotation
--transcriptome_source has been removed; use --transcriptome_mode instead
both --ref_genome and --ref_annotation are required in both modes
--ref_transcriptome has been removed; if you want annotation-based quantification, use --transcriptome_mode fixed_annotation together with --ref_genome and --ref_annotation
when --de_analysis is enabled, the sample sheet must contain alias, the primary condition column, and any requested columns named in --covariates

Input parameters

Main Options

Nextflow parameter name	Type	Description	Help	Default
fastq	string	FASTQ reads to analyse.	You can provide a single FASTQ, a folder of FASTQs, or a multiplexed folder containing one sub-folder per sample or barcode.
bam	string	BAM or uBAM reads to analyse.	You can provide a single BAM or uBAM, a folder of BAMs, or a multiplexed folder containing one sub-folder per sample or barcode.
ref_genome	string	Reference genome FASTA.	Required in both discover and fixed_annotation modes.
ref_annotation	string	Reference transcript annotation in GTF or GFF format.	Required in both discover and fixed_annotation modes.
transcriptome_mode	string	How bambu should prepare the transcriptome model.	Use discover for reference-guided transcript discovery and quantification, or fixed_annotation for quantification only against the supplied annotation.	discover
direct_rna	boolean	Set this for direct RNA sequencing libraries.		False

Read Filtering Options

Nextflow parameter name	Type	Description	Help	Default
analyse_unclassified	boolean	Include unclassified reads from multiplexed input directories.		False
analyse_fail	boolean	Include fail reads from bam_fail and fastq_fail folders found in sample folders on the input path.		False

Sample Options

Nextflow parameter name	Type	Description	Help	Default
sample_sheet	string	CSV file describing barcodes, aliases, and optional experimental design columns.	For multiplexed runs, the sample sheet should contain both barcode and alias. Additionally, for differential analysis, it must also contain the condition column, and any extra columns named in `--covariates`.
sample	string	Single sample name for singleplexed input or to restrict multiplexed analysis to one sample.

Differential Expression Analysis Options

Nextflow parameter name	Type	Description	Help	Default
de_analysis	boolean	Run differential gene expression and differential transcript usage analyses.		False
condition_column	string	Main comparison column in the sample sheet.		condition
covariates	string	Comma-separated extra sample-sheet columns to adjust for, for example batch.	Each listed name must exist as a column in the sample sheet.
reference_level	string	Baseline group for the main comparison column.	If omitted, the workflow will use control when that level exists.

Output Options

Nextflow parameter name	Type	Description	Help	Default
out_dir	string	Directory for user-facing workflow outputs.		output
igv	boolean	Generate an IGV configuration file for the aligned BAM outputs.		False

Advanced Options

Nextflow parameter name	Type	Description	Help	Default
mod_codes	string	Comma-separated modified base codes to pass to modkit pileup.	Provide values accepted by `modkit pileup --modified-bases`, for example `A:a,C:m`. If omitted, the workflow infers `primary_base:mod_code` pairs from the BAM with `modkit modbam check-tags`.
force_alignment	boolean	Force re-alignment of input BAM files.	Read alignment is skipped if the existing sequence names in the aligned BAM match the provided reference. Enable this if the existing alignments used incorrect minimap2 presets (e.g. missing --splice or direct RNA settings).	False
ndr	number	Optional bambu novel discovery rate override.	Lower values are more conservative (higher precision), while higher values are more permissive (higher novel-discovery sensitivity). See the `bambu` repository for method details.
sqanti_skip_orf	boolean	Skip ORF prediction during SQANTI3 QC.		True

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

Title	File path	Description	Per sample or aggregated
Workflow report	wf-transcriptomes-report.html	HTML report summarising transcript discovery, quantification, SQANTI3 classification, and optional differential analysis results.	aggregated
Aligned BAM	samples/{{ alias }}/alignment/reads.bam	Genome-aligned BAM used for bambu, SQANTI3 QC, and IGV.	per-sample
Aligned BAM index	samples/{{ alias }}/alignment/reads.bam.bai	Index for the aligned BAM.	per-sample
Alignment summary	samples/{{ alias }}/alignment/bamstats.flagstat.tsv	bamstats flagstat summary for the aligned BAM.	per-sample
Modified base pileup	samples/{{ alias }}/mods/{{ alias }}.mods.bedmethyl.gz	Per-sample modkit bedMethyl pileup generated from the aligned BAM when MM and ML tags are present.	per-sample
Modified base summary	samples/{{ alias }}/mods/{{ alias }}.mods.summary.tsv	Per-sample global modification-percent summary aggregated from the modkit bedMethyl pileup, with one row per modification code.	per-sample
Modified base bigWig	samples/{{ alias }}/mods/{{ alias }}.mods.*.bw	Per-sample modkit bigWig tracks generated from the aligned BAM, with one file per requested or inferred modification code.	per-sample
Reference and annotation preparation summary	cohort/reference/annotation_reference_summary.json	Summary of reference and annotation preparation, including seqname overlap, build/provider hints, and excluded unstranded annotation counts.	aggregated
Excluded unstranded annotation records	cohort/reference/unstranded_annotation.gtf	Full set of annotation records excluded because their strand was not '+' or '-'. Present only when unstranded records are found.	aggregated
Cohort transcriptome GTF	cohort/transcripts.gtf	Joint bambu transcript model used as the primary cohort transcriptome.	aggregated
Cohort transcriptome FASTA	cohort/cohort.transcriptome.fa	Transcript sequences derived from the joint cohort GTF.	aggregated
Cohort transcript counts	cohort/transcript_counts.tsv	Transcript-level count matrix produced by bambu.	aggregated
Cohort gene counts	cohort/gene_counts.tsv	Gene-level count matrix derived from bambu output.	aggregated
Cohort transcript metadata	cohort/transcript_metadata.tsv	Transcript annotations and bambu transcript classes for the cohort model.	aggregated
Cohort SQANTI3 summary	cohort/sqanti/classification_summary.tsv	SQANTI3 classification summary for the cohort transcriptome.	aggregated
Per-sample transcriptome GTF	samples/{{ alias }}/transcripts.gtf	Independent bambu transcript model for an individual sample.	per-sample
Per-sample transcriptome FASTA	samples/{{ alias }}/{{ alias }}.transcriptome.fa	Transcript sequences derived from the per-sample GTF.	per-sample
Per-sample transcript counts	samples/{{ alias }}/transcript_counts.tsv	Transcript-level abundance estimates for the per-sample bambu model.	per-sample
Per-sample gene counts	samples/{{ alias }}/gene_counts.tsv	Gene-level abundance estimates for the per-sample bambu model.	per-sample
Per-sample transcript metadata	samples/{{ alias }}/transcript_metadata.tsv	Transcript annotations and bambu transcript classes for the per-sample model.	per-sample
Per-sample SQANTI3 summary	samples/{{ alias }}/sqanti/classification_summary.tsv	SQANTI3 classification summary for the per-sample transcriptome.	per-sample
Differential gene expression results	de_analysis/{{ contrast }}/results_dge.tsv	DESeq2 gene-level differential expression results for one contrast.	aggregated
Differential gene expression plots	de_analysis/{{ contrast }}/results_dge.pdf	PDF plots generated during DESeq2 analysis for one contrast.	aggregated
Differential transcript usage results	de_analysis/{{ contrast }}/results_dtu_transcript.tsv	Transcript-level DTU results for one contrast.	aggregated
Differential transcript usage gene summary	de_analysis/{{ contrast }}/results_dtu_gene.tsv	Gene-level DTU summary for one contrast.	aggregated
DEXSeq results	de_analysis/{{ contrast }}/results_dexseq.tsv	Full DEXSeq result table for one contrast.	aggregated
Differential transcript usage plots	de_analysis/{{ contrast }}/results_dtu.pdf	PDF plots generated during DEXSeq analysis for one contrast.	aggregated
Differential analysis QC summary	de_analysis/de_qc_stats.json	Structured DE/DTU QC summary. Use analysis_fallbacks for aggregate counts, and each contrast's deseq2_dispersion_fallback, dexseq_dispersion_method, and dexseq_covariates_dropped fields for interpretation.	aggregated
Differential analysis text summary	de_analysis/de_overall_summary.txt	Human-readable DE/DTU run summary across all contrasts.	aggregated
Per-contrast QC summary	de_analysis/{{ contrast }}/contrast_qc_summary.txt	Human-readable per-contrast DE/DTU QC summary including sample counts and key significance totals.	aggregated
DESeq2 fallback diagnostic	de_analysis/DESeq2_dispersion_fallback_{{ contrast }}.txt	Diagnostic details when DESeq2 falls back to gene-wise dispersion estimation.	aggregated
DGE failure diagnostic	de_analysis/{{ contrast }}/DGE_ANALYSIS_FAILED.txt	Diagnostic details when DESeq2 fails for a contrast.	aggregated
DTU failure diagnostic	de_analysis/{{ contrast }}/DTU_ANALYSIS_FAILED.txt	Diagnostic details when DEXSeq fails for a contrast.	aggregated
Multiple-testing warning	de_analysis/MULTIPLE_TESTING_WARNING.txt	Family-wise error-rate note generated when multiple contrasts are tested.	aggregated
IGV configuration	igv.json	JSON configuration for viewing the aligned BAMs in IGV.	aggregated
Reference FASTA index	reference/{{ ref_genome_file }}.fai	FAI index for the reference genome published for IGV.	aggregated
Reference GZI index	reference/{{ ref_genome_file }}.gzi	GZI index for a compressed reference genome published for IGV.	aggregated

This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.

Find related RNA and cDNA sequencing protocols in the Nanopore community.

Troubleshooting

Check that the reference genome and reference annotation use overlapping sequence names. The workflow checks this early and will fail if there is no overlap at all.
If --de_analysis is enabled, ensure the sample sheet contains alias, the primary condition column, and any columns named in --covariates.
DE/DTU requires at least two condition levels and at least two samples per level.
See how to interpret common Nextflow exit codes here.

Common confusion when coming from the previous workflow version

I supplied `--ref_transcriptome`, but the workflow still built or used `bambu` outputs

The previous workflow version used --ref_transcriptome as a main driver for transcript-level downstream analysis. In the current version the --ref_transcriptome option has been removed and is not part of the current bambu input setup.

Use --transcriptome_mode fixed_annotation together with --ref_genome and --ref_annotation if you want annotation-driven quantification without transcript discovery.

I omitted `--ref_annotation` because I expected the old input setup

The previous workflow version could be driven from a different combination of transcriptome inputs. In the current version both --ref_genome and --ref_annotation are required in discover and fixed_annotation modes. Always provide a compatible genome FASTA and transcript annotation when launching the workflow.

I used `--transcriptome_source` and got behaviour I did not expect

In the previous workflow version the --transcriptome_source was the main setting that chose how the workflow behaved. In the current version, --transcriptome_mode is the main setting that controls this. The --transcriptome_source option has been removed from the workflow interface.

The options --transcriptome_mode discover or --transcriptome_mode fixed_annotation should be used to choose between the modes of operation.

I cannot find the old flat DE output files

In the previous workflow version, DE files appeared directly under de_analysis/. In the current version DE and DTU results are grouped by contrast under de_analysis/<contrast>/. Look for outputs such as de_analysis/<contrast>/results_dge.tsv and de_analysis/<contrast>/results_dtu_transcript.tsv.

I expected the old output layout or transcriptome files

The previous workflow version emitted a single folder of all analysis outputs. In the current version, the output folders are organised around cohort/, samples/<alias>/ and de_analysis/<contrast>/.

The previous workflow output a non-redundant transcriptome whereas the current version of the workflow outputs a true joint transcriptome in the cohort/ folder, with individual transcriptome analyses additionally output in the samples/<alias>/ folders.

My DE/DTU run fails because of `--sample_sheet`

The sample sheet rules have been amended compared to the previous version to allow for multi-way comparisons.

The sample sheet must contain alias, the primary condition column, and any columns named in --covariates. At least two condition levels are required, and each level must contain at least two samples.

What to change: verify the sample sheet columns first, then check --condition_column, --covariates, and --reference_level.

The workflow validates that the annotation and genome share sequence names, and it excludes unstranded annotation entries from the differential analysis path.

Confirm the genome and annotation come from a compatible source, and ensure the annotation uses only + or - strand values where required for DE/DTU and SQANTI3.

FAQs

Does the workflow support both cDNA and direct RNA?

Yes. Use --direct_rna for direct RNA data. cDNA is the default mode.

Do I need both `--ref_genome` and `--ref_annotation` in fixed-annotation mode?

Yes. bambu still uses the genome together with the imported annotation.

Does the workflow create one shared transcriptome or one per sample?

Both. The joint cohort model is the primary result for reporting and DE/DTU, and the workflow also emits independent per-sample transcriptomes.

Can I run DE/DTU without transcript discovery?

Yes. Use --transcriptome_mode fixed_annotation together with --de_analysis.

What changed from the previous workflow version?

The current workflow uses bambu, SQANTI3, DESeq2, and DEXSeq. The main transcriptome result is now one shared bambu model built from all samples together, with separate per-sample bambu outputs published alongside it. The most important differences are summarised in this FAQ section.

Why do the results now mention `bambu` and `SQANTI3` instead of StringTie or GffCompare?

The workflow now uses a different set of transcript analysis tools. It builds its transcript models with bambu, classifies them with SQANTI3, and performs DE/DTU from the cohort bambu outputs.

Why is `--ref_annotation` now required even in fixed-annotation mode?

bambu still requires the annotation together with the genome in both discover and fixed_annotation modes. Fixed-annotation mode means annotation-driven quantification, not annotation-only execution.

Can I still use `--ref_transcriptome`?

No. --ref_transcriptome has been removed from the workflow interface.

Short old-to-new example:

Previous workflow version:
  --transcriptome_source precomputed --ref_transcriptome transcripts.fa

Current workflow:
  --transcriptome_mode fixed_annotation \
  --ref_genome genome.fa \
  --ref_annotation annotation.gtf

Why do I now get both cohort and per-sample transcriptomes?

The workflow now treats both as important outputs. The shared cohort model is the main transcriptome used for reporting and optional DE/DTU, while the per-sample transcriptomes are provided for looking at each sample separately.

Why are DE results under `de_analysis/<contrast>/`?

The workflow now writes one subdirectory per comparison instead of publishing one flat DE result set. This makes the output folder clearer when more than one comparison is present.