RNA Modified Base Best Practices and Benchmarking

By Marcus Stoiber
Published in Data Releases
March 06, 2025
5 min read
RNA Modified Base Best Practices and Benchmarking

Modified bases play a crucial role in RNA biology as essential regulators of gene expression and cellular function, impacting processes such as translation, splicing, and RNA stability. RNA modifications have been implicated in various biological pathways and are associated with diseases including cancer, neurological disorders, and immune system dysregulation.

Unlike most RNA sequencing technologies that rely on complementary DNA (cDNA) sequencing, direct RNA nanopore sequencing enables the direct detection of modified bases by measuring perturbations in the ionic current. As with DNA nanopore sequencing, these signal differences are leveraged in basecalling and can be further analyzed to characterize RNA modifications in greater detail.

In this post, we present a direct RNA nanopore sequencing dataset derived from synthetic oligonucleotides for modified base validation. This dataset enables researchers to systematically assess detection performance across different RNA modifications, and contains strands with the following modified bases:

  • m6A (N6-methyladenosine)
  • m5C (5-methylcytosine)
  • pseU (pseudouridine)
  • inosine

Moreover, in this post discusses best practices for benchmarking and evaluating model performance.

Raw data, tools, and step-by-step instructions for running a validation pipeline to replicate these results are provided.

Unlike our previous DNA dataset, this RNA dataset does not systematically include all possible contexts due to practical constraints, only a sampling of random contexts. In the spirit of open data, instead of holding back this dataset, we have released the current benchmarking strands, as we continue to work on a more complete validation dataset.

In eukaryotic RNA biology is the m6A modification in DRACH sequence contexts (D = not C; R = A or G; H = not G, IUPAC codes) is of particular interest, thus a special sampling of the data is dedicated to DRACH-context sequences.

Data Access

Raw nanopore data for canonical and modified samples, reference sequences, and annotations of canonical and modified positions are available for download.

These datasets allow users to follow the analyses described in this post or expand upon them to conduct more in-depth investigations. We include two datasets: a “full” dataset and a “subset” dataset. Both are provided as raw sequencer outputs in pod5 format, ideal for signal visualization with Remora or custom signal processing algorithm development. The full dataset comprises all data that was collected during experimentation. The subset dataset was produced from the full dataset by aligning to the provided reference sequences randomly selecting 10,000 reads per synthetic construct. The reference-balanced subset is intended to allow users to quickly reproduce results and investigate the synthetic datasets. Provided basecalls allow users to inspect modified base calls without the needing to run the basecalling step described below. Basecalling for those BAMs provided were performed on the subset dataset with the SUP v5.1 model.

# running basecalling with dorado
dorado basecaller \
sup,m6A_inosine,m5C,pseU \
subset/control_DRACH.pod5 \
--reference references/drach_context_strands.fa \
> control_DRACH_mappings.bam

The data is located on AWS S3 at:

s3://ont-open-data/rna-modbase-validation_2025.03

The structure of the S3 prefix is shown below.

.
├── full
| ├── control_DRACH.pod5
| ├── m6A_DRACH.pod5
| ├── control_rep1.pod5
| ├── control_rep2.pod5
| ├── m6A_rep1.pod5
| ├── m6A_rep2.pod5
| ├── inosine_rep1.pod5
| ├── inosine_rep2.pod5
| ├── m5C_rep1.pod5
| ├── m5C_rep2.pod5
| ├── pseU_rep1.pod5
| └── pseU_rep2.pod5
├── subset
| ├── control_DRACH.pod5
| ├── m6A_DRACH.pod5
| ├── control_rep1.pod5
| ├── control_rep2.pod5
| ├── m6A_rep1.pod5
| ├── m6A_rep2.pod5
| ├── inosine_rep1.pod5
| ├── inosine_rep2.pod5
| ├── m5C_rep1.pod5
| ├── m5C_rep2.pod5
| ├── pseU_rep1.pod5
| └── pseU_rep2.pod5
├── basecalls
| ├── control_DRACH.bam
| ├── m6A_DRACH.bam
| ├── control_rep1.bam
| ├── control_rep2.bam
| ├── m6A_rep1.bam
| ├── m6A_rep2.bam
| ├── inosine_rep1.bam
| ├── inosine_rep2.bam
| ├── m5C_rep1.bam
| ├── m5C_rep2.bam
| ├── pseU_rep1.bam
| └── pseU_rep2.bam
└── references
├── drach_context_strands.fa
├── drach_context_A_sites.bed
├── drach_context_m6A_sites.bed
├── sampled_context_strands.fa
├── sampled_context_A_sites.bed
├── sampled_context_C_sites.bed
├── sampled_context_U_sites.bed
├── sampled_context_m6A_sites.bed
├── sampled_context_inosine_sites.bed
├── sampled_context_m5C_sites.bed
└── sampled_context_pseU_sites.bed

For more information and help downloading data from our open dataset archive, see the Datasets Tutorials page. Analyses based on these data are presented below.

For more details on setting up tools and performing validation please see our previous Modified Base Best Practices and Benchmarking blog post. The results presented below use Dorado version 0.9.1 and modkit version 0.4.3. All other setup and execution instructions remain the same.

Benchmarking Modified Base Detection with Modkit

The modkit validate command is a flexible tool for benchmarking modified base detection across different data types. It computes per-read, per-site balanced accuracy using specified ground truth sites, ensuring equal representation of modified and unmodified bases.

In these tests, ground truth sites correspond to known modified positions in modified strands and their matching positions in control strands. False positive rates may be elevated near modified bases — a topic briefly explored at the end of this post. Users are encouraged to analyze this dataset to assess its impact on their applications.

The validate command is run across all replicates using appropriate ground truth files to generate accuracy metrics. The table below summarizes results for:

  • Different basecallers
  • Context-specific strands
  • Excluding specific modified base types (for multi-modification models)
Modified
Base
ContextHAC
Accuracy
SUP
Accuracy
m6A onlyDRACH99.3099.32
m6A onlySampled97.0797.21
Inosine onlySampled95.0897.05
m6A + InosineSampled94.5496.05
m5CSampled93.7794.32
pseUSampled96.8497.27

Digging in a bit deeper, the HAC m6A + Inosine calls validated on sampled context sites results in the following confusion matrix (produced directly by modkit validate):

HAC All-context m6A+Inosine Confusion Matrix
Figure 2. HAC All-context m6A+Inosine Confusion Matrix

High rates of falsely identified modified bases at canonical positions are generally more problematic for downstream pipelines than modified sites falsely identified as canonical. Note that false modified base frequency (first row) is much lower than false canonical calls at modified sites. Model training has been tuned to reduce false modified base calls for all models. For applications requiring higher sensitivity, modkit provides options to adjusted filtering thresholds.

Isolating either m6A or inosine calls eliminates the possibility of misclassification between modified base types, resulting in higher accuracy (Fig. 2).

HAC Sampled Contexts m6A Confusion Matrix
Figure 2a. Isolated m6A calls
HAC Sampled Contexts Inosine Confusion Matrix
Figure 2b. Isolated Inosine calls

For applications where certain modified base types can be excluded without significant loss of informative signal, the following modkit command can be used:

# ignore m6A
modkit \
adjust-mods \
--ignore a \
control_rep1_basecalls.bam \
control_rep1_basecalls_inosine_only.bam
# ignore inosine (encoded as ChEBI code 17596 - defined in SAM specifications)
modkit \
adjust-mods \
--ignore 17596 \
control_rep1_basecalls.bam \
control_rep1_basecalls_m6A_only.bam

Upgrading the basecalling model from high-accuracy to super-accuracy imparts a corresponding improvement in modified base accuracy including specific improvements in false modified base calls (Fig. 3).

SUP Sampled Contexts m6A Confusion Matrix
Figure 3a. SUP Isolated m6A calls
SUP Sampled Contexts Inosine Confusion Matrix
Figure 3b. SUP Isolated Inosine calls

Analysis of the strands containing DRACH sequence contexts shows additional improvements in accuracy to enable common eukaryotic analyses with higher precision (Fig. 4).

SUP DRACH-context m6A Confusion Matrix
Figure 4. SUP DRACH-context m6A Confusion Matrix

For the m5C and pseU, we also observe similar model characteristics with very low false modified base frequency (Fig. 5).

SUP Sampled Contexts m5C Confusion Matrix
Figure 4a. SUP m5C Confusion Matrix
SUP Sampled Contexts pseU Confusion Matrix
Figure 4b. SUP pseU Confusion Matrix

As discussed earlier, bases adjacent to modified sites can occasionally exhibit false modified base calls. This effect arises due to nanopore signal perturbations caused by the presence of nearby modifications, leading the model to exhibit uncertainty in distinguishing which base carries a modification.

For example, within the DRACH context, false m5C calls occur at a rate of 13.6% for cytosine bases within two positions of an m6A site, whereas more distant cytosines in the same strands maintain a low 0.9% error rate. On the other hand, false m6A calls at unmodified adenine bases within two positions of an m6A site occur at a 4.2% rate, while more distant adenines experience only a modest decrease to 3.6%.

These results highlight the impact of signal interactions in modified base detection and reinforce the importance of careful interpretation when analyzing modifications in close proximity. To assist researchers, we provide these benchmarking strands as a valuable resource for calibrating biological experiments and refining analytical pipelines based on empirical error rates. Moreover, we are actively enhancing our models to further improve accuracy in distinguishing true modification sites, ensuring continued advancements in modified base detection with each new release.

Discussion

These benchmarking strands provide a resource for assisting in calibration of modified base experiments, allowing researchers to optimize experimental parameters. For example, they can be used to determine the required sequencing coverage for detecting differential modifications with confidence.

As highlighted in our previous DNA blog post, Modkit offers powerful tools for modified base analysis, including identifying differentially modified regions, mapping modifications to transcriptomic features, performing differential methylation analysis, analyzing modification entropy, and exploring sequence motifs. These capabilities make it easier to extract meaningful insights from modified base data at scale.

This post serves as a comprehensive guide to RNA modified base analysis using synthetic ground truth data. By following best practices and leveraging these datasets, researchers can generate accurate and reproducible modification calls, driving forward discoveries in RNA epitranscriptomics.

While all sequencing technologies face challenges in resolving closely spaced modifications, nanopore sequencing stands apart as the only platform capable of direct, single-molecule RNA modification detection—without requiring amplification or chemical conversion. With ongoing model improvements and the continued expansion of supported modified bases, nanopore sequencing is poised to push the boundaries of RNA biology. Future developments will further refine accuracy, expand the repertoire of detectable modifications, and enhance our understanding of the epitranscriptome, solidifying nanopore sequencing as the leading technology for RNA modification analysis.


Tags

#modifiedbases#ont-open-datarna

Share

Marcus Stoiber

Machine Learning Scientist

Table Of Contents

1
Data Access
2
Benchmarking Modified Base Detection with Modkit
3
Discussion

Related Posts

Modified Base Best Practices and Benchmarking
October 22, 2024
4 min

Quick Links

WorkflowsOpen DataContact

Social Media

© 2020 - 2025 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.