Data, Results, and Scripts for "Alignment of RNA Secondary Structures with Arbitrary Pseudoknots using Structural Sequences"

20260 citationsDatasetgreen Open Access

Authors

Michela Quadrini · Università di Camerino

Emanuela Merelli · Università di Camerino

Abstract

This repository contains datasets, results, and scripts associated with the following manuscript: Tesei, L., Levi, F., Quadrini, M., Merelli, E. (2026) “Alignment of RNA Secondary Structures with Arbitrary Pseudoknots using Structural Sequences” The software tool SERNAlign, used to process the RNA secondary structures and generate SERNA and SERNA-NC distances is available: https://github.com/bdslab/sernalign Please refer to the documentation of the SERNALign tool at: https://github.com/bdslab/sernalign/blob/master/README.md This repository includes curated collections of RNA structures (with and without pseudoknots), comparative analyses performed with a wide set of structural alignment tools, and scripts for clustering analysis. The following distances were used to generate the comparative results included in this repository: SERNA, SERNA-NC, ASPRA, PSMAlign, RAG-2D, Genus, and PskOrder. For completeness, the tables also include metrics computed using simple global descriptors, namely the number of base pairs (BP), number of GC nucleotides in the sequence (GC-Seq), number of G–C base pairs (GC-Pairs), and sequence length (Length). ASPRA is the distance obtained by aligning Structural RNA Trees of RNA secondary structures with arbitrary pseudoknots [1]. BP and Length represent the distances computed from the number of base pairs and from the sequence length of each molecule, respectively. GC-Seq and GC-Pairs are distances based on the number of occurrences of nucleotides G and C, and on the number of G–C and C–G base pairs, respectively. SERNA and SERNA-NC represent the distances computed by aligning Structural Sequences, with and without structural constraints. Genus and PskOrder correspond to distances computed from the genus of each RNA molecule [2] and from the pseudoknot order [3]. PSMAlign and RAG-2D are the distances produced by the homonymous algorithms [4] and [5]. Pseudoknots_Dataset Folder This folder contains RNA molecules featuring pseudoknotted secondary structures, extracted from the Pseudobase++ database. The dataset is organized into two main subfolders: pseudoknots_3class and pseudoknots_8class. The first dataset groups molecules into three pseudoknot classes (H, HLout, and LL_HHH), while the second organizes them into eight classes (H, HHH, HLin, HLout, HLout_HHH, LL, LL_HHH, and HLout_HLin), providing a finer level of structural categorization. Within each dataset, the molecules are organized into subdirectories corresponding to different structural formats (BPSEQ, CT, and DB), each representing an alternative encoding of the RNA secondary structure. These formats allow the pseudoknotted structures to be processed by different comparison and analysis tools used in this study. Phylogenetic_Dataset Folder This folder contains ribosomal RNA molecules extracted from the CRW2 (Comparative RNA Web-2) database [6] and organized into three top-level subfolders corresponding to the three domains of life: Archaea, Bacteria, and Eukaryota. Each of these domain-specific folders includes three additional subdirectories, one for each ribosomal RNA type present in the dataset, namely 5S, 16S, and 23S. Within each RNA-type folder, the molecules are provided in several structural formats (Bpseq, CT, and DB), which represent different encodings of RNA secondary structures, with or without non-canonical interactions. This organization ensures compatibility with the various alignment and comparison tools employed in our analyses. Pseudoknots_Dataset_Distances_And_Labels Folder This folder contains the pseudoknot datasets and is organized into three subfolders: pseudoknots_3class, pseudoknots_8class, and Labels. The first two subfolders collect the outputs produced by the different alignment and comparison tools applied to the molecules belonging to each pseudoknot dataset. All result files follow a uniform naming convention of the form ToolName_DatasetName.csv, ensuring that each result can be directly associated with both the tool used and the dataset to which it refers. The Labels folder contains the annotation files for both pseudoknot datasets. Each file follows the naming convention Labels_DatasetName_pseudoknotType.csv, where the dataset name identifies whether the file refers to the three-class or eight-class dataset, and pseudoknotType denotes the structural pseudoknot class assigned to each molecule. In the three-class dataset, pseudoknots are classified into H, HLout, and LL_HHH. In the eight-class dataset, the classification is refined into H, HHH, HLin, HLout, HLout_HHH, LL, LL_HHH, and HLout_HLin. These label files provide the necessary information to associate every molecule with its corresponding pseudoknot class and support downstream analyses based on structural categories. Phylogenetic_Dataset_Distances_And_Labels Folder This folder contains the phylogenetic datasets and is structured into four subfolders: Archaea, Bacteria, Eukaryota, and Labels. The first three folders collect the outputs produced by the different alignment and comparison tools applied to the molecules extracted from each domain of life. For each tool and domain, the results are distinguished by RNA type, and all files follow a consistent naming convention of the form ToolName_Domain_RNAType.csv, where the domain is one among Archaea, Bacteria, or Eukaryota, and the RNA type is one among 5S, 16S, or 23S. This convention guarantees that each file can be unambiguously associated with the corresponding tool, domain, and ribosomal RNA type. The Labels folder contains all annotation files for the phylogenetic datasets. Each file follows the naming convention Labels_Domain_Type_taxonomicClassification.csv, where the domain indicates Archaea, Bacteria, or Eukaryota, the type specifies whether the file refers to 5S, 16S, or 23S molecules, and taxonomicClassification denotes that the file includes taxonomic metadata associated with each molecule, namely its phylum, order, and class. Scripts Folder This folder contains the Python scripts to perform the hierarchical clustering of all datasets using the different linkages and evaluating the obtained clusters with the metrics. The script ClusterMatrix.py can be used for csv with distances while the script ClusterFeatures.py can be used for csv with features (RAG-2D). Numerical_Results Folder This folder contains all the tables with numerical values of the metrics for each dataset and each distance, computed by the provided scripts. References [1] Quadrini, M., Tesei, L., and Merelli, E. (2020). ASPRAlign: a tool for the alignment of RNA secondary structures with arbitrary pseudoknots. Bioinformatics, 36(11), 3578-3579. [2] Andersen, J. E., Penner, R. C., Reidys, C. M., and Waterman, M. S. (2013). Topological classification and enumeration of RNA structures by genus. Journal of mathematical biology, 67(5), 1261-1278. [3] Zok, T., Badura, J., Swat, S., Figurski, K., Popenda, M., and Antczak, M. (2020). New models and algorithms for RNA pseudoknot order assignment. International Journal of Applied Mathematics and Computer Science, 30(2), 315-324. [4] Chiu, J. K. H., and Chen, Y. P. P. (2015). Pairwise RNA secondary structure alignment with conserved stem pattern. Bioinformatics, 31(24), 3914-3921. [5] Gan, H. H., Fera, D., Zorn, J., Shiffeldrim, N., Tang, M., Laserson, U., Kim N. and Schlick, T. (1987). RAG: RNA-As-Graphs database—concepts, analysis, and features. Nutrition and Health, 5(1-2), 1285-1291. [6] Cannone, J. J., Subramanian, S., Schnare, M. N., Collett, J. R., D'Souza, L. M., Du, Y., Feng B., Lin N., Lakshmi V., Madabusi, Müller K. M., Pande N., Zhidi Shang Z., Yu N. and Gutell, R. R. (2002). The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3(1), 2. [7] Taufer, M., Licon, A., Araiza, R., Mireles, D., Van Batenburg, F. H. D., Gultyaev, A. P., & Leung, M. Y. (2009). PseudoBase++: an extension of PseudoBase for easy searching, formatting and visualization of pseudoknots. Nucleic acids research, 37(suppl_1), D127-D135. FundingThis work was supported by the European Union - NextGenerationEU - National Recovery and Resilience Plan (NRRP), Mission 4, Component 2, Investment 1.1, under the PRIN 2022 PNRR call (Min. Decree No. 1409, dated September 14, 2022), project: P2022FFEWN RNA secondary structures and their relationship with function: application to non-coding RNAs (RNA2Fun), CUP: J53D23014960001.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18292181

Command Palette

Data, Results, and Scripts for "Alignment of RNA Secondary Structures with Arbitrary Pseudoknots using Structural Sequences"

Authors

Abstract

Topics & Keywords

Publication Details