circ-EnviroPredict: A machine learning-based tool to predict potential involvement of circRNAs with cold and drought stress through a Word2Vec approach - datasets

20260 citationsDatasetgreen Open Access

Authors

Maria Clara Martins Ferreira · Universidade Federal de Pelotas

Frederico Schmitt Kremer · Universidade Federal de Pelotas

Vanessa Galli · Universidade Federal de Pelotas

Abstract

circ-EnviroPredict Zenodo Repository Abstract This repository contains the primary datasets used in the development and validation of circ-EnviroPredict, a machine learning tool designed to predict circular RNA (circRNA) involvement in plant abiotic stress conditions (cold and drought). The repository includes raw genomic sequences for vocabulary construction, labeled database records, processed Word2Vec numerical embeddings based on k-mer segmentation, approximate nearest neighbor search results for sequence similarity analysis, and independent cross-species datasets for external model validation. Dataset Directory Structure 1. annoy/ (Approximate Nearest Neighbors Analysis) neighbors_results_cold_rice.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between rice circRNAs under cold stress and control conditions. neighbors_results_drought.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between circRNAs under drought stress and control conditions. 2. raw/ (Primary Sequence Data and Metadata) maize_db.xlsx & rice_db.xlsx: Labeled database records containing circRNA annotations, environmental condition classifications (control, cold, drought), and metadata extracted from CropCircDB. osaj43883_genomic_seq.txt & zma10381_genomic_seq.txt: Genomic sequences in FASTA format for 43,883 rice (Oryza sativa) circRNAs and 10,381 maize (Zea mays) circRNAs, used as the text corpus for Word2Vec vocabulary construction. 3. sample_validation/ (Cross-Species Validation Sets) validation_seq_arabidopsis.txt: Independent test set sequences for Arabidopsis thaliana. validation_seq_soybean.txt: Independent test set sequences for Glycine max. validation_seq_t_aestivum.txt: Independent test set sequences for Triticum aestivum. validation_seq_maize.txt: Supplementary validation sequence set for maize. validation_seq_control.txt: Unstressed baseline sequences utilized for external model validation. 4. word2vec_datasets/ (Engineered Feature Sets) maize_w2vec_3mer_64_dataset.xlsx: Numerically encoded maize dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space. rice_w2vec_3mer_64_dataset.xlsx: Numerically encoded rice dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space. 5. Root Directory Files file.txt: General repository documentation or unstructured textual data.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19372428

Command Palette

circ-EnviroPredict: A machine learning-based tool to predict potential involvement of circRNAs with cold and drought stress through a Word2Vec approach - datasets

Authors

Abstract

Topics & Keywords

Publication Details