If you use workflow in your research, please consider citing;
Orhan, M. E., Demirci, Y. M., & Saçar Demirci, M. D. (2023). NeRNA: A negative data generation framework for machine learning applications of noncoding RNAs. Computers in biology and medicine, 159, 106861. https://doi.org/10.1016/j.compbiomed.2023.106861
NeRNA is a novel negative data generation framework that is developed on the KNIME analytics platform. This workflow employs non-coding RNA sequences to generate negative RNAs.
Supervised machine learning-based non-coding RNA (ncRNA) analysis methods have been developed to classify and identify novel sequences. During such analysis, the positive learning data sets usually have known examples of ncRNAs published in databases. On the contrary, neither databases listing the confirmed negative sequences for a specific ncRNA class nor standardized methodologies developed to generate high-quality negative examples. To achieve this challenge, we developed a novel negative data generation method, NeRNA (negative RNA).
You can download NeRNA workflow in Knime Workflow folder or directly here.
Firstly, the NeRNA framework is developed on the KNIME Analytics platform; therefore KNIME should be installed. A second required tool is RNAfold application from Vienna RNA package (Please follow the instruction, for the installation RNAfold on their website.). R software environment, seqinR and stringR packages are required for R scripts.
To configure R settings in KNIME:
-
Inside KNIME File -> Preferences -> KNIME (left side of the pop-up) -> R
-
Set to R path and Rserve memory.
-
Please use the following commands in your R / R Studio to install the required packages.
library("Rserve")
Rserve(args = "--vanilla")
#Additionally, the seqinr and stringr packages are required in order to use R scripts.
install.packages("seqinr")
install.packages("stringr")
- Select Sequence File, Sequence Type, and RNAfold Path: This node configures the location of the Sequence fasta file and RNAfold path. Also, non-coding RNA types should be selected.
- NeRNA Generation: NeRNA Generation is the primary node of the NeRNA workflow. There are two subgroups in this node: CASE switch and NeRNA generation.
- CASE switch: This node changes RNAfold and Sequence Converter Calculation parameters by Sequence type condition. Such as, for circRNA sequences, the --circ parameter is used in RNAfold, and for the tRNA condition, the Sequence converter node is modified.
- NeRNA Generator: Main node of Negative RNA workflow.
-
RNAfold Calculation: This node calculates secondary structures for each sequence. Secondary structures are essential since negative sequences are generated based on these structural representations.
Sequences that RNAfold does not calculate are removed. Check Std Output and R error Output on the RNAfold Calculation node.
-
Checking Wrong Calculation: This node checks the for the structures without mfe(minimum free energy) values.
-
Check Missing Value: This node checks non-calculated sequences. These sequences are removed before the sequence converter process.
-
Sequence Converter: This meta node's task is to reconfigure sequences based on their secondary structures and base pairing.
-
Negative Generator Binary Index Change: This meta node is the main calculation of NeRNA workflow. All sequences are converted to octal representation, and then a novel methodology is applied to each sequence for creating negative RNA sequences.
-
- Column Filter: Filtering unused columns like iteration number.
- Column Rename: This node renames the Column for the FASTA Writer.
- FASTA Writer: Writes a fasta file based on the file name and output location information taken from the user.
NeRNA workflow is tested on four non-coding RNA classes: microRNA, long non-coding RNA, circular RNA, and tRNA sequences.
In case studies, machine learning and deep learning-based classifiers like Decision Trees (DT), Random Forest (RF), Naive Bayes (NB), Multilayer perceptron (MLP), Convolutional neural network (CNN), and Feed-forward neural networks (FNN) are employed to test novel negative sequences. In the test condition, equal numbers of negative and positive sequences are used to train the models, and the data sets are divided into learning and testing portions at a 70/30 ratio. Additionally, 1000-fold Monte Carlo Cross-Validation is used in the process.
RNA type | Organisms | Number | Sequence Length Min | Sequence Length Max | Sequence Length Average | Source |
---|---|---|---|---|---|---|
miRNA hairpins |
Homo sapiens | 1917 | 41 | 180 | 81.89 | miRBase |
Mus musculus | 1234 | 39 | 147 | 82.6 | miRBase | |
Bos taurus | 1064 | 43 | 149 | 76.23 | miRBase | |
Gallus gallus | 882 | 48 | 169 | 87.36 | miRBase | |
Oreochromis niloticus | 812 | 40 | 100 | 61.05 | miRBase | |
Equus caballus | 715 | 52 | 145 | 104.61 | miRBase | |
Glycine max | 684 | 54 | 473 | 135.92 | miRBase | |
Monodelphis domestica | 680 | 44 | 111 | 64.92 | miRBase | |
Medico truncatula | 672 | 54 | 910 | 165.26 | miRBase | |
Pan troglodytes | 655 | 69 | 148 | 89.94 | miRBase | |
tRNA | 101* | 1110 | 54 | 99 | 77.56 | Psi-C Database |
lncRNA | Homo sapiens | 1000 | 202 | 29066 | 1496.97 | LNCipedia |
circRNA | Mus musculus | 1000 | 51 | 29991 | 1566.49 | circBase |
Positive sequences, NeRNA generated negative sequences and the classification results of case studies are available in Case Studies folder. NeRNA Structure Result contains the secondary structures of 5 negative and 5 normal example sequences. Secondary structures of RNAs are constructed using StructureEditortool.
Negative RNA sources in the literature are used to compare with negative data from NeRNA.