NeRNA: a negative data generation framework for machine learning applications of non-coding RNAs

About

If you use workflow in your research, please consider citing;

Orhan, M. E., Demirci, Y. M., & Saçar Demirci, M. D. (2023). NeRNA: A negative data generation framework for machine learning applications of noncoding RNAs. Computers in biology and medicine, 159, 106861. https://doi.org/10.1016/j.compbiomed.2023.106861

NeRNA is a novel negative data generation framework that is developed on the KNIME analytics platform. This workflow employs non-coding RNA sequences to generate negative RNAs.

Supervised machine learning-based non-coding RNA (ncRNA) analysis methods have been developed to classify and identify novel sequences. During such analysis, the positive learning data sets usually have known examples of ncRNAs published in databases. On the contrary, neither databases listing the confirmed negative sequences for a specific ncRNA class nor standardized methodologies developed to generate high-quality negative examples. To achieve this challenge, we developed a novel negative data generation method, NeRNA (negative RNA).

Requirements

You can download NeRNA workflow in Knime Workflow folder or directly here.

Firstly, the NeRNA framework is developed on the KNIME Analytics platform; therefore KNIME should be installed. A second required tool is RNAfold application from Vienna RNA package (Please follow the instruction, for the installation RNAfold on their website.). R software environment, seqinR and stringR packages are required for R scripts.

To configure R settings in KNIME:

Inside KNIME File -> Preferences -> KNIME (left side of the pop-up) -> R
Set to R path and Rserve memory.
Please use the following commands in your R / R Studio to install the required packages.

library("Rserve")
Rserve(args = "--vanilla")
#Additionally, the seqinr and stringr packages are required in order to use R scripts.
install.packages("seqinr")
install.packages("stringr")

KNIME Workflow Overview

Select Sequence File, Sequence Type, and RNAfold Path: This node configures the location of the Sequence fasta file and RNAfold path. Also, non-coding RNA types should be selected.
NeRNA Generation: NeRNA Generation is the primary node of the NeRNA workflow. There are two subgroups in this node: CASE switch and NeRNA generation.
- CASE switch: This node changes RNAfold and Sequence Converter Calculation parameters by Sequence type condition. Such as, for circRNA sequences, the --circ parameter is used in RNAfold, and for the tRNA condition, the Sequence converter node is modified.
- NeRNA Generator: Main node of Negative RNA workflow.
  - RNAfold Calculation: This node calculates secondary structures for each sequence. Secondary structures are essential since negative sequences are generated based on these structural representations.
    
    Sequences that RNAfold does not calculate are removed. Check Std Output and R error Output on the RNAfold Calculation node.
  - Checking Wrong Calculation: This node checks the for the structures without mfe(minimum free energy) values.
  - Check Missing Value: This node checks non-calculated sequences. These sequences are removed before the sequence converter process.
  - Sequence Converter: This meta node's task is to reconfigure sequences based on their secondary structures and base pairing.
  - Negative Generator Binary Index Change: This meta node is the main calculation of NeRNA workflow. All sequences are converted to octal representation, and then a novel methodology is applied to each sequence for creating negative RNA sequences.
Column Filter: Filtering unused columns like iteration number.
Column Rename: This node renames the Column for the FASTA Writer.
FASTA Writer: Writes a fasta file based on the file name and output location information taken from the user.

Case Studies

NeRNA workflow is tested on four non-coding RNA classes: microRNA, long non-coding RNA, circular RNA, and tRNA sequences.

In case studies, machine learning and deep learning-based classifiers like Decision Trees (DT), Random Forest (RF), Naive Bayes (NB), Multilayer perceptron (MLP), Convolutional neural network (CNN), and Feed-forward neural networks (FNN) are employed to test novel negative sequences. In the test condition, equal numbers of negative and positive sequences are used to train the models, and the data sets are divided into learning and testing portions at a 70/30 ratio. Additionally, 1000-fold Monte Carlo Cross-Validation is used in the process.

RNA type	Organisms	Number	Sequence Length Min	Sequence Length Max	Sequence Length Average	Source
miRNA hairpins	Homo sapiens	1917	41	180	81.89	miRBase
	Mus musculus	1234	39	147	82.6	miRBase
	Bos taurus	1064	43	149	76.23	miRBase
	Gallus gallus	882	48	169	87.36	miRBase
	Oreochromis niloticus	812	40	100	61.05	miRBase
	Equus caballus	715	52	145	104.61	miRBase
	Glycine max	684	54	473	135.92	miRBase
	Monodelphis domestica	680	44	111	64.92	miRBase
	Medico truncatula	672	54	910	165.26	miRBase
	Pan troglodytes	655	69	148	89.94	miRBase
tRNA	101*	1110	54	99	77.56	Psi-C Database
lncRNA	Homo sapiens	1000	202	29066	1496.97	LNCipedia
circRNA	Mus musculus	1000	51	29991	1566.49	circBase

Positive sequences, NeRNA generated negative sequences and the classification results of case studies are available in Case Studies folder. NeRNA Structure Result contains the secondary structures of 5 negative and 5 normal example sequences. Secondary structures of RNAs are constructed using StructureEditortool.

Comparison Analysis

Negative RNA sources in the literature are used to compare with negative data from NeRNA.

mehmeteminorhan / negativerna Goto Github PK

negativerna's Introduction

NeRNA: a negative data generation framework for machine learning applications of non-coding RNAs

About

Requirements

KNIME Workflow Overview

Case Studies

Comparison Analysis

negativerna's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent