Classification of DNA-Binding Proteins Using Sequence Based Features and Feature Selection

The Datasets

The datasets folder contains all the feature for the experiments. All the features need to unzipped and kept in the datasets folder for the codes to run properly.

"All_32620_Features_Test_and_Train.zip" contains all the features extratced from both the train and the test datasets, these were used for the Recursive feature Selection.
"Group Test Dataset.zip" and "Group Train Dataset.zip" contains test and train files for the Grouped feature Selection. The features groups are separated in different csv files.

The Codes

Grouped Feature Selection

The coding of this technique was done manually and spearately for different combinations of features. We carried out all the experiemnts and stored the results in the "Grouped_Feature_Selection_All_Results.xlsx" files. After carrying out all the experiemnts we found out the best group combination and the tested it on the train dataset. The "Grouped_Feature_Selection_Final_GCEF_Test_Train.py" contains that final code where we calculated both the train and the test results.

Recursive Feature Selection

In this technique we ranked the features using Random Forest classifier and identified the least important feature and removed it from the train dataset. We ran the loop for 32620 times as we have that many features and chose the feature set with the best accuracy. After choosing the optimal feature set we we tested it on the test dataset.

"Recursive_Feature_Selection.py" contains the entire code for recursive feature selecton.
"Recursive_Best_Feature_Set_Train_Test.py" contains the code where we only ran the code till the optimal set of features was reached and then tested the feature set on the testing data.

Classifiers Used

The following classifiers have been used in the experiements:

Random Forest
Extra Tree Classifier
Support Vector Machine
Logistic Regression
AdaBoost
Decision Tree
Gaussian Naive Bayes
K-Nearest Neighbour
Linear Discriminant Analysis

Performance Metrics

The classifiers were evaluated using the following metrics:

Accuracy
Sensitivity or Recall
Specificity
Mathews correlation coefficient (MCC)
area under receiver operating characteristic curve (auROC)
area under precision recall curve (auPR)

skadilina / dna_binding Goto Github PK

dna_binding's Introduction

Classification of DNA-Binding Proteins Using Sequence Based Features and Feature Selection

The Datasets

The Codes

Grouped Feature Selection

Recursive Feature Selection

Classifiers Used

Performance Metrics

dna_binding's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent