Git Product home page Git Product logo

dna_binding's Introduction

Classification of DNA-Binding Proteins Using Sequence Based Features and Feature Selection

The Datasets

The datasets folder contains all the feature for the experiments. All the features need to unzipped and kept in the datasets folder for the codes to run properly.

  • "All_32620_Features_Test_and_Train.zip" contains all the features extratced from both the train and the test datasets, these were used for the Recursive feature Selection.
  • "Group Test Dataset.zip" and "Group Train Dataset.zip" contains test and train files for the Grouped feature Selection. The features groups are separated in different csv files.

The Codes

Grouped Feature Selection

The coding of this technique was done manually and spearately for different combinations of features. We carried out all the experiemnts and stored the results in the "Grouped_Feature_Selection_All_Results.xlsx" files. After carrying out all the experiemnts we found out the best group combination and the tested it on the train dataset. The "Grouped_Feature_Selection_Final_GCEF_Test_Train.py" contains that final code where we calculated both the train and the test results.

Recursive Feature Selection

In this technique we ranked the features using Random Forest classifier and identified the least important feature and removed it from the train dataset. We ran the loop for 32620 times as we have that many features and chose the feature set with the best accuracy. After choosing the optimal feature set we we tested it on the test dataset.

  • "Recursive_Feature_Selection.py" contains the entire code for recursive feature selecton.
  • "Recursive_Best_Feature_Set_Train_Test.py" contains the code where we only ran the code till the optimal set of features was reached and then tested the feature set on the testing data.
Classifiers Used

The following classifiers have been used in the experiements:

  • Random Forest
  • Extra Tree Classifier
  • Support Vector Machine
  • Logistic Regression
  • AdaBoost
  • Decision Tree
  • Gaussian Naive Bayes
  • K-Nearest Neighbour
  • Linear Discriminant Analysis
Performance Metrics

The classifiers were evaluated using the following metrics:

  • Accuracy
  • Sensitivity or Recall
  • Specificity
  • Mathews correlation coefficient (MCC)
  • area under receiver operating characteristic curve (auROC)
  • area under precision recall curve (auPR)

dna_binding's People

Contributors

skadilina avatar

Stargazers

 avatar  avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.