Git Product home page Git Product logo

mdavari / protein-dihedral-angles-prediction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from m3h0w/protein-dihedral-angles-prediction

0.0 0.0 0.0 6.54 MB

A hybrid approach to protein structure (dihedral / torsional angles) prediction using techniques from Mohammed AlQuraishi's work on End-to-end differentiable learning of protein structure and Gao et al. on RaptorX.

Home Page: https://gacka.space/portfolio/protein-structure-prediction.html

Jupyter Notebook 96.22% Python 3.78%

protein-dihedral-angles-prediction's Introduction

Protein Tertiary Structure Prediction

This project aims at reproducing selected part of Mohammed AlQuraishi's work on End-to-end differentiable learning of protein structure (https://www.biorxiv.org/content/early/2018/08/29/265231), and Gao et al. on RaptorX (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2065-x).

Report with the results: https://drive.google.com/file/d/1-SFavU5i6XlHK2sswy60k5TowgButezy/view

In the main folder you'll find notebooks that show examples of how to use the model.

Note:

The training_50_dih.joblib and validation_dih.joblib files are not available because they have to be generated from txt version of the data. The data is in its raw format represented as 3d vectors, while the files expected by the txt pipeline are protein representation converted to dihedral angles. Thus if you can't generate these files yourself I recommend working with the full tensor-based pipeline (Full pipeline - tensor data.ipynb or the model files directly) instead of the txt data.

Model configuration details

This tesnorflow model uses ProteinNet dataset (in the tensor version) as available in the preliminary release here: https://github.com/aqlaboratory/proteinnet

Input

Input is comprised of aminoacid sequences and evolutionary data (PSSM) and parsing is done through the DataHandler object, which is written in the old queue paradigm (instead of the new tensorflow Data Pipeline).

Files

Files to consider as trianing inputs are decided based on following variables:

  • data_path: path to the ProteinNet containing casps (each casp then contains training, validation and test folders)
  • casps: a list of strings defining which casps should be loaded
  • percentages: a list of integers defining which structure identity clusters should be loaded

Features

Controlled fully by a boolean: include_evo, that controlls if evolutionary features should be used together with aminoacid sequences.

Model

The model is controlled using the Model class defined in the model.py.

Its behaviour is fully determined by a set of arguments passed to the contructors: n_angles, model_type, prediction_mode, ang_mode, loss_mode,dropout_rate.

  • n_angles: 2 if should predict only phi and psi and 3 if phi, psi, and omega
  • model_type: see Core (below)
  • prediction_mode: see Prediction (below)
  • ang_mode: see Predictions and corresponding loss modes -> Angularization
  • loss_mode: see Predictions and corresponding loss modes
  • dropout_rate: controlls the regularization applied to the core model
  • regularize_vectors: controlls if regularization loss should be applied to vectors to keep them on unit circle (only available in 'regression_vectors' mode)

Core

Both CNNs are composed of resnet type architecture with residual connections in between layers, batch normalization after each layer and dropout after every second layer.

Filter numbers (neurons per layer) start at 32 and are incrementally doubled every 2 layers. Filter size is fixed at 5.

Modes:

  • cnn_big: 8 layers
  • cnn_small: 6 layers
  • bilstm: bidirectional lstm. 1 layer, 128 neurons

Predictions and corresponding loss modes

Modes:

  • regression_angles

    n_angles values predicted in a dense layer, piped through tanh or cos and multiplied by pi to fit radian range

    Available Angularization Modes: 'tanh' or 'cos'

    Available loss modes: 'angular_mae' or 'mae', both are applied to angles

  • regression_vectors

    n_angles*2 values predicted in a dense layer, converted to angles by passing through an atan2 function

    Available loss modes: 'angular_mae' or 'mae'. Angular mae is applied to angles, mae is applied to vectors.

  • alphabet_angles

    n_angles values predicted by calculating a weighted average of an alpahabet of n_clusters size and a probability disttribution over that alphabet

    Available loss modes: 'angular_mae' or 'mae', both are applied to angles

  • alphabet_vectors

    As in alphabet_angles but first the network predicts 2 values per angle and then atan2 is applied as in regression vectors.

    Available loss modes: 'angular_mae' or 'mae'. Angular mae is applied to angles, mae is applied to vectors.

Angularization

Depending on the Prediction mode, angualrization mode might need to be specified.

regression_vectors: has angularization included in its transformations and has no options to choose from.

regression_angles: continuous value predicted by a linear dense layer is piped through either tanh or cos, specified by the ang_mode argument.

protein-dihedral-angles-prediction's People

Contributors

m3h0w avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.