Git Product home page Git Product logo

protein-embeddings's Introduction

Learning Dense Vector Representation for Proteins

This repo is largely developed on top of the TAPE Repository. Please refer to it for setting up the basic python environment. For the DeepSF model, we provide the pretrained results over the test set in deepsf directory.

Overview

Protein sequences and their interactions can be seen as a natural language just like English. This opens up a wide spectrum of deep learning based language modeling techniques which can be applied to understand the inherent semantics of protein molecules. In this paper, we learn dense vector representations for proteins and compare them with popular sequence alignment methods on homology detection task. Results indicate that protein vector representations outperform alignment techniques by a significant margin in both supervised and unsupervised learning paradigms. Utilizing a fusion of pretrained protein vectors from TAPE (BERT) and DeepSF, we improve upon the existing state-of-the-art on SCOP 1.75 dataset by 5% in terms of accuracy.

Contributions

  1. On the unsupervised clustering task, dense protein vectors are found to be significantly more effective than sequence alignment.

  2. TSNE projection analysis reveals that protein embeddings learnt from TAPE (transformer-based architecture) are found to capture protein structure semantics much better than DeepSF (CNN-based architecture).

  3. TAPE embeddings when combined with DeepSF embeddings are able to push the state-of-the-art on homology detection task over SCOP1.75 dataset by 5%.

Package Requirements

biopython
scikit-learn
pyclustering
tape
pytorch
transformers
matplotlib
scipy
numpy

Dataset

Download remote_homology json dataset from TAPE repository and unzip it in data directory.

Global Sequence Alignment Based Clustering

python global_cluster_optim.py --cluster_label class_label --suffix test_fold_holdout --itermax 10
python global_cluster_optim.py --cluster_label fold_label --suffix test_fold_holdout --itermax 10

Vector Based Clustering

bash pretrained_cluster_script.sh

Supervised Learning (Training)

For training from scratch:

CUDA_VISIBLE_DEVICES=0,2,4,5 python train.py transformer remote_homology --from_pretrained bert-base --batch_size 64 --gradient_accumulation_steps 16 --num_train_epochs 15

For resuming training from previously saved checkpoint:

CUDA_VISIBLE_DEVICES=0,2,4,5 python train.py transformer remote_homology --from_pretrained results/remote_homology_transformer_20-12-19-04-49-07_070319 --batch_size 64 --gradient_accumulation_steps 16 --num_train_epochs 15 --resume_from_checkpoint

Supervised Learning (Evaluation)

CUDA_VISIBLE_DEVICES=4,5,6,7 python test.py transformer remote_homology remote_homology_transformer_20-12-20-05-26-23_907518 --batch_size 64 --metric accuracy --split test_fold_holdout

tSNE Projections

python tape_cluster.py --name remote_homology --suffix test_fold_holdout --cluster_label class_label
python deepsf_cluster.py --cluster_label class_label

Result Analysis

python result_analysis.py

Acknowledgements

This work is developed as a course project for CS466: Introduction to Bioinformatics course. I whole-heartedly thank our course instructor, Prof. Mohammed El-Kebir and all TAs and course staff for their teaching, guidance and continuous support. The work is summarized with detailed results and analysis in the submitted report.

Contact Details

Jatin Arora
[email protected]

protein-embeddings's People

Contributors

jatinarora2702 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.