Git Product home page Git Product logo

proteingan's Introduction

ProteinGAN

Generative network architecture that may be used to produce de-novo protein sequences.

Paper abstract

De novo protein design for catalysis of any desired chemical reaction is a long standing goal in protein engineering, due to the broad spectrum of technological, scientific and medical applications. Currently, mapping protein sequence to protein function is, however, neither computationionally nor experimentally tangible. Here we developed ProteinGAN, a specialised variant of the generative adversarial network that is able to 'learn' natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase as a template enzyme, we show that 24% of the ProteinGAN-generated and experimentally tested sequences are soluble and display wild-type level catalytic activity in the tested conditions in vitro, even in highly mutated (>100 mutations) sequences. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space.

Licenses

All material is made available under Creative Commons BY-NC 4.0 license. You can use, redistribute, and adapt the material for non-commercial purposes, as long as you give appropriate credit by citing our paper and indicating any changes that you've made.

System requirements

  • Operating System: Linux.
  • 64-bit Python 3.7 installation.
  • blastp: 2.6.0+
  • TensorFlow 1.13.1 or newer with GPU support.
  • One or more NVIDIA GPUs. Recomendation: NVIDIA at least P100 GPU with 16GB.
  • NVIDIA driver 418.87 or newer, CUDA toolkit 10.1 or newer, cuDNN 7.6.2 or newer.

Conda environment

environment.yml contains all the dependencies required in order to run ProteinGAN. You can simply run:

conda env create --file environment.yml

Data for training

ProteinGAN expects a number of files in order to be able to train and evaluate the network.

File name Data
properties.json File should contain information about max length of sequences and enzyme class.
db_train.phr Output of makeblastdb script using training sequences. Used to evaluate the network during the training.
db_train.pin Output of makeblastdb script using training sequences. Used to evaluate the network during the training.
db_train.psq Output of makeblastdb script using training sequences. Used to evaluate the network during the training.
db_val.phr Output of makeblastdb script using validation sequences. Used to evaluate the network during the training.
db_val.pin Output of makeblastdb script using validation sequences. Used to evaluate the network during the training.
db_val.psq Output of makeblastdb script using validation sequences. Used to evaluate the network during the training.
train/{1}{2}{3}.tfrecords Multiple tfrecords containing training sequences. {2}, {3} - are upsampling factors used to balance training dataset

Training networks

Once data is ready, you can train your own ProteinGAN for chosen set of sequences as follows:

  1. Edit gan/parameters.py to specify the dataset and training configuration.
  2. Run the training script with python train_gan.
  3. The results, weights will be stored in specified location. This location is printed once training script is executed. You can use tensorboard to view all the details.
  4. The training may take several days (or weeks) to complete, depending on the configuration.
  5. Once training is completed, you can use generate.py to generate chosen number of sequences.
  6. Once training is completed, you can use discriminator_scores.py to get discriminator scores for all provided sequences.
  7. Once training is completed, you can use test_gan.py to investigate GAN performance via interpolation.

Useful links

Papers influenced final solution:

proteingan's People

Contributors

cecilyu avatar bmd-drepecka avatar donatasrep avatar bmd-ci avatar

Forkers

barabaika

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.