Git Product home page Git Product logo

s4-for-de-novo-drug-design's Introduction

S4 for De Novo Drug Design

Hello hello! ๐Ÿ™‹โ€โ™‚๏ธ Welcome to the official repository of Structured State Space Sequence Models for De Novo Drug Design!

First things first, thanks a lot for your interest in our work and code ๐Ÿ™ Please consider starring โญ the repository if you find it useful โ€” it helps us know how much maintenance we should do! ๐Ÿ˜‡

This document will walk you through the installation and usage of our codebase. By completing this document, you'll be able to pre-train, fine-tune, and sample your own structured state-space sequence model (S4) to design molecules in only 4 lines of code ๐Ÿ˜ Let's get started ๐Ÿš€

Installation ๐Ÿ› ๏ธ

You first need to download this codebase. You can either click on the green button on the top-right corner of this page and download the codebase as a zip file or clone the repository with the following command, if you have git installed:

git clone https://github.com/molML/s4-for-de-novo-drug-design.git

We'll use conda to create a new environment for our codebase. If you haven't used conda before, we recommend you take a look at this tutorial before moving forward.

Otherwise, fire up a terminal in the (root) directory of the codebase and type the following commands:

conda create -n s4_for_dd python==3.8.11 
conda activate s4_for_dd 
conda install pytorch==1.13.1 pytorch-cuda==11.6 -c pytorch -c nvidia  # install pytorch with CUDA support
conda install --file requirements.txt -c conda-forge  
python -m pip install .  # install this codebase -- make sure that you are in the root directory of the codebase

Warning

If you don't have (or need) GPU support for pytorch, replace the third command above with following: conda install pytorch==1.13.1 -c pytorch

That's it! You have successfully installed our codebase; s4dd to name it. Now, let's see the magical 4 lines to design bioactive molecules with S4 ๐Ÿ”ฎ

Designing Molecules with S4 ๐Ÿ‘ฉโ€๐Ÿ’ป

Here we are: we pre-train an S4 on ChEMBL, fine-tune on a set of bioactive molecules for the protein target PKM2, and design new molecules. All with the following 4 lines of code:

from s4dd import S4forDenovoDesign

# Create an S4 model with (almost) the same parameters in the paper.
s4 = S4forDenovoDesign(
    n_max_epochs=1,  # This is for only demonstration purposes. Set this to a (much) higher value for actual training. Default: 400.
    batch_size=64,  # This is also for demonstration purposes. The value in the paper is 2048.
    device="cuda",  # Replace this with "cpu" if you didn't install pytorch with CUDA support.
)
# Pretrain the model on ChEMBL
s4.train(
    training_molecules_path="./datasets/chemblv31/mini_train.zip",  # This a 50K subsample of the ChEMBL training set for quick(er) testing.
    val_molecules_path="./datasets/chemblv31/valid.zip",
)
# Fine-tune the model on bioactive molecules for PKM2
s4.train(
    training_molecules_path="./datasets/pkm2/train.zip",
    val_molecules_path="./datasets/pkm2/valid.zip",
)
# Design new molecules
designs, lls = s4.design_molecules(n_designs=32, batch_size=16, temperature=1.0)

Voila! ๐ŸŽ‰ You have successfully trained your own S4 model from scratch for de novo drug design and designed molecules in 4 lines ๐Ÿงฟ Examples for each step are also available in the examples/ folder.

Warning

Make sure that you replace the "cuda" argument with "cpu" if you didn't install pytorch with CUDA support.

Important

Use a smaller batch size if you face out-of-memory errors.

You can do more with s4dd, e.g., save/load models, calculate likelihoods of molecules, and monitor model training. Let's quickly cover those ๐Ÿƒ

Additional Functionalities ๐Ÿ•น๏ธ

1. Save/Load Models ๐Ÿ’พ

Saving models are useful to resume training later or to design molecules without repeating the training, e.g., for fine-tuning and chemical space exploration. That's why we made model saving in s4dd as simple as:

s4.save("./models/foo")  # s4 is the S4 model we trained above.

Then to load the same model in another file/session:

# load it back
loaded_s4 = S4forDenovoDesign.from_file("./models/foo")
...  # resume training with `loaded_s4` or design molecules...

2. Calculate Molecule Likelihoods ๐ŸŽฒ

In addition to designing molecules, S4 (or any chemical language model), can compute likelihoods of molecules, enabling new evaluation perspectives. A detailed discussion of 'how' is available in our paper.

Let's dive back into the code here and see how we can compute the (log)likelihood of a molecule via s4dd:

lls = s4.compute_molecule_loglikelihoods(["CCCc1ccccc1", "CCO"], batch_size=1)

As usual, it's that easy! ๐Ÿคทโ€โ™‚๏ธ

3. Monitor Model Training ๐Ÿ”

Tracking the model training is crucial for any machine learning project. Our codebase, s4dd, provides out-of-the-box functionality to help you fellow machine learning researcher ๐Ÿคž

s4dd implements four "callbacks" to monitor model training:

  • EarlyStopping callback stops the training if an evaluation metric stops improving for a pre-set number of epochs and saves some precious training time ๐Ÿ’ฐ
  • ModelCheckpoint saves the model per fixed number of epochs so that the intermediate models are available for analysis ๐Ÿ”ฌ
  • HistoryLogger saves the training history at every epoch to monitor the training and validation losses ๐Ÿ“‰
  • DenovoDesign designs molecules in the end of every epoch with selected temperatures to track model's generation capabilities ๐Ÿ’Š

Integrating any of those callbacks to the model training is almost trivial โ€” you just need to pass them as a list to the train method:

from s4dd import S4forDenovoDesign
from s4dd.torch_callbacks import EarlyStopping, ModelCheckpoint, HistoryLogger, DenovoDesign

s4 = S4forDenovoDesign(
    n_max_epochs=10,
    batch_size=32,
    device="cuda", 
)
s4.train(
    training_molecules_path="./datasets/chemblv31/train.zip",
    val_molecules_path="./datasets/chemblv31/valid.zip",
    callbacks=[
        EarlyStopping(
            patience=5, delta=1e-5, criterion="val_loss", mode="min"
        ),
        ModelCheckpoint(
            save_fn=s4.save, save_per_epoch=3, basedir="./models/"
        ),
        HistoryLogger(savedir="./models/"),
        DenovoDesign(
            design_fn=lambda t: s4.design_molecules(
                n_designs=32, batch_size=16, temperature=t
            ),
            basedir="./models/",
            temperatures=[1.0, 1.5, 2.0],
        ),
    ],
)

Documentation ๐Ÿ“œ

Are you interested in doing more with s4dd? Or you need more information about some of s4dd's (very cool) functionalities? Then you can find our online documentation useful. Here you can find the detailed description of each single class and function in s4dd. Happy reading! ๐Ÿค“

Closing Remarks ๐ŸŽ†

Thanks again for finding our code interesting! Please consider starring the repository โœจ and citing our work if this codebase has been useful for your research ๐Ÿ‘ฉโ€๐Ÿ”ฌ ๐Ÿ‘จโ€๐Ÿ”ฌ

@article{ozcelik2023structured,
  title={Structured State-Space Sequence Models for De Novo Drug Design},
  author={{\"O}z{\c{c}}elik, R{\i}za and de Ruiter, Sarah and Grisoni, Francesca},
  journal={ChemRxiv},
  year={2023},
  publisher={ACS Publications}
}

If you have any questions, please don't hesitate to open an issue in this repository. We'll be happy to help ๐Ÿ•บ

Hope to see you around! ๐Ÿ‘‹ ๐Ÿ‘‹ ๐Ÿ‘‹

s4-for-de-novo-drug-design's People

Contributors

rizaozcelik avatar cosmina98 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.