Git Product home page Git Product logo

s4-for-de-novo-drug-design's Introduction

S4 for De Novo Drug Design

Hello hello! ๐Ÿ™‹โ€โ™‚๏ธ Welcome to the official repository of Chemical Language Modeling with Structured State Spaces!

First things first, thanks a lot for your interest in our work and code ๐Ÿ™ Please consider starring โญ the repository if you find it useful โ€” it helps us know how much maintenance we should do! ๐Ÿ˜‡

This document will walk you through the installation and usage of our codebase. By completing this document, you'll be able to pre-train, fine-tune, and sample your own structured state-space sequence model (S4) to design molecules in only 4 lines of code ๐Ÿ˜ Let's get started ๐Ÿš€

Installation ๐Ÿ› ๏ธ

You first need to download this codebase. You can either click on the green button on the top-right corner of this page and download the codebase as a zip file or clone the repository with the following command, if you have git installed:

git clone https://github.com/molML/s4-for-de-novo-drug-design.git

We'll use conda to create a new environment for our codebase. If you haven't used conda before, we recommend you take a look at this tutorial before moving forward.

Otherwise, fire up a terminal in the (root) directory of the codebase and type the following commands:

conda create -n s4_for_dd python==3.8.11 
conda activate s4_for_dd 
conda install pytorch==1.13.1 pytorch-cuda==11.6 -c pytorch -c nvidia  # install pytorch with CUDA support
conda install --file requirements.txt -c conda-forge  
python -m pip install .  # install this codebase -- make sure that you are in the root directory of the codebase

Warning

If you don't have (or need) GPU support for pytorch, replace the third command above with following: conda install pytorch==1.13.1 -c pytorch

That's it! You have successfully installed our codebase; s4dd to name it. Now, let's see the magical 4 lines to design bioactive molecules with S4 ๐Ÿ”ฎ

Designing Molecules with S4 ๐Ÿ‘ฉโ€๐Ÿ’ป

Here we are: we pre-train an S4 on ChEMBL, fine-tune on a set of bioactive molecules for the protein target PKM2, and design new molecules. All with the following 4 lines of code:

from s4dd import S4forDenovoDesign

# Create an S4 model with (almost) the same parameters in the paper.
s4 = S4forDenovoDesign(
    n_max_epochs=1,  # This is for only demonstration purposes. Set this to a (much) higher value for actual training. Default: 400.
    batch_size=64,  # This is also for demonstration purposes. The value in the paper is 2048.
    device="cuda",  # Replace this with "cpu" if you didn't install pytorch with CUDA support.
)
# Pretrain the model on ChEMBL
s4.train(
    training_molecules_path="./datasets/chemblv31/mini_train.zip",  # This a 50K subsample of the ChEMBL training set for quick(er) testing.
    val_molecules_path="./datasets/chemblv31/valid.zip",
)
# Fine-tune the model on bioactive molecules for PKM2
s4.train(
    training_molecules_path="./datasets/pkm2/train.zip",
    val_molecules_path="./datasets/pkm2/valid.zip",
)
# Design new molecules
designs, lls = s4.design_molecules(n_designs=32, batch_size=16, temperature=1.0)

Voila! ๐ŸŽ‰ You have successfully trained your own S4 model from scratch for de novo drug design and designed molecules in 4 lines ๐Ÿงฟ Examples for each step are also available in the examples/ folder.

Warning

Make sure that you replace the "cuda" argument with "cpu" if you didn't install pytorch with CUDA support.

Important

Use a smaller batch size if you face out-of-memory errors.

You can do more with s4dd, e.g., save/load models, calculate likelihoods of molecules, and monitor model training. Let's quickly cover those ๐Ÿƒ

Additional Functionalities ๐Ÿ•น๏ธ

1. Save/Load Models ๐Ÿ’พ

Saving models are useful to resume training later or to design molecules without repeating the training, e.g., for fine-tuning and chemical space exploration. That's why we made model saving in s4dd as simple as:

s4.save("./models/foo")  # s4 is the S4 model we trained above.

Then to load the same model in another file/session:

# load it back
loaded_s4 = S4forDenovoDesign.from_file("./models/foo")
...  # resume training with `loaded_s4` or design molecules...

2. Calculate Molecule Likelihoods ๐ŸŽฒ

In addition to designing molecules, S4 (or any chemical language model), can compute likelihoods of molecules, enabling new evaluation perspectives. A detailed discussion of 'how' is available in our paper.

Let's dive back into the code here and see how we can compute the (log)likelihood of a molecule via s4dd:

lls = s4.compute_molecule_loglikelihoods(["CCCc1ccccc1", "CCO"], batch_size=1)

As usual, it's that easy! ๐Ÿคทโ€โ™‚๏ธ

3. Monitor Model Training ๐Ÿ”

Tracking the model training is crucial for any machine learning project. Our codebase, s4dd, provides out-of-the-box functionality to help you fellow machine learning researcher ๐Ÿคž

s4dd implements four "callbacks" to monitor model training:

  • EarlyStopping callback stops the training if an evaluation metric stops improving for a pre-set number of epochs and saves some precious training time ๐Ÿ’ฐ
  • ModelCheckpoint saves the model per fixed number of epochs so that the intermediate models are available for analysis ๐Ÿ”ฌ
  • HistoryLogger saves the training history at every epoch to monitor the training and validation losses ๐Ÿ“‰
  • DenovoDesign designs molecules in the end of every epoch with selected temperatures to track model's generation capabilities ๐Ÿ’Š

Integrating any of those callbacks to the model training is almost trivial โ€” you just need to pass them as a list to the train method:

from s4dd import S4forDenovoDesign
from s4dd.torch_callbacks import EarlyStopping, ModelCheckpoint, HistoryLogger, DenovoDesign

s4 = S4forDenovoDesign(
    n_max_epochs=10,
    batch_size=32,
    device="cuda", 
)
s4.train(
    training_molecules_path="./datasets/chemblv31/train.zip",
    val_molecules_path="./datasets/chemblv31/valid.zip",
    callbacks=[
        EarlyStopping(
            patience=5, delta=1e-5, criterion="val_loss", mode="min"
        ),
        ModelCheckpoint(
            save_fn=s4.save, save_per_epoch=3, basedir="./models/"
        ),
        HistoryLogger(savedir="./models/"),
        DenovoDesign(
            design_fn=lambda t: s4.design_molecules(
                n_designs=32, batch_size=16, temperature=t
            ),
            basedir="./models/",
            temperatures=[1.0, 1.5, 2.0],
        ),
    ],
)

Documentation ๐Ÿ“œ

Are you interested in doing more with s4dd? Or you need more information about some of s4dd's (very cool) functionalities? Then you can find our online documentation useful. Here you can find the detailed description of each single class and function in s4dd. Happy reading! ๐Ÿค“

Or, are you only interested in a deeper look into the results in our work? ๐Ÿ” Then, here is a link our Zenodo repository ๐Ÿ’ผ

Closing Remarks ๐ŸŽ†

Thanks again for finding our code interesting! Please consider starring the repository โœจ and citing our work if this codebase has been useful for your research ๐Ÿ‘ฉโ€๐Ÿ”ฌ ๐Ÿ‘จโ€๐Ÿ”ฌ

@article{ozccelik2024chemical,
  title={Chemical Language Modeling with Structured State Spaces},
  author={{\"O}z{\c{c}}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca},
  year={2024}
}

If you have any questions, please don't hesitate to open an issue in this repository. We'll be happy to help ๐Ÿ•บ

Hope to see you around! ๐Ÿ‘‹ ๐Ÿ‘‹ ๐Ÿ‘‹

s4-for-de-novo-drug-design's People

Contributors

rizaozcelik avatar molml avatar phseidl avatar

Stargazers

Andrew Liu avatar  avatar  avatar chunๆทณ avatar Friederike Biermann avatar Elena Del Pup  avatar  avatar Marko Hermsen avatar  avatar Philippe Schwaller avatar  avatar Stefan+o avatar Mellon.TANG avatar  avatar Cosmina98 avatar Fabrizio Costa avatar Zhao Yang avatar Gustavo Seabra avatar Oliver Fleetwood avatar  avatar Ajay Muralidharan avatar  avatar loopchen avatar ็ˆฑๅฏๅฏ-็ˆฑ็”Ÿๆดป avatar Yu Zhang avatar Songlin Yang avatar Shida Wang avatar maximzubkov avatar Albert Gu avatar Alireza Omidi avatar  avatar land avatar Bryan avatar Kirty Vedula avatar Leela S. Dodda avatar Doo Nam Kim avatar Damilola Bodun avatar

Watchers

 avatar Damilola Bodun avatar  avatar

Forkers

cosmina98 phseidl

s4-for-de-novo-drug-design's Issues

Bug? libmamba error: cannot install both pin-1-1 and pin-1-1

Hi guys,

Thanks a lot for making your research available.

I'm trying to install it locally, following the installation instructions in your readme file. However, in the next-to-last step, I get this weird error:

$ conda install --file requirements.txt -c conda-forge
Channels:
 - conda-forge
 - bioconda
 - defaults
 - https://repo.anaconda.com/pkgs/free
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: \ warning  libmamba Added empty dependency for problem type SOLVER_RULE_UPDATE                                                                                  failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - cannot install both pin-1-1 and pin-1-1

Could not solve for environment specs
The following packages are incompatible
โ””โ”€ pin-1 is installable with the potential options
   โ”œโ”€ pin-1 1, which can be installed;
   โ””โ”€ pin-1 1 conflicts with any installable versions previously reported.

Pins seem to be involved in the conflict. Currently pinned specs:
 - python 3.8.* (labeled as 'pin-1')

Have you seen this before? Do you know how I could solve that?

Thanks!

How to train custom training data? thank you๏ผ

Thank you for your work!

I would like to ask how to modify if we want to train on our own custom dataset (e.g. pre-training dataset or fine-tuning dataset)? I read the code, the key is the number of vocabulary? Can you give me some hints?

Thanks!

Did you compare your results with that created by REINVENT

Thank you so much for your work. People prefer to use RL to enhance the final results for de-novo drug design to get better molecules that can have better activities. Did you compare your results with those created by such as REINVENT?

In addition, you use SMILES to represent molecules. Did you try to use SELFIES or SAFE?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.