longlongman / desert Goto Github PK

View Code? Open in Web Editor NEW

30.0 2.0 9.0 377 KB

Zero-Shot 3D Drug Design by Sketching and Generating (NeurIPS 2022)

Python 99.73% Thrift 0.06% Shell 0.21%

desert's Introduction

DESERT

Zero-Shot 3D Drug Design by Sketching and Generating (NeurIPS 2022)

P.s. Because the project is too tied to ByteDance infrastructure, we can not sure that it can run on your device painlessly.

Requirement

Our method is powered by an old version of ParaGen (previous name ByCha).

Install it with

cd mybycha
pip install -e .
pip install horovod
pip install lightseq

You also need to install

conda install -c "conda-forge/label/cf202003" openbabel # recommend using anaconda for this project 
pip install rdkit-pypi
pip install pybel scikit-image pebble meeko==0.1.dev1 vina pytransform3d

Pre-training

Data Preparation

Our training data was extracted from the open molecule database ZINC. You need to download it first.

To get the fragment vocabulary

cd preparation
python get_fragment_vocab.py # fill blank paths in the file first

To get the training data

python get_training_data.py # fill blank paths in the file first

We also provide partial training data and vocabulary Here.

Training Shape2Mol Model

You need to fill blank paths in configs/training.yaml and train.sh.

bash train.sh

We also provide a trained checkpoint Here.

Design Molecules

Sketching

For a given protein, you need to get its pocket by using CAVITY.

Sampling molecular shapes with

cd sketch
python sketching.py # fill blank paths in the file first

Generating

bash generate.sh # fill blank paths in the file first

Citation

@inproceedings{long2022DESERT,
  title={Zero-Shot 3D Drug Design by Sketching and Generating},
  author={Long, Siyu and Zhou, Yi and Dai, Xinyu and Zhou, Hao},
  booktitle={NeurIPS},
  year={2022}
}

desert's People

Contributors

Stargazers

Watchers

Forkers

shunsunsun xiongb-lab anyuese vivekbisht467 lyndonlens highdxy deargen jingqiong pratyusha-code

desert's Issues

Dataset unable to download for preprocessing

I want to reproduce the results of this paper for my research work but I am not able to download datasets from ZINC20 or ZINC15 for the data preprocessing, the links I am using are: https://zinc20.docking.org/substances/, https://zinc15.docking.org/substances/
the downloads are not happening like after sometimes it says failed, Is there any google drive/one drive link where the data is already available to use or any other references will also help. Please help me out to sort this issue.

Thanks in advance.

question about the vocab file

Dear Longlongman,

is the vocab file provided, a partial vocab file?? If it isn't could you provide the full vocab file? some fragments aren't being properly decoded compared to the original. Maybe this is the reason why?

thanks

Some issues during installing and using

Hello longlongman,
I'm new to deep learning. Please bear with me for some dump questions.

Could you provide more details of installation? ex. Which python version and which pip
version to choose for installing?
I used python 3.7.1 and at the last step of installing ' pip install pybel scikit-image pebble
meeko==0.1.dev1 vina pytransform3d' an error message prints:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.4.0 requires typing-extensions~=3.7.4, but you have typing-extensions 4.7.1 which is incompatible.

at the pretraining stage after I executed the 'python get_training_data.py' an error occurred:

File "get_training_data.py", line 31, in <module>
    with open(vocab_path, 'rb') as fr:
IsADirectoryError: [Errno 21] Is a directory: '/home/ubuntu/user_space/DESERT/preparation/vocab'

The vocab directory contains two files produced by
'get_fragment_vocab.py' : 'BRICS_RING_R.vocab.pkl' and 'BRICS_RING_R.494789.pkl'

Since I was stuck on step 2, I tried to use the training data and vocab upload online to train the shape2mol. What's the difference between 0.pkl and 1.pkl? Do I need to use both pretrained model? I'm a bit confused about how to fill out the blanks in training.yaml

class: ShapePretrainingDatasetShard
      path: ---TRAINING DATA PATH---
      vocab_path: ---VOCAB PATH---
      sample_each_shard: 500000
      shuffle: True
    valid:
      class: ShapePretrainingDataset
      path: 
        samples: ---VALID DATA PATH---
        vocab: ---VOCAB PATH---
    test:
      class: ShapePretrainingDataset
      path: 
        samples: ---TEST DATA PATH---
        vocab: ---VOCAB PATH---

The ---TRAINING DATA PATH--- is simply the path to get 0.pkl and 1.pkl
But what about valid data path and test data path? ( I just put it same as training data path)

When executing train.sh another error occured
pkg_resources.DistributionNotFound: The 'typing-extensions~=3.7.4' distribution was not found and is required by tensorflow
And I believe it is related with the first error during installing.
4. For the sketching process, where to fill in the path for input ligand sdf in sketch.py?
When sketching the pocket by sketch.py , the following error occurred:

File "sketching.py", line 2, in <module>
   from shape_utils import get_atom_stamp
 File "/home/ubuntu/user_space/DESERT/sketch/shape_utils.py", line 3, in <module>
   from common import ATOM_RADIUS, ATOMIC_NUMBER, ATOMIC_NUMBER_REVERSE
ImportError: cannot import name 'ATOM_RADIUS' from 'common' (/home/ubuntu/miniconda3/envs/DESERT/lib/python3.7/site-packages/common/__init__.py)

For generating process

bycha-run \
   --config configs/generating.yaml \
   --lib shape_pretraining \
   --task.mode evaluate \
   --task.data.train.path data \
   --task.data.valid.path.samples ❗❗❗FILL_THIS(MOLECULE SHAPES SAMPLED FROM CAVITY)❗❗❗ \
   --task.data.test.path.samples  ❗❗❗FILL_THIS❗❗❗ \
   --task.dataloader.train.max_samples 1 \
   --task.dataloader.valid.sampler.max_samples 1 \
   --task.dataloader.test.sampler.max_samples 1 \
   --task.model.path ❗❗❗FILL_THIS❗❗❗ \
   --task.evaluator.save_hypo_dir ❗❗❗FILL_THIS❗❗❗

What should the path for task.data.test.path.samples ?
Is task.model.path the path to 1WW_30W_5048064.pt ?

Sorry for bothering

issue with get_training_data

after get_fragment_vocab i get 2 pkl files

at get_training_data i put in
BRICS_RING_R.vocab.pkl path..

there's an error

Traceback (most recent call last):

frag_idx = vocab[frag_smi][2]

TypeError: 'Mol' object is not subscriptable

fail to work on mac(m1)

Hi!
Thank u for sharing your excellent work！
when I try to run this on my mac book pro(m1 pro),it just doesn't work
so could u please share a dock file or something can work on mac?
thanks!

how to reduce batch size during generation

hi, my gpu keeps busting up when i'm trying to generate.

horovodrun -np 8 bycha-run
--config configs/generating.yaml
--lib shape_pretraining
--task.mode evaluate
--task.data.train.path data
--task.data.valid.path.samples /home/kiwoong/DESERT/data/sample_shapes.pkl
--task.data.test.path.samples /home/kiwoong/DESERT/data/sample_shapes.pkl
--task.dataloader.train.max_samples 1
--task.dataloader.valid.sampler.max_samples 1
--task.dataloader.test.sampler.max_samples 1
--task.model.path /home/kiwoong/DESERT/trainer/save_model_dir/1WW_30W_5048064.pt
--task.evaluator.save_hypo_dir /home/kiwoong/DESERT/trainer/save_hypo_di

this is the bash file I'm running right now.
How do you reduce the batch size?? In the config file for generation it seems that the only option is max_sample, which i set to 1 but increases steadily. thanks.

unable to generate

Hi ！
thanks for sharing your excellent work！

I am unable to do the final step of generate after modifying part of your code to successfully train and get pockets, could you please upload the latest and correct full code?

Thanks!

question regarding encoder-decoder

Hi longlongman,

I've tried using your encoder-decoder on CASP-ligands but wasn't successful in fully decoding(recovering) the ligands after encoding (molecule cut off, some atoms missing)

I'm suspecting whether if it's because the patch size for the shapepretrainingencoder is too small? or maybe due to maxlen_coef being too small?

any ideas on getting the full molecule?