Git Product home page Git Product logo

difflinker's Introduction

DiffLinker: Equivariant 3D-Conditional Diffusion Model for Molecular Linker Design

Demo DOI

Official implementation of DiffLinker, an Equivariant 3D-conditional Diffusion Model for Molecular Linker Design by Ilia Igashov, Hannes Stärk, Clément Vignac, Arne Schneuing, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein and Bruno Correia.

Given a set of disconnected fragments in 3D, DiffLinker places missing atoms in between and designs a molecule incorporating all the initial fragments. Our method can link an arbitrary number of fragments, requires no information on the attachment atoms and linker size, and can be conditioned on the protein pockets.

Animations

Environment Setup

The code was tested in the following environment:

Software Version
Python 3.10.5
CUDA 10.2.89
PyTorch 1.11.0
PyTorch Lightning 1.6.3
OpenBabel 3.0.0

You can create a new conda environment using provided environment.yaml file:

conda env create -f environment.yml

or manually creating the base environment:

conda create -c conda-forge -n difflinker rdkit

and installing all the necessary packages:

biopython
imageio
networkx
pytorch
pytorch-lightning
scipy
scikit-learn
tqdm
wandb

Activate the environment:

conda activate difflinker

Normally, the whole installation process takes 5-10 min.

Models

Please find the models here or use direct download links:

Usage

Generating linkers for your own fragments

1. Without protein pocket

First, download necessary models and create directories (we recommend to use GEOM models as they are the most generic):

mkdir -p models
wget https://zenodo.org/record/7121300/files/geom_difflinker.ckpt?download=1 -O models/geom_difflinker.ckpt
wget https://zenodo.org/record/7121300/files/geom_size_gnn.ckpt?download=1 -O models/geom_size_gnn.ckpt

Generate linkers for your own fragments:

python -W ignore  generate.py --fragments <YOUR_PATH> --model models/geom_difflinker.ckpt --linker_size models/geom_size_gnn.ckpt

2. With protein pocket (full atomic representation)

If you have the full target protein and want the pocket to be computed automatically based on the input fragments:

mkdir -p models
wget https://zenodo.org/records/10988017/files/pockets_difflinker_full_no_anchors_fc_pdb_excluded.ckpt?download=1 -O models/pockets_difflinker_full.ckpt
python -W ignore generate_with_protein.py --fragments <FRAGMENTS_PATH> --protein <PROTEIN_PATH> --model models/pockets_difflinker_full.ckpt --linker_size <DESIRED_LINKER_SIZE> --anchors <COMMA_SEPARATED_ANCHOR_INDICES> 

If you want to use the file with pocket you computed yourself:

mkdir -p models
wget https://zenodo.org/records/10988017/files/pockets_difflinker_full_no_anchors_fc_pdb_excluded.ckpt?download=1 -O models/pockets_difflinker_full.ckpt
python -W ignore generate_with_pocket.py --fragments <FRAGMENTS_PATH> --pocket <POCKET_PATH> --model models/pockets_difflinker_full.ckpt --linker_size <DESIRED_LINKER_SIZE> --anchors <COMMA_SEPARATED_ANCHOR_INDICES> 

3. With protein pocket (backbone representation)

mkdir -p models
wget https://zenodo.org/record/7121300/files/pockets_difflinker_backbone.ckpt?download=1 -O models/pockets_difflinker_backbone.ckpt
python -W ignore generate_with_pocket.py --fragments <FRAGMENTS_PATH> --pocket <POCKET_PATH> --backbone_atoms_only --model models/pockets_difflinker_backbone.ckpt --linker_size <DESIRED_LINKER_SIZE> --anchors <COMMA_SEPARATED_ANCHOR_INDICES>

Note:

  • Fragment file should be passed in one of the following formats: .sdf, .pdb, .mol, .mol2
  • Protein should be passed in .pdb format
  • Currently pocket-conditioned generation does not support prediction and sampling of the linker size (will be added later)
  • To obtain correct anchor indices for your fragments, you can open the file in PyMOL and click Label -> atom identifiers -> ID. You can select anchor atoms and pass the corresponding IDs to the generation script
  • For more options check help: python generate.py --help or python generate_with_pocket.py --help

Training DiffLinker

First, download datasets:

mkdir -p datasets
wget https://zenodo.org/record/7121271/files/zinc_final_train.pt?download=1 -O datasets/zinc_final_train.pt
wget https://zenodo.org/record/7121271/files/zinc_final_val.pt?download=1 -O datasets/zinc_final_val.pt

Next, create necessary directories:

mkdir -p models
mkdir -p logs

Run trainig:

python -W ignore train_difflinker.py --config configs/zinc_difflinker.yml

Training Size GNN

In this example, we will consider the training and testing process on the ZINC dataset. All the instructions about downloading or creating datasets from scratch can be found in data directory.

python -W ignore train_size_gnn.py \
                 --experiment zinc_size_gnn \
                 --data datasets \
                 --train_data_prefix zinc_final_val \
                 --val_data_prefix zinc_final_val \
                 --hidden_nf 256 \
                 --n_layers 5 \
                 --batch_size 256 \
                 --normalization batch_norm \
                 --lr 1e-3 \
                 --task classification \
                 --loss_weights \
                 --device gpu \
                 --checkpoints models \
                 --logs logs

There are the distributions of numbers of atoms in linkers used for training linker size prediction GNNs:

Sampling

First, download test dataset:

mkdir -p datasets
wget https://zenodo.org/record/7121271/files/zinc_final_test.pt?download=1 -O datasets/zinc_final_test.pt

Download the necessary models:

mkdir -p models
wget https://zenodo.org/record/7121300/files/zinc_difflinker.ckpt?download=1 -O models/zinc_difflinker.ckpt
wget https://zenodo.org/record/7121300/files/zinc_size_gnn.ckpt?download=1 -O models/zinc_size_gnn.ckpt

Next, create necessary directories:

mkdir -p samples
mkdir -p trajectories

If you want to sample 250 linkers for each input set of fragments, run the following:

python -W ignore sample.py \
                 --checkpoint models/zinc_difflinker.ckpt \
                 --linker_size_model models/zinc_size_gnn.ckpt \
                 --samples samples \
                 --data datasets \
                 --prefix zinc_final_test \
                 --n_samples 2 \
                 --device cuda:0

You will be able to see .xyz files of the generated molecules in the directory ./samples.

If you want to sample linkers and save trajectories, run the following:

python -W ignore sample_trajectories.py \
                 --checkpoint models/zinc_difflinker.ckpt \
                 --chains trajectories \
                 --data datasets \
                 --prefix zinc_final_test \
                 --keep_frames 10 \
                 --device cuda:0

You will be able to see trajectories as .xyz, .png and .gif files in the directory ./trajectories.

Evaluation

First, you need to download ground-truth SMILES and SDF files of molecules, fragments and linkers from the relevant test sets (recomputed with OpenBabel) + SMILES of the training linkers. Check this resource for finding the right ones. Here, we will download files for ZINC:

mkdir -p datasets
wget https://zenodo.org/record/7121448/files/zinc_final_test_smiles.smi?download=1 -O datasets/zinc_final_test_smiles.smi
wget https://zenodo.org/record/7121448/files/zinc_final_test_molecules.sdf?download=1 -O datasets/zinc_final_test_molecules.sdf
wget https://zenodo.org/record/7121448/files/zinc_final_train_linkers.smi?download=1 -O datasets/zinc_final_train_linkers.smi 

Next, you need to run OpenBabel to reformat the data:

mkdir -p formatted
python -W ignore reformat_data_obabel.py \
                 --samples samples \
                 --dataset zinc_final_test \
                 --true_smiles_path datasets/zinc_final_test_smiles.smi \
                 --checkpoint zinc_difflinker \
                 --formatted formatted \
                 --linker_size_model_name zinc_size_gnn

Then you can run evaluation scripts:

python -W ignore compute_metrics.py \
                 ZINC \
                 formatted/zinc_difflinker/sampled_size/zinc_size_gnn/zinc_final_test.smi \
                 datasets/zinc_final_train_linkers.smi \
                 5 1 None \
                 resources/wehi_pains.csv \
                 diffusion

All the metrics will be saved in the directory ./formatted.

Reference

Igashov, I., Stärk, H., Vignac, C. et al. Equivariant 3D-conditional diffusion model for molecular linker design. Nat Mach Intell (2024). https://doi.org/10.1038/s42256-024-00815-9

@article{igashov2024equivariant,
  title={Equivariant 3D-conditional diffusion model for molecular linker design},
  author={Igashov, Ilia and St{\"a}rk, Hannes and Vignac, Cl{\'e}ment and Schneuing, Arne and Satorras, Victor Garcia and Frossard, Pascal and Welling, Max and Bronstein, Michael and Correia, Bruno},
  journal={Nature Machine Intelligence},
  pages={1--11},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

Contact

If you have any questions, please contact me at [email protected]

difflinker's People

Contributors

hannesstark avatar igashov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

difflinker's Issues

Evaluation for reproducing the paper's result

I am currently conducting research on structure-based drug design using proteins.

I find the concept of fragmentation linking to be a valuable approach in drug design, and I am particularly impressed with your work's ability to consider the conditioning on the protein pocket. Thank you for hard working on it!

I have a few questions regarding your research:

  1. First, I attempted to reproduce the results from your paper using the Sampling section (https://github.com/igashov/DiffLinker#sampling). However, I noticed that the results for the ZINC and GEOM datasets differ significantly from the paper's reported results, especially concerning the SA score. While the paper's SA score is approximately 3.x, my results yielded a score of 6.x. I'm unsure why these results are different. Is there an additional step required to accurately reproduce the paper's findings?

  2. Unfortunately, as mentioned in the readme, there is no pocket linker prediction model available. Therefore, I was unable to conduct experiments with the Pocket dataset. Could you provide some suggestions on how I can reproduce the paper's results without this model? Additionally, I am curious about the linker prediction model used in the Table 5 pocket section.

  3. I came across Figure 2 in the paper, which showcases examples of linkers sampled by DiffLinker conditioned on pocket atoms. I attempted to replicate these results using the same molecule fragments from the MOAD dataset and the awesome hugging face. However, I couldn't achieve the same results as presented in the paper, even when utilizing the same protein anchors. Could you please guide me on how to accurately reproduce the results shown in Figure 2?

  4. I am interested in reproducing the results shown in Figure 4 and 5 from the paper. However, I encountered a challenge as there is no index provided in the paper, which prevents me from attempting the test. Could you kindly provide me with the necessary information about the fragments used in the Figure 4 and 5 datasets? This would be immensely helpful in my efforts to replicate the results accurately

If you require any specific information or have any additional questions to aid in reproducing my experiments, please let me know, and I will promptly provide the requested details.

Thank you for your time and consideration.

@igashov

Question:given the limitation of pocket but without limitataion of --anchors

hi,

if using generate_with_protein.py or generate_with_pocket.py method, the para --anchors is necessary. however,
In order to take into account the diversity of compounds generated, is there any approach that takes into account the combination and limitation of pockets but doesn't limit the location of the anchors for the fragment links?

many many thanks,

Sh-Y

Larger segments as input

Hi DiffLinker Team,

Thank you for your great effort on this.
And I was wondering if the model is possible for sampling linkers for larger molecules (like connecting 2 10-mer peptides).
I tried using existing model (I know it's not appropriate, I just want to check if the model can run without error) , the model raised NanError during sample_p_zs_given_zt_only_linker. Is it because the inputs contains too many atoms?

Thank you for you patient and help!

sascorer.py calls removed RDkit module - ModuleNotFoundError: No module named 'rdkit.six'

delinker_utils/sascorer.py is calling a module of RDKit that was removed in Release_2024.03.1
See change log: https://github.com/rdkit/rdkit/blob/master/ReleaseNotes.md#code-removed-in-this-release-1

I encountered this when trying to run one of your case studies:

$ python ./DiffLinker/generate_with_protein.py \
    --fragments 3hz1_modified_fragments_obabel.sdf \
    --protein 3hz1_protein.pdb \
    --output samples \
    --model models/pockets_difflinker_full_given_anchors.ckpt \
    --linker_size models/zinc_size_gnn.ckpt \
    --anchors 12,22 \
    --n_samples 1000 \
    --max_batch_size 16 \
    --random_seed 1
Traceback (most recent call last):
  File "[...]/DiffLinker/generate_with_protein.py", line 14, in <module>
    from src.lightning import DDPM
  File "[...]/DiffLinker/src/lightning.py", line 7, in <module>
    from src import metrics, utils, delinker
  File "[...]/DiffLinker/src/delinker.py", line 7, in <module>
    from src.delinker_utils import sascorer, calc_SC_RDKit
  File "[...]/DiffLinker/src/delinker_utils/sascorer.py", line 22, in <module>
    from rdkit.six.moves import cPickle
ModuleNotFoundError: No module named 'rdkit.six'

The format of the fragment file

Your work is especially helpful! I found that the two fragments in the SDF file you provided are complete molecules with 3D structures. If I have two fragments, must they also be complete molecules and 3D structures? And I want to ask are fragments from ligands in the original PDB protein structure broken up?
My fragment molecule for example: c1ccc((=O)(O)NC(C)(C)C)cc1”,c1nnn1

Calculating the "Clash Metric" as Described in Your Paper

Hi,

Thanks for your amazing work!

I am particularly interested in the metric of "steric clashes" as described in your paper

image

I am attempting to implement this metric in my project, but I did not find specific code in your repository for calculating this metric. Could you please provide it, or guide me on how to implement it?

Thank you for your time and consideration!

@igashov

AttributeError: can't set attribute

After using your weights and trying to find the linker of the fragment with the following code:

generate.main(
        input_path = "...generator_0.sdf",
        model = "models/geom_difflinker.ckpt",
        output_dir = "difflinker/output",
        n_samples = 2,
        n_steps = None,
        linker_size = "...difflinker/models/geom_size_gnn.ckpt",
        anchors = None)

I got the following error:

File "difflinker/src/egnn.py", line 134, in init
self.device = device
File "lib/python3.7/site-packages/torch/nn/modules/module.py", line 1317, in setattr
super().setattr(name, value)
AttributeError: can't set attribute

It was happening when I was trying to load the weights. After commenting parts with self.device = device in several files from src it started compiling properly. Please let me know I misunderstand smth or if you have experienced the same. Just in case I'm working with cuda if it might matter

cannot repeat the process of fragment generation

hi,

thank you to provide so interesting and powful tool to generate linkers.

while I try to repeat your model, run into the error

the below command line I used to run the model om_difflinker_given_anchors.ckpt

!python -W ignore generate_with_protein.py --fragments fragments/frags_3hz1.sdf \
--protein proteins/pro_3hz1_protein.pdb --model models/geom_difflinker_given_anchors.ckpt \
--linker_size 3 --anchors 14,11

and the anchors were set as you mentioned in other issues.
Snipaste_2023-08-02_15-32-24
but the error:

Will generate linkers with 3 atoms
[15:28:10] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8

could you please provide some suggesitions about how to fix it and make it work?
many thanks,

Sh-Y

Can Difflinker do scaffold decoration?

Hi team,

I am currently using Difflinker to design molecules where the scaffold is fixed but modifying the R group. Is there a way Difflinker can do it? Based on what I see so far, it can only link two fixed fragment, not growing on one side.

Best,
Ziyue

can I run cpu mode to generate linkers?

Hey, is it possible to run cpu mode? I saw it's possible from the generate.py, but when I tried I got this error:

python -W ignore /home/softwares/DiffLinker/generate.py --fragments frag.sdf --model models/geom_difflinker.ckpt --linker_size models/geom_size_gnn.ckpt --anchors 9,19
Will generate linkers with sampled numbers of atoms
Sampling...
  0%|                                                                                                  | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/softwares/DiffLinker/generate.py", line 187, in <module>
    main(
  File "/home/softwares/DiffLinker/generate.py", line 156, in main
    chain, node_mask = ddpm.sample_chain(data, sample_fn=sample_fn, keep_frames=1)
  File "/home/softwares/DiffLinker/src/lightning.py", line 449, in sample_chain
    chain = self.edm.sample_chain(
  File "/home/softwares/mambaforge3/envs/difflinker/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/softwares/DiffLinker/src/edm.py", line 152, in sample_chain
    z = self.sample_p_zs_given_zt_only_linker(
  File "/home/softwares/DiffLinker/src/edm.py", line 188, in sample_p_zs_given_zt_only_linker
    eps_hat = self.dynamics.forward(
  File "/home/softwares/DiffLinker/src/egnn.py", line 383, in forward
    edges = self.get_edges(n_nodes, bs)  # (2, B*N)
  File "/home/softwares/DiffLinker/src/egnn.py", line 464, in get_edges
    return self.get_edges(n_nodes, batch_size)
  File "/home/softwares/DiffLinker/src/egnn.py", line 459, in get_edges
    edges = [torch.LongTensor(rows).to(self.device), torch.LongTensor(cols).to(self.device)]
  File "/home/softwares/mambaforge3/envs/difflinker/lib/python3.10/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

About input fragment file and anchors

Hi, you guys did a good job about connecting diffusion model and de novo design based on fragments.
I'm now working on design small molecules, I want to use this model to work, but after I run the code I met some problems.

  1. Accroding to the help instructions, If I got 3 mol fragments, should I put them into one pdb file? And how the format of this kind of PDB file looks like, is there any example input file?
  2. the argument anchor, is that mean the atoms id in the PDB file that link the fragments?
    I'm looking forward to your help ~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.