ehoogeboom / e3_diffusion_for_molecules Goto Github PK

View Code? Open in Web Editor NEW

412.0 8.0 110.0 76.1 MB

License: MIT License

Python 100.00%

e3_diffusion_for_molecules's People

Contributors

Stargazers

Watchers

Forkers

igashov ftyuejian dunni3 shiyx409 ykq98 byun-jinyoung carlenyu clvnmng pk-at-myclassroom jarvisloh sysuyl truongchien takshan riavinod mgyigit pincher-chen orgw clikesong ahlawat-paramvir humanoid-z roberthoenig ljn917 bowen-gao sailfish009 chshm layne-huang shunsunsun wdon021 rajangyawali psyche11 bernddoser mattsonthieme pinglmlcv jasonchow1991 astraightrain slives-lab stsan9 kierannp azimgivron sevencheung2021 hengma1001 nabayanc sundevil0405 croydon-brixton nichrun waqarahmed89 szaman19 tims-ml amorehead lucyvost fastscience-ai nimijkrap ixsluo cx1027 najwalb luisaforozco acse-sr1022 tianyuzelin cs60002-distributed-systems-iitkgp ask-berkeley trotsky1997 izumitkh yansonggu total-sa ksun63 bytetora cmargreitter simul-eqn rish-16 markussorensen zoigin charlenebruno changzhijiang haishan-wang franklalalala jasperdelandsheere songk42 jaybeeh32 binzhangmit shijiale0609 huangjiameng lajictw snagaraj0 akshgarg7 mehulbhuradia danyalrehman ameya98 random-zhang weichiyao zztsuperpower yanliang3612 bondrewd sscake fyre87 imagdau jearegio wilddrude preghosh gmoharram martin-sedlacek

e3_diffusion_for_molecules's Issues

Training time for Geom Drug & Incorrect mask generation?

How long will it take for training geom drug dataset on 8 GPUs? It seems that it would take about 4 hours to finish the first epoch?

Training time for QM9?

How long does it take to run the QM9 experiment, with a n_epochs=3000, and batch size =64, It seems like it will take multiple days, are any of these hyperparameters wrong?

How to convert EDM results to SMILES form?

How many samples are generated in the GEOM-drugs experiment?

Hello, I am enjoying your paper, and thank for your nice work!

I have a minor question.
How many samples are generated in the GEOM-drugs experiment (Table 4)?
Is it 10000 like the QM9 experiment?

Explicit valence error in training

Hi! I met a problem in training. The codes works super fine in the 0th epoch. But when it came to the 1st epoch, it pumped out explicit error for atoms. Two screenshots of the run have been attached below. Thanks for your help!

Question related to the equivariance

Hi, nice work and thanks for sharing the code. I notice that the training process of EDM is computing the L2 loss of the sampled $\epsilon$ (subtract center of gravity) and predicted one, i.e., $\hat{\epsilon}=\phi(z_t, t)$ from the model. The latter one is actually computed as velocity (and also subtract center of gravity). I can see that $\hat{\epsilon}$ is equivaraint to $z_t$, but as mentioned in another work "Geodiff"(https://arxiv.org/abs/2203.02923), we should also hope that $\epsilon$ is also equivaraint to $z_t$. I wonder if you consider this alignment problem since it is not mentioned in the paper. If I miss something, please let me know!!

Problem is already solved.

Problem is solved, anyway :)
It was my mistake.
Thank you so much -!

energies for the molecules generated

Hello, how did you calculate the molecular energy? If convenient, could you provide the relevant code?

question regarding EGCL layer

why is normalizing factor 100 in code instead of distance?? in equation 12

Questions regarding dataset error

After followed the instructions of building the dataset, I encountered this error:
Traceback (most recent call last):
File "/home/jovyan/e3_diffusion_for_molecules/eval_sample.py", line 164, in
main()
File "/home/jovyan/e3_diffusion_for_molecules/eval_sample.py", line 130, in main
dataloaders, charge_scale = dataset.retrieve_dataloaders(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/e3_diffusion_for_molecules/qm9/dataset.py", line 45, in retrieve_dataloaders
split_data = build_geom_dataset.load_split_data(data_file,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/e3_diffusion_for_molecules/build_geom_dataset.py", line 107, in load_split_data
val_data, test_data, train_data = np.split(data_list, [val_index, test_index])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<array_function internals>", line 200, in split
File "/root/mambaforge/envs/my-rdkit-env/lib/python3.11/site-packages/numpy/lib/shape_base.py", line 874, in split
return array_split(ary, indices_or_sections, axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<array_function internals>", line 200, in array_split
File "/root/mambaforge/envs/my-rdkit-env/lib/python3.11/site-packages/numpy/lib/shape_base.py", line 786, in array_split
sary = _nx.swapaxes(ary, axis, 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<array_function internals>", line 200, in swapaxes
File "/root/mambaforge/envs/my-rdkit-env/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 594, in swapaxes
return _wrapfunc(a, 'swapaxes', axis1, axis2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/mambaforge/envs/my-rdkit-env/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 54, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/mambaforge/envs/my-rdkit-env/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6922516,) + inhomogeneous part.

Can you tell me how to fix this error? Thank you!

Use e3_diffusion for sphere packs generation

Hi, do you think it's possible to transfer this work from molecules to a sphere packs problem (so no charges or atom type)? Like a list of spheres that should respect some physical properties according to their relative positions to each other?

Question regarding property norms calculation

Hello! I noticed that the line linked below contains identical conditions inside this if statement. I wanted to check if this will affect any of the behavior for your calculation of property norms (e.g., with certain dataset configurations).

e3_diffusion_for_molecules/qm9/utils.py

Line 7 in fce07d7

elif dataset_name == 'qm9_second_half' or dataset_name == 'qm9_second_half':

Validity and uniqueness of molecules generated by G-schent

Hello, how is the validity and uniqueness of the molecules sampled by G-schent assessed? I get high results from calculating Chem.SanitizeMol(mol) using the molecules generated by G-schent and the bond generation method he gives.

question regarding generation

Hi, thanks for the awesome paper

I was wondering if i can generate a molecule by inserting Zt(coordinates, atom types etc) instead of sampling Zt?

Question regarding the dimension of QM9 and input to property classifier

Hi,

I have a question about the number of maximum atoms used for QM9 generation. The maximum number of atoms for the QM9 dataset is 9, but when running the code, it used (29,5) as input to the property classifier, Can you let me know where are the extra entires coming from?

Also, do you transform your predicted coordinates or other things from EDM before inputting them into the property classifier and How?

Regards,
Yogesh

Question about the function log_constants_p_x_given_z0() in EnVariationalDiffusion

hello, I found that the last line of this function is "return degrees_of_freedom_x * (- log_sigma_x - 0.5 * np.log(2 * np.pi))"
and you take the negative value when you call the function "neg_log_constants = -self.log_constants_p_x_given_z0(x, node_mask)"

I know that - log_sigma_x means sigma_0 / alpha_0
So according to the formular of Z, it maybe "return degrees_of_freedom_x * (- log_sigma_x + 0.5 * np.log(2 * np.pi))" ?

Another question is that if i replace the atom charge and onehot features to continuous features,like 6 dim latent feature, am i need to compute the normalization constant for this latent feature?

question about error during training

Hi i customized your code a bit, and having issues regarding the error term (between net_out and z_t)

the issues is that in some cases, the error explodes to over 50000. This is because the output of phi goes to scales of 100s.
Have you had this issue?

Thank you!

Path to pretrained QM9 property classifiers is missing

Hello. Thank you for sharing this great work with us all.

I noticed, after looking at the end of your README.md, that you are offering pretrained QM9 property classifier models within this repository. However, the path you specified at the end of the README does not exist, at least in the main branch for this project.

qm9/property_prediction/outputs/exp_class_alpha_pretrained

May I ask where we might find these checkpoints?

Turning off equivariance?

Hi @ehoogeboom,
Great project! I am interested in trying it on an astronomy problem. Here, we have a point cloud of particles, but we don't care about their positions – only their properties. Therefore, I was wondering if there is a way I can turn off the equivariance module? i.e., only generate particles and their features, without worrying about their positions?
Thanks!
Miles

Question about the paper

Hi, I am studying your nice work these days. And I am confused with a little bit foolish question: the paper mentioned that "it is impossible to have a non-zero distribution that is invariant to translations, since it cannot integrate to one", could you please explain further mathematically on why it cannot integrate to one?

aromatic bonds??

Hi,

may i ask the proportion of aromatic compounds during generation? I find difficulty generating aromatic compounds

Thanks

Request for Release of Pre-trained Model Checkpoints

Hello Authors,

I hope this message finds you well. I am writing to inquire about the possibility of releasing the checkpoints for the pre-trained models associated with this repository.

As an enthusiast and practitioner in the field, I've been greatly impressed by the work done in this project. The models you've developed have the potential to significantly aid in various applications and research. However, to fully leverage these models, having access to the pre-trained checkpoints would be immensely beneficial.

Understanding that there might be constraints or reasons for not having released them yet, I am curious to know if there are any plans to make these checkpoints available in the near future. If there are specific conditions or guidelines that must be met for their release, I would be more than willing to discuss or comply with them.

Thank you for considering this request. Your work is greatly appreciated by the community, and having access to these resources would be invaluable.

Looking forward to your response.

Best regards,
Divin

Missing argument for name of dataset

Hi! Below is a link to the line in your main_geom_drugs.py script where I believe you forgot to include the name of your selected dataset (i.e., args.dataset) as the third function argument here. I hope this helps.

e3_diffusion_for_molecules/main_geom_drugs.py

Line 181 in fce07d7

property_norms = compute_mean_mad(dataloaders, args.conditioning)

AssertionError: Variables not masked properly.

I get the following error message when running main_qm9.py with 8 GPU cards. But the error is gone If I use 4 cards or even 1 card. Could you please let me know how to fix it?

Entropy of n_nodes: H[N] -2.475700616836548
alphas2 [9.99990000e-01 9.99988000e-01 9.99982000e-01 ... 2.59676966e-05
1.39959211e-05 1.00039959e-05]
gamma [-11.51291546 -11.33059532 -10.92513058 ... 10.55863126 11.17673063
11.51251595]
Training using 8 GPUs
Traceback (most recent call last):
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/main_qm9.py", line 289, in
main()
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/main_qm9.py", line 241, in main
train_epoch(args=args, loader=dataloaders['train'], epoch=epoch, model=model, model_dp=model_dp,
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/train_test.py", line 53, in train_epoch
nll, reg_term, mean_abs_z = losses.compute_loss_and_nll(args, model_dp, nodes_dist,
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/qm9/losses.py", line 23, in compute_loss_and_nll
nll = generative_model(x, h, node_mask, edge_mask, context)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/root/anaconda3/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
raise exception
AssertionError: Caught AssertionError in replica 4 on device 4.
Original Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/equivariant_diffusion/en_diffusion.py", line 701, in forward
loss, loss_dict = self.compute_loss(x, h, node_mask, edge_mask, context, t0_always=False)
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/equivariant_diffusion/en_diffusion.py", line 606, in compute_loss
diffusion_utils.assert_mean_zero_with_mask(z_t[:, :, :self.n_dims], node_mask)
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/equivariant_diffusion/utils.py", line 47, in assert_mean_zero_with_mask
assert_correctly_masked(x, node_mask)
File "/disk/nvme1n1/mg/e3_diffusion_for_molecules/equivariant_diffusion/utils.py", line 56, in assert_correctly_masked
assert (variable * (1 - node_mask)).abs().max().item() < 1e-4,
AssertionError: Variables not masked properly.

Inquiry Regarding the Creation of the qm9_second_half Dataset

Hi,

I am currently studying your work presented in the EDM paper and am particularly interested in the qm9_second_half dataset. Could you please share insights on how this dataset was created? Understanding its creation process is crucial for my research.

Thank you for your time and assistance.

Best regards,

Divin

Does the code use Unet?

First of all thank you so much for sharing such wonderful work. But I apologize for asking this question on a personal level, is there any framework like Unet used in this work?

Instructions to turn off equivarinace?

could you please share how to turn off equivariance, I would like to apply this model for my use case, where I also don't need equivariance

Originally posted by @Ala1s in #25 (comment)

About dataset_info in datasets_config.py

Hii! Thanks for your impressive work on the molecular diffusion model! But I'm wondering how to leverage EDM on a custom dataset. The keys n_nodes and distances in datasets_config.py confuse me. How can I obtain these items from my custom dataset? I would really appreciate it if you could help. Thanks!

Trained model available?

Hi,

Thank you for this work. I am wondering if you will release the trained GEOM EDM model in the future?

eval_sample.py looking for a model file that doesn't exist

Hi, I ran into a minor issue when trying to sample molecules from the trained edm_qm9 model. In brief, I think the file outputs/edm_qm9/flow_ema.npy might be a misnamed or be a misplaced file, or a file leftover from a time when your code used a different naming convention for output files. I'll explain in more detail below:

I ran the the following command (taken from README.md) for sampling molecules:

python eval_sample.py --model_path outputs/edm_qm9 --n_samples 1

Note that I am trying to sample molecules with one of the trained models provided in this repository.

This results in the following missing file exception:

Exception has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'outputs/edm_qm9/generative_model_ema.npy'
  File ".../e3_diffusion_for_molecules/eval_sample.py", line 137, in main
    flow_state_dict = torch.load(join(eval_args.model_path, fn),
  File "...e3_diffusion_for_molecules/eval_sample.py", line 164, in <module>
    main()

It seems that the arguments contained in args.pickle are causing the function eval_sample.main to look for a file in the repository: outputs/edm_qm9/generative_model_ema.npy

When I train a diffusion model myself on the qm9 dataset, using the command given in the readme, the training code produces a file named generative_model_ema.npy, and I am able to run eval_sample.py successfully when pointing it to the args/model file for the model I trained.

There seems to be a model file in the repository outputs/edm_qm9/flow_ema.npy. Is this file perhaps misnamed? or from another experiment?. I thought maybe this file needs to be changed.

If I'm correct, I figured this might be an important update to make.

PS: Congrats on putting out this awesome work and thank you for making it so accessible!

About the training loss

Hi, this is a great job. There are some questions about training, such as how many epochs of loss converge, and what is the final loss of training?

Train on custom dataset

Thanks for sharing the code! I was wondering whether it is possible to train the diffusion model on our own graph dataset? If so, can you provide an example?

Differences between paper and code

Hi,
I have been very interested in your equivariant neural network for generating QM9-like molecules and I have been trying to reproduce your results for the unconditional generation without success. I ran the default configurations for training on QM9 and I have observed a few differences between what is said in the paper and what is implemented in the code with the default configs.

Here is what I have identified so far:

The node features that are given as input to phi_x at layer l is not h^l but h^l+1.
The gcl_equiv module of the EquivariantBlock class in egnn_new.py essentially implements the phi_x function by instantiating the EquivariantUpdate class. As shown below in the forward() method of EquivariantBlock, the gcl_equiv module takes the output h since it's been overwritten before giving it as input. I would define two variables (h_in and h_out) to distinguish them.

    def forward(self, h, x, edge_index, node_mask=None, edge_mask=None, edge_attr=None):
        # Edit Emiel: Remove velocity as input
        distances, coord_diff = coord2diff(x, edge_index, self.norm_constant)
        if self.sin_embedding is not None:
            distances = self.sin_embedding(distances)
        edge_attr = torch.cat([distances, edge_attr], dim=1)
        for i in range(0, self.n_layers):
            h, _ = self._modules["gcl_%d" % i](h, edge_index, edge_attr=edge_attr, node_mask=node_mask, edge_mask=edge_mask)
        x = self._modules["gcl_equiv"](h, x, edge_index, coord_diff, edge_attr, node_mask, edge_mask)

        # Important, the bias of the last linear might be non-zero
        if node_mask is not None:
            h = h * node_mask
        return h, x

https://github.com/ehoogeboom/e3_diffusion_for_molecules/blob/fce07d701a2d2340f3522df588832c2c0f7e044a/egnn/egnn_new.py#L134C5-L134C5

Compared to what's explained in section B of the supplementary information of the paper, phi_x has an extra tanh() function and its output is rescaled by a constant (self.coords_range). This can be seen in the coord_model method of the EquivariantUpdate class.

    def coord_model(self, h, coord, edge_index, coord_diff, edge_attr, edge_mask):
        row, col = edge_index
        input_tensor = torch.cat([h[row], h[col], edge_attr], dim=1)
        if self.tanh:
            trans = coord_diff * torch.tanh(self.coord_mlp(input_tensor)) * self.coords_range
        else:
            trans = coord_diff * self.coord_mlp(input_tensor)
        if edge_mask is not None:
            trans = trans * edge_mask
        agg = unsorted_segment_sum(trans, row, num_segments=coord.size(0),
                                   normalization_factor=self.normalization_factor,
                                   aggregation_method=self.aggregation_method)
        coord = coord + agg
        return coord

https://github.com/ehoogeboom/e3_diffusion_for_molecules/blob/fce07d701a2d2340f3522df588832c2c0f7e044a/egnn/egnn_new.py#L86C1-L99C21

The default value for self.tanh is True, which is inherited by the default --tanh argument in main_qm9.py. This can be simplify fixed by setting the default to False.

The code uses gradient clipping with a moving-averaged maximal norm, which is not mentioned at all in the paper.
This can be seen in the train_epoch function in train_test.py:

        if args.clip_grad:
            grad_norm = utils.gradient_clipping(model, gradnorm_queue)
        else:
            grad_norm = 0.

https://github.com/ehoogeboom/e3_diffusion_for_molecules/blob/fce07d701a2d2340f3522df588832c2c0f7e044a/train_test.py#L59C1-L62C27

The default value for args.clip_grad is True in main_qm9.py. This could also be fixed by setting the default, but I now wonder if this was used for producing the results of the paper. Some clarifications would be appreciated.

Training Split for G-SchNet

Hi, was the G-SchNet model used to generate samples here trained on the same data-split? The original paper only used 50k training samples.

Why set the normalize value and bias?

Could you please give me some insights about the reason why you implemented normalize values and bias and how you determine their values?

Question about the units of gap, homo and lumo in conditional gneration task

Hi,

I have noticed that the units of gap, homo and lumo in table 3 are meV. Is this a typo error? I think it should be ev or ha.
Besides, the l1 losses of the three properties are too large. Could you please check it?

Thank you very much!

Why did you remove the input coordinates in EGNN?

I am sorry to bother you. Why did you remove the input coordinates in EGNN and why didn't you do the same operation for the atom features or in GNN?

Question about the code

Hi,
Thank you for sharing such a great work. I am a little confused about this line of the code. Can you tell me the role of delta_log_px in this line of the code and I don’t know why it times np.log(self.norm_values[0])?
Thank you for sharing this work. I genuinely appreciate the effort you've put into this work and would be grateful for your guidance.

Thank you.

e3_diffusion_for_molecules/equivariant_diffusion/en_diffusion.py

Line 344 in fce07d7

 delta_log_px = -self.subspace_dimensionality(node_mask) * np.log(self.norm_values[0])