jamesgleave / dd_protocol Goto Github PK

View Code? Open in Web Editor NEW

82.0 3.0 28.0 249 KB

Official repository for the Deep Docking protocol

License: MIT License

Python 92.27% Shell 7.73%

docking virtual-screening drug-discovery deep-learning

dd_protocol's People

Contributors

Stargazers

Watchers

dd_protocol's Issues

Question about test_set/valid_set in iteration_2-N

In the paper "a graphical user interface for deep learning-acceleratedArtificial intelligence–enabled virtual screening of largeultra-large chemical libraries ( with Ddeep Ddocking)", it says that the validation and test sets are generated only during the first iteration. I think it means the validation and test set and only generated in the first iteration and will be used in the subsequent iterations. But, when I l checked the code in "scripts_1/sampling.py" I found it generated new test_set and valid_set in the iteration 2-N. If the test_set and valid_set is new generated, I have to do the docking for test_set and valid_set. So, which one is right? Whether the valid_set and test_set will be generated in every iteration or not?

train the model, but I get an empty all_models folder

Hi, I run the script simple_job_models_manual.py, and set the arguments as follows:
--iteration_no 1 --morgan_directory /data3/liuyx/DD_test/morgan_lib --file_path /data3/liuyx/DD_test/project --number_of_hyp 16 --total_iterations 4 --is_last False --number_mol 8000 --percent_first_mols 20 --percent_last_mols 0.1 --recall 0.90
I want to test the model, so I chose smaller scale of library which contain the training set of 8734 molecules (score number in training_labels.txt), test set of 8674 molecules (score number in testing_labels.txt) and valid set of 8760 molecules(score number in validation_labels.txt) . However, after I ran the script, the created folder all_models is empty and no hyperparameter_morgan_with_freq_v3.txt file is obtained. Can you help me about this problem? Thanks a lot.

[Feature Request] Docker Container

Hello James and the devs,

Is there a way you guys can make a working docker container for everyone to use with SLURM setup. It would help a lot of people.

Thanks

what happens to the best model after each iteration?

Hello! If I understand correctly, at the end of each iteration the best model is selected and used to predict the hit-likeness of all molecules in the library. Afterwards, if I use the recommended parameters reported in the article, I can train another 24 models with different hyperparameters. Why can't we simply take the best model from the first iteration and continue to train it for another N iterations? What happenes to the best models after each iteration? Am I missing something?
Thank you!

file doesn't exists

DD_protocol/scripts_2/simple_job_models.py

Line 80 in 9c842f6

 t_mol = pd.read_csv(mdd+'/Mol_ct_file_%s.csv'%protein,header=None)[[0]].sum()[0]/1000000 # num of compounds in each file is mol_ct_file 

Hello, I don't understand where "Mol_ct_file_%s.csv" is generated. I have checked all your scripts and I didn't find any command that create such file.
Should the variable t_mol be just the total number of compounds in the general database (i.e. ZINC20) ? Supposing that such database has 1 billion, the t_mol is supposed to be 1000? If not, why then divide by 1 million?

What to do with those 'negative' (non-hit) molecules after each iteration

Hello!

We are very interested in the deep docking workflow.
Here I have a simple question:

After each iteration, there will be a large number of molecules that not labeled as hits.
One way to deal with these 'negative' molecules is to simply discard them. That means, such 'negative' molecules will be removed from then on.
Alternatively, these 'negative' molecules, along with those 'positive' molecules that were virtual hits but not being sampled to do augmentation of training set, should be used for inference in the next iteration.
Could you please take a look and let me know your opinion? I feel that 'negative' molecules should go into the next iteration but I am not sure.

Thank you!
BEST,

Pei
.--. . ..

conda requirements

Hi
Can you mention how to install the dependencies for DD protocol?
Also I have a 3 million small molecule library in sdf format, can I use it as a test for docking?
Thanks
Deb

ValueError: Number of processes must be at least 1

Dear all, i'm running into issue with this error. Tried troubleshooting a few times but no avail.
Can i seek some pointers?

Thanks!

I used these codes:
python molecular_file_count_updated.py --project_name protein_test --n_iteration 1 --data_directory /fred/oz241/xchee/[sandbox]_Deepdocking/phase_1/Morgan/library_prepared_fp --tot_process 1 --tot_sampling 3000000

But i get the following error:

molecular_file_count_updated: Parsed Args:
         molecular_file_count_updated:  - Iteration: 1
         molecular_file_count_updated:  - Data Directory: /fred/oz241/xchee/[sandbox]_Deepdocking/phase_1/Morgan/library_prepared_fp
         molecular_file_count_updated:  - Training Size: 1
         molecular_file_count_updated:  - Validation Size: 3000000
         molecular_file_count_updated: Number Of Files: 0
         molecular_file_count_updated: Minimum Value: 0
         molecular_file_count_updated: Error: Minimum value is less than 1
         molecular_file_count_updated: Reading Files...
Traceback (most recent call last):
  File "/fred/oz241/xchee/[sandbox]_Deepdocking/phase_1/molecular_file_count_updated.py", line 68, in <module>
    with closing(Pool(min_value)) as pool:
                 ^^^^^^^^^^^^^^^
  File "/home/xchee/.conda/envs/conda_vs/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xchee/.conda/envs/conda_vs/lib/python3.11/multiprocessing/pool.py", line 205, in __init__
    raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1

However, when i check the path /fred/oz241/xchee/[sandbox]_Deepdocking/phase_1/Morgan/library_prepared_fp using list, i can see all the files:

smiles_all_00.txt smiles_all_34.txt smiles_all_68.txt
smiles_all_01.txt smiles_all_35.txt smiles_all_69.txt
smiles_all_02.txt smiles_all_36.txt smiles_all_70.txt
smiles_all_03.txt smiles_all_37.txt smiles_all_71.txt
smiles_all_04.txt smiles_all_38.txt smiles_all_72.txt
smiles_all_05.txt smiles_all_39.txt smiles_all_73.txt
smiles_all_06.txt smiles_all_40.txt smiles_all_74.txt
smiles_all_07.txt smiles_all_41.txt smiles_all_75.txt
smiles_all_08.txt smiles_all_42.txt smiles_all_76.txt
smiles_all_09.txt smiles_all_43.txt smiles_all_77.txt
smiles_all_10.txt smiles_all_44.txt smiles_all_78.txt
smiles_all_11.txt smiles_all_45.txt smiles_all_79.txt
smiles_all_12.txt smiles_all_46.txt smiles_all_80.txt
smiles_all_13.txt smiles_all_47.txt smiles_all_81.txt
smiles_all_14.txt smiles_all_48.txt smiles_all_82.txt
smiles_all_15.txt smiles_all_49.txt smiles_all_83.txt
smiles_all_16.txt smiles_all_50.txt smiles_all_84.txt
smiles_all_17.txt smiles_all_51.txt smiles_all_85.txt
smiles_all_18.txt smiles_all_52.txt smiles_all_86.txt
smiles_all_19.txt smiles_all_53.txt smiles_all_87.txt
smiles_all_20.txt smiles_all_54.txt smiles_all_88.txt
smiles_all_21.txt smiles_all_55.txt smiles_all_89.txt
smiles_all_22.txt smiles_all_56.txt smiles_all_90.txt
smiles_all_23.txt smiles_all_57.txt smiles_all_91.txt
smiles_all_24.txt smiles_all_58.txt smiles_all_92.txt
smiles_all_25.txt smiles_all_59.txt smiles_all_93.txt
smiles_all_26.txt smiles_all_60.txt smiles_all_94.txt
smiles_all_27.txt smiles_all_61.txt smiles_all_95.txt
smiles_all_28.txt smiles_all_62.txt smiles_all_96.txt
smiles_all_29.txt smiles_all_63.txt smiles_all_97.txt
smiles_all_30.txt smiles_all_64.txt smiles_all_98.txt
smiles_all_31.txt smiles_all_65.txt smiles_all_99.txt
smiles_all_32.txt smiles_all_66.txt
smiles_all_33.txt smiles_all_67.txt

An example of a file content is:
ZINC000000000638_1,2,4,13,29,64,80,120,121,145,147,175,301,356,423,433,491,4
94,504,568,649,650,659,661,695,726,728,807,832,849,890,891,892,893,910,926,9
54,967,983,1019
ZINC000000000794_1,1,21,29,33,64,80,104,118,123,124,159,175,207,227,232,234,
329,356,367,386,399,428,572,582,666,695,698,705,726,773,789,807,849,861,926,
946,948,975,988

Inquiry Regarding Ligand Preparation in Stage IV of Deep Docking Project

Dear jamesgleave,

I hope this email finds you well. Firstly, I would like to express my sincere gratitude to the Deep Docking team for their remarkable contributions to the field of accelerating molecular docking through deep learning. Your work in this domain has been invaluable, and it is greatly appreciated.

I am writing to seek clarification on a specific aspect of the project, particularly in Stage IV: ligand preparation (DD phase 2). In this context, my inquiry revolves around the use of the OMEGA module for ligand preparation in Glide docking. Obtaining academic licenses for OpenEye's software can be challenging, and I am exploring alternative options.

Specifically, I would like to know if it is mandatory to utilize the OMEGA module for ligand preparation in Glide docking during Stage IV. Given the potential challenges associated with obtaining academic licenses for OpenEye, I am curious whether the ligprep module in Maestro can be a viable alternative for this purpose. Could you please provide some insights into the compatibility and capabilities of the ligprep module in addressing the ligand preparation requirements for Glide docking in Stage IV?

Your guidance on this matter would be immensely helpful, and I appreciate your time and assistance. Thank you once again for your groundbreaking work in advancing the field of molecular docking, and I look forward to your response.

Best regards,
Shang Xinci
Soochow University School of Pharmacy

Cannot generate model in regular training of DD Phase 4

Hello,
We used the prepared smiles from the given link and followed the steps in the paper. When the regular training step of DD Phase 4 was performed, the oversampling seemed to meet trouble. Nothing was generated in the 'all_models" folder. Below is the traceback info:

100 0.2 0.0001 2 2.0 -9.695906405812957 256 10 /data/sdc/projects/protein_test
Training size not specified, using entire dataset...
Finished parsing args...

Getting data from iteration 1
Data acquired...
Train shape: (645651, 1) Valid shape: (645654, 1) Test shape: (645716, 1)
Data Augmentation iteration 1 data shape: (645651, 1)
Training labels shape:  (645651, 1)
Using binary labels...
Converting y_pos and y_neg to dict (for faster access time)

Oversampling... size: 63800
        Num pos: 6380
        Num neg: 639271
Using morgan fingerprints...
looking through file path: /data/sdc/projects/protein_test/iteration_1/morgan/*
         train
x data from: /data/sdc/projects/protein_test/iteration_1/morgan/train_morgan_1024_updated.csv
         test
x data from: /data/sdc/projects/protein_test/iteration_1/morgan/test_morgan_1024_updated.csv
Done...
Index(['r_i_docking_score'], dtype='object')
r_i_docking_score
x data from: /data/sdc/projects/protein_test/iteration_1/morgan/test_morgan_1024_updated.csv
         valid
x data from: /data/sdc/projects/protein_test/iteration_1/morgan/valid_morgan_1024_updated.csv
Done...
Index(['r_i_docking_score'], dtype='object')
r_i_docking_score
x data from: /data/sdc/projects/protein_test/iteration_1/morgan/valid_morgan_1024_updated.csv
y validation shape: (645654, 1)
oversampled sample: ('ZINC000265314938_1', array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]]))
Done oversampling, number of missing morgan fingerprints: 0
Traceback (most recent call last):
  File "progressive_docking.py", line 399, in <module>
    del bin_valid
NameError: name 'bin_valid' is not defined
complete

Could you help to give some advice on the solution to the problem?

Thanks a lot

jamesgleave / dd_protocol Goto Github PK

dd_protocol's People

Contributors

Stargazers

Watchers

Forkers

dd_protocol's Issues

Question about test_set/valid_set in iteration_2-N

train the model, but I get an empty all_models folder

[Feature Request] Docker Container

what happens to the best model after each iteration?

file doesn't exists

What to do with those 'negative' (non-hit) molecules after each iteration

conda requirements

ValueError: Number of processes must be at least 1

Inquiry Regarding Ligand Preparation in Stage IV of Deep Docking Project

Cannot generate model in regular training of DD Phase 4

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent