devalab / deeppocket Goto Github PK

Ligand Binding Site detection using Deep Learning

License: MIT License

Python 100.00%

deeppocket's Introduction

DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks

DeepPocket is a 3D convolutional Neural Network framework for ligand binding site detection and segmentation from protein structures. This is the official open source repository for the following paper:

Aggarwal, Rishal; Gupta, Akash; Chelur, Vineeth; Jawahar, C. V.; Priyakumar, U. Deva (2021): DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks. ACS JCIM. 2021 Link

If you want to use this project for development, we recommend going through libmolgrid first. To use DeepPocket for predicting binding sites on an input protein skip to "Predicting Binding Sites" section.

Requirements

Fpocket, Pytorch, libmolgrid, Biopython and other frequently used python packages

To reproduce the substructure benchmark Prody and Rdkit are also required.

Dataset Preprocessing

PDB files are first parsed to remove hetero atoms, then converted to "gninatypes" files and finally collected into a "molcache2" file for quicker input and model training with libmolgrid. "gninatypes" and "molcache2" files are binary files that store an efficient representation of the input protein to be used for gridding the molecule. They are prepared for faster input with libmolgrid for quicker training of the CNN models.

cavity6.mol2 files that are provided by scPDB and generated by volsite for other datasets are used as is, the "data_dir" argument in training scripts have to be pointed to the parent directory they are present.

".types" files contain training data points prepared, the first column is the class label, the next three columns are pocket center cordinates (x,y,z) and the final columns contain molecule files required for that datapoint. All the molecule files specified in the types files must be present in either the molcache or in the "data_dir".

Prepared types, molcache and saved model checkpoints can be downloaded here. SC6K can be downloaded from the same link (SC6K.tar.gz). COACH420 and HOLO4k are publicly available here

Edit - I now provide the ligands used in the publication for Coach420 and Holo4k in the same onedrive link above. The number of ligands is not the same as the original manuscript possibly due to differences in whether a molecule file is percieved as valid by different versions of openbabel. For Coach420 I now provide 361 molecules as compared to 359 in the manuscript, and for Holo4k I provide 4309 as compared to 4288 mentioned earlier.

I know this may be very cryptic, therefore I have written down simple steps in the last section of the README that one can use to prepare a new dataset for training.

Predicting Binding Site

"predict.py" is a simple script that can be used for predicting binding sites from a .pdb file. It follows 6 steps namely:

Hetero atom removal (clean_pdb)
fpocket run
Parsing fpocket output for candidate centers (get_centers)
Creating gninatypes and types file for CNN input (types_and_gninatyper)
Rerank types input according to CNN score (rank_pockets)
Segment shape of top ranked pockets (segment_pockets)

Example usage of predict.py:

python predict.py -p protein.pdb -c first_model_fold1_best_test_auc_85001.pth.tar -s seg0_best_test_IOU_91.pth.tar -r 3

Description of each argument given in script.

If the name of the input file is protein.pdb, then fpocket creates a protein_out/pockets directory. The CNN ranked pockets will be given in the bary_centers_ranked.types file in that directory. The CNN confidence scores will be provided in the *_confidence.txt file

If you asked for segmented pockets ("-r") the script will output ".dx" files that can be visualised in pymol. It will also output "pocket*.pdb" files that contain predicted binding site residues. If no binding site residues are predicted that particular pocket*.pdb will not be created.

Training Classifier

We use wandb to track training performance. It's free and easy to use. If you want to avoid using wandb, simply comment out all lines that contain "wandb" in the training script.

Example usage of train.py:

python train.py -m model.py --train_types scPDB_train0.types --test_types scPDB_test0.types -i 200000 --train_recmolcache scPDB_new.molcache2 --test_recmolcache scPDB_new.molcache2 -r val0 -o /model_saves/val9 --base_lr 0.001 --solver Adam

Description of each argument given in script.

Training segmentation

Example usage of train_segmentation.py:

python train_segmentation.py --train_types seg_scPDB_train9.types --test_types seg_scPDB_test9.types -d data/ --train_recmolcache scPDB_new.molcache2 --test_recmolcache scPDB_new.molcache2 -b 8 -o model_saves/seg9 -e 200 -r seg9

Description of each argument in script

Preparing Data

I have written down steps below that pertain to preparing training data for a dataset like PDBbind, but could easily be adopted for other datasets by making approprate changes of file paths and file names in the scripts.

Steps for preparing training data:

remove hetero atoms (clean_pdb.py)
run fpocket through structures (fpocket -f *_protein.pdb)
get candidate pocket centers for all structures (get_centers.py)
create .gninatypes files for all structure (gninatype() in types_and_gninatyper.py)
make train and test types (make_types.py)
create molcache file for training (create_molcache2.py)

Example usage of create_molcache2:

python create_molcache2.py -c 4 --recmolcache scPDB_new.molcache2 -d data/scPDB/  scPDB_train0.types scPDB_test0.types

Substructure Benchmark

To reproduce our results on the substructure benchmark, run the following command:

    python subpockets_benchmark_all.py --test_types refined4414_predict.types --model_weights refined_best_test_IOU_88.pth.tar -d ./data/ --test_recmolcache refined4414.molcache2

"refined4414_predict.types" file contains fpocket candidate centers closest to ligand for protein-ligand complexes in the refined dataset. The data directory should contain clean (no water) .pdb files and ligand sdf files.

Citation

If you find this useful please cite the paper mentioned above.

@article{doi:10.1021/acs.jcim.1c00799,
author = {Aggarwal, Rishal and Gupta, Akash and Chelur, Vineeth and Jawahar, C. V. and Priyakumar, U. Deva},
title = {DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks},
journal = {Journal of Chemical Information and Modeling},
volume = {0},
number = {0},
pages = {null},
year = {0},
doi = {10.1021/acs.jcim.1c00799},
note ={PMID: 34374539},

URL = { 
    https://doi.org/10.1021/acs.jcim.1c00799

},
eprint = { 
    https://doi.org/10.1021/acs.jcim.1c00799

}

}

deeppocket's People

Contributors

Stargazers

Watchers

deeppocket's Issues

Prediction Error

Thank you for this valued work.

When I run the code below, I am getting an error.

python predict.py -p 6aah.pdb -c first_model_fold1_best_test_auc_85001.pth.tar -s seg0_best_test_IOU_91.pth.tar -r 3

Fpocket files were created properly but then the program throws an error

Note: All libraries properly installed before running.

/content/DeepPocket/predict.py:14: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
***** POCKET HUNTING BEGINS ***** 
***** POCKET HUNTING ENDS ***** 
==============================
*** **Open Babel Warning  in PerceiveBondOrders
  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders** (title is /content/6aah_protein_nowat.pdb)

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/tarfile.py", line 187, in nti
    n = int(s.strip() or "0", 8)
ValueError: invalid literal for int() with base 8: '\x04ctorch.'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/tarfile.py", line 2289, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/local/lib/python3.7/tarfile.py", line 1095, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/usr/local/lib/python3.7/tarfile.py", line 1037, in frombuf
    chksum = nti(buf[148:156])
  File "/usr/local/lib/python3.7/tarfile.py", line 189, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 555, in _load
    return legacy_load(f)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 466, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/usr/local/lib/python3.7/tarfile.py", line 1593, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/usr/local/lib/python3.7/tarfile.py", line 1623, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/local/lib/python3.7/tarfile.py", line 1486, in __init__
    self.firstmember = self.next()
  File "/usr/local/lib/python3.7/tarfile.py", line 2301, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/DeepPocket/predict.py", line 87, in <module>
    class_checkpoint=torch.load(args.class_checkpoint)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 559, in _load
    raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: /content/gdrive/MyDrive/Colab Notebooks/Ainnocence/DeepPocket/first_model_fold1_best_test_auc_85001.pth.tar is a zip archive (did you mean to use torch.jit.load()?)

Channels in training script are different from those in your Supporting Information table s1

Thanks for your great work. I have one quesion, in your training code, you use the first 14 channels in gninamap file, which are

Hydrogen, PolarHydrogen, AliphaticCarbonXSHydrophobe, AliphaticCarbonXSNonHydrophobe, AromaticCarbonXSHydrophobe, AromaticCarbonXSNonHydrophobe, Nitrogen, NitrogenXSDonor, NitrogenXSDonorAcceptor, NitrogenXSAcceptor, Oxygen, OxygenXSDonor, OxygenXSDonorAcceptor and OxygenXSAcceptor

, However in Supporting Information table s1(https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.1c00799/suppl_file/ci1c00799_si_001.pdf), they are

AliphaticCarbonXSHydrophobe, AliphaticCarbonXSNonHydrophobe, AromaticCarbonXSHydrophobe, AromaticCarbonXSNonHydrophobe, Bromine Iodine Chlorine Fluorine, Nitrogen NitrogenXSAcceptor, NitrogenXSDonor NitrogenXSDonorAcceptor, Oxygen OxygenXSAcceptor, OxygenXSDonorAcceptor OxygenXSDonor, Sulfur SulfurAcceptor, Phosphorus, Calcium, Zinc and GenericMetal Boron Manganese Magnesium Iron

why they are different, may I misunderstand something?

Transform part in Train

    for b in range(batch_size):
            center = molgrid.float3(float(centers[b][0]), float(centers[b][1]), float(centers[b][2]))
            #intialise transformer for rotaional augmentation
            transformer = molgrid.Transform(center, 0, True)
            #center=transformer.get_quaternion().rotate(center.x,center.y,center.z)
            # random rotation on input protein
            transformer.forward(batch[b],batch[b])
            # Update input tensor with b'th datapoint of the batch
            gmaker.forward(center, batch[b].coord_sets[0], input_tensor[b])

The above code takes part in train.py. Can you explain why you are using this code please?

Why do we need to rotate or another processes on coordinates data while we train our data?

I am asking this question because

transformer.get_rotation_center().x is equal to centers[b][0]
transformer.get_rotation_center().y is equal to centers[b][1]
transformer.get_rotation_center().z is equal to centers[b][2]

What is your main goal by using transformer here ?

Training Classifier Dataset

Hi @RishalAggarwal,

Firstly, thank you for this repo.

python train.py -m model.py --train_types scPDB_train0.types --test_types scPDB_test0.types -i 200000 --train_recmolcache scPDB_new.molcache2 --test_recmolcache scPDB_new.molcache2 -r val0 -o /model_saves/val9 --base_lr 0.001 --solver Adam

As seen above, file scPDB_train0 is required for training classifier.

The sample content of the scPDB_train0 file is as follows;

0 -6.417309121621622 37.99337461018711 86.51209004677753 10mh_1/protein_0.gninatypes
0 -48.73792600326857 40.15845814418013 90.75518894134738 10mh_1/protein_0.gninatypes
0 -22.384561944279785 38.16762551867219 62.667952578541794 10mh_1/protein_0.gninatypes
0 4.418982018111255 43.43278783958602 81.18465174644241 10mh_1/protein_0.gninatypes
...

My first question is how did you do the labeling (0 or 1) of whether the proteins are pockets according to their coordinates. Is this dataset a public dataset? You didn't mention it in the paper too. How did you create this train file?

My second question is that if you did labeling this dataset by yourself how can I do this pocket / non-pocket (0 or 1) labeling according to the coordinates for my protein files.

Note: Neither COACH420 nor HOLO4k nor scPDB datasets contain coordinates for non-druggable regions. How did you labeled your scPDB_train0 file as a 0 (non-druggable) or 1 (druggable).

Could not open 'gninamap'

Hi,
Thank you for making the code public. I am getting an error however in the data preprocessing stage. When I try to convert a .pdb file to gninatypes, I get the error Could not open gninamap. I simply separated the data preprocessing stage (using your code without any modifications) to create a self contained example to show my error.

from Bio.PDB import PDBParser, PDBIO, Select
import Bio
import os
import sys
import molgrid
import struct
import numpy as np
import os
import sys

class NonHetSelect(Select):
    def accept_residue(self, residue):
        return 1 if Bio.PDB.Polypeptide.is_aa(residue,standard=True) else 0

def clean_pdb(input_file,output_file):
    pdb = PDBParser().get_structure("protein", input_file)
    io = PDBIO()
    io.set_structure(pdb)
    io.save(output_file, NonHetSelect())

def gninatype(file):
    # creates gninatype file for model input
    f=open(file.replace('.pdb','.types'),'w')
    f.write(file)
    f.close()
    atom_map=molgrid.FileMappedGninaTyper('gninamap')
    dataloader=molgrid.ExampleProvider(atom_map,shuffle=False,default_batch_size=1)
    train_types=file.replace('.pdb','.types')
    dataloader.populate(train_types)
    example=dataloader.next()
    coords=example.coord_sets[0].coords.tonumpy()
    types=example.coord_sets[0].type_index.tonumpy()
    types=np.int_(types)
    print(coords)
    fout=open(file.replace('.pdb','.gninatypes'),'wb')
    for i in range(coords.shape[0]):
        fout.write(struct.pack('fffi',coords[i][0],coords[i][1],coords[i][2],types[i]))
        print(struct.pack('fffi',coords[i][0],coords[i][1],coords[i][2],types[i]))
    fout.close()
    os.remove(train_types)
    return file.replace('.pdb','.gninatypes')

def create_types(file,protein):
    # create types file for model predictions
    fout=open(file.replace('.txt','.types'),'w')
    fin =open(file,'r')
    for line in fin:
        fout.write(' '.join(line.split()) + ' ' + protein +'\n')
    return file.replace('.txt','.types')

protein_file="/home/ubuntu/Data/1a8o.pdb"
protein_nowat_file=protein_file.replace('.pdb','_nowat.pdb')
clean_pdb(protein_file,protein_nowat_file)
protein_gninatype=gninatype(protein_nowat_file)

The code ends with the error

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_13436/2408537986.py in <module>
      2 protein_nowat_file=protein_file.replace('.pdb','_nowat.pdb')
      3 clean_pdb(protein_file,protein_nowat_file)
----> 4 protein_gninatype=gninatype(protein_nowat_file)

/tmp/ipykernel_13436/3305498276.py in gninatype(file)
      4     f.write(file)
      5     f.close()
----> 6     atom_map=molgrid.FileMappedGninaTyper('gninamap')
      7     dataloader=molgrid.ExampleProvider(atom_map,shuffle=False,default_batch_size=1)
      8     train_types=file.replace('.pdb','.types')

ValueError: Could not open gninamap

Can you please help me with this issue? Thank you.

Do not have files for running make_types.py when prerparing custom data for training a new classifier

I am trying to use your instruction to prepare data for training a new classifier.
I have stuck in make_types step because I can't find train.txt and test.txt files.

Moreover, I have 4 questions:

If I want to add several pdb files to the available scPDB dataset, how can I complete it?
Your instruction for preparing data only works for a single pdb file, does it? If not, I need to write a pipeline to wrap up it.
How to prepare train.txt and test.txt files to run make_types.py?
Could you please show me which file/folder needed inputting from previous to each step?

I am tried on this pdb.

Thank you very much.

predict.py running error

Hi, I have some issues when running 'predict.py'.

It seems can't find 'class_checkpoint' and 'seg_checkpoint'.

python predict.py -p Downloads/3g73.pdb -c first_model_fold1_best_test_auc_85001.pth.tar -s seg0_best_test_IOU_91.pth.tar -r 3

DeepPocket/predict.py:14: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/Bio/pairwise2.py:283: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
BiopythonDeprecationWarning,
/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3096.
PDBConstructionWarning,
/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3097.
PDBConstructionWarning,
/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3098.
PDBConstructionWarning,
/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3146.
PDBConstructionWarning,
/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3177.
PDBConstructionWarning,
/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 3218.
PDBConstructionWarning,
***** POCKET HUNTING BEGINS *****
mkdir: cannot create directory ‘Downloads/3g73_nowat_out/pockets’: File exists
***** POCKET HUNTING ENDS *****
Traceback (most recent call last):
File "DeepPocket/predict.py", line 87, in
class_checkpoint=torch.load(args.class_checkpoint)
File "/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/torch/serialization.py", line 579, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/park/miniconda3/envs/torchdrug/lib/python3.7/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'first_model_fold1_best_test_auc_85001.pth.tar'

Plz help...!!
Thanks

How to avoid data leakage?

In the "Data sets and Preprocessing" section of your paper, you mention that " we removed all proteins from the training set that had either sequence identity greater than 50% or ligand similarity greater than 0.9 and sequence identity greater than 30%".

How do you define sequence identity and ligand similarity?
Could you provide the scripts to calculate sequence identity and ligand similarity?
You mention twice sequence identities which are greater than 50% and 30%. Do you mean the protein sequence identity greater than 50% and ligand sequence identity greater than 30%?

can not find seg0_best_test_IOU_91.pth.tar.

hello,
I have not found the file : seg0_best_test_IOU_91.pth.tar.
Can you tell me where can i find it?

Thank you.

Pocket Probability

The content of bary_centers_ranked.types is as follows

3 -5.8872820891631195 2.4274225254193103 16.306759363865474 /content/xxx_nowat.gninatypes
1 -3.399881608822576 0.7210002432695428 11.260844794031787 /content/xxx_nowat.gninatypes
2 4.8732080737995735 -7.189318852006624 6.719965229046755 /content/xxxx_nowat.gninatypes
6 -8.189112907608695 8.49208722826087 8.578613858695654 /content/xxxx_nowat.gninatypes
5 -11.534677691417235 3.978282288738218 31.699375888870517 /content/xxxx_nowat.gninatype

How do you calculate the druggability score for each pocket centre? I'm trying to learn the Pocket Probability calculation

No output after running train_segmentation.py

Hi, I've been trying to run this code but keep running into issues while trying to run the segmentation code.

First few times I tried to run that code, I got the same error as shown here from Issue #18 - #18 (comment)

I re-downloaded the "scPDB_new" file and tried running it again, and now the code doesn't show an error, but it doesn't show any output either. I checked wandb as well, and it doesn't show any output (the train.py output had no issues and was represented perfectly on wandb).

Here is my code (on Google Colab) -
!python /content/drive/MyDrive/DeepPocket/train_segmentation.py
--train_types /content/drive/MyDrive/DeepPocket/seg_scPDB_train0.types
--test_types /content/drive/MyDrive/DeepPocket/seg_scPDB_test0.types
-d /content/drive/MyDrive/DeepPocket/data/
--train_recmolcache scPDB_new.molcache2
--test_recmolcache scPDB_new.molcache2
-b 8
-o /content/drive/MyDrive/DeepPocket/model_saves/seg0
-e 200 -r seg0

And here is the output -

I've been trying to fix this for a week, not sure what else I could do here. Any fixes or suggestions would be appreciated.

Broken Link to Model Checkpoints, "404 FILE NOT FOUND"

The link provided in the README/documentation for downloading the prepared types, molcache, and saved model checkpoints seems to be broken. When attempting to access the resources through the link, it results in a 404 File Not Found error.

Could you please look into this and provide an updated link or guidance on how to access these materials?

Best regards.

rank_pockets.py - UserWarning

I'm getting a userwarning from rank_pockets.py

DeepPocket/rank_pockets.py:88: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.

How to prepare the inputs for training segmentation model?

Well, since I could not find any code related to this issue, I wonder the details of preprocessing.
I guess use the protein and the binding site to mask the ground truth. But which files did you use? Because in the scPDB dataset, there are many files such as protein.mol2, site.mol2, cavity6.mol2, ligand.mol2, etc. I am getting confused.

Question about classes

Hello and congrats on your repo!

As I see at training segmentation, you have num_classes=1. That means that label=1 -> pocket and label->0 not pocket/?

Data Preparation for HOLO4K

Hi, I have some questions about data preparation.

In your paper, you mentioned that "The proteins and ligands were separated from the corresponding structure files using the Biopython library". But I can't find corresponding codes in this repo, could you share those parts of codes?
A pdb file in HOLO4K may have several ligands, do you remain all ligands or remove some? What are the criteria to choose ligands in a pdb file?
When you use Fpocket to choose pocket candidates, do you run Fpocket on the original pdb file, pdb file without ligands, or a single chain in pdb file?

bary_centers.txt Issue

What is the 'bary_centers.txt' ? It is neither automatically created nor a file given as an external argument. I'm getting an error. May I learn what is this txt file? Thank you.

segmentation fault

when predicting binding sites given a .pdb file of a protein using:
python predict.py -p pdb/1alb_A.pdb -c first_model_fold1_best_test_auc_85001.pth.tar -s seg0_best_test_IOU_91.pth.tar -r 3

I meet this bug :
'segmentation fault'
after
***** POCKET HUNTING BEGINS *****
***** POCKET HUNTING ENDS *****

I used gdb to view the core file :
'Failed to read a valid object file image from memory.
Core was generated by `python3 predict.py -p pdb/1alb_A.pdb -c first_model_fold1_best_test_auc_85001.p'.
Program terminated with signal 11, Segmentation fault.'

.

IndexError: list index out of range in output_pocket_pdb (segment_pockets.py)

Hello,

I'm trying to run the the Predicting Binding Site section example:

python predict.py -p protein.pdb -c first_model_fold1_best_test_auc_85001.pth.tar -s seg0_best_test_IOU_91.pth.tar -r 3

But it crashes with the following errors:

***** POCKET HUNTING BEGINS ***** ***** POCKET HUNTING ENDS ***** /usr/local/lib/python3.7/dist-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1951. PDBConstructionWarning, /usr/local/lib/python3.7/dist-packages/Bio/PDB/StructureBuilder.py:92: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2008. PDBConstructionWarning, /content/DeepPocket/rank_pockets.py:87: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument. all_probs.append(F.softmax(output).detach().cpu()) @> 1674 atoms and 1 coordinate set(s) were parsed in 0.01s. Traceback (most recent call last): File "predict.py", line 116, in <module> test(seg_model, seg_eptest, seg_gmaker,device,dx_name, args) File "/content/DeepPocket/segment_pockets.py", line 142, in test output_pocket_pdb(dx_name+'_pocket'+str(count)+'.pdb',prot_prody,pred_aa) File "/content/DeepPocket/segment_pockets.py", line 82, in output_pocket_pdb pocket=prot_prody.select(sel_str) File "/usr/local/lib/python3.7/dist-packages/prody/atomic/atomic.py", line 232, in select return SELECT.select(self, selstr, **kwargs) File "/usr/local/lib/python3.7/dist-packages/prody/atomic/select.py", line 885, in select indices = self.getIndices(atoms, selstr, **kwargs) File "/usr/local/lib/python3.7/dist-packages/prody/atomic/select.py", line 943, in getIndices torf = self.getBoolArray(atoms, selstr, **kwargs) File "/usr/local/lib/python3.7/dist-packages/prody/atomic/select.py", line 995, in getBoolArray tokens = parser(selstr, parseAll=True) File "/usr/local/lib/python3.7/dist-packages/prody/atomic/select.py", line 1100, in _noParser return [self._default(selstr, 0, selstr.split())] File "/usr/local/lib/python3.7/dist-packages/prody/atomic/select.py", line 1118, in _default torf, err = self._and2(sel, loc, tokens) File "/usr/local/lib/python3.7/dist-packages/prody/atomic/select.py", line 1319, in _and2 firsttoken = tokens[0] if not isinstance(tokens[0], Iterable) else list(tokens[0]) IndexError: list index out of range

Ubuntu 18.04 on Google Colab (pytorch 1.9+cuda 10.2) with prody version 2.0.

Understanding the Train Dataset for Training Part

My question is simple, but I believe it will be useful for everyone to understand the paper better.

The following code block needs to be run to train the classification

Here is an example line of the train and test files as follows

1 50.69633356250253 -8.818796255105756 9.213237190116068 2bel_4/protein_0.gninatypes 2bel_4/cavity6.mol2

I have two questions.

First, what does the number 1 in the first part represent?

My second question is that does the last part of the dataset need to be in the train and test files? (Las part means: 2bel_4/cavity6.mol2) If I delete the 2bel_4/cavity6.mol2 in the last part, will the train part work or do I need the mol2 files too?
Isn't just the gninatype enough (2bel_4/protein_0.gninatypes)?

The number of pocket in types file

Hi, I read the train.types like seg_scPDB_train0.types. The first few lines are as follows:
1 -18.161927039784217 32.606813980669806 85.32244760620364 10mh_1/protein_0.gninatypes 10mh_1/cavity6.mol2
1 -11.51310276710522 28.98620689253697 91.02771812783796 10mh_1/protein_0.gninatypes 10mh_1/cavity6.mol2
1 14.198903210849663 9.972515184884662 25.079490147212237 12gs_1/protein_0.gninatypes 12gs_1/cavity6.mol2
1 6.117556524238361 -2.4037784058248697 32.47945066104617 12gs_1/protein_0.gninatypes 12gs_1/cavity6.mol2

My confusion is that for this PDB 10mh_1 , it seems that there is only one cavity in the source folder "scPDB/10mh_1/", but there are two lines about 10mh_1 in seg_scPDB_train0.types.

Training Classifier Problem

You are giving code block below as an example for Training Classifier

You are using --data_dir (-d) in train.py as below

 eptrain = molgrid.ExampleProvider(shuffle=True, stratify_receptor=True, labelpos=0,balanced=True,
                                      data_root=args.data_dir,recmolcache=args.train_recmolcache)

Where are you reading data_dir from? The training classifier example you shared does not have data_dir in the code block ?
'

lack of gninatypes files

Hi, I want to use the train_segmentation.py , but it is prompted that the gninatypes file is missing. Can you provide this part of the file?
or should I process these files?
THX

Full test scripts to reproduce the metrics results in the paper.

Thank you for open-sourcing this great work!

I am a freshman in this topic. And I notice there are a lot of different metrics used in the paper, such as accuracy, DCA, DCC, DVO, success rate of Top-N, Top-(N+2), and ratio.

Could you kindly please provide the testing scripts to calculate these metrics on four datasets for reproducing the results in your paper?

It will be a great help to cite and compare with your paper. Thanks in advance.