Git Product home page Git Product logo

masif's Introduction

MaSIF banner and concept

MaSIF- Molecular Surface Interaction Fingerprints: Geometric deep learning to decipher patterns in protein molecular surfaces.

bioRxiv shield DOI

Table of Contents:

Description

MaSIF is a proof-of-concept method to decipher patterns in protein surfaces important for specific biomolecular interactions. To achieve this, MaSIF exploits techniques from the field of geometric deep learning. First, MaSIF decomposes a surface into overlapping radial patches with a fixed geodesic radius, wherein each point is assigned an array of geometric and chemical features. MaSIF then computes a descriptor for each surface patch, a vector that encodes a description of the features present in the patch. Then, this descriptor can be processed in a set of additional layers where different interactions can be classified. The features encoded in each descriptor and the final output depend on the application-specific training data and the optimization objective, meaning that the same architecture can be repurposed for various tasks.

This repository contains a protocol to prepare protein structure files into feature-rich surfaces (with both geometric and chemical features), to decompose these into patches, and tensorflow-based neural network code to identify patterns in these using deep geometric learning. To show the potential of the approach, we showcase three proof-of-concept applications: a) ligand prediction for protein binding pockets (MaSIF-ligand); b) protein-protein interaction (PPI) site prediction in protein surfaces, to predict which surface patches on a protein are more likely to interact with other proteins (MaSIF-site); c) ultrafast scanning of surfaces, where we use surface fingerprints from binding partners to predict the structural configuration of protein-protein complexes (MaSIF-search).

This repository should closely reproduce the experiments of:

Gainza, P., Sverrisson, F., Monti, F., Rodola, E., Boscaini, D Bronstein, M. M., & Correia, B. E. (2019). Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods 17, 184–192 (2020). https://doi.org/10.1038/s41592-019-0666-6

Note: Since Feb 2020, we have greatly simplified the installation of MaSIF by replacing all Matlab code with Python code. However, this slightly changes the results from the paper. To reproduce the results for the paper exactly as published (with the pretrained neural networks) you can obtain it at: https://github.com/pablogainza/masif_paper .

MaSIF is distributed under an Apache License. This code is meant to serve as a tutorial, and the basis for researchers to exploit MaSIF in protein-surface learning tasks.

System and hardware requirements

MaSIF has been tested on both Linux (Red Hat Enterprise Linux Server release 7.4, with a Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz processesor and 16GB of memory allotment) and Mac OS environments (macOS High Sierra, processor 2.8 GHz Intel Core i7, 16GB memory). To reproduce the experiments in the paper, the entire datasets for all proteins consume about 1.4 terabytes.

Currently, MaSIF takes about 2 minutes to preprocess every protein. For this reason, we recommend a distributed cluster to preprocess the data for large datasets of proteins. Once data has been preprocessed, we strongly recommend using a GPU to train or evaluate the trained models as it can be up to 100 times faster than a CPU.

Software prerequisites

MaSIF relies on external software/libraries to handle protein databank files and surface files, to compute chemical/geometric features and coordinates, and to perform neural network calculations. The following is the list of required libraries and programs, as well as the version on which it was tested (in parenthesis).

  • Python (3.6)
  • reduce (3.23). To add protons to proteins.
  • MSMS (2.6.1). To compute the surface of proteins.
  • BioPython (1.66) . To parse PDB files.
  • PyMesh (0.1.14). To handle ply surface files, attributes, and to regularize meshes.
  • PDB2PQR (2.1.1), multivalue, and APBS (1.5). These programs are necessary to compute electrostatics charges.
  • open3D (0.5.0.0). Mainly used for RANSAC alignment.
  • Tensorflow (1.9). Use to model, train, and evaluate the actual neural networks. Models were trained and evaluated on a NVIDIA Tesla K40 GPU.
  • StrBioInfo. Used for parsing PDB files and generate biological assembly for MaSIF-ligand.
  • Dask (2.2.0). Run function calls on multiple threads (Optional for reproducing some benchmarks).
  • Pymol. This optional plugin allows one to visualize surface files in PyMOL.

Alternatively you can use the Docker version, which is the easiest to install (See Docker container)

Installation

After preinstalling dependencies, add the following environment variables to your path, changing the appropriate directories:

export APBS_BIN=/path/to/apbs/APBS-1.5-linux64/bin/apbs
export MULTIVALUE_BIN=/path/to/apbs/APBS-1.5-linux64/share/apbs/tools/bin/multivalue
export PDB2PQR_BIN=/path/to/apbs/apbs/pdb2pqr-linux-bin64-2.1.1/pdb2pqr
export PATH=$PATH:/path/to/reduce/
export REDUCE_HET_DICT=/path/to/reduce/reduce_wwPDB_het_dict.txt
export PYMESH_PATH=/path/to/PyMesh
export MSMS_BIN=/path/to/msms/msms
export PDB2XYZRN=/path/to/msms/pdb_to_xyzrn

Clone masif to a local directory

git clone https://github.com/lpdi-epfl/masif
cd masif/

Since MaSIF is written in Python, no compilation is required.

Method overview

From a protein structure MaSIF computes a molecular surface discretized as a mesh according to the solvent excluded surface (computed using MSMS), and assigns geometric and chemical features to every point (vertex) in the mesh. Around each vertex of the mesh, we extract a patch with geodesic radius of r=9 Å or r=12 Å. Then, MaSIF applies a geometric deep neural network to these patches. The neural network consists of one or more layers applied sequentially; a key component of the architecture is the geodesic convolution, generalizing the classical convolution to surfaces and implemented as an operation on local patches.

MaSIF conceptual framework and method

The procedure is repeated for different patch locations similarly to a sliding window operation on images, producing the surface fingerprint descriptor at each point, in the form of a vector that stores information about the surface patterns of the center point and its neighborhood. The parameter set minimizes a cost function on the training dataset, which is specific to each application that we present here.

MaSIF data preparation

For each application, MaSIF requires a preprocessing of data. This entails a running a scripted protocol, which performs the following steps:

  1. Download the PDB.
  2. Protonate the PDB, extract the desired chains, triangulate the surface (using MSMS), and compute chemical features.
  3. Extract all patches, with features and coordinates, for each protein.

MaSIF's main speed bottleneck lie in these three steps. The main performance bottlenecks are computing the angular coordinates using MDS, computing the Poisson-Boltzmann electrostatics and regularizing the mesh after computing the MSMS surface.

Each application data directory (under masif/data/masif*) contains a script to precompute the data.

To run this protocol for a single protein, (e.g. chain A of PDB id code 1MBN ) run:

./data_prepare_one.sh 1MBN_A_

To run it on a pair of interacting protein domains (chains A,B, of PDB id 1AKJ form the first domain and chains D,E form the second domain):

./data_prepare_one.sh 1AKJ_AB_DE

If you have access to a cluster (strongly recommended), then this process can be run in parallel. If your cluster supports slurm files, we provide a slurm file under each application data directory. which can be run using sbatch:

sbatch data_prepare.slurm

Most of the PDBs that were used for the paper, and their corresponding surfaces (with precomputed chemical features) are available at: https://doi.org/10.5281/zenodo.2625420 . The unbound proteins are available in this repository under data/masif_ppi_search_ub/data_preparation/00-raw_pdbs/.

Note that the preparation of the data can consume a large amount of space for large protein databases. This is due to the fact that the preprocessing step decomposes protein surfaces into overlapping patches, which results in a large amount of duplicated data. In upcoming versions we hope to optimize this process to perform patch-decomposition operations on-the-fly, eliminating the need for large amounts of disk space.

MaSIF proof-of-concept applications

MaSIF was tested on three proof-of-concept applications. For each application we provide the trained neural network model that was used for the main experiments in the paper.

MaSIF proof-of-concept applications

MaSIF-ligand

cd data/masif_site/

The lists of pdb ids and chains used in the training and test sets are located, in numpy format, under:

data/masif_ligand/lists/test_pdbs_sequence.npy
data/masif_ligand/lists/train_pdbs_sequence.npy
data/masif_ligand/lists/val_pdbs_sequence.npy

Each of these files can be read using the numpy.load function.

Precompute the datasets (see MaSIF data preparation), ideally using slurm:

sbatch prepare_data.slurm

Be sure you have enough disk space, about 400GB.

Once the data has been precomputed, MaSIF-ligand requires the generation of Tensorflow TFRecords for training. For this, either run slurm or execute the command present in the make_tfrecord.slurm file:

sbatch make_tfrecord.slurm

Once the tfrecords have been precomputed, the training for the network can start, where we strongly recommend a GPU (run the commands in the slurm file one by one if you do not have slurm):

sbatch train_model.slurm

To evaluate the neural network run:

sbatch evaluate_test.slurm

The output of the evaluation is placed under the data/masif_ligand/test_set_predictions/ directory, with two numpy files per input protein databank structure, e.g.:

5LXM_AD_labels.npy
5LXM_AD_logits.npy

where the labels file contains the ground truth, and the logits file contains the prediction logits.

MaSIF-site

Change to the masif-site data directory.

cd data/masif_site/

The lists of pdb ids and chains used in the training and test sets are located under:

data/masif_site/data/lists/full_list.txt
data/masif_site/data/lists/training.txt
data/masif_site/data/lists/testing.txt

Precompute the datasets (see MaSIF data preparation), ideally using slurm:

sbatch prepare_data.slurm

Be sure you have enough disk space, about 400GB.

Once the data has been precomputed, the training for the network can start:

./train_nn.sh

For the experiments in the paper we trained MaSIF-site for 40 hours.

Once a network has been trained, specific proteins can be evaluated. For example to evaluate the selected subset of transient interactions:

./predict_site.sh

The predictions for each vertex in each protein are stored in the directory data/masif_site/output/all_feat_3l/pred_data/. The surfaces of the predicted sites can be colored according to the site prediction:

./color_site.sh

and saved to a ply file, under the directory: data/masif_site/output/all_feat_3l/pred_surfaces/

These surfaces can then be visualized using our PyMOL plugin.

A jupyter notebook with code to compare the prediction on the transient interactiosn of this test set to the program SPPIDER can be found at:

masif/comparison/masif_site/masif_vs_sppider/masif_sppider_comp.ipynb

MaSIF-search

Change to the masif-search data directory.

cd data/masif_ppi_search/

The lists of pdb ids and chains used in the training and test sets are located under:

data/masif_ppi_searhc/data/lists/full_list.txt
data/masif_site/data/lists/training.txt
data/masif_site/data/lists/testing.txt

Precompute the datasets (see MaSIF data preparation), ideally using slurm:

sbatch prepare_data.slurm

Be sure you have enough disk space, about 400GB.

For speed reasons, the actual data that will be used by the neural network is cached in a separate directory. This data consists of the pairs of patches that pass a shape complementarity threshold and an equal number of random patches. This process is run by executing:

./cache_nn.sh nn_models.sc05.custom_params

Once the data has been cached, the training for the network can start:

./train.sh nn_models.sc05.custom_params

For the paper we trained for about 40 hours. The neural network model is saved in the nn_models/sc05/all_feat/model_data directory whenever the validation ROC AUC improves over the previously saved model's validation ROC AUC.

Once the neural network has been trained and saved, descriptors for specific proteins can be computed using the command:

./compute_descriptors.sh lists/testing.txt

These descriptors are saved under the descriptors/ directory.

To evaluate the second stage ransac protocol, go to the masif/comparison/masif_ppi_search directory:

cd $masif_root/comparison/masif_ppi_search/masif_descriptors/
./second_stage.sh

To reproduce the large PD-L1:PD1 benchmark presented in the paper:

cd data/masif_ppi_search/pdl1_benchmark
./run_benchmark.sh

PyMOL plugin

A PyMOL plugin to visualize protein surfaces is provided in the source/pymol subdirectory. We used this plugin for all the structural figures shown in our paper. This plugin requires PyMOL to be installed in your local computer.

Please see the following tutorial on how to install it:

Pymol plugin installation

To load a protein surface file, run this command inside PyMOL:

loadply ABCD_E.ply

Example: MaSIF PyMOL plugin example

Docker container

The easiest way to test MaSIF is through a Docker container. Please see our tutorial on reproducing the paper results here:

Docker container

License

MaSIF is released under an Apache v2.0 license.

Reference

If you use this code, please use the bibtex entry in citation.bib

masif's People

Contributors

dcoukos avatar freyrs avatar pablogainza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

masif's Issues

Use PDB structures from modelling software like ITASSER

Hi!!!

I wanted to know whether its possible to use the output PDB files from software like I-TASSER (or any other structural modeller) as an input for MASIF. My proteins of interest on PDB site have low resolution and the crystal structures are of partial sequences.
Hence, I want to create pdb files based on complete sequqence s and then use MASIF to predict the PPI interface. Could you please help me out with this ? Thanks in advance for any help.

Regards,

Anupam

Masif_pymol_plugin erorr

Hi,

I am able to install the masaif_pymol_plugin into pymol but when I try to load the plugin pymol gives me this error:
Unable to initialize plugin 'masif_pymol_plugin' (pmg_tk.startup.masif_pymol_plugin).

I tried writing "import pmg_tk.startup.masif_pymol_plugin" in the pymol command line as well but it gives me an error again:
Traceback (most recent call last):
File "/Applications/PyMOL.app/Contents/lib/python3.7/site-packages/pmg_tk/startup/masif_pymol_plugin/init.py", line 5, in
from loadPLY import *
ModuleNotFoundError: No module named 'loadPLY'

I also looked in the masif_pymol_plugin folder to make sure there was a loadPLY file and there was.
Thank you!

Error handling: missing atoms when calculating hydrogen-bonding potential

Sometimes I've found that when looking for hydrogen bond acceptors, the code will break if the acceptor atom is there but coordinates are missing for its bonded, neighbouring atom. I.e. in "triangulation.computeCharges" line 82, res[acceptorAngleAtom[atom_name]].get_coord() will throw a KeyError exception. I found this on PDB entries 2avn and 1vdn.

My suggested fix was already implemented for acceptorPlaneAtom in the same function:

try:
    a = res[acceptor_atom.get_coord()
except KeyError:
    return 0.0

Cannot download large PDB structures

Bio.PDB.PDBList() disallows the downloading of structures >62 chains or >99999 ATOM lines using the 'pdb' (.ent) format. Attempting this gives a "Desired structure doesn't exist" error.

There are a couple of other file_format options for which this is allowed, but it's not completely clear how to utilize these formats in the downstream data_preparation steps. It would be very helpful to be able to use one of these other formats in the MaSIF pipeline.

how to visualize the output files?

The pdl1_benchmark_nn.py produces several .pdb files and .vert files. Just wondering use these files, say to visualize the docking site in PyMol? Thank you.

I tried to use docker image & fastest and easiest way reproduce way.

2022-07-18 09:09:30.240557: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
['/masif/source//masif_ppi_search/second_stage_alignment_nn.py', '../../../data/masif_ppi_search', '100', '2000', '1000', 'masif']
Loading patch coordinates for 2I32_A_E
Loading patch coordinates for 2P47_A_B
Loading patch coordinates for 1BRS_A_D
Loading patch coordinates for 3P8B_C_D
Loading patch coordinates for 2Y32_B_D
Loading patch coordinates for 3M85_B_E
Loading patch coordinates for 2J12_A_B
Loading patch coordinates for 2ZXW_O_U
Loading patch coordinates for 3TND_B_D
Loading patch coordinates for 3F74_A_B
Loading patch coordinates for 1JZO_A_B
Loading patch coordinates for 2P45_A_B
Loading patch coordinates for 3IBM_A_B
Loading patch coordinates for 3CDW_A_H
Loading patch coordinates for 3S8V_A_X
Loading patch coordinates for 4KGG_C_A
Loading patch coordinates for 1TQ9_A_B
Loading patch coordinates for 1NPO_A_C
Loading patch coordinates for 3OGF_A_B
Loading patch coordinates for 1Z0K_A_C
Loading patch coordinates for 3Q9U_A_C
Loading patch coordinates for 1XUA_A_B
Loading patch coordinates for 3AXY_B_D
Loading patch coordinates for 2QLC_C_B
Loading patch coordinates for 3QWQ_A_B
Loading patch coordinates for 2O8Q_A_B
Loading patch coordinates for 2JI1_C_D
Loading patch coordinates for 1HBT_I_H
Loading patch coordinates for 3QWN_I_J
Loading patch coordinates for 2Z0P_C_D
Loading patch coordinates for 1XDT_T_R
Loading patch coordinates for 1ID5_H_L
Loading patch coordinates for 1PXV_A_C
Loading patch coordinates for 1I4O_B_D
Loading patch coordinates for 2Z7F_E_I
Loading patch coordinates for 2FE8_A_C
Loading patch coordinates for 1LQM_E_F
Loading patch coordinates for 2Z29_A_B
Loading patch coordinates for 3P71_C_T
Loading patch coordinates for 4CJ0_A_B
Loading patch coordinates for 4TQ1_A_B
Loading patch coordinates for 2WQ4_A_C
Loading patch coordinates for 2LBU_E_D
Loading patch coordinates for 3S9C_A_B
Loading patch coordinates for 1AVX_A_B
Loading patch coordinates for 1A2K_C_AB
Loading patch coordinates for 2WAM_A_C
Loading patch coordinates for 3SGB_E_I
Loading patch coordinates for 3B5U_J_L
Loading patch coordinates for 1YLQ_A_B
Loading patch coordinates for 1YY9_A_D
Loading patch coordinates for 2B3Z_C_D
Loading patch coordinates for 3HN6_B_D
Loading patch coordinates for 1T0F_A_B
Loading patch coordinates for 3PGA_1_4
Loading patch coordinates for 2AQX_A_B
Loading patch coordinates for 1SOT_A_C
Loading patch coordinates for 1SHY_A_B
Loading patch coordinates for 3EYD_C_D
Loading patch coordinates for 1UUG_A_B
Loading patch coordinates for 3KZH_A_B
Loading patch coordinates for 2HEK_A_B
Loading patch coordinates for 4YDJ_HL_G
Loading patch coordinates for 3HCG_A_C
Loading patch coordinates for 3K3C_A_B
Loading patch coordinates for 1JKG_A_B
Loading patch coordinates for 5GPG_A_B
Loading patch coordinates for 4AG2_A_C
Loading patch coordinates for 3SLH_A_B
Loading patch coordinates for 3ISM_A_B
Loading patch coordinates for 3KMT_A_B
Loading patch coordinates for 1XPJ_A_D
Loading patch coordinates for 1UGH_E_I
Loading patch coordinates for 1I07_A_B
Loading patch coordinates for 3CEW_C_D
Loading patch coordinates for 2HDP_A_B
Loading patch coordinates for 2G2W_A_B
Loading patch coordinates for 3WN7_A_B
Loading patch coordinates for 3Q0Y_C_B
Loading patch coordinates for 3CG8_C_B
Loading patch coordinates for 1Q5H_A_B
Loading patch coordinates for 2B42_B_A
Loading patch coordinates for 2YZJ_A_C
Loading patch coordinates for 3ECY_A_B
Loading patch coordinates for 3HRD_E_H
Loading patch coordinates for 1ZR0_A_B
Loading patch coordinates for 3E2U_A_E
Loading patch coordinates for 1ERN_A_B
Loading patch coordinates for 1O9Y_A_D
Loading patch coordinates for 3RDZ_A_C
Loading patch coordinates for 1ZVN_A_B
Loading patch coordinates for 3CHW_A_P
Loading patch coordinates for 4M5F_A_B
Loading patch coordinates for 3Q87_A_B
Loading patch coordinates for 3BTV_A_B
Loading patch coordinates for 3FJS_C_D
Loading patch coordinates for 2GKW_A_B
Loading patch coordinates for 2GD4_C_B
Loading patch coordinates for 2A2L_C_B
Loading patch coordinates for 1XT9_A_B
Docking all binders on target: 2I32_A_E 
Traceback (most recent call last):
  File "/masif/source//masif_ppi_search/second_stage_alignment_nn.py", line 198, in <module>
    target_coord = subsample_patch_coords(target_pdb, "p1", precomp_dir_9A, center_point)
  File "/masif/source/masif_ppi_search/alignment_utils_masif_search.py", line 305, in subsample_patch_coords
    for iii, v in enumerate(cv):
TypeError: 'numpy.int64' object is not iterable

what is cv and center_point role?
if subsample_patch_coords function is well functioning, center_point shape should be one-dimension list. but, center_point is int64.
I don't know how to run fastest and easiest way.

by the way, docker image belongs too better version open3D(0.9). fix your code, please.

Problems when extracting chains from PDB files

Hi,

I was using "input_output.extractPDB" to extract PDB chains for my work and two problems popped up. Essentially the code can end up missing a few residues depending on the following:

1:

class NotDisordered(Select):
    def accept_atom(self, atom):
        return not atom.is_disordered() or atom.get_altloc() == "A" 

According to the comment on this class, it is supposed to exclude disordered atoms. However, this class actually is used to save disordered atoms. It appears to be a result of an error in the validation of the X-ray-derived structure (link), whereby two different configurations are given for a set of atoms. What this code is supposed to do is to choose one out of those two configurations ().

Typically the two configurations are labelled 'A' and 'B', however I have noticed that they can be labelled as '1' or '2'. When they are labelled '1' or '2', the class returns False -- the disordered residues are ignored. For my own work, I have used the following function to ensure that the residues are not ignored, even if they are labelled '1' or '2'.

class NotDisordered(Select):
    def accept_atom(self, atom):
        return not atom.is_disordered() or atom.get_altloc() == "A" or atom.get_altloc() == "1" 

2:

Modified / non-canonical amino acids are designated HETATM in the PDB, even though they are clearly amino acids. I added a function which reads the names of modified / non-canonical amino acids from the PDB SEQRES section, and notes the non-standard codes.

from Bio.SeqUtils import IUPACData
PROTEIN_LETTERS = [x.upper() for x in IUPACData.protein_letters_3to1.keys()]

def find_modified_amino_acids(path):
    res_set = set()
    for line in open(path, 'r'):
        if line[:6] == 'SEQRES':
            for res in line.split()[4:]:
                res_set.add(res)
    for res in list(res_set):
        if res in PROTEIN_LETTERS:
            res_set.remove(res)
    return res_set

def extractPDB(
    infilename, outfilename, chain_ids=None, includeWaters=False, invert=False
):
    # extract the chain_ids from infilename and save in outfilename. 
    # includeWaters: deprecated parameter, include the crystallographic waters (should not be used). 
    # invert: Select all chains EXCEPT those in chain_ids.
    parser = PDBParser(QUIET=True)
    struct = parser.get_structure(infilename, infilename)
    model = Selection.unfold_entities(struct, "M")[0]
    chains = Selection.unfold_entities(struct, "C")

    # Select residues to extract and build new structure
    structBuild = StructureBuilder.StructureBuilder()
    structBuild.init_structure("output")
    structBuild.init_seg(" ")
    structBuild.init_model(0)
    outputStruct = structBuild.get_structure()


    # Load a list of non-standard amino acid names -- these are
    # typically listed under HETATM, so they would be typically
    # ignored by the orginal algorithm
    modified_amino_acids = find_modified_amino_acids(infilename)

    for chain in model:
        if (
            chain_ids == None
            or (chain.get_id() in chain_ids and not invert)
            or invert == True
        ):
            structBuild.init_chain(chain.get_id())
            for residue in chain:
                het = residue.get_id()
                if not invert:
                    if het[0] == " " or (het[0] == "W" and includeWaters):
                        outputStruct[0][chain.get_id()].add(residue)
                    elif het[0][-3:] in modified_amino_acids:
                        print(het[0])
                        outputStruct[0][chain.get_id()].add(residue)
                else:
                    if (het[0] == "W" and includeWaters) or (
                        chain.get_id() not in chain_ids
                    ):
                        outputStruct[0][chain.get_id()].add(residue)
                    elif het[0][-3:] in modified_amino_acids:
                        outputStruct[0][chain.get_id()].add(residue)
                                                                                                                                                                                                 63,0-1        


    # Output the selected residues
    pdbio = PDBIO()
    pdbio.set_structure(outputStruct)
    pdbio.save(outfilename, select=NotDisordered())

StrBioInfo problem

Hi
I have install StrBioInfo 0.9a0.dev1.

However, when I execute line "struct_assembly = struct.apply_biomolecule_matrices()[0]" in "00b-generate_assembly.py ",it show error
AttributeError: 'PDBFrame' object has no attribute 'apply_biomolecule_matrices'

will this package gets update?

Unexpected Errors second stage alignment

In the Masif search process after running the descriptors, I am. Unable to run the second_alignment_nn. Py successfully the pretrained model doesn't load. Any hint can u provide the correct pretrained.hdf5 file. Thanks. Hoping for a positive response

subprocess.py error while running data_prepare_one

Hi,
I have tried to run data_prepare_one but I get the following error and the program stops within ipdb.

MLC02GC4Z3Q05P:masif_site 4464689$ ./data_prepare_one.sh 1AKJ_AB_DE
:/Users/4464689/Downloads/masif/source/
Structure exists: '/var/folders/mn/7xx5f6314c1glph1lkt4k_9m002gyq/T/pdb1akj.ent' 
--Call--
>    /opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py(875)__del__()
   873             self.wait()
   874 
--> 875     def __del__(self, _maxsize=sys.maxsize, _warn=warnings.warn):
    876         if not self._child_created:
    877             # We didn't get to successfully create a child process.

ipdb>

I appreciate any input,
Aleks

multivalue in apbs

I cant find : export MULTIVALUE_BIN=/path/to/apbs/APBS-1.5-linux64/share/apbs/tools/bin/multivalue

i do:
export MULTIVALUE_BIN=/home/ubuntu/miniconda3/envs/masif/share/apbs/tools/mesh/multivalue.c
multivalue is a C program, when i run it ,
OSError: [Errno 8] Exec format error

MaSIF-search second-stage issue

hi, there.
I have some questions about the MaSIF-search second-stage, I can't figure out where is the code to generates the complexes like the paper said.It would be great if anyone can give me a clue.

The complex surface is generated in a confusing way.

When computing the iface for parts of the protein complex, which interact with each other, the complex surface should only contain the chains of interest.

For example, Complex_id1 has 5 chains.

We are looking at the interaction between id1_AB and id1_CD parts, however the surface for the Complex_id1 still contains a fifth chain (E). This way we will wrongly label some parts of the id1_AB and id1_BC surfaces as interfaces because they are being covered by an extra chain.

I am assuming we only want to look at interfaces between the parts we know that interact, so I am assuming that using the full complex surface is not the right way to label the interfaces.

docker run masif_ligand/data_prepare_one.sh meets error OSError: File not found: data_preparation/01-benchmark_surfaces//4ZQK_A.ply

i tried the script in docker, ./masif_ligand/data_prepare_one.sh 4ZQK_A
and gets the following error
Downloading PDB structure '4ZQK'...
Traceback (most recent call last):
File "/masif/source//data_preparation/00b-generate_assembly.py", line 3, in
from SBI.structure import PDB
ImportError: No module named SBI.structure
Traceback (most recent call last):
File "/masif/source//data_preparation/00c-save_ligand_coords.py", line 2, in
import numpy as np
ImportError: No module named numpy
Traceback (most recent call last):
File "/masif/source//data_preparation/01-pdb_extract_and_triangulate.py", line 48, in
extractPDB(pdb_filename, out_filename1+".pdb", chain_ids1)
File "/masif/source/input_output/extractPDB.py", line 21, in extractPDB
model = Selection.unfold_entities(struct, "M")[0]
IndexError: list index out of range
4ZQK_A
Reading data from input ply surface files.
Traceback (most recent call last):
File "/masif/source//data_preparation/04-masif_precompute.py", line 74, in
input_feat[pid], rho[pid], theta[pid], mask[pid], neigh_indices[pid], iface_labels[pid], verts[pid] = read_data_from_surface(ply_file[pid], params)
File "/masif/source/masif_modules/read_data_from_surface.py", line 23, in read_data_from_surface
mesh = pymesh.load_mesh(ply_fn)
File "/usr/local/lib/python3.6/site-packages/pymesh/meshio.py", line 21, in load_mesh
raise IOError("File not found: {}".format(filename));
OSError: File not found: data_preparation/01-benchmark_surfaces//4ZQK_A.ply

can anyone tell me where i got wrong, thx

Meaning of mask in precomputed numpy arrays

Hi,

I am exploring the numpy arrays produced by the precomputation script:
$masif_source/data_preparation/04-masif_precompute.py masif_ppi_search

Could you please explain the meaning of p1_mask.npy and p2_mask.npy?
I understand that the mask is for rho and theta, but don't understand the meaning. In which cases the value is zero?
Why do we skip some of the neighbors in the patch?

Thank you so much for all your effort!

Question regarding PyMOL plugin

I tried to install the provided plugin into the software (PyMOL). But I got the error message as following.
捕获
It says that the plugin is installed but the initialization fails. The above image is taken under a Windows machine.
And the same situation happens on my Ubuntu machine.
Does this mean that the plugin you provided only works with the MAC machine?

How to install reduce(3.23)

Hi,
Thanks for your great work. I was trying to install dependencies. But I am not sure how to install reduce(3.23) on the Linux system.
Could you show me how?

How can i docking two protein monomer by masif-search?

I want to docking two proteins.
I just have two momomer,not Homopolymer or Heteromer.

my questions of masif-search pipeline

  • prepare data for ppi search:when i run data_prepare with PDBID_CHAIN,i can't get the p1_sc_labels.npyfile,but if i run it with PDBID_CHAIN1_CHAIN2,i got it.so i copy them:
cd /masif/data/masif_ppi_search/data_preparation/04b-precomputation_12A/precomputation
cp 1A2K_C_A/p1_sc_labels.npy  1A2K_C/p1_sc_labels.npy
cp 1A2K_C_A/p2_sc_labels.npy  1A2K_A/p1_sc_labels.npy
cp 5JYL_A_B/p1_sc_labels.npy 5JYL_A/p1_sc_labels.npy
cp 5JYL_A_B/p2_sc_labels.npy 5JYL_B/p1_sc_labels.npy

but i don't know whether it is right,it is right?

  • transform from RegistrationResult and PointCloud to pdb format:i get the mutidock result, but i don't know how to transform it to pdb format,has some tools to transform it.
  • I am novice for docking and masif-search, if this pipeline has same other errors and suggestion,please point it.I don't whether i do right.

pipeline

run masif-site

Firstly,i run masif-site ./data_prepare_one.sh 2MWS_B,and then I run ./predict_site.sh 2MWS_B to predict sites,and i got four folds

00-raw_pdbs  01-benchmark_pdbs  01-benchmark_surfaces  04a-precomputation_9A

it run very well.

prepare data for ppi search

And Next run ppi searchcd ../masif_ppi_search,it need format PDBID_CHAIN1_CHAIN2,but i just have two momomer,not Homopolymer or Heteromer.the command /masif/data/masif_ppi_search/data_prepare_one.sh 3AXY_Bdoesn't work.
image

and i down and extract pdb manually,move they into data_preparation/00-raw_pdbs/

root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/00-raw_pdbs/
1A2K.pdb  5JYL.pdb
root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/01-benchmark_pdbs/
1A2K_A.pdb  1A2K_C.pdb  5JYL_A.pdb  5JYL_B.pdb

next i run:

masif_root=$(git rev-parse --show-toplevel)
masif_source=$masif_root/source/
export PYTHONPATH=$PYTHONPATH:$masif_source
PDB_ID='1A2K'
CHAIN1='C'
CHAIN2='A'
# Load your environment here.
python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py $PDB_ID\_$CHAIN1
python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py $PDB_ID\_$CHAIN2
python $masif_source/data_preparation/04-masif_precompute.py masif_site $PDB_ID\_$CHAIN1
python $masif_source/data_preparation/04-masif_precompute.py masif_site $PDB_ID\_$CHAIN2
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search $PDB_ID\_$CHAIN1
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search $PDB_ID\_$CHAIN2

get the output

root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/04a-precomputation_9A/precomputation/1A2K_A/
p1_X.npy  p1_Z.npy             p1_input_feat.npy    p1_mask.npy            p1_theta_wrt_center.npy
p1_Y.npy  p1_iface_labels.npy  p1_list_indices.npy  p1_rho_wrt_center.npy
root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/04a-precomputation_9A/precomputation/1A2K_C
p1_X.npy  p1_Z.npy             p1_input_feat.npy    p1_mask.npy            p1_theta_wrt_center.npy
p1_Y.npy  p1_iface_labels.npy  p1_list_indices.npy  p1_rho_wrt_center.npy

and then calculate the description

./compute_descriptors.sh $PDB_ID\_$CHAIN1
./compute_descriptors.sh $PDB_ID\_$CHAIN2

and get the output

root@eb92233498e0:/masif/data/masif_ppi_search# ls descriptors/sc05/all_feat/1A2K_C
p1_desc_flipped.npy  p1_desc_straight.npy
root@eb92233498e0:/masif/data/masif_ppi_search# ls descriptors/sc05/all_feat/1A2K_A
p1_desc_flipped.npy  p1_desc_straight.npy

it doesn't raise any Exception just some warning.
for another protein,wo also do so.

python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py 5JYL_A
python $masif_source/data_preparation/04-masif_precompute.py masif_site 5JYL_A
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search 5JYL_A
./compute_descriptors.sh 5JYL_A

python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py 5JYL_B
python $masif_source/data_preparation/04-masif_precompute.py masif_site 5JYL_B
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search 5JYL_B
./compute_descriptors.sh 5JYL_B

nohup sh ./data_prepare_one.sh  5JYL_A_B &
nohup sh ./data_prepare_one.sh 1A2K_C_A &
nohup sh ./compute_descriptors.sh 5JYL_A_B &
nohup sh ./compute_descriptors.sh 1A2K_C_A &

i see the py file/masif/source/masif_ppi_search/second_stage_alignment_nn.py
it need some necessary setting variables.

masif_opts = {}
masif_opts["pdb_chain_dir"] = "data_preparation/01-benchmark_pdbs/"#* 每个蛋白对应的链
masif_opts["ply_chain_dir"] = "data_preparation/01-benchmark_surfaces/"#* 蛋白对应的表面 用site计算出来的
masif_opts["ppi_search"]={}
masif_opts["ppi_search"][
    "masif_precomputation_dir"
] = "data_preparation/04b-precomputation_12A/precomputation/"
masif_opts["ppi_search"]["desc_dir"] = "descriptors/sc05/all_feat/"#*  这里通过compute_description.sh计算得出
masif_opts["ppi_search"]["gif_descriptors_out"] = "gif_descriptors/"#方法gif才需要,masif不需要 空的目录文件夹
masif_opts["site"]={}
masif_opts["site"][
    "masif_precomputation_dir"
] = "data_preparation/04a-precomputation_9A/precomputation/"

it is global variable.
define my protein list:

root@eb92233498e0:/masif/data/masif_ppi_search# for i in $(ls data_preparation/01-benchmark_pdbs);do echo ${i/.pdb/}; done > lists/mylist.txt
root@eb92233498e0:/masif/data/masif_ppi_search# cat lists/mylist.txt
1A2K_A
1A2K_C
5JYL_A
5JYL_B
import os
import numpy as np
# Location of surface (ply) files. 
data_dir='/masif/data/masif_ppi_search'
surf_dir = os.path.join(data_dir, masif_opts["ply_chain_dir"])
desc_dir = os.path.join(data_dir, masif_opts["ppi_search"]["desc_dir"])
pdb_dir = os.path.join(data_dir, masif_opts["pdb_chain_dir"])
precomp_dir = os.path.join(
    data_dir, masif_opts["ppi_search"]["masif_precomputation_dir"]
)
precomp_dir_9A = os.path.join(
    data_dir, masif_opts["site"]["masif_precomputation_dir"]
)
benchmark_list=os.path.join(data_dir, 'lists','mylist.txt')
pdb_list = open(benchmark_list).readlines()[0:100]
pdb_list = [x.rstrip() for x in pdb_list]
# Read all surfaces.
all_pc = []
all_desc = []

rand_list = np.copy(pdb_list)
#np.random.seed(0)
np.random.shuffle(rand_list)
rand_list = rand_list[0:100]

p2_descriptors_straight = []
p2_point_clouds = []
p2_patch_coords = []
p2_names = []

lack file p1_sc_labels.npy,

root@eb92233498e0:/masif/source/masif_ppi_search# ls /masif/data/masif_ppi_search/data_preparation/04b-precomputation_12A/precomputation/5JYL_B/
p1_X.npy  p1_Z.npy             p1_input_feat.npy    p1_mask.npy            p1_theta_wrt_center.npy
p1_Y.npy  p1_iface_labels.npy  p1_list_indices.npy  p1_rho_wrt_center.npy
root@eb92233498e0:/masif/source/masif_ppi_search# ls /masif/data/masif_ppi_search/data_preparation/04b-precomputation_12A/precomputation/5JYL_A_B/
p1_X.npy             p1_input_feat.npy      p1_sc_labels.npy         p2_Z.npy             p2_mask.npy
p1_Y.npy             p1_list_indices.npy    p1_theta_wrt_center.npy  p2_iface_labels.npy  p2_rho_wrt_center.npy
p1_Z.npy             p1_mask.npy            p2_X.npy                 p2_input_feat.npy    p2_sc_labels.npy
p1_iface_labels.npy  p1_rho_wrt_center.npy  p2_Y.npy                 p2_list_indices.npy  p2_theta_wrt_center.npy

i move file but i don't know whether it is right

cd /masif/data/masif_ppi_search/data_preparation/04b-precomputation_12A/precomputation
cp 1A2K_C_A/p1_sc_labels.npy  1A2K_C/p1_sc_labels.npy
cp 1A2K_C_A/p2_sc_labels.npy  1A2K_A/p1_sc_labels.npy
cp 5JYL_A_B/p1_sc_labels.npy 5JYL_A/p1_sc_labels.npy
cp 5JYL_A_B/p2_sc_labels.npy 5JYL_B/p1_sc_labels.npy

move model to workplace

cd /masif/source/masif_ppi_search/
cp -r /masif/comparison/masif_ppi_search/masif_descriptors_nn/models .

run docking by masif-search

cd /masif/source/masif_ppi_search;
touch a new python scripttouch docking.py

import scipy.sparse as spio
import copy
from Bio.PDB import *
from scipy.spatial import cKDTree
from transformation_training_data.score_nn import ScoreNN
from alignment_utils_masif_search import compute_nn_score, rand_rotation_matrix, \
        get_center_and_random_rotate, get_patch_geo, multidock, test_alignments, \
       subsample_patch_coords
import time
import sklearn.metrics
masif_opts = {}
masif_opts["pdb_chain_dir"] = "data_preparation/01-benchmark_pdbs/"#* 每个蛋白对应的链
masif_opts["ply_chain_dir"] = "data_preparation/01-benchmark_surfaces/"#* 蛋白对应的表面 用site计算出来的
masif_opts["ppi_search"]={}
masif_opts["ppi_search"][
    "masif_precomputation_dir"
] = "data_preparation/04b-precomputation_12A/precomputation/"
masif_opts["ppi_search"]["desc_dir"] = "descriptors/sc05/all_feat/"#*  这里通过compute_description.sh计算得出
masif_opts["ppi_search"]["gif_descriptors_out"] = "gif_descriptors/"#方法gif才需要,masif不需要 空的目录文件夹
masif_opts["site"]={}
masif_opts["site"][
    "masif_precomputation_dir"
] = "data_preparation/04a-precomputation_9A/precomputation/"
nn_model = ScoreNN()
import os
import numpy as np
data_dir='/masif/data/masif_ppi_search'
surf_dir = os.path.join(data_dir, masif_opts["ply_chain_dir"])
desc_dir = os.path.join(data_dir, masif_opts["ppi_search"]["desc_dir"])
pdb_dir = os.path.join(data_dir, masif_opts["pdb_chain_dir"])
precomp_dir = os.path.join(
    data_dir, masif_opts["ppi_search"]["masif_precomputation_dir"]
)
precomp_dir_9A = os.path.join(
    data_dir, masif_opts["site"]["masif_precomputation_dir"]
)
benchmark_list=os.path.join(data_dir, 'lists','mylist.txt')
pdb_list = open(benchmark_list).readlines()[0:100]
pdb_list = [x.rstrip() for x in pdb_list]
# Read all surfaces.
all_pc = []
all_desc = []

rand_list = np.copy(pdb_list)
#np.random.seed(0)
np.random.shuffle(rand_list)
rand_list = rand_list[0:100]

p2_descriptors_straight = []
p2_point_clouds = []
p2_patch_coords = []
p2_names = []

from geometry.open3d_import import *
for i, pdb in enumerate(rand_list):
    print("Loading patch coordinates for {}".format(pdb))
    pdb_id = pdb.split("_")[0]
    chains = pdb.split("_")[1]
    # Descriptors for global matching.
    p2_descriptors_straight.append(
        np.load(os.path.join(desc_dir, pdb, "p1_desc_straight.npy"))
    )
    p2_point_clouds.append(
        read_point_cloud(
            os.path.join(surf_dir, "{}.ply".format(pdb_id + "_" + chains))
        )
    )
    pc = subsample_patch_coords(pdb, "p1", precomp_dir_9A)
    p2_patch_coords.append(pc)
    p2_names.append(pdb)

all_positive_scores = []
all_positive_rmsd = []
all_negative_scores = []
# Match all descriptors.
count_found = 0
all_rankings_desc = []


# Now go through each target (p1 in every case) and dock each 'decoy' binder to it. 
# The target will have flipped (inverted) descriptors.
K=30
ransac_iter=100
ttf=[]
for target_ix, target_pdb in enumerate(rand_list):
    target_pdb_id = target_pdb.split("_")[0]
    chains = target_pdb.split("_")[1]
    # Load target descriptors for global matching.
    target_desc = np.load(os.path.join(desc_dir, target_pdb, "p1_desc_flipped.npy"))
    # Load target point cloud
    target_pc = os.path.join(surf_dir, "{}.ply".format(target_pdb_id + "_" + chains))
    target_pcd = read_point_cloud(target_pc)
    # Read the point with the highest shape compl.
    sc_labels = np.load(os.path.join(precomp_dir, target_pdb, "p1_sc_labels.npy"))
    center_point = np.argmax(np.median(np.nan_to_num(sc_labels[0]), axis=1))
    # Go through each source descriptor, find the top descriptors, store id+pdb
    num_negs = 0
    all_desc_dists = []
    all_pdb_id = []
    all_vix = []
    gt_dists = []
    # This is where the desriptors are actually compared (stage 1 of the MaSIF-search protocol)
    for source_ix, source_pdb in enumerate(rand_list):
        source_desc = p2_descriptors_straight[source_ix]
        desc_dists = np.linalg.norm(source_desc - target_desc[center_point], axis=1)
        all_desc_dists.append(desc_dists)
        all_pdb_id.append([source_pdb] * len(desc_dists))
        all_vix.append(np.arange(len(desc_dists)))
        if source_pdb == target_pdb:
            source_pcd = p2_point_clouds[source_ix]
            eucl_dists = np.linalg.norm(
                np.asarray(source_pcd.points)
                - np.asarray(target_pcd.points)[center_point, :],
                axis=1,
            )
            eucl_closest = np.argsort(eucl_dists)
            gt_dists = desc_dists[eucl_closest[0:50]]
            gt_count = len(source_desc)
    all_desc_dists = np.concatenate(all_desc_dists, axis=0)
    all_pdb_id = np.concatenate(all_pdb_id, axis=0)
    all_vix = np.concatenate(all_vix, axis=0)
    ranking = np.argsort(all_desc_dists)
    # Load target geodesic distances.
    target_coord = subsample_patch_coords(target_pdb, "p1", precomp_dir_9A, [center_point])
    # Get the geodesic patch and descriptor patch for the target.
    target_patch, target_patch_descs = get_patch_geo(
        target_pcd, target_coord, center_point, target_desc, flip=True
    )
    # Make a ckdtree with the target.
    target_ckdtree = cKDTree(target_patch.points)
    ## Load the structures of the target and the source (to get the ground truth).
    parser = PDBParser()
    target_struct = parser.get_structure(
        "{}_{}".format(target_pdb_id, chains[0]),
        os.path.join(pdb_dir, "{}_{}.pdb".format(target_pdb_id, chains)),
    )
    #gt_source_struct = parser.get_structure(
    #    "{}_{}".format(target_pdb_id, chains[1]),
    #    os.path.join(pdb_dir, "{}_{}.pdb".format(target_pdb_id, chains[1])),
    #)
    # Get coordinates of atoms for the ground truth and target.
    target_atom_coords = [atom.get_coord() for atom in target_struct.get_atoms()]
    target_ca_coords = [
        atom.get_coord() for atom in target_struct.get_atoms() if atom.get_id() == "CA"
    ]
    target_atom_coord_pcd = PointCloud()
    target_ca_coord_pcd = PointCloud()
    target_atom_coord_pcd.points = Vector3dVector(np.array(target_atom_coords))
    target_ca_coord_pcd.points = Vector3dVector(np.array(target_ca_coords))
    target_atom_pcd_tree = KDTreeFlann(target_atom_coord_pcd)
    target_ca_pcd_tree = KDTreeFlann(target_ca_coord_pcd)
    found = False
    myrank_desc = float("inf")
    chosen_top = ranking[0:K]
    pos_scores = []
    pos_rmsd = []
    neg_scores = []
    # This is where the matched descriptors are actually aligned.
    for source_ix, source_pdb in enumerate(rand_list):
        viii = chosen_top[np.where(all_pdb_id[chosen_top] == source_pdb)[0]]
        source_vix = all_vix[viii]
        if len(source_vix) == 0:
            continue
        source_desc = p2_descriptors_straight[source_ix]
        source_pcd = copy.deepcopy(p2_point_clouds[source_ix])
        source_coords = p2_patch_coords[source_ix]
        # Randomly rotate and translate.
        random_transformation = get_center_and_random_rotate(source_pcd)
        source_pcd.transform(random_transformation)
        # Dock and score each matched patch. 
        #print({'source_pcd':source_pcd,'source_coords':source_coords,'source_desc':source_desc,'source_vix':source_vix\
        #,'target_patch':target_patch,'target_patch_descs':target_patch_descs,'target_ckdtree':target_ckdtree,'ransac_iter':ransac_iter})
        if source_pdb!=target_pdb:#same structure does not need docking
            all_results, all_source_patch, all_source_scores = multidock(
                source_pcd,
                source_coords,
                source_desc,
                source_vix,
                target_patch,
                target_patch_descs,
                target_ckdtree,
                nn_model, 
                ransac_iter=ransac_iter
            )
            res={'target_pdb':target_pdb,'source_pdb':source_pdb,'all_results':all_results,\
            'all_source_patch':all_source_patch,'all_source_scores':all_source_scores}
            ttf.append(res)
            #ttf 返回的值代表好几种对接方式,通过指纹搜索来确定种类

see ttf[0]

>>> ttf[0]
{'target_pdb': '1A2K_C', 'source_pdb': '5JYL_B', 'all_results': [registration::RegistrationResult with fitness=0.000000e+00, inlier_rmse=0.000000e+00, and correspondence_set size of 0
Access transformation to get result., registration::RegistrationResult with fitness=0.000000e+00, inlier_rmse=0.000000e+00, and correspondence_set size of 0
Access transformation to get result., registration::RegistrationResult with fitness=2.600000e-01, inlier_rmse=6.074436e-01, and correspondence_set size of 26
Access transformation to get result., registration::RegistrationResult with fitness=0.000000e+00, inlier_rmse=0.000000e+00, and correspondence_set size of 0
Access transformation to get result.], 'all_source_patch': [geometry::PointCloud with 100 points., geometry::PointCloud with 100 points., geometry::PointCloud with 100 points., geometry::PointCloud with 100 points.], 'all_source_scores': [0.0014200918, 0.001413941, 2.9317687e-05, 0.001454716]}

all_results[0]is <open3d.open3d.registration.RegistrationResult>object,fitnessmeans correspondence point,if it eq 0,
the structure is bad.
but how can i transform from RegistrationResult and PointCloud to pdb format.I want to get the dockered structure.
I am Novice for docking,please help me.
Thank you.

step 1 download pdf failed

When I run ./data_prepare_one.sh 1MBN_A_ under data/masif_site, it failed to download and showed

WARNING: The default download format has changed from PDB to PDBx/mmCif

Getting "docked" coordinates from masif_ppi

I have two proteins and would like to get the "docked" coordinates from masif_ppi. It looks like source/masif_ppi_search/second_stage_alignment.py perfoms an alignment of points, but I was wondering if there's another function or script that can apply the best N transformations to the input ligand in order to generate a set of N docked poses.

Lot of variation in masif_ppi_search outputs?

Hi there - I always get very different answers when doing the pdl1 benchmark with the neural network and I am not sure why.

What is the source of variation, shouldn't the neural network weights for pdl1_benchmark_nn.py be preloaded?

Otherwise a completely random model would be initialized with nn_model = ScoreNN(), which doesn't seem right. Thanks!

Whether a local pdb file as input possible?

Hi,
If masif could take local pdb files(not deposited in PDB database) as inputs? May by modifying the scripts in data preparation?
Thanks in advance if anyone could give some suggestions!

NameError: name 'read_point_cloud' is not defined

when i run search masif-search ./second_stage_masif.sh 2000 with docker,occur problem this function not define,but i do not know where read_point_cloud where is......

root@8638240e23fe:/masif/comparison/masif_ppi_search/masif_descriptors_nn# ./second_stage_masif.sh 2000
2021-06-18 05:15:52.852452: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
['/masif/source//masif_ppi_search/second_stage_alignment_nn.py', '../../../data/masif_ppi_search', '2000', '2000', '1000', 'masif']
Loading patch coordinates for 2I32_A_E
Traceback (most recent call last):
  File "/masif/source//masif_ppi_search/second_stage_alignment_nn.py", line 114, in <module>
    read_point_cloud(
NameError: name 'read_point_cloud' is not defined

Ever considered changing the hydrophobicity scale?

I was wondering if you have ever considered using a more detailed treatment of hydrophobicity? I was just thinking about how if a tiny fraction of a leucine residue is solvent-accessible, this is then labelled as extremely hydrophobic. However, in this case it is incorrect - the Kyte-Doolittle scale rates leucine this way due to its size and composition, and if you reduce the (effective / solvent-accessible) size then the hydrophobicity should also be reduced.

For example, this one assigns a +1 or -1 depending on both the residue and the atom type:
A Simple Atomic-Level Hydrophobicity Scale Reveals Protein Interfacial Structure, Kapcha & Rossky, JMB 2014
https://www.sciencedirect.com/science/article/pii/S0022283613006232?via%3Dihub

Also, I notice that the module "triangulation.computeHydrophobicity" has no means of dealing with non-canonical or modified amino acids. I haven't checked it, but it looks like the code will break if it encounters one. Maybe you should at least use the .get() method to access the kd_scale dictionary. It would be best if there was some way of matching non-canonical / modified amino acids to some hydrophobicity scale. However I'm not sure if those terms are used consistently. At least I've never found a look-up table for 3-letter-code to modified amino acid, etc. And of course this should be made clear somewhere, whichever way you go with.

second_stage_alignment_nn.py error

Hi there,
I am trying to see if I can train the ppi-search model and use it for new predictions. However, when I was following the instruction and run ./second_stage_masif.sh 100 there is en error message from python 3.

File "/masif/source//masif_ppi_search/second_stage_alignment_nn.py", line 114, in
read_point_cloud(
NameError: name 'read_point_cloud' is not defined

Would you mind letting me know how to fix it?
Also, would you mind kindly describing how to prepare my own data ( I have some docking models from Rosetta) as I would like to use them for the ppi_search.
Thanks.

masif_ppi_search struggles with rotated chains

Hello,

Thanks for your work on MaSIF and for making the source code available! I'm running MaSIF in the supplied docker container, with some tweaks (for example, I changed the scripts a little to output the high-scoring docked structures for manual review).

I'm running on a very small list of proteins of interest to me. I first ran the proteins through the default pipeline where it downloads the complexes directly from the PDB to recompute the benchmark, followed by the second stage alignment. This was fairly successful.

However, in the next stage, I decided to try something more like a "real world" experiment where I would not necessarily have a native structure; instead, I would have two chains that I suspect interact but don't know how. So I used the same pairs from above and did some standard light protein prep (no backbone changes) and rotation to emulate the perhaps-unknown position of the two proteins to one another. There was a marked decrease in performance, and upon further trials it became clear that the protein preparation was not an issue but that rotation of the chains contributed to a huge decrease in performance, and even an inability to find anything close to a native structure despite previously finding near-native structures with ease.

Based on what I understand from the paper and the code, the interface descriptors and alignment phase should be agnostic to initial rotation of the chain. Is there something I'm missing here?

Thanks!

length assert error while run data_prepare_one.sh 1MBN_A_ under folder data/masif_site

Traceback

Traceback (most recent call last):
  File "/home/masif/source//data_preparation/04-masif_precompute.py", line 67, in <module>
    read_data_from_matfile_full_protein(coord_file, shape_file, ppi_pair_id, params, pid, label_iface=True)
  File "/home/masif/source/masif_modules/read_data_from_matfile.py", line 324, in read_data_from_matfile_full_protein
    assert len(np.unique(rows)) == len(p_s[protein_id]["X"][0])
AssertionError

length left 3273, right 3276

Any clue for this error? Thanks!

Generating Geometric Data

Hi,

Thanks for showing your work on MASIF. I was trying to understand the process a little more in depth, in particular how the angular and radial coordinates for each patch were generated and used. However, in the masif->source->data preparation section, the README mentions the files that I should look for, but the files from 02, 03, 03b from that README are missing in the github. Thus I cannot see the code for generating this geometric data.

If you could please upload these missing files, I'd greatly appreciate it!

Thank you
Devesh

pdl1_benchmark do not predict interface between 4ZQK_A and 4ZQK_B

Hi,

I tried to run pdl1_benchmark with pdl1_benchmark_nn.py, but the 4ZQK_A and 4ZQK_B are not in the list of top_scores.
They matched only when I lower iface_cutoff to 0.5, and only at the single point
near_points: [1820]
iface: [0.5888073]
diff: [1.696772]

Is the model provided in masif/data/masif_pdl1_benchmark/nn_models/
the same as described in the paper, or for the reproduction of the results I should train the model myself?

Thank you so much!

Patch extraction

Sorry to bother you, I don't really understand how you divide the mesh surface into patches.

Do you start at one random vertex in the mesh, perform geodesic distance calculation for a given radius, select new vertex that is yet to be parsed in previous calculations or just compute patches for each vertex in the mesh? Input for neural network consists a matrix of size: number of patches x features considered (5)

I cannot find the code lines that explain this patch division.

Best,
Liv

Error while running on a predicted structure

Hello,

I encountered an error while running data_prepare_one.sh on a predicted protein structure.

The script ran fine on the download pdb file:

Singularity masif_latest:/global/scratch/software/MaSIF/masif/data/masif_site> ./data_prepare_one.sh --file data_preparation/00-raw_pdbs/4ZQK.pdb 4ZQK_A

Running masif site on data_preparation/00-raw_pdbs/4ZQK.pdb
cp: 'data_preparation/00-raw_pdbs/4ZQK.pdb' and 'data_preparation/00-raw_pdbs/4ZQK.pdb' are the same file
Empty
Removing degenerated triangles
Removing degenerated triangles
4ZQK_A
Reading data from input ply surface files.
Dijkstra took 3.65s
Only MDS time: 15.50s
Full loop time: 24.70s
MDS took 24.70s

It also ran fine on a predicted structure I downloaded online:

Singularity masif_latest:/global/scratch/software/MaSIF/masif/data/masif_site> ./data_prepare_one.sh --file data_preparation/00-raw_pdbs/AvrPita.pdb AvrPita_A

Running masif site on data_preparation/00-raw_pdbs/AvrPita.pdb
cp: 'data_preparation/00-raw_pdbs/AvrPita.pdb' and 'data_preparation/00-raw_pdbs/AvrPita.pdb' are the same file
Empty
Removing degenerated triangles
Removing degenerated triangles
AvrPita_A
Reading data from input ply surface files.
Dijkstra took 7.01s
Only MDS time: 29.31s
Full loop time: 46.89s
MDS took 46.89s

However, for this structure predicted on my local machine,

Singularity masif_latest:/global/scratch/software/MaSIF/masif/data/masif_site> ./data_prepare_one.sh --file data_preparation/00-raw_pdbs/MGG-01993-ITASSER.pdb MGG-01993-ITASSER_A

Running masif site on data_preparation/00-raw_pdbs/MGG-01993-ITASSER.pdb
cp: 'data_preparation/00-raw_pdbs/MGG-01993-ITASSER.pdb' and 'data_preparation/00-raw_pdbs/MGG-01993-ITASSER.pdb' are the same file
--Call--

/usr/local/lib/python3.6/subprocess.py(758)del()
756 self.wait()
757
--> 758 def del(self, _maxsize=sys.maxsize, _warn=warnings.warn):
759 if not self._child_created:
760 # We didn't get to successfully create a child process.

ipdb>

I wasn't so sure what was causing this error.... Thank you in advance!

>head AvrPita.pdb
ATOM 1 N MET A 1 50.404 53.465 89.261 1.00 13.70
ATOM 2 CA MET A 1 49.060 53.953 88.970 1.00 13.70
ATOM 3 HA MET A 1 48.349 53.550 89.692 1.00 13.70
ATOM 4 CB MET A 1 49.107 55.497 89.071 1.00 13.70
ATOM 5 HB1 MET A 1 49.608 55.899 88.190 1.00 13.70
ATOM 6 HB2 MET A 1 49.694 55.787 89.940 1.00 13.70
ATOM 7 CG MET A 1 47.733 56.158 89.212 1.00 13.70
ATOM 8 HG1 MET A 1 47.334 55.866 90.163 1.00 13.70
ATOM 9 HG2 MET A 1 47.047 55.782 88.460 1.00 13.70
ATOM 10 SD MET A 1 47.708 57.971 89.163 1.00 13.70

>head MGG-011730-ITASSER.pdb
ATOM 1 H LEU 1 -30.724 18.366 -0.112 1.00 4.24
ATOM 2 N LEU 1 -30.717 19.328 -0.332 1.00 4.24
ATOM 3 CA LEU 1 -31.127 20.240 0.732 1.00 4.24
ATOM 4 C LEU 1 -30.216 20.107 1.947 1.00 4.24
ATOM 5 O LEU 1 -29.360 19.226 1.982 1.00 4.24
ATOM 6 CB LEU 1 -32.579 19.966 1.133 1.00 4.24
ATOM 7 CG LEU 1 -33.577 20.301 0.018 1.00 4.24
ATOM 8 CD1 LEU 1 -34.987 19.880 0.429 1.00 4.24
ATOM 9 CD2 LEU 1 -33.575 21.804 -0.259 1.00 4.24
ATOM 10 N PRO 2 -30.291 20.947 3.077 1.00 2.47

> tail AvrPita.pdb
ATOM 3585 H CYS A 224 76.364 66.692 47.328 1.00 5.19
ATOM 3586 CA CYS A 224 74.809 66.366 45.907 1.00 5.19
ATOM 3587 HA CYS A 224 74.247 65.434 45.807 1.00 5.19
ATOM 3588 CB CYS A 224 73.982 67.329 46.770 1.00 5.19
ATOM 3589 HB1 CYS A 224 73.157 67.727 46.174 1.00 5.19
ATOM 3590 HB2 CYS A 224 74.606 68.173 47.067 1.00 5.19
ATOM 3591 SG CYS A 224 73.262 66.560 48.246 1.00 5.19
ATOM 3592 C CYS A 224 75.018 66.922 44.489 1.00 5.19
ATOM 3593 O CYS A 224 76.123 67.110 43.980 1.00 5.19
TER

>tail MGG-011730-ITASSER.pdb
ATOM 661 OG1 THR 82 2.127 -6.129 7.772 1.00 2.03
ATOM 662 CG2 THR 82 0.326 -5.954 6.201 1.00 2.03
ATOM 663 N PRO 83 0.144 -8.497 9.877 1.00 3.59
ATOM 664 CA PRO 83 0.505 -9.336 10.943 1.00 3.59
ATOM 665 C PRO 83 0.584 -10.638 10.319 1.00 3.59
ATOM 666 O PRO 83 0.347 -10.751 9.103 1.00 3.59
ATOM 667 CB PRO 83 -0.615 -9.288 11.984 1.00 3.59
ATOM 668 CG PRO 83 -1.888 -9.067 11.196 1.00 3.59
ATOM 669 CD PRO 83 -1.811 -9.989 9.990 1.00 3.59
TER

PyMol plugin slow

Dear Masif-Team,

thanks for publishing MaSIF as open-source :-)
I find the PyMol plugin very useful, and I would like to use it also for general visualization tasks. However, I found that the current version of Simple_mesh is loading very slowly when the number of vertices is high. I managed to fix it (at least for my purposes) by changing the following:

        for jj in range(len(self.attributes["vertex_x"])):
            self.vertices = np.vstack(
                [
                    self.attributes["vertex_x"],
                    self.attributes["vertex_y"],
                    self.attributes["vertex_z"],
                ]
            ).T

to

        self.vertices = np.vstack(
            [
                self.attributes["vertex_x"],
                self.attributes["vertex_y"],
                self.attributes["vertex_z"],
            ]
        ).T

This reduces the loading time from >5 minutes (I killed PyMol at some point) to about 1 sec, for a surface with about 100000 vertices. I think this is safe to change because the jj variable is not used anywhere in the loop (maybe one could also do if len(...):?). I don't need any help with this, I just thought I'd report it in case you want to fix it on GitHub.

Best regards,
Franz Waibl

Masif_ppi_search output

Hi, I have run masif_ppi_search on the PDB 6M3M_A and have gotten two output files labeled:
p1_desc_flipped.npy and p1_desc_straight.npy
I cannot find documented what these two files mean and how they relate to finding binders of the input PDB. Any help would be appreciated, thanks!

Failed to place the graph without changing the devices of some resources

Hi.

I tried to train MaSIF-site and got the following. It is still training, but slowly.

2021-11-10 14:06:26.243439: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without   changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Identity: CPU XLA_CPU XLA_GPU 
Assign: CPU 
Const: CPU XLA_CPU XLA_GPU 
ApplyAdam: CPU 
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
fully_connected_3/biases/Initializer/zeros (Const) 
fully_connected_3/biases (VariableV2) /device:GPU:1
fully_connected_3/biases/Assign (Assign) /device:GPU:1
fully_connected_3/biases/read (Identity) /device:GPU:1
fully_connected_3/biases/Adam/Initializer/zeros (Const) /device:GPU:1
fully_connected_3/biases/Adam (VariableV2) /device:GPU:1
fully_connected_3/biases/Adam/Assign (Assign) /device:GPU:1
fully_connected_3/biases/Adam/read (Identity) /device:GPU:1
fully_connected_3/biases/Adam_1/Initializer/zeros (Const) /device:GPU:1
fully_connected_3/biases/Adam_1 (VariableV2) /device:GPU:1
fully_connected_3/biases/Adam_1/Assign (Assign) /device:GPU:1
fully_connected_3/biases/Adam_1/read (Identity) /device:GPU:1
Adam/update_fully_connected_3/biases/ApplyAdam (ApplyAdam) /device:GPU:1
save/Assign_62 (Assign) /device:GPU:1
save/Assign_63 (Assign) /device:GPU:1
save/Assign_64 (Assign) /device:GPU:1

Pretrained model for the second stage alignment

Hi!

In MaSIF search protocol I'm trying to run second stage alignment by using the pretrained model, however in the directory /masif/data/masif_pdl1_benchmark/models/nn_score there are only .index and .data files. Is it possible to provide the trained_model.hdf5 file that is being called by score_nn.py script?

Thank you so much,
Goran

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.