Prediction of protein-protein interaction sites using convolutional neural network and improved data sets.
Zengyan Xie, Xiaoya Deng, Kunxian shu.
Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
Protein-protein interaction (PPI) sites play a key role in the formation of protein complex which is the basis of a variety of biological processes. Experimental methods to solve PPI sites are expensive and time-consuming, which leads to the development of different kinds of prediction algorithms. We propose a convolutional neural network for PPI sites prediction and use residue binding propensity to improve the positive samples. Our method obtains a remarkable result of the area under curve (AUC)=0.912 on the improved data set. In addition, it yields much better results on samples with high binding propensity than on randomly selected samples. This suggests that there are considerable false positive PPI sites in the positive samples defined by distance between residue atoms.
If you publish pictures or models using our software please cite the following paper:
Xie, Z.; Deng, X.; Shu, K. Prediction of Protein–Protein Interaction Sites Using Convolutional Neural Network and Improved Data Sets. Int. J. Mol. Sci. 2020, 21, 467.
DEPENDENCIES
Our tools depends upon the following:
-
Python 3.5
-
Tensorflow 1.10.0
-
Python modules: Numpy, Matplotlib, re, sys, os, random, sklearn
-
Tools: PSAIA, PSI-BLAST
Please install these dependencies before using our tools.
USAGE
- Feature Extraction(section 4.5 in our paper for details):
- Amino Acid Encoding
Twenty amino acids were coded as one-hot encoding. (Table S4 in the Supplementary Material).
- Profile Features
PSSM and PSFM were computed by running 3 iterations of PSIBLAST [66] against the NCBI NR database for a given protein with E-value set to 0.001. PSSM and PSFM columns were taken within a length 3 window centered at a residue of the protein to obtain a 3 x 40 matrix.
- Amino Acid Physicochemical Properties
Twenty-four physicochemical properties of amino acids [67] are used in this study. Twenty amino acids are divided into three groups according to these properties and each group is encoded using one-hot encoding, thus each amino acid is represented as a 72-dimensional vector.
- Structure Features
Five structure-based features (ASA, RASA, DPX, CX, Hydrophobicity) were computed using PSAIA.
You can use Python scripts for all the above steps, including data preprocessing.
- The training and testing is pretty simple. Just follow the following steps:
-
Put feature files of each complex in a fold.
-
Run leave_one_complex.py, then you can get AUC of each complex by using leave-one-complex-out validation.
-
Run kfold.py, then you can get the result of 5-fold cross-validation.
We tested our model on 8 Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz and NVIDIA Corporation GP102 [TITAN Xp] (rev a1).