shen-lab / deepaffinity Goto Github PK

Protein-compound affinity prediction through unified RNN-CNN

License: GNU General Public License v3.0

Python 99.05% Shell 0.95%

drug-discovery interpretable-deep-learning semi-supervised-learning attention-mechanism protein

deepaffinity's Introduction

DeepAffinity: Intro

Drug discovery demands rapid quantification of compound-protein interaction (CPI). However, there is a lack of methods that can predict compound-protein affinity from sequences alone with high applicability, accuracy, and interpretability. We present a integration of domain knowledges and learning-based approaches. Under novel representations of structurally-annotated protein sequences, a semi-supervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting affinities. Performances for new protein classes with few labeled data are further improved by transfer learning. Furthermore, novel attention mechanisms are developed and embedded to our model to add to its interpretability. Lastly, alternative representations using protein sequences or compound graphs and a unified RNN/GCNN-CNN model using graph CNN (GCNN) are also explored to reveal algorithmic challenges ahead.

DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks

(What has happened since DeepAffinity? Please check out our latest work: Explainable Deep Relational Networks for Predicting Compound–Protein Affinities and Contacts.

More DeepAffinity variants (such as hierarchical RNN for protein amino-acid sequence and GCN/GIN for compound graphs)
Much more interpretable DeepAffinity+ with regularized and supervised attentions as well as DeepRelations with intrinsically explainable model architecture
Demonstration of how interpretability helps in drug discovery: binding site prediction, ligand docking, and structure activity relationship (SAR; such as ligand scoring and lead optimization)

We have not released the code but have already done so for the data labeled with both the affinity and the explanation of the affinity (binary residue-atom contacts).)

Training DeepAffinity: Illustration

Pre-requisite:

Tensorflow-gpu v1.1
Python 3.6
TFLearn v0.3
Scikit-learn v0.19
Anaconda 3/5.0.0.1
We have already provided our environment list as environment.yml. You can create your own environment by:

conda env create -n envname -f environment.yml

data_script: Contain the supervised learning datasets(pIC50, pKi, pEC50, and pKd)
Seq2seq_models: Contain auto-encoder seq2seq models and their data for both SPS and SMILE representations
baseline_models: Contain shallow models for both Pfam/pubchem features and features generated from the encoder part of seq2seq model
Separate_models: Contain deep learning model for features generated from the encoder part of seq2seq model
Joint_models: Contain all the joint models including:
- Separate attention mechanism
- Marginalized attention mechanism
- Joint attention mechanism
- Graph convolution neural network (GCNN) with separate attention mechanism
(Update: Apr. 22, 2021) data_DeepRelations: A newly curated dataset for explainabe prediction of compound-protein affinities, containing 4446 pairs between 3672 compounds and 1287 proteins, labeled with both inter-molecular affinity (pKd/pKi) and residue-atom contacts/interactions.

Testing the model

To test DeepAffinity for new dataset, please follow the steps below:

Download the checkpoints trained based on training set of IC50 from the following link
Download the checkpoints trained based on the whole dataset of IC50 from the following link
Download the checkpoints trained based on the whole dataset of Kd from the following link
Download the checkpoints trained based on the whole dataset of Ki from the following link
Put your data in folder "Joint_models/joint_attention/testing/data"
cd Joint_models/joint_attention/testing/
Run the Python code joint-Model-test.py

You may use the script to run our model in one command. The details can be found in our manual (last updated: Apr. 9, 2020).

(Apr. 27, 2021) If the number of testing pairs in the input is below 64 (batch size), the script returns an error (InvalidArgumentError ... ConCat0p : Dimensions of inputs should match: ...). Such rigidity is unfortunately due to TFLEARN. An easy get around is to "pad" the input file to reach at least 64 pairs, using arbitrary compound-protein inputs.

(Aug. 21, 2020) We are now providing SPS (Structure Property-annotated Sequence) for all human proteins! zip (Credit: Dr. Tomas Babak at Queens University). Columns: 1. Gene identifier 2. Protein FASTA 3. SS (Scratch) 4. SS8 (Scratch) 5. acc (Scratch) 6. acc20 7. SPS

P.S. Considering the distribution of protein sequence lengths in our training data, our trained checkpoints are recommended for proteins of lengths between tens and 1500.

Re-training the seq2seq models for new dataset:

(Added on Jan. 18, 2021) To re-train the seq2seq models for new dataset, please follow the steps below:

Use the translate.py function in any of the seq2seq models with the following important arguments:
- data_dir: data directory where includes all the data
- train_dir: training directory where all the checkpoints will be saved in.
- from_train_data: source training data which will be translated from.
- to_train_data: target training data which will be translated to (can be the same with from_train_data if doing auto-encoding which we used in the paper).
- from_dev_data: source validation data which will be translated from.
- to_dev_data: target validation data which will be translated to (can be the same with from_dec_data if doing auto-encoding which we used in the paper).
- num_layers: Number of RNN layers (default 2)
- batch_size: Batch size (default 256)
- num_train_step: number of training steps (default 100K)
- size: the size of hidden dimension for RNN models (default 256)
- SPS_max_length (SMILE_max_length): maximum length of SPS (SMILE)
Suggestion for using seq2seq models:
- For protein encoding: seq2seq_part_FASTA_attention_fw_bw
- For compound encoding: seq2seq_part_SMILE_attention_fw_bw
Example of running for proteins: python translate.py --data_dir ./data --train_dir ./checkpoints --from_train_data ./data/FASTA_from.txt --to_train_data ./data/FASTA_to.txt --from_dev_data ./data/FASTA_from_dev.txt --to_dev_data ./data/FASTA_to_dev.txt --SPS_max_length 152
Once the training is done, you should copy the parameters' weights cell_*.txt, embedding_W.txt, *_layer_states.txt in the joint_attention/joint_fixed_RNN/data/prot_init which will be used for the next step, supervised training in the joint attention model (you can do the same for separate and marginalized attention models as well)

Note:

We recommend referring to PubChem for canonical SMILES for compounds.

Citation:

If you find the code useful for your research, please consider citing our paper:

@article{karimi2019deepaffinity,
  title={DeepAffinity: interpretable deep learning of compound--protein affinity through unified recurrent and convolutional neural networks},
  author={Karimi, Mostafa and Wu, Di and Wang, Zhangyang and Shen, Yang},
  journal={Bioinformatics},
  volume={35},
  number={18},
  pages={3329--3338},
  year={2019},
  publisher={Oxford University Press}
}

Contacts:

Yang Shen: [email protected]

Di Wu: [email protected]

Mostafa Karimi: [email protected]

deepaffinity's People

Contributors

Stargazers

Watchers

deepaffinity's Issues

Typrerror when running joint-Model_test.py

I'm trying to use deep affinity with pretained error, but the error appeared :

Tokenizing data in ./data/test_sps
Tokenizing data in ./data/test_smile
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 491, in apply_op
preferred_dtype=default_dtype)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 704, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 577, in _TensorTensorConversionFunction
% (dtype.name, t.dtype.name, str(t)))
ValueError: Tensor conversion requested dtype int32 for Tensor with dtype float32: 'Tensor("GRU/GRU/GRUCell/Gates/add:0", shape=(64, 512), dtype=float32)'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "joint-Model_test.py", line 230, in
prot_gru_1 = tflearn.gru(prot_embd, GRU_size_prot,initial_state= prot_init_state_1,trainable=True,return_seq=True,restore=True)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tflearn/layers/recurrent.py", line 294, in gru
scope=scope, name=name)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tflearn/layers/recurrent.py", line 67, in rnn_template
sequence_length=sequence_length)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn.py", line 197, in static_rnn
(output, state) = call_cell()
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn.py", line 184, in
call_cell = lambda: cell(input, state)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tflearn/layers/recurrent.py", line 601, in call
self.trainable, self.restore, self.reuse))
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1198, in split
split_dim=axis, num_split=num_or_size_splits, value=value, name=name)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3306, in _split
num_split=num_split, name=name)
File "/home/recher/anaconda3/envs/deep_affinity/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 514, in apply_op
(prefix, dtypes.as_dtype(input_arg.type).name))
TypeError: Input 'split_dim' of 'Split' Op has type float32 that does not match expected type of int32.

Both Training and Testing with my own dataset.

Hi, I appreciate your nice work provides us with insights of SPS.

I read the README pdf file for running your model on my own datasets, but I couldn't find how to run the model from scratch with my own datasets.

I only found that the way "train with your dataset and testing my own dataset", or
"Using pretrained-model with your dataset, testing my own dataset".

What I want to do is "Training with my own dataset and testing with my own dataset, too" for fair comparing.

Thanks in advance!

Generating the protein SPS representation

Hi, I am trying to run DeepAffinity on the PDBind dataset, and am running into issues when trying to use the SSpro/ACCpro to get the the secondary structure and exposedness predictions for the proteins in PDBind. I've followed the installation instructions and was able to run the software on some proteins, but for the majorities of the proteins it failed. The error message was pretty cryptic and I don't think it's a problem with the binaries since I was able to run the software on some of the PDBind protein sequences. I am wondering if you have some insights on these? Or do you perhaps have the processed data for PDBind? Thanks!

Is the tool Mac compatible?

Hello,

Is DeepAffinity Mac compatible? Asking because I'm following the installation using the conda environment.yml and these packages are not found:

libstdcxx-ng
libgcc-ng

Many thanks

Checkpoint for ki dataset

Hi,

Thanks for sharing this great work.

Is it true that the checkpoints you provided are for IC-50 dataset only? If yes, do you have checkpoints for ki dataset as well?

Thank you!

Pubchem fingerprints

Is it possible to share the code that was used to generate the pubchem fingerprints?

No BindingDB_All.tsv exists.

Hi,

Thank you for your work.
I am following your work with "DeepAffinity/data/script/split_data_script/README.md"
But there is a problem.
According to README.md, there must be BindingDB_All.tsv in google drive(https://drive.google.com/open?id=1_uZVTBVPeeF64joitPU8uDZnfIGS1me0).
But there's no tsv file exist instead only BindingDB_ALL_2D.sdf is available.
So, I can't proceed anymore.

Thank you
Dek

Data

Hello,

I was looking for channel, GPCR, ER and kinase datasets with train/test splits, but couldn't find them in In "baseline" folder. Could you please help me?

DeepAffinity ID and Public ID Mapping Table

Thanks for your great packages.

I want to get mapping table (DeepAffinity ID to public ID) for Protein and Drug each.
(For interpreting DeepAffinity output.)

e.g. For Protein (Map to Uniprot AC)
KV37 | P1234
e.g. For Drug (Map to Pubchem CID)
KV37 | 12345

And could you re-describe how you made pubchem binary fingerprint per compound.
-> Input=Pubchem CID or SMILES,
-> Output=0/1 881 length vector

Thanks in advance

why tflearn?

tflearn is no longer updated, it only supports tf1 and now it seems to be abandoned.
Why did you use tflearn when writing code?

Is it because you like tflearn?

Can't create environment

Hi, thank for your awesome works in DTA prediction, I really appreciate to your team!

I want to make the SPS sequence from my own protein data. I'm struggling to use the SSpro/ACCpro tool.

https://github.com/Shen-Lab/DeepAffinity/blob/master/DeepAffinity_Manual.pdf
I refered this manual and I tried to create the conda environment for running the SSpro/ACCpro but I failed to create the environment.

I'm not major in the CS field. I'm sorry if my question is too basic 😭.

> Conda info
active environment : base
active env location : /home/miruware/anaconda3
shell level : 1
user config file : /home/miruware/.condarc
populated config files :
conda version : 4.10.3
conda-build version : 3.20.5
python version : 3.8.5.final.0
virtual packages : __cuda=11.4=0
__linux=5.15.0=0
__glibc=2.31=0
__unix=0=0
__archspec=1=x86_64
base environment : /home/miruware/anaconda3 (writable)
conda av data dir : /home/miruware/anaconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /home/miruware/anaconda3/pkgs
/home/miruware/.conda/pkgs
envs directories : /home/miruware/anaconda3/envs
/home/miruware/.conda/envs
platform : linux-64
user-agent : conda/4.10.3 requests/2.24.0 CPython/3.8.5 Linux/5.15.0-60-generic ubuntu/20.04.2 glibc/2.31
UID:GID : 1000:1000
netrc file : None
offline mode : False

> Conda list

Name Version Build Channel
_ipyw_jlab_nb_ext_conf 0.1.0 py38_0
_libgcc_mutex 0.1 main
absl-py 0.13.0 pypi_0 pypi
alabaster 0.7.12 py_0
anaconda 2020.11 py38_0
anaconda-client 1.7.2 py38_0
anaconda-navigator 1.10.0 py38_0
anaconda-project 0.8.4 py_0
argh 0.26.2 py38_0
argon2-cffi 20.1.0 py38h7b6447c_1
asn1crypto 1.4.0 py_0
astroid 2.4.2 py38_0
astropy 4.0.2 py38h7b6447c_0
astunparse 1.6.3 pypi_0 pypi
async_generator 1.10 py_0
atomicwrites 1.4.0 py_0
attrs 20.3.0 pyhd3eb1b0_0
autopep8 1.5.4 py_0
babel 2.8.1 pyhd3eb1b0_0
backcall 0.2.0 py_0
backports 1.0 py_2
backports.functools_lru_cache 1.6.1 py_0
backports.shutil_get_terminal_size 1.0.0 py38_2
backports.tempfile 1.0 py_1
backports.weakref 1.0.post1 py_1
beautifulsoup4 4.9.3 pyhb0f4dca_0
bitarray 1.6.1 py38h27cfd23_0
bkcharts 0.2 py38_0
blas 1.0 mkl
bleach 3.2.1 py_0
blosc 1.20.1 hd408876_0
bokeh 2.2.3 py38_0
boto 2.49.0 py38_0
bottleneck 1.3.2 py38heb32a55_1
brotlipy 0.7.0 py38h7b6447c_1000
bzip2 1.0.8 h7b6447c_0
ca-certificates 2020.10.14 0
cachetools 4.2.2 pypi_0 pypi
cairo 1.14.12 h8948797_3
certifi 2020.6.20 pyhd3eb1b0_3
cffi 1.14.3 py38he30daa8_0
chardet 3.0.4 py38_1003
click 7.1.2 py_0
cloudpickle 1.6.0 py_0
clyent 1.2.2 py38_1
colorama 0.4.4 py_0
conda 4.10.3 py38h06a4308_0
conda-build 3.20.5 py38_1
conda-env 2.6.0 1
conda-package-handling 1.7.2 py38h03888b9_0
conda-verify 3.4.2 py_1
contextlib2 0.6.0.post1 py_0
cryptography 3.1.1 py38h1ba5d50_0
cudatoolkit 11.0.221 h6bb024c_0
curl 7.71.1 hbc83047_1
cycler 0.10.0 py38_0
cython 0.29.21 py38he6710b0_0
cytoolz 0.11.0 py38h7b6447c_0
dask 2.30.0 py_0
dask-core 2.30.0 py_0
dbus 1.13.18 hb2f20db_0
decorator 4.4.2 py_0
defusedxml 0.6.0 py_0
diff-match-patch 20200713 py_0
distributed 2.30.1 py38h06a4308_0
docutils 0.16 py38_1
entrypoints 0.3 py38_0
et_xmlfile 1.0.1 py_1001
expat 2.2.10 he6710b0_2
fastcache 1.1.0 py38h7b6447c_0
filelock 3.0.12 py_0
flake8 3.8.4 py_0
flask 1.1.2 py_0
fontconfig 2.13.0 h9420a91_0
freetype 2.10.4 h5ab3b9f_0
fribidi 1.0.10 h7b6447c_0
fsspec 0.8.3 py_0
future 0.18.2 py38_1
gast 0.3.3 pypi_0 pypi
get_terminal_size 1.0.0 haa9412d_0
gevent 20.9.0 py38h7b6447c_0
glib 2.66.1 h92f7085_0
glob2 0.7 py_0
gmp 6.1.2 h6c8ec71_1
gmpy2 2.0.8 py38hd5f6e3b_3
google-auth 1.35.0 pypi_0 pypi
google-auth-oauthlib 0.4.5 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
graphite2 1.3.14 h23475e2_0
greenlet 0.4.17 py38h7b6447c_0
grpcio 1.39.0 pypi_0 pypi
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
h5py 2.10.0 py38h7918eee_0
harfbuzz 2.4.0 hca77d97_1
hdf5 1.10.4 hb1b8bf9_0
heapdict 1.0.1 py_0
html5lib 1.1 py_0
icu 58.2 he6710b0_3
idna 2.10 py_0
imageio 2.9.0 py_0
imagesize 1.2.0 py_0
importlib-metadata 2.0.0 py_1
importlib_metadata 2.0.0 1
iniconfig 1.1.1 py_0
intel-openmp 2020.2 254
intervaltree 3.1.0 py_0
ipykernel 5.3.4 py38h5ca1d4c_0
ipython 7.19.0 py38hb070fc8_0
ipython_genutils 0.2.0 py38_0
ipywidgets 7.5.1 py_1
isort 5.6.4 py_0
itsdangerous 1.1.0 py_0
jbig 2.1 hdba287a_0
jdcal 1.4.1 py_0
jedi 0.17.1 py38_0
jeepney 0.5.0 pyhd3eb1b0_0
jinja2 2.11.2 py_0
joblib 0.17.0 py_0
jpeg 9b h024ee3a_2
json5 0.9.5 py_0
jsonschema 3.2.0 py_2
jupyter 1.0.0 py38_7
jupyter_client 6.1.7 py_0
jupyter_console 6.2.0 py_0
jupyter_core 4.6.3 py38_0
jupyterlab 2.2.6 py_0
jupyterlab_pygments 0.1.2 py_0
jupyterlab_server 1.2.0 py_0
keras-preprocessing 1.1.2 pypi_0 pypi
keyring 21.4.0 py38_1
kiwisolver 1.3.0 py38h2531618_0
krb5 1.18.2 h173b8e3_0
lazy-object-proxy 1.4.3 py38h7b6447c_0
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libarchive 3.4.2 h62408e4_0
libcurl 7.71.1 h20c2e04_1
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
liblief 0.10.1 he6710b0_0
libllvm10 10.0.1 hbcb73fb_5
libpng 1.6.37 hbc83047_0
libsodium 1.0.18 h7b6447c_0
libspatialindex 1.9.3 he6710b0_0
libssh2 1.9.0 h1ba5d50_1
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libtool 2.4.6 h7b6447c_1005
libuuid 1.0.3 h1bed415_2
libuv 1.40.0 h7b6447c_0
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 hb55368b_3
libxslt 1.1.34 hc22bd24_0
llvmlite 0.34.0 py38h269e1b5_4
locket 0.2.0 py38_1
lxml 4.6.1 py38hefd8a0e_0
lz4-c 1.9.2 heb0550a_3
lzo 2.10 h7b6447c_2
markdown 3.3.4 pypi_0 pypi
markupsafe 1.1.1 py38h7b6447c_0
matplotlib 3.3.2 0
matplotlib-base 3.3.2 py38h817c723_0
mccabe 0.6.1 py38_1
mistune 0.8.4 py38h7b6447c_1000
mkl 2020.2 256
mkl-service 2.3.0 py38he904b0f_0
mkl_fft 1.2.0 py38h23d657b_0
mkl_random 1.1.1 py38h0573a6f_0
mock 4.0.2 py_0
more-itertools 8.6.0 pyhd3eb1b0_0
mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.1.0 py38_0
msgpack-python 1.0.0 py38hfd86e86_1
multipledispatch 0.6.0 py38_0
navigator-updater 0.2.1 py38_0
nbclient 0.5.1 py_0
nbconvert 6.0.7 py38_0
nbformat 5.0.8 py_0
ncurses 6.2 he6710b0_1
nest-asyncio 1.4.2 pyhd3eb1b0_0
networkx 2.5 py_0
ninja 1.10.2 hff7bd54_1
nltk 3.5 py_0
nose 1.3.7 py38_2
notebook 6.1.4 py38_0
numba 0.51.2 py38h0573a6f_1
numexpr 2.7.1 py38h423224d_0
numpy 1.19.2 py38h54aff64_0
numpy-base 1.19.2 py38hfa32c7d_0
numpydoc 1.1.0 pyhd3eb1b0_1
oauthlib 3.1.1 pypi_0 pypi
olefile 0.46 py_0
openpyxl 3.0.5 py_0
openssl 1.1.1h h7b6447c_0
opt-einsum 3.3.0 pypi_0 pypi
packaging 20.4 py_0
pandas 1.1.3 py38he6710b0_0
pandoc 2.11 hb0f4dca_0
pandocfilters 1.4.3 py38h06a4308_1
pango 1.45.3 hd140c19_0
parso 0.7.0 py_0
partd 1.1.0 py_0
patchelf 0.12 he6710b0_0
path 15.0.0 py38_0
path.py 12.5.0 0
pathlib2 2.3.5 py38_0
pathtools 0.1.2 py_1
patsy 0.5.1 py38_0
pcre 8.44 he6710b0_0
pep8 1.7.1 py38_0
pexpect 4.8.0 py38_0
pickleshare 0.7.5 py38_1000
pillow 8.0.1 py38he98fc37_0
pip 20.2.4 py38h06a4308_0
pixman 0.40.0 h7b6447c_0
pkginfo 1.6.1 py38h06a4308_0
pluggy 0.13.1 py38_0
ply 3.11 py38_0
prometheus_client 0.8.0 py_0
prompt-toolkit 3.0.8 py_0
prompt_toolkit 3.0.8 0
protobuf 3.17.3 pypi_0 pypi
psutil 5.7.2 py38h7b6447c_0
ptyprocess 0.6.0 py38_0
py 1.9.0 py_0
py-lief 0.10.1 py38h403a769_0
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pycodestyle 2.6.0 py_0
pycosat 0.6.3 py38h7b6447c_1
pycparser 2.20 py_2
pycurl 7.43.0.6 py38h1ba5d50_0
pydocstyle 5.1.1 py_0
pyflakes 2.2.0 py_0
pygments 2.7.2 pyhd3eb1b0_0
pylint 2.6.0 py38_0
pyodbc 4.0.30 py38he6710b0_0
pyopenssl 19.1.0 py_1
pyparsing 2.4.7 py_0
pyqt 5.9.2 py38h05f1152_4
pyrsistent 0.17.3 py38h7b6447c_0
pysocks 1.7.1 py38_0
pytables 3.6.1 py38h9fd0a39_0
pytest 6.1.1 py38_0
python 3.8.5 h7579374_1
python-dateutil 2.8.1 py_0
python-jsonrpc-server 0.4.0 py_0
python-language-server 0.35.1 py_0
python-libarchive-c 2.9 py_0
pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 pytorch
pytz 2020.1 py_0
pywavelets 1.1.1 py38h7b6447c_2
pyxdg 0.27 pyhd3eb1b0_0
pyyaml 5.3.1 py38h7b6447c_1
pyzmq 19.0.2 py38he6710b0_1
qdarkstyle 2.8.1 py_0
qt 5.9.7 h5867ecd_1
qtawesome 1.0.1 py_0
qtconsole 4.7.7 py_0
qtpy 1.9.0 py_0
readline 8.0 h7b6447c_0
regex 2020.10.15 py38h7b6447c_0
requests 2.24.0 py_0
requests-oauthlib 1.3.0 pypi_0 pypi
ripgrep 12.1.1 0
rope 0.18.0 py_0
rsa 4.7.2 pypi_0 pypi
rtree 0.9.4 py38_1
ruamel_yaml 0.15.87 py38h7b6447c_1
scikit-image 0.17.2 py38hdf5156a_0
scikit-learn 0.23.2 py38h0573a6f_0
scipy 1.4.1 pypi_0 pypi
seaborn 0.11.0 py_0
secretstorage 3.1.2 py38_0
send2trash 1.5.0 py38_0
setuptools 50.3.1 py38h06a4308_1
simplegeneric 0.8.1 py38_2
singledispatch 3.4.0.3 py_1001
sip 4.19.13 py38he6710b0_0
six 1.15.0 py38h06a4308_0
snowballstemmer 2.0.0 py_0
sortedcollections 1.2.1 py_0
sortedcontainers 2.2.2 py_0
soupsieve 2.0.1 py_0
sphinx 3.2.1 py_0
sphinxcontrib 1.0 py38_1
sphinxcontrib-applehelp 1.0.2 py_0
sphinxcontrib-devhelp 1.0.2 py_0
sphinxcontrib-htmlhelp 1.0.3 py_0
sphinxcontrib-jsmath 1.0.1 py_0
sphinxcontrib-qthelp 1.0.3 py_0
sphinxcontrib-serializinghtml 1.1.4 py_0
sphinxcontrib-websupport 1.2.4 py_0
spyder 4.1.5 py38_0
spyder-kernels 1.9.4 py38_0
sqlalchemy 1.3.20 py38h7b6447c_0
sqlite 3.33.0 h62c20be_0
statsmodels 0.12.0 py38h7b6447c_0
sympy 1.6.2 py38h06a4308_1
tbb 2020.3 hfd86e86_0
tblib 1.7.0 py_0
tensorboard 2.2.2 pypi_0 pypi
tensorboard-plugin-wit 1.8.0 pypi_0 pypi
tensorflow-estimator 2.2.0 pypi_0 pypi
tensorflow-gpu 2.2.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
terminado 0.9.1 py38_0
testpath 0.4.4 py_0
threadpoolctl 2.1.0 pyh5ca1d4c_0
tifffile 2020.10.1 py38hdd07704_2
tk 8.6.10 hbc83047_0
toml 0.10.1 py_0
toolz 0.11.1 py_0
torchaudio 0.7.2 py38 pytorch
torchvision 0.8.2 py38_cu110 pytorch
tornado 6.0.4 py38h7b6447c_1
tqdm 4.50.2 py_0
traitlets 5.0.5 py_0
typing_extensions 3.7.4.3 py_0
ujson 4.0.1 py38he6710b0_0
unicodecsv 0.14.1 py38_0
unixodbc 2.3.9 h7b6447c_0
urllib3 1.25.11 py_0
watchdog 0.10.3 py38_0
wcwidth 0.2.5 py_0
webencodings 0.5.1 py38_1
werkzeug 1.0.1 py_0
wheel 0.35.1 py_0
widgetsnbextension 3.5.1 py38_0
wrapt 1.11.2 py38h7b6447c_0
wurlitzer 2.0.1 py38_0
xlrd 1.2.0 py_0
xlsxwriter 1.3.7 py_0
xlwt 1.3.0 py38_0
xmltodict 0.12.0 py_0
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
yapf 0.30.0 py_0
zeromq 4.3.3 he6710b0_3
zict 2.0.0 py_0
zipp 3.4.0 pyhd3eb1b0_0
zlib 1.2.11 h7b6447c_3
zope 1.0 py38_1
zope.event 4.5.0 py38_0
zope.interface 5.1.2 py38h7b6447c_0
zstd 1.4.5 h9ceee32_0

conda

model number in output

Hi,
What does model numbers mean for the output of the DeepAffinity?
(e.g. Model_7414, Model_20432, Model_399544, Model_1452500)

Thanks in advance!!

Trained Models

Hi,

Is it possible to share the trained models?

Thanks.

Why should the number of SAP representation protein sequence file lines and the number of Canonical compound SMILE file lines match?

Hi,

I am trying to get results of my own data with your model.

(1) According to the file "DeepAffinity_inference.sh", it seems that the number of lines for input protein sequences file and compound file must matches like below.

Is it mean that the number of each entity in both files have to be matched or literally the the number of lines of both files have to be matched?

(2) I got two files for my own data after following your manual.
Could you tell me if their entities' structure are correct for model input?

CID_Smi_Feature:
protein_grouped_finalPresentation

Thank you,
CallMeDek

About 'uniprot.human.scratch_outputs.w_sps.tab_corrected' file

Hello, I am very interested in your work.

I find the 'uniprot.human.scratch_outputs.w_sps.tab_corrected' file
in the path where '/data/dataset/uniprot.human.scratch_outputs.w_sps.tab_corrected/'.
It seems to have SPS mapping.
but It has different SPS with 'protein_sequence_SPS_mapping' file.

For example, in case of 'CP3A4', which sequence is
'MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA'

In 'protein_sequence_SPS_mapping' file, The sps of this protein is
'ANGL,CEGL,AEKM,CEKS,BNDS,CNTS,BNGS,ANDM,CEKL,AEKL,CEDS,ANTL,CETM,AEDM,CEDS,ANGL,CEGS,AEKL,CEKS,AEDM,CEKM,ANDM,CNDS,ANDL,CETM,AEDM,CNKM,BEGM,CEKS,BNGS,ANTS,CEDM,AEKM,CETL,ANGL,CETL,BEDS,CEDM'

However, in 'uniprot.human.scratch_outputs.w_sps.tab_corrected' file, The sps of this protein is
'ANGL,CEKS,AEKS,CEGM,AEKL,BNGL,AEDM,CETM,AEKS,CEDS,AEDL,CEKS,ANTL,CETM,AEDS,CEDM,AEGL,CEGS,AEKL,CEDM,AETM,CEKM,ANDL,CEGM,ANDL,CNKM,BEGM,CEKS,ANGM,CEKL,ANGL,CETL,BEDS,CEDM'

I would like to know why they are different though they have same protein sequence.
Also Is it ok to use sps in 'uniprot.human.scratch_outputs.w_sps.tab_corrected' file for training and testing the DeepAffinity model.

Thank you,