zhonghuigu / heal Goto Github PK

Python 100.00%

heal's Introduction

HEAL

Hierarchical Graph Transformer with Contrastive Learning for Protein Function Prediction

Setup Environment

Clone the current repo

git clone https://github.com/ZhonghuiGu/HEAL.git
conda env create -f environment.yml
conda install pytorch==1.7.0 cudatoolkit=10.2 -c pytorch
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_scatter-2.0.7-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_sparse-0.6.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_spline_conv-1.2.1-cp37-cp37m-linux_x86_64.whl
pip install *.whl
pip install torch_geometric==1.6.3

You also need to install the relative packages to run ESM-1b protein language model.
Please see facebookresearch/esm for details.
And the ESM-1b model weight we use can be downloaded here.

Protein function prediction

python predictor.py --task mf
                    --device 0 
                    --pdb data/4RQ2-A.pdb 
                    --esm1b_model $esm1b_model
                    --only_pdbch false
                    --prob 0.5

$task can be among the three GO-term task -- [bp, mf, cc].
$pdb is the path of the pdb file.
$esm1b_model is the path of the ESM-1b model weight.
$prob means outputing the functions with predicted probability larger than 0.5.
$only_pdbch means using the model parameters trained on the PDBch training set solely.

The default model parameters are trained on the combination of PDBch and AFch training set, e.g., model_bpCLaf.pt, model_ccCLaf.pt and model_mfCLaf.pt.
You can also use the model parameters which are only trained on the PDBch training by setting $only_pdbch true, e.g., model_bpCL.pt, model_ccCL.pt and model_mfCL.pt.

output

The protein may hold the following functions of MF:
Possibility: 0.99 ||| Functions: GO:0034061, DNA polymerase activity
Possibility: 0.98 ||| Functions: GO:0140097, catalytic activity, acting on DNA
Possibility: 0.96 ||| Functions: GO:0003887, DNA-directed DNA polymerase activity
Possibility: 0.79 ||| Functions: GO:0003677, DNA binding
Possibility: 0.95 ||| Functions: GO:0016772, transferase activity, transferring phosphorus-containing groups
Possibility: 0.97 ||| Functions: GO:0016779, nucleotidyltransferase activity

Exploring functions of understudied protein

For exploring functions of understudied protein, we recommend to consider the predicted results from both _CLaf.pt and _CL.pt parameters.
When there are no functions predicted under the probability threshold 0.5, you can set $prob lower to 0.4, 0.3 or even 0.2, and the predicted results still have reference value.

Model training

cd data

Our data set can be downloaded from here.

tar -zxvf processed.tar.gz

The dataset related files will be under data/processed. Files with prefix of AF2 belong to AFch dataset, others belong to PDBch dataset. Files with suffix of pdbch record the PDBid or uniprot accession of each protein, and files with suffix of graph contain the graph we constructed for each protein.

AF2test_graph.pt  AF2train_graph.pt  AF2val_graph.pt  test_graph.pt  train_graph.pt  val_graph.pt
AF2test_pdbch.pt  AF2train_pdbch.pt  AF2val_pdbch.pt  test_pdbch.pt  train_pdbch.pt  val_pdbch.pt

To train the model:

python train.py --device 0
                --task bp 
                --batch_size 64 
                --suffix CLaf
                --contrast True
                --AF2model True

$task can be among the three GO-term task -- [bp, mf, cc].
$suffix is the suffix of the model weight file that will be saved.
$contrast is whether to use contrastive learning.
$AF2model is whether to add AFch training set for training.

For whom want to build the new dataset:

The *graph.pt file contain the list of protein graphs, the way to build the graph can be seen from predictor.py.
Each graph is built by Pytorch Geometric, and each graph has three attributes.
graph.edge_index \in [2, protein_len] is edge index of residue pairs whose Ca are within 10 angstroms.
graph.native_x is the one-hot embedding for each residue type.
graph.x is the ESM-1b language embedding for each sequences.

graph_data.py is the script to load the data. If you want to train a new model, you can change the self.graph_list and self.y_true variable.

heal's People

Contributors

Stargazers

Watchers

Forkers

imseaton nsridhar1 luoxiao12 laplacezj biochunan

heal's Issues

About all model_*CLaf.pt

Hi,

Thanks for opensource this wonderful work. Using the relevant model_*CLaf.pt to predict the AFch test set, I can’t reproduce the results in the paper? For example, the prediction results in ‘bp’ are much higher than the results given in the paper. As shown in the figure below, what could this be caused by？The prediction results of all model_*CL.pt models are consistent with the results in the paper.

dataset link not working

Our data set can be downloaded from here.

When I tried to download the dataset, it said, "This address has expired." can you update the download link?

The download link for the dataset is no longer available.

Hi,

Thanks for opensource this wonderful work. The download link for the dataset is no longer available. Can you provide a new download link?

Dataset link has expired

Dear sir, thank you very much for your open source work, but your dataset link has expired. Could you please provide a new dataset link? Your help is very much appreciated!

esm1b only supports input sequence lengths up to 1024, how to handle proteins with sequence lengths greater than 1024

Hi author, we are trying to build our own dataset but found that esm1b only supports input sequence length up to 1024, we would like to ask how you guys deal with protein sequences with length greater than 1024 when building the dataset?

HGT

Hi, thanks for presenting HEAL. So far I have encounter some doubt, as you have mentioned the HGT take K,V and a learnable Q as inputs, but the code in "pool.py" seems like you are only taking the output of one of the GCN as input to generate the K,V instead of using two different outputs (denoted as GCN1 and GCN2) to generate K and V.

how to get pdb files of proteins without in alphafold2 database

hi, Dr. Gu, I would like to know how you get PDB files for protein sequences that are not included in alphafold2？

model.load_state_dict(torch.load(f'model/model_{task}CL.pt',map_location=device)) error

When I run the code, it showed that:

RuntimeError: Error(s) in loading state_dict for CL_protNET:
Missing key(s) in state_dict: "gcn.gcn.0.lin.weight", "gcn.gcn.1.lin.weight", "gcn.gcn.2.lin.weight", "gcn.pool.pools.0.mab.layer_k.lin.weight", "gcn.pool.pools.0.mab.layer_v.lin.weight", "gcn.pool.pools.1.mab.layer_k.lin.weight", "gcn.pool.pools.1.mab.layer_v.lin.weight".
Unexpected key(s) in state_dict: "gcn.gcn.0.weight", "gcn.gcn.1.weight", "gcn.gcn.2.weight", "gcn.pool.pools.0.mab.layer_k.weight", "gcn.pool.pools.0.mab.layer_v.weight", "gcn.pool.pools.1.mab.layer_k.weight", "gcn.pool.pools.1.mab.layer_v.weight".

Can you help me? Thank you!

I would like to re-generate datasets such as train_graph.pt and train_pdbch.pt using a higher version of pyg

Hello, I have installed a higher version of pyg due to gpu configuration, but the dataset you provided was processed using pyg==1.6.3, is it possible to provide the code you used to process the data? I would like to re-generate datasets such as train_graph.pt and train_pdbch.pt using a higher version of pyg, thank you very much!

Missing file: args.esm1b_model --contact- regression.pt

Hello,I run predictor.py with esm_model, alphabet = esm.pretrained.load_model_and_alphabet_local(args.esm1b_model) reporting error can't find args.esm1b_model --contact- regression.pt, do I need to enter additional weights?

Unable to reproduce performance

Hello,

I am currently trying to replicate the performance metrics published in your paper. I used the test.py script with the provided model_bpCL.pt weights on the test_graph.pt data. However, I encountered a significant discrepancy in the Fmax value; I obtained an Fmax of 0.379, which is quite different from the expected 0.595 as reported.

I have set up the environment exactly as specified in the environment.yml and followed the other setup instructions as given.

I have attached a screenshot of the test results for your reference. Could you please help me understand what might be causing this issue? Is there any step or configuration that I might be missing?

graph input

Hello, I would like to ask you specifically how the graph.pt file is created, as I would like to add new features to it, and would like to know exactly how the generation process works, and what I need to do if I would like to build my own graphs with direct inputs as you have done, thank you!

predictor.py error

I tried to make predictions on the following protein files taken from pdb.
https://files.wwpdb.org/pub/pdb/data/structures/divided/structure_factors/rq/r1rq0sf.ent.gz
https://files.wwpdb.org/pub/pdb/data/structures/divided/structure_factors/08/r108lsf.ent.gz. I am getting the same error on both files using biopython version 1.70, 1.80 and 1.81 using the following run command given on the github repository.
python predictor.py --task mf --device 0 --pdb r1rq0sf.ent --esm1b_model ./esm1b_t33_650M_UR50S.pt --only_pdbch false --prob 0.5.
Here is the error message.
Traceback (most recent call last):
File "predictor.py", line 63, in
model = struct[0]
File "/home/chs.csb/.anaconda3/envs/heal/lib/python3.7/site-packages/Bio/PDB/Entity.py", line 40, in getitem
return self.child_dict[id]
KeyError: 0.
The environment was downloaded as per the provided yml file.
How could I fix this? Thank you.

Is the PDBch test set compared with DeepGO? Is the AFch test set compared with DeepGOPlus?

Hi,

Thanks for opensource this wonderful work. Is the PDBch test set compared with DeepGO? Is the AFch test set compared with DeepGOPlus?

Urgent! Missing key(s) in state_dict of provided models

Hi!

First of all, well done and nice work!

I was trying to use HEAL and follow the instruction given in the repo, however it seems that the provided model's weight is not matched.

I used the same .pdb file as also the same parameters setting:

python predictor.py --task mf
--device 0
--pdb case_study/4RQ2-A.pdb
--only_pdbch false
--prob 0.5 \

However, I got this error message:
RuntimeError: Error(s) in loading state_dict for CL_protNET:
Missing key(s) in state_dict: "gcn.gcn.0.lin.weight", "gcn.gcn.1.lin.weight", "gcn.gcn.2.lin.weight", "gcn.pool.pools.0.mab.layer_k.lin.weight", "gcn.pool.pools.0.mab.layer_v.lin.weight", "gcn.pool.pools.1.mab.layer_k.lin.weight", "gcn.pool.pools.1.mab.layer_v.lin.weight".
Unexpected key(s) in state_dict: "gcn.gcn.0.weight", "gcn.gcn.1.weight", "gcn.gcn.2.weight", "gcn.pool.pools.0.mab.layer_k.weight", "gcn.pool.pools.0.mab.layer_v.weight", "gcn.pool.pools.1.mab.layer_k.weight", "gcn.pool.pools.1.mab.layer_v.weight".

Could you please help me with this problem and it is very urgent as we would like to use HEAL for one of our project while the deadline in almost in two days

Thank you very much for your help!

段错误 (核心已转储)

您好，我在安装"fair-esm[esmfold]"和'dllogger @ git+https://github.com/NVIDIA/dllogger.git'后运行predictor.py，直接报错“段错误 (核心已转储)”，没有其它任何文字。试过安装'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'，但安装失败， “RuntimeError: Error compiling objects for extension ERROR: Failed building wheel for openfold”。

Issues of unziping the dataset

Hi,

Thanks for opensource this wonderful work. However, I found there are something issues with your provided dataset. Could you please help me solve it?

Protein data with B chains could not be downloaded when collecting AF data

Dear authors, have you ever encountered protein data with B-chains that could not be downloaded when collecting AF data? The final data collected are protein data with A chain. How did you solve this problem? Looking forward to your answer!