Git Product home page Git Product logo

heal's Introduction

HEAL

Hierarchical Graph Transformer with Contrastive Learning for Protein Function Prediction

Setup Environment

Clone the current repo

git clone https://github.com/ZhonghuiGu/HEAL.git
conda env create -f environment.yml
conda install pytorch==1.7.0 cudatoolkit=10.2 -c pytorch
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_scatter-2.0.7-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_sparse-0.6.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.7.0%2Bcu102/torch_spline_conv-1.2.1-cp37-cp37m-linux_x86_64.whl
pip install *.whl
pip install torch_geometric==1.6.3

You also need to install the relative packages to run ESM-1b protein language model.
Please see facebookresearch/esm for details.
And the ESM-1b model weight we use can be downloaded here.

Protein function prediction

python predictor.py --task mf
                    --device 0 
                    --pdb data/4RQ2-A.pdb 
                    --esm1b_model $esm1b_model
                    --only_pdbch false
                    --prob 0.5

$task can be among the three GO-term task -- [bp, mf, cc].
$pdb is the path of the pdb file.
$esm1b_model is the path of the ESM-1b model weight.
$prob means outputing the functions with predicted probability larger than 0.5.
$only_pdbch means using the model parameters trained on the PDBch training set solely.

The default model parameters are trained on the combination of PDBch and AFch training set, e.g., model_bpCLaf.pt, model_ccCLaf.pt and model_mfCLaf.pt.
You can also use the model parameters which are only trained on the PDBch training by setting $only_pdbch true, e.g., model_bpCL.pt, model_ccCL.pt and model_mfCL.pt.

output

The protein may hold the following functions of MF:
Possibility: 0.99 ||| Functions: GO:0034061, DNA polymerase activity
Possibility: 0.98 ||| Functions: GO:0140097, catalytic activity, acting on DNA
Possibility: 0.96 ||| Functions: GO:0003887, DNA-directed DNA polymerase activity
Possibility: 0.79 ||| Functions: GO:0003677, DNA binding
Possibility: 0.95 ||| Functions: GO:0016772, transferase activity, transferring phosphorus-containing groups
Possibility: 0.97 ||| Functions: GO:0016779, nucleotidyltransferase activity

Exploring functions of understudied protein

For exploring functions of understudied protein, we recommend to consider the predicted results from both _CLaf.pt and _CL.pt parameters.
When there are no functions predicted under the probability threshold 0.5, you can set $prob lower to 0.4, 0.3 or even 0.2, and the predicted results still have reference value.

Model training

cd data

Our data set can be downloaded from here.

tar -zxvf processed.tar.gz

The dataset related files will be under data/processed. Files with prefix of AF2 belong to AFch dataset, others belong to PDBch dataset. Files with suffix of pdbch record the PDBid or uniprot accession of each protein, and files with suffix of graph contain the graph we constructed for each protein.

AF2test_graph.pt  AF2train_graph.pt  AF2val_graph.pt  test_graph.pt  train_graph.pt  val_graph.pt
AF2test_pdbch.pt  AF2train_pdbch.pt  AF2val_pdbch.pt  test_pdbch.pt  train_pdbch.pt  val_pdbch.pt

To train the model:

python train.py --device 0
                --task bp 
                --batch_size 64 
                --suffix CLaf
                --contrast True
                --AF2model True   

$task can be among the three GO-term task -- [bp, mf, cc].
$suffix is the suffix of the model weight file that will be saved.
$contrast is whether to use contrastive learning.
$AF2model is whether to add AFch training set for training.

For whom want to build the new dataset:

The *graph.pt file contain the list of protein graphs, the way to build the graph can be seen from predictor.py.
Each graph is built by Pytorch Geometric, and each graph has three attributes.
graph.edge_index \in [2, protein_len] is edge index of residue pairs whose Ca are within 10 angstroms.
graph.native_x is the one-hot embedding for each residue type.
graph.x is the ESM-1b language embedding for each sequences.

graph_data.py is the script to load the data. If you want to train a new model, you can change the self.graph_list and self.y_true variable.

heal's People

Contributors

zhonghuigu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

heal's Issues

About all model_*CLaf.pt

Hi,

Thanks for opensource this wonderful work. Using the relevant model_*CLaf.pt to predict the AFch test set, I can’t reproduce the results in the paper? For example, the prediction results in ‘bp’ are much higher than the results given in the paper. As shown in the figure below, what could this be caused by?The prediction results of all model_*CL.pt models are consistent with the results in the paper.
image

dataset link not working

Our data set can be downloaded from here.

When I tried to download the dataset, it said, "This address has expired." can you update the download link?

Dataset link has expired

Dear sir, thank you very much for your open source work, but your dataset link has expired. Could you please provide a new dataset link? Your help is very much appreciated!

HGT

Hi, thanks for presenting HEAL. So far I have encounter some doubt, as you have mentioned the HGT take K,V and a learnable Q as inputs, but the code in "pool.py" seems like you are only taking the output of one of the GCN as input to generate the K,V instead of using two different outputs (denoted as GCN1 and GCN2) to generate K and V.

model.load_state_dict(torch.load(f'model/model_{task}CL.pt',map_location=device)) error

When I run the code, it showed that:

RuntimeError: Error(s) in loading state_dict for CL_protNET:
Missing key(s) in state_dict: "gcn.gcn.0.lin.weight", "gcn.gcn.1.lin.weight", "gcn.gcn.2.lin.weight", "gcn.pool.pools.0.mab.layer_k.lin.weight", "gcn.pool.pools.0.mab.layer_v.lin.weight", "gcn.pool.pools.1.mab.layer_k.lin.weight", "gcn.pool.pools.1.mab.layer_v.lin.weight".
Unexpected key(s) in state_dict: "gcn.gcn.0.weight", "gcn.gcn.1.weight", "gcn.gcn.2.weight", "gcn.pool.pools.0.mab.layer_k.weight", "gcn.pool.pools.0.mab.layer_v.weight", "gcn.pool.pools.1.mab.layer_k.weight", "gcn.pool.pools.1.mab.layer_v.weight".

Can you help me? Thank you!

Missing file: args.esm1b_model --contact- regression.pt

Hello,I run predictor.py with esm_model, alphabet = esm.pretrained.load_model_and_alphabet_local(args.esm1b_model) reporting error can't find args.esm1b_model --contact- regression.pt, do I need to enter additional weights?

Unable to reproduce performance

Hello,

I am currently trying to replicate the performance metrics published in your paper. I used the test.py script with the provided model_bpCL.pt weights on the test_graph.pt data. However, I encountered a significant discrepancy in the Fmax value; I obtained an Fmax of 0.379, which is quite different from the expected 0.595 as reported.

I have set up the environment exactly as specified in the environment.yml and followed the other setup instructions as given.

I have attached a screenshot of the test results for your reference. Could you please help me understand what might be causing this issue? Is there any step or configuration that I might be missing?

Screenshot 2024-06-05 at 1 01 11 PM

graph input

Hello, I would like to ask you specifically how the graph.pt file is created, as I would like to add new features to it, and would like to know exactly how the generation process works, and what I need to do if I would like to build my own graphs with direct inputs as you have done, thank you!

predictor.py error

I tried to make predictions on the following protein files taken from pdb.
https://files.wwpdb.org/pub/pdb/data/structures/divided/structure_factors/rq/r1rq0sf.ent.gz
https://files.wwpdb.org/pub/pdb/data/structures/divided/structure_factors/08/r108lsf.ent.gz. I am getting the same error on both files using biopython version 1.70, 1.80 and 1.81 using the following run command given on the github repository.
python predictor.py --task mf --device 0 --pdb r1rq0sf.ent --esm1b_model ./esm1b_t33_650M_UR50S.pt --only_pdbch false --prob 0.5.
Here is the error message.
Traceback (most recent call last):
File "predictor.py", line 63, in
model = struct[0]
File "/home/chs.csb/.anaconda3/envs/heal/lib/python3.7/site-packages/Bio/PDB/Entity.py", line 40, in getitem
return self.child_dict[id]
KeyError: 0.
The environment was downloaded as per the provided yml file.
How could I fix this? Thank you.

Urgent! Missing key(s) in state_dict of provided models

Hi!

First of all, well done and nice work!

I was trying to use HEAL and follow the instruction given in the repo, however it seems that the provided model's weight is not matched.

I used the same .pdb file as also the same parameters setting:

python predictor.py --task mf
--device 0
--pdb case_study/4RQ2-A.pdb
--only_pdbch false
--prob 0.5 \

However, I got this error message:
RuntimeError: Error(s) in loading state_dict for CL_protNET:
Missing key(s) in state_dict: "gcn.gcn.0.lin.weight", "gcn.gcn.1.lin.weight", "gcn.gcn.2.lin.weight", "gcn.pool.pools.0.mab.layer_k.lin.weight", "gcn.pool.pools.0.mab.layer_v.lin.weight", "gcn.pool.pools.1.mab.layer_k.lin.weight", "gcn.pool.pools.1.mab.layer_v.lin.weight".
Unexpected key(s) in state_dict: "gcn.gcn.0.weight", "gcn.gcn.1.weight", "gcn.gcn.2.weight", "gcn.pool.pools.0.mab.layer_k.weight", "gcn.pool.pools.0.mab.layer_v.weight", "gcn.pool.pools.1.mab.layer_k.weight", "gcn.pool.pools.1.mab.layer_v.weight".

Could you please help me with this problem and it is very urgent as we would like to use HEAL for one of our project while the deadline in almost in two days

Thank you very much for your help!

Issues of unziping the dataset

Hi,

Thanks for opensource this wonderful work. However, I found there are something issues with your provided dataset. Could you please help me solve it?

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.