Git Product home page Git Product logo

provis's Introduction

BERTology Meets Biology: Interpreting Attention in Protein Language Models

This repository is the official implementation of BERTology Meets Biology: Interpreting Attention in Protein Language Models.

Table of Contents

ProVis Attention Visualizer

This section provides instructions for generating visualizations of attention projected onto 3D protein structure.

Image Image

Installation

General requirements:

  • Python >= 3.7
pip install biopython==1.77
pip install tape-proteins==0.5
pip install jupyterlab==3.0.14
pip install nglview
jupyter-nbextension enable nglview --py --sys-prefix

If you run into problems installing nglview, please refer to their installation instructions for additional installation details and options.

Execution

cd <project_root>/notebooks
jupyter notebook provis.ipynb

If you get an error running the notebook, you may need to execute the notebook as follows:

jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000

See nglview installation instructions for more details.

You may edit the notebook to choose other proteins, attention heads, etc. The visualization tool is based on the excellent nglview library.


Experiments

This section describes how to reproduce the experiments in the paper.

Installation

cd <project_root>
python setup.py develop

To download additional required datasets from TAPE:

cd <project_root>/data
wget http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/secondary_structure.tar.gz
tar -xvf secondary_structure.tar.gz && rm secondary_structure.tar.gz
wget http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/proteinnet.tar.gz
tar -xvf proteinnet.tar.gz && rm proteinnet.tar.gz

Attention Analysis

The following steps will reproduce the attention analysis experiments and generate the reports currently found in <project_root>/reports/attention_analysis. This includes all experiments besides the probing experiments (see Probing Analysis).

Before performing steps, navigate to appropriate directory:

cd <project_root>/protein_attention/attention_analysis

Tape BERT Model

The following executes the attention analysis (may run for several hours):

sh scripts/compute_all_features_tape_bert.sh

The above script create a set of extract files in <project_root>/data/cache corresponding to various properties being analyzed. You may edit the script files to remove properties that you are not interested in. If you wish to run the analysis without a GPU, you must specify the --no_cuda flag.

The following generate reports based on the files created in previous step:

sh scripts/report_all_features_tape_bert.sh

If you removed steps from the analysis script above, you will need to update the reporting script accordingly.

ProtTrans Models

In order to generate reports for the ProtTrans models, follow the instructions as for the TapeBert model above, but substitute the following commands:

ProtBert:

sh scripts/compute_all_features_prot_bert.sh
sh scripts/report_all_features_prot_bert.sh

ProtBertBFD:

sh scripts/compute_all_features_prot_bert_bfd.sh
sh scripts/report_all_features_prot_bert_bfd.sh

ProtAlbert:

sh scripts/compute_all_features_prot_albert.sh
sh scripts/report_all_features_prot_albert.sh

ProtXLNet:

sh scripts/compute_all_features_prot_xlnet.sh
sh scripts/report_all_features_prot_xlnet.sh

Probing Analysis

The following steps will recreate the figures from the probing analysis, currently found in <project_root>/reports/probing

Navigate to directory:

cd <project_root>/protein_attention/probing

Training

Train diagnostic classifiers. Each script will write out an extract file with evaluation results. Note: each of these scripts may run for several hours.

sh scripts/probe_ss4_0_all
sh scripts/probe_ss4_1_all
sh scripts/probe_ss4_2_all
sh scripts/probe_sites.sh
sh scripts/probe_contacts.sh

Reports

python report.py

License

This project is licensed under BSD3 License - see the LICENSE file for details

Acknowledgments

This project incorporates code from the following repo:

Citation

When referencing this repository, please cite this paper.

@misc{vig2020bertology,
    title={BERTology Meets Biology: Interpreting Attention in Protein Language Models},
    author={Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani},
    year={2020},
    eprint={2006.15222},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2006.15222}
}

provis's People

Contributors

jessevig avatar svc-scm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

provis's Issues

Global Distance Test (GDT)

Hi,
Thanks for this Awesome repo. What's the GDT from this model? Is it better than AlphaFold1?
Greetings!

nglview

Hi,
nglview is problematic how can I save the resulting PDB structure from the notebook example? Any code directions?
Greetings!

About seaborn required by provis

Problem

By now, running the install script yields numpy==1.24.2

installation output(click to expand)
Searching for numpy
Reading https://pypi.org/simple/numpy/
Downloading https://files.pythonhosted.org/packages/9c/ee/77768cade9607687fadbcc1dcbb82dba0554154b3aa641f9c17233ffabe8/numpy-1.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=2eabd64ddb96a1239791da78fa5f4e1693ae2dadc82a76bc76a14cbb2b966e96
Best match: numpy 1.24.2
Processing numpy-1.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing numpy-1.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl to /usr/local/lib/python3.8/dist-packages
Adding numpy 1.24.2 to easy-install.pth file
Installing f2py script to /usr/local/bin
Installing f2py3 script to /usr/local/bin
Installing f2py3.8 script to /usr/local/bin

Installed /usr/local/lib/python3.8/dist-packages/numpy-1.24.2-py3.8-linux-x86_64.egg

And, this version of numpy is not compatable with seaborn==0.10.1 which is a dependency of provis.
This fails as this version of seaborn internally uses np.bool which is deprecated in numpy>=1.20.
As a result, all report_*.py scripts fail, yielding the following error.

error (click to expand)
report_edge_features.py:27: MatplotlibDeprecationWarning:                                                                                
The mpl_toolkits.axes_grid1.colorbar module was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use matplotlib
.colorbar instead.                                                                                                                       
  from mpl_toolkits.axes_grid1.colorbar import colorbar                                                                                  
Namespace(exp_name='edge_features_contact_tape_bert')                                                                                    
Namespace(dataset='proteinnet', exp_name='edge_features_contact_tape_bert', features=['contact_map'], max_seq_len=512, min_attn=0.3, mode
l='bert', model_dir=None, model_version=None, no_cuda=True, num_sequences=5000, seed=123, shuffle=True)                                  
/usr/local/lib/python3.8/dist-packages/seaborn-0.10.1-py3.8.egg/seaborn/matrix.py:78: FutureWarning: In the future `np.bool` will be defi
ned as the corresponding NumPy scalar.                                                                                                   
  dtype=np.bool)                                                                                                                         
Traceback (most recent call last):                                                                                                       
  File "report_edge_features.py", line 147, in <module>                                                                                  
    create_figure(feature_name, weighted_sum, weight_total, report_dir, min_total=min_total, filetype=filetype)                          
  File "report_edge_features.py", line 69, in create_figure                                                                              
    heatmap = sns.heatmap((mean_by_head * 100).tolist(), center=0.0, ax=ax1,                                                             
  File "/usr/local/lib/python3.8/dist-packages/seaborn-0.10.1-py3.8.egg/seaborn/matrix.py", line 535, in heatmap                         
    plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,                                                            
  File "/usr/local/lib/python3.8/dist-packages/seaborn-0.10.1-py3.8.egg/seaborn/matrix.py", line 111, in __init__                        
    mask = _matrix_mask(data, mask)                                                                                                      
  File "/usr/local/lib/python3.8/dist-packages/seaborn-0.10.1-py3.8.egg/seaborn/matrix.py", line 78, in _matrix_mask                     
    dtype=np.bool)                                                                                                                       
  File "/usr/local/lib/python3.8/dist-packages/numpy-1.24.2-py3.8-linux-x86_64.egg/numpy/__init__.py", line 305, in __getattr__          
    raise AttributeError(__former_attrs__[attr])                                                                                         
AttributeError: module 'numpy' has no attribute 'bool'.                                                                                  
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not 
modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.                                  
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:                     
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 

Therefore, I strongly suggest recommend bumping up the seaborn version or downgrading numpy.

Forcing the installation of seaborn==0.12.2 solved the problem for me.

cmyc@oncogene:~/provis/protein_attention/attention_analysis$ python -m pip check                                                                                                                                                                                             
fpygobject 3.36.0 requires pycairo, which is not installed.                                                                              
ipywidgets 7.7.3 has requirement jupyterlab-widgets<3,>=1.0.0; python_version >= "3.6", but you have jupyterlab-widgets 3.0.5.           
ipywidgets 7.7.3 has requirement widgetsnbextension~=3.6.0, but you have widgetsnbextension 4.0.5.                                       
provis 0.0.1 has requirement seaborn==0.10.1, but you have seaborn 0.12.2.

Version Info

cmyc@oncogene:~/provis/protein_attention/attention_analysis$ py -V
Python 3.8.10
cmyc@oncogene:~/provis/protein_attention/attention_analysis$ uname -a
Linux oncogene 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Running provis.ipynb does not generate any figure

Hello,

I was running the provis.ipynb and noticed that the output from the cell with # Example for head 7-1 (targets binding sites) shows loading the chains but does not plot any figures. Are there any prerequisites to load figure?
I did not see any figure saved to the git rep as well. I am not sure if the issue is related to NGLWidget. Would be great if I could test it out

Thank you!

Custom dataset

Hi, I'm wondering if is there a way to run the attention experiments with a custom dataset of my own.

Some problems related to AMINO ACIDS partial

Hi, This work is very interesting, I encountered some doubts in the process of reading the article, I hope to get your help
In the paper section 4.4: Attention heads specialize in particular amino acids
I don't know how this matrix is calculated, and I didn't find the corresponding calculation part. Could you tell me where the calculation formula and code are located? Hope to get your reply, thank you very much, have a good life😁

Get the report and plots for BindingDB dataset

Hi.

Thanks for your great paper and package.
Could you please kindly let me know how I can regenerate the reports for other datasets such as BindingDB, KIBA, and DAVIS dataset?

These are protein datasets for drug-target interaction problems.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.