Git Product home page Git Product logo

udsmprot's Introduction

UDSMProt, universal deep sequence models for protein classification

UDSMProt is an algorithm for the classification of proteins based on the sequence of amino acids alone. Its key component is a self-supervised pretraining step based on a language modeling task. The model is then subsequently finetuned to specific classification tasks. In our paper we considered enzyme class classification, gene ontology prediction and remote homology detection showcasing the excellent performance of UDSMProt.

For a detailed description of technical details and experimental results, please refer to our paper:

Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek, UDSMProt: universal deep sequence models for protein classification, Bioinformatics 36, no. 8, 2401-2409, 2020.

@article{Strodthoff:2019universal,
author = {Strodthoff, Nils and Wagner, Patrick and Wenzel, Markus and Samek, Wojciech},
title = "{UDSMProt: universal deep sequence models for protein classification}",
journal = {Bioinformatics},
volume = {36},
number = {8},
pages = {2401-2409},
year = {2020},
month = {01},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa003},
}

An earlier preprint of this work is also available at bioRxiv. This is the accompanying code repository, where we also provide links to pretrained language models.

Also have a look at USMPep:Universal Sequence Models for Major Histocompatibility Complex Binding Affinity Prediction that builds on the same framework.

Dependencies

for training/evaluation: pytorch fastai fire

for dataset creation: numpy pandas scikit-learn biopython sentencepiece lxml

Installation

We recommend using conda as Python package and environment manager. Either install the environment using the provided proteomics.yml by running conda env create -f proteomics.yml or follow the steps below:

  1. Create conda environment: conda create -n proteomics and conda activate proteomics
  2. Install pytorch: conda install pytorch -c pytorch
  3. Install fastai: conda install -c fastai fastai=1.0.52
  4. Install fire: conda install fire -c conda-forge
  5. Install scikit-learn: conda install scikit-learn
  6. Install Biopython: conda install biopython -c conda-forge
  7. Install sentencepiece: pip install sentencepiece
  8. Install lxml: conda install lxml

Optionally (for support of threshold 0.4 clusters) install cd-hit and add cd-hit to the default searchpath.

Data

Swiss-Prot and UniRef

  • Download and extract the desired Swiss-Prot release (by default we use 2017_03) from the UniProt ftp server. Save the contained uniprot_sprot.xml as uniprot_sprot_YEAR_MONTH.xml in the ./data directory
  • Download and extract the desired UniRef release (by default we use 2017_03) from the UniProt ftp server. Save the contained uniref50.xml as uniref50_YEAR_MONTH.xml in the ./data directory. As an alternative and for full reproducibility, we also provide pickled cluster files cdhit04_uniprot_sprot_2016_07.pkl and uniref50_2017_03_uniprot_sprot_2017_03.pkl to be placed under ./tmp_data that avoid downloading the full UniRef file or running cd-hit.
  • Or just call our provided script ./download_swissprot_uniref.sh 2017 03 which manages everything for you.

EC prediction

  • Preprocessed versions of the DEEPre and ECPred datasets are already contained in the ./git_data folder of the repository.
  • The custom EC40 and EC50 datasets will be created from Swiss-Prot data directly.

GO prediction

  • Download the raw GO prediction data data-2016.tar.gz from DeepGoPlus and extract it into the ./data/deepgoplus_data_2016 folder

Remote Homology Detection

  • Download the superfamily and fold datasets and extract them into the ./data folder

Data Preprocessing

  • Run the data preparation script
cd code 
./create_datasets.sh
  • The output is structured as follows:
    • tok.npy sequences as list of numerical indices (mapping is provided by tok_itos.npy)
    • label.npy (if applicable) label as list of numerical indices (mapping is provided by label_itos.npy)
    • train_IDs.npy/val_IDs.npy/test_IDs.npy numerical indices identifying training/validation/test set by specifying rows in tok.npy
    • train_IDs_prev.npy/val_IDs_prev.npy/test_IDs_prev.npy original non-numerical IDs for all entries that were ever assigned to the respective sets (used to obtain consistent splits for downstream tasks)
    • ID.npy original non-numerical IDs for all entries in tok.npy
  • The approach is easily extendable to further downstream classification or regression tasks. It only requires to implement a corresponding preprocessing method similar to the ones provided for the existing tasks in preprocessing_proteomics.py.

Basic Usage

We provide some basic usage information for the most common tasks:

  • Language Model Pretraining (or skip this step and use the provided pretrained LMs (forward and backward models trained on SwissProt 2017_03))
cd code
python modelv1.py language_model --epochs=60 --lr=0.01 --working_folder=datasets/lm/lm_sprot_dirty/ --export_preds=False --eval_on_val_test=True
  • Finetuning for enzyme class classification (here for level 1 and EC50 dataset; assuming the pretrained folder is located at datasets/lm/lm_sprot_uniref_fwd)
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --metrics=["accuracy","macro_f1"] --lr=0.001 --lr_fixed=True --bs=32 --lr_slice_exponent=2.0 --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --export_preds=True --eval_on_val_test=True
  • Finetuning for gene ontology prediction
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --lr=0.001 --lr_fixed=True --bs=32 --lin_ftrs=[1024] --lr_slice_exponent=2.0 --metrics=[] --working_folder=datasets/clas_go/clas_go_deepgoplus_2016 --export_preds=True --eval_on_val_test=True
  • Finetuning for remote homology detection (here for superfamily level and a single dataset)
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=10 --bs=128 --metrics=["binary_auc","binary_auc50","accuracy"] --early_stopping=binary_auc --bs=64 --lr=0.05 --fit_one_cycle=False --working_folder=datasets/clas_scop/clas_scop0 --export_preds=True --eval_on_val_test=True

The output is logged in logfile.log in the working directory, the final results are exported for convenience as result.npy and individual predictions that can be used for example for ensembling forward and backward models are exported as preds_valid.npz and preds_valid.npz (in case export_preds is set to true).

udsmprot's People

Contributors

nstrodt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

udsmprot's Issues

Error when trying to run the benchmarks

Hello, when trying to run the benchmarks from the jupyter notebook, which for various reasons I had to port to run in a local script instead, I received the following error message:
2023-12-23 18:19:26.336247: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-23 18:19:26.359764: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-23 18:19:26.359816: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-23 18:19:26.360542: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-23 18:19:26.364383: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-23 18:19:26.364542: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-23 18:19:28.150936: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
14945 training set records, 1661 validation set records, 4152 test set records.
Traceback (most recent call last):
File "/home/gradwan/protein_bert/sally1.py", line 37, in
pretrained_model_generator, input_encoder = load_pretrained_model()
File "/home/gradwan/protein_bert/proteinbert/existing_model_loading.py", line 54, in load_pretrained_model
return load_pretrained_model_from_dump(dump_file_path, create_model_function, create_model_kwargs = create_model_kwargs, optimizer_class = optimizer_class, lr = lr,
File "/home/gradwan/protein_bert/proteinbert/model_generation.py", line 159, in load_pretrained_model_from_dump
n_annotations, model_weights, optimizer_weights = pickle.load(f)
_pickle.UnpicklingError: invalid load key, 'v'.

Could you please help? Many thanks!

Recreate same EC dataset

Hello,
I am trying to replicate the same dataset for EC prediction(EC40 and EC50) as in your paper UDSMProt but I find some difficulties.

First in your script code/create_datasets.sh in line 27 :
python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True
I think it should be :
python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level2 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True The working folder should be name clas_ec_ec50_level2 not clas_ec_ec50_level1 .

Secondly when I run the script create_datasets.sh, I have an error that says :
../tmp_data/cdhit04_uniprot_sprot_2017_03.pkl not found.
And I think this maybe because in the link you provide :
there are two files with two different versions of swissprot. The file "cdhit04_uniprot_sprot_2016_07.pkl" which uses the 07/2016 version and the file "uniref50_2017_03_uniprot_sprot_2017_03.pkl" which uses the 03/2017 version.

So I am a little bit confused, I don't know if I have to download the Swiss-Prot release of 03/2017 or 07/2016 with the files in your link in order to replicate exactly the same dataset as you in your UDSMProt paper. And even if I have the correct cdhit04 file, will I have exactly the same test dataset ? And if it is not the case, it would be kind if you could provide a link to download the exact train/dev/test dataset for ECpred.

Thanks a lot in advance

incompatible packages in the proteomics.yml

Hi,
I'm trying to install the UDSMProt through conda env create -f proteomics.yml, but incompatible packages were found. Could you provide an updated proteomics.yml file?
Thanks!

Using new dataset for classification

Hi, thank you for the amazing work!
I am just curious if I have my own dataset for which I would like to use your model architecture for protein classification, then what would be the best way to do that?

Also, I am not able to find these Enzyme classification datasets in the repository which are mentioned in code.
path_ec_knna = git_data_path/"suppa.txt"
path_ec_knnb = git_data_path/"suppb.txt"

Thanks.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions.

Hi,

I tried to replicate the data using ./create_datasets.sh but I received the following error:

File "/UDSMProt/code/utils/dataset_utils.py", line 268, in prepare_dataset
tok_num = np.array([[tok_stoi[o] for o in p] for p in tok])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (556825,) + inhomogeneous part.

Could you please help me how I can fix it. Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.