Git Product home page Git Product logo

unirep-analysis's Introduction

UniRep-analysis

Analysis and figure code from Alley et al. 2019.

Start by cloning the repo:

git clone https://github.com/churchlab/UniRep-analysis.git

Requirements

python: 3.5.2

For reference on how to install, see https://askubuntu.com/questions/682869/how-do-i-install-a-different-python-version-using-apt-get

venv with necessary requirements, can be installed with:

cd UniRep-analysis # root directory of the repository
python3 -m venv venv/
source venv/bin/activate
pip install -r venv_requirements/requirements-py3.txt
deactivate

conda, can be installed with:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# restart the currrent shell

two conda environments necessary to run different parts of the code:

cd UniRep-analysis # root directory of the repository
conda env create -f ./yml/grig_alldatasets_run.yml
conda env create -f ./yml/ethan_analysis.yml

Getting the data

mkdir data
cd data

wget https://s3.us-east-2.amazonaws.com/unirep-data-storage/unirep_analysis_data_part2.tar.gz
tar -zxvf unirep_analysis_data_part2.tar.gz # this may take some time
mv data/* ./
rm unirep_analysis_data_part2.tar.gz

wget https://s3.amazonaws.com/unirep-public/unirep_analysis_data.zip
unzip unirep_analysis_data.zip # may need to install unzip with sudo apt install unzip
mv unirep_analysis_data/* ./
rm unirep_analysis_data.zip

cd ..

Project Structure

.
├── analysis
│   ├── analysis_unsupervised_clustering_oxbench_homstrad.ipynb # ethan_analysis
│   ├── FINAL_compute_std_by_val_resampling.py # grig_alldatasets_run
│   ├── FINAL_run_l1_regr_quant_function_stability_and_supp_analyses.py # grig_alldatasets_run
│   ├── FINAL_run_RF_homology_detection.py # grig_alldatasets_run
│   └── FINAL_run_transfer_analysis_function_prediction__stability.py # grig_alldatasets_run
├── figures
│   ├── figure2
│   │   ├── fig_1a.ipynb # ethan_analysis
│   │   ├── fig2b_supp_fig2_upper.ipynb # ethan_analysis
│   │   ├── fig2c.ipynb # ethan_analysis
│   │   ├── fig2e_supp_fig4-5.ipynb # ethan_analysis
│   │   ├── FINAL_Fig2d_AND_SupTableS2_Homology_detection.ipynb # grig_alldatasets_run
│   │   ├── FINAL_Fig2g_alpha-beta_neuron.ipynb # grig_alldatasets_run
│   │   ├── supp_fig2.ipynb # ethan_analysis
│   │   └── supp_fig3.ipynb # ethan_analysis
│   ├── figure3
│   │   ├── fig3b.ipynb # ethan_analysis
│   │   ├── fig3c.ipynb # ethan_analysis
│   │   ├── FINAL_Fig_3a_Rosetta_comparison.ipynb # grig_alldatasets_run
│   │   ├── FINAL_Fig3e_Quant_function_prediction_Fig3b_stability_ssm2_significance_SuppTableS4-5.ipynb # grig_alldatasets_run
│   │   └── supp_fig8.ipynb # ethan_analysis
│   ├── figure4 
│   │   ├── A007h_budget_constrained_functional_sequence_recovery_analysis.ipynb # venv
│   │   ├── A007j_pred_v_actual_fpbase_plots.ipynb # venv
│   │   ├── A008c_visualize_ss_feature_predictors_on_protein.ipynb # venv
│   │   ├── common.py
│   │   ├── supp_fig10a_partial_and_e.ipynb # ethan_analysis
│   │   ├── supp_fig10a_partial_and_f.ipynb # ethan_analysis
│   │   ├── supp_fig10b-d_left.ipynb # ethan_analysis
│   │   ├── supp_fig10b-d_right.ipynb # ethan_analysis
│   │   └── supp_fig_10g_10h.ipynb # venv
│   └── other
│       ├── FINAL_supp_data_3_2_1__supp_fig_S9__Supp_fig_s12.ipynb # grig_alldatasets_run
│       ├── FINAL_SuppFigS1_Seq_db_growth.ipynb # grig_alldatasets_run
│       └── supp_fig13.ipynb # ethan_analysis
├── common
├── common_v2
├── README.md
├── venv_requirements
└── yml

Usage

Reproducing figures

To re-generate all the figures in the main text, one should execute jupyter/ipython notebooks in the /figures directory using the right environment.

To run a notebook, do the following:

Activate the right environment (as noted in the Project Structure section for each notebook):

For grig_alldatasets_run:

source activate grig_alldatasets_run

For ethan_analysis:

source activate ethan_analysis

For venv:

source venv/bin/activate

Then execute:

jupyter notebook

This will automatically open a browser window where one can interactively rerun the code generating the figures. Aesthetic components (colors, font sizes) may differ slightly from the final version of the figures in the paper.

Re-training top models and re-generating performance metrics

In order to re-train the models and regenerate the metrics from which the figures are constructed, one can run the python scripts in the /analysis folder. By default these will evaluate all representations and baselines on all available datasets and subsets and will computer metrics (such as MSE and Pearson r) on the test subset.

The easiest way is to do this is to start an AWS instance with sufficient resources (we recommend m5.12xlarge or m5.24xlarge for shorter runtime - the code takes advantage of all the available CPU cores) with Ubuntu Server 18.04 LTS AMI (for example, ami-0f65671a86f061fcd). After performing the initial setup above, create the necessary directories:

cd analysis
mkdir results # folder for various model performance metrics
mkdir predictions # folder for model predictions for various datasets
mkdir models # folder for trained models
mkdir params # folder for recording best parameters after hyperparameter search 

Activate the right environment:

source activate grig_alldatasets_run

To run SCOP 1.67 Superfamily Remote Homology Detection and SCOP 1.67 Fold-level Similarity Detection with Random Forest, execute:

python FINAL_run_RF_homology_detection.py

To run quantitative function prediction, de novo designed mini proteins stability prediction, DMS stability prediction for 17 de novo designed and natural protein datasets from Figure 3, as well as supplementary benchmarks, such as small-scale function prediction (Supp. Table S4):

python FINAL_run_l1_regr_quant_function_stability_and_supp_analyses.py
python FINAL_compute_std_by_val_resampling.py # computes estimates of standard deviations through validation/test set resampling for significance testing (generates std_results_val_resamp.csv)

To run the analyses from Supp. Fig. S10 (generalized stability prediction, generalized quantitative function prediction and a special central to remote generalized stability prediction task):

python FINAL_run_transfer_analysis_function_prediction__stability.py

unirep-analysis's People

Contributors

grnikh avatar sandias42 avatar surgebiswas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

unirep-analysis's Issues

Resolving conda conflicts

Great work! Thanks for sharing.

I'm interested in running the random forest homology detection code but cannot get past setting up the conda environment. There seem to be many package conflicts for grig_alldatasets_run environment. Any pointers on resolving these conflicts would be greatly appreciated.

Thanks!

Project is missing some files?

Hi guys,

Thanks for you sharing you work!

When trying to reproduce A007j_pred_v_actual_fpbase_plots.ipynb I found that there is no 'settings' directory in the project directory, and there is no 'common' directory in the secondary directory of the project.So I got ImportError. No way for the code to run properly.

The same problem also exists in A007h_budget_constrained_functional_sequence_recovery_analysis.ipynb and A008c_visualize_ss_feature_predictors_on_protein.ipynb

Hope to help me solve it thank you!

embedding with 10 dimensions?

Hi guys,

Thanks for you sharing you work!

When trying to reproduce fig_1a.ipynb we found you are suing a 10 dimensional embedding set for the aminoacids. However in the UniRep code we find babbler classes for 1900, 256 and 64 dimensions but but not 10 dimensions.

When using the 1900 mLSTM units version -instead the 10 dimensions provided in the example- while applying the code provided in fig_1a.ipynb, the resulting plots show similarities with those coming from the 10 dimensional version they but do not fully match. We wonder, how did you came up with the 10 dimensional embedding located at "../../../unirep/weights/1900_weights/embed_matrix:0.npy"?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.