Git Product home page Git Product logo

pimms's Introduction

PIMMS

Read the Docs GitHub Actions Workflow Status

PIMMS stands for Proteomics Imputation Modeling Mass Spectrometry and is a hommage to our dear British friends who are missing as part of the EU for far too long already (Pimms is also a British summer drink).

The pre-print is available on biorxiv.

PIMMS was called vaep during development.
Before entire refactoring has to been completed the imported package will be vaep.

We provide functionality as a python package, an excutable workflow and notebooks.

The models can be used with the scikit-learn interface in the spirit of other scikit-learn imputers. You can try this in colab. open in Colab

Python package

For interactive use of the models provided in PIMMS, you can use our python package pimms-learn. The interface is similar to scikit-learn.

pip install pimms-learn

Then you can use the models on a pandas DataFrame with missing values. Try this in the tutorial on Colab: open in Colab

Notebooks as scripts using papermill

If you want to run a model on your prepared data, you can run notebooks prefixed with 01_, i.e. project/01_*.ipynb after cloning the repository. Using jupytext also python percentage script versions are saved.

cd project # project folder as pwd
papermill 01_0_split_data.ipynb --help-notebook
papermill 01_1_train_vae.ipynb --help-notebook

Mistyped argument names won't throw an error when using papermill

PIMMS comparison workflow

The PIMMS comparison workflow is a snakemake workflow that runs the all selected PIMMS models and R-models on a user-provided dataset and compares the results. An example for the smaller HeLa development dataset on the protein groups level is re-built regularly and available at: rasmussenlab.org/pimms

Setup comparison workflow

The core funtionality is available as a standalone software on PyPI under the name pimms-learn. However, running the entire snakemake workflow in enabled using conda (or mamba) and pip to setup an analysis environment. For a detailed description of setting up conda (or mamba), see instructions on setting up a virtual environment.

Download the repository

git clone https://github.com/RasmussenLab/pimms.git
cd pimms

Using conda (or mamba), install the dependencies and the package in editable mode

# from main folder of repository (containing environment.yml)
conda env create -n pimms -f environment.yml # slower
mamba env create -n pimms -f environment.yml # faster, less then 5mins

If on Mac M1, M2 or having otherwise issue using your accelerator (e.g. GPUs): Install the pytorch dependencies first, then the rest of the environment:

Install development dependencies

Check how to install pytorch for your system here.

  • select the version compatible with your cuda version if you have an nvidia gpu or a Mac M-chip.
conda create -n vaep python=3.8 pip
conda activate vaep
# Follow instructions on https://pytorch.org/get-started 
# conda env update -f environment.yml -n vaep # should not install the rest.
pip install pimms-learn
pip install jupyterlab papermill # use run notebook interactively or as a script

cd project
# choose one of the following to test the code
jupyter lab # open 04_1_train_pimms_models.ipynb
papermill 04_1_train_pimms_models.ipynb 04_1_train_pimms_models_test.ipynb # second notebook is output
python 04_1_train_pimms_models.py # just execute the code

Entire development installation

conda create -n pimms_dev -c pytorch -c nvidia -c fastai -c bioconda -c plotly -c conda-forge --file requirements.txt --file requirements_R.txt --file requirements_dev.txt
pip install -e . # other pip dependencies missing
snakemake --configfile config/single_dev_dataset/example/config.yaml -F -n

or if you want to update an existing environment

conda update  -c defaults -c conda-forge -c fastai -c bioconda -c plotly --file requirements.txt --file requirements_R.txt --file requirements_dev.txt

or using the environment.yml file (can fail on certain systems)

conda env create -f environment.yml

Troubleshooting

Trouble shoot your R installation by opening jupyter lab

# in projects folder
jupyter lab # open 01_1_train_NAGuideR.ipynb

Run an analysis

Change to the project folder and see it's README You can subselect models by editing the config file: config.yaml file.

conda activate pimms # activate virtual environment
cd project # go to project folder
pwd # so be in ./pimms/project
snakemake -c1 -p -n # dryrun demo workflow
snakemake -c1 -p

The demo will run an example on a small data set of 50 HeLa samples (protein groups):

  1. it describes the data and does create the splits based on the example data
    • see 01_0_split_data.ipynb
  2. it runs the three semi-supervised models next to some default heuristic methods
    • see 01_1_train_collab.ipynb, 01_1_train_dae.ipynb, 01_1_train_vae.ipynb
  3. it creates an comparison
    • see 01_2_performance_plots.ipynb

The results are written to ./pimms/project/runs/example, including html versions of the notebooks for inspection, having the following structure:

│   01_0_split_data.html
│   01_0_split_data.ipynb
│   01_1_train_collab.html
│   01_1_train_collab.ipynb
│   01_1_train_dae.html
│   01_1_train_dae.ipynb
│   01_1_train_vae.html
│   01_1_train_vae.ipynb
│   01_2_performance_plots.html
│   01_2_performance_plots.ipynb
│   data_config.yaml
│   tree_folder.txt
|---data
|---figures
|---metrics
|---models
|---preds

The predictions of the three semi-supervised models can be found under ./pimms/project/runs/example/preds. To combine them with the observed data you can run

# ipython or python session
# be in ./pimms/project
folder_data = 'runs/example/data'
data = vaep.io.datasplits.DataSplits.from_folder(
    folder_data, file_format='pkl')
observed = pd.concat([data.train_X, data.val_y, data.test_y])
# load predictions for missing values of a certain model
model = 'vae'
fpath_pred = f'runs/example/preds/pred_real_na_{model}.csv '
pred = pd.read_csv(fpath_pred, index_col=[0, 1]).squeeze()
df_imputed = pd.concat([observed, pred]).unstack()
# assert no missing values for retained features
assert df_imputed.isna().sum().sum() == 0
df_imputed

Available imputation methods

Packages either are based on this repository, or were referenced by NAGuideR (Table S1). From the brief description in the table the exact procedure is not always clear.

Method Package source status name
CF pimms pip Collaborative Filtering
DAE pimms pip Denoising Autoencoder
VAE pimms pip Variational Autoencoder
ZERO - - replace NA with 0
MINIMUM - - replace NA with global minimum
COLMEDIAN e1071 CRAN replace NA with column median
ROWMEDIAN e1071 CRAN replace NA with row median
KNN_IMPUTE impute BIOCONDUCTOR k nearest neighbor imputation
SEQKNN SeqKnn tar file Sequential k- nearest neighbor imputation
starts with feature with least missing values and re-use imputed values for not yet imputed features
BPCA pcaMethods BIOCONDUCTOR Bayesian PCA missing value imputation
SVDMETHOD pcaMethods BIOCONDUCTOR replace NA initially with zero, use k most significant eigenvalues using Singular Value Decomposition for imputation until convergence
LLS pcaMethods BIOCONDUCTOR Local least squares imputation of a feature based on k most correlated features
MLE norm CRAN Maximum likelihood estimation
QRILC imputeLCMD CRAN quantile regression imputation of left-censored data, i.e. by random draws from a truncated distribution which parameters were estimated by quantile regression
MINDET imputeLCMD CRAN replace NA with q-quantile minimum in a sample
MINPROB imputeLCMD CRAN replace NA by random draws from q-quantile minimum centered distribution
IRM VIM CRAN iterativ robust model-based imputation (one feature at at time)
IMPSEQ rrcovNA CRAN Sequential imputation of missing values by minimizing the determinant of the covariance matrix with imputed values
IMPSEQROB rrcovNA CRAN Sequential imputation of missing values using robust estimators
MICE-NORM mice CRAN Multivariate Imputation by Chained Equations (MICE) using Bayesian linear regression
MICE-CART mice CRAN Multivariate Imputation by Chained Equations (MICE) using regression trees
TRKNN - script truncation k-nearest neighbor imputation
RF missForest CRAN Random Forest imputation (one feature at a time)
PI - - Downshifted normal distribution (per sample)
GSIMP - script QRILC initialization and iterative Gibbs sampling with generalized linear models (glmnet)
MSIMPUTE msImpute BIOCONDUCTOR Missing at random algorithm using low rank approximation
MSIMPUTE_MNAR msImpute BIOCONDUCTOR Missing not at random algorithm using low rank approximation
grr DreamAI - Fails to install Rigde regression
GMS GMSimpute tar file Fails on Windows Lasso model

Build status

Documentation Status

pimms's People

Contributors

enryh avatar

Stargazers

Pedro Aragon avatar Vic avatar Joachim Johansen avatar  avatar BoJi avatar  avatar Florian A Rosenberger avatar Elena Krismer avatar Simon Rasmussen avatar

Watchers

 avatar

pimms's Issues

FASTA file analysis

Does every gene have unique peptides?

  • add analysis to 01_FASTA_tryptic_digest.ipynb on how many genes have no unique peptide associated to them.

How many duplicated entries are in fasta files?

I3L1U9 and I3L3I0 have identical AA sequences

  • run analysis per gene for proteins of equal length

Aggregating peptides

# up to two missed cleavage sites.
peptides = ("ILTERGYSFTTTAEREIVR",
                 "GYSFTTTAEREIVRDIK",
                           "EIVRDIKEK",
                               "DIKEKLCYVALDFEQEMATAASSSSLEK")
  • math consecutive sequences (the order is known and leads to consecutive overlaps)
  • aggregate peptides in case of overlap with peptides resulting from no miscleavage (having a min lenght of 6), otherwise keep them?
  • distinguish observed vs non-observed for aggregation (only consider peptides with evidence)

Application ideas

On the gene level, the model could be used to

improved batch of samples

  • each condition is dealt with separately

fill in missing for ML follow-up task

  • replace current proteome imputation by vaep-model based imputation

3_select_data.ipynb

Fokus: peptides.txt

  • change data loading to new format (less memory needed with folder based structure, gene focused)
    • read non-filtered peptide dumps (
  • transfer code to library
  • get tensorboard in notebook running

imputation strategies

For masking the missing inputs (to be recovered), one would needs to set value.

Possible options

  • [ ] lower detection limit
  • [ ] standardize data and input mean (feature-wise or sample wise?)
  • [ ] specific token (representing "non-detected"). The lower detection in the data is a numeric "non-deteced" token. One could try to find a learned representation for this (e.g. check BERT)

contaminants

Idea: Replace MQ list of internal contaminants by explicit list of contaminants

In order to reduce the dependency of an internal list of contaminants of a specific tool (or MQ version), specify explicitly a list of contaminants.

Next Steps

  • Create index by date

  • Cluster HeLa Cellines into two parts to see if change of biological sample of HeLa cellline is matched.

  • Download HeLas for comparison from Pride

  • copy files to /tmp/ on Computerome1

  • Uniprod search space of possible peptides sequences (using defined constraints)

  • get latest MaxQuant Parameter file for v1.6.1.*'

  • fasta-trypsin-digest.ipynb (Johannes Müller)

  • gene-name (and more) look -up using knowledge graph package

  • Blast tool (to see how unique peptides match to the genome)

:bug: Windows-latest is not windows desktop, but server

Installing all NAGuideR dependencies fails if bioconductor images are used, as the github action testing with windows-latest is using windows server which seems to be compatible with noarch linux packages.

Will probably be good to support NAGuideR trough a separate environment linked to the Snakemake rule and provide instructions on how to install dependencies on windows.

MQ files

  • use previous notebook (to process MQ-output) to analyze an entire MQ-OUTPUT folder
  • provide a set of peptides and check for different additional information in this specific MQ-OUTPUT ("Retention Time", group of proteins

Experiment02

  • consolidate experiment 02 notebooks
    • move bullets points with tasks to issues
    • make model specific issues with ideas for evaluating them
    • make it work as a script with parameters
    • put more code into configuration files
    • add some more tests for important classes/functions
  • Select training data for performance comparisons
    • Entire data PCA, t-SNE, UMAP -> find a cluster of samples
    • fix data, describe it, order it (500-1000 samples for training, validation and testing data)
  • Compare different setups and performances
    • On fixed data, create performance difference plots between models, etc
  • denoising evaluation

project config

  • try to find examples for placing project configs
  • will have to be moved to a src folder potentially or read in as a file in src.__init__.py?

build docs

  • structure and commit docs folder
  • create API docs with sphinx for first set of objects defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.