rasmussenlab / pimms Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 2.0 770.47 MB

Imputing proteomics data using deep learning models

Home Page: https://pimms.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 47.75% Jupyter Notebook 49.00% Shell 0.28% R 2.97%

pimms's Issues

Find a way to upload data to Computerome2

Use https://erda.dk/ ?

VAE description

Add overview over different sets of VAEs, see docs/vae_notes.md

:bug: Windows-latest is not windows desktop, but server

Installing all NAGuideR dependencies fails if bioconductor images are used, as the github action testing with windows-latest is using windows server which seems to be compatible with noarch linux packages.

Will probably be good to support NAGuideR trough a separate environment linked to the Snakemake rule and provide instructions on how to install dependencies on windows.

imputation strategies

For masking the missing inputs (to be recovered), one would needs to set value.

Possible options

[ ] lower detection limit
[ ] standardize data and input mean (feature-wise or sample wise?)
[ ] specific token (representing "non-detected"). The lower detection in the data is a numeric "non-deteced" token. One could try to find a learned representation for this (e.g. check BERT)

compare to other imputation approaches

scRNA (Erslan) approach
R package imputeLCMD source code for left censored missing data
- docs contain examples, e.g. for QRILC (contain in R-package)

build docs

structure and commit docs folder
create API docs with sphinx for first set of objects defined

map protein isotopes to gene sequence

In a webseminar the gene sequence for a set of proteins were mentioned:

UniProt has the cross-reference to the ENA database which has the DNA sequence (-> possible for predicted proteins)

protein existence levels

Sample: How many peptides identified result from mis-cleavages.

incorporate sample specific information in ProcessingStrategy (to be defined as base class of all processing strategies)

Experiment02

Application ideas

On the gene level, the model could be used to

improved batch of samples

each condition is dealt with separately

fill in missing for ML follow-up task

replace current proteome imputation by vaep-model based imputation

project config

try to find examples for placing project configs
will have to be moved to a src folder potentially or read in as a file in src.__init__.py?

Download more samples from PRIDE

Using bioservices to access PRIDE
dataset RESTful API from the command line.
pride-py - check out

peptide aggregation to proteins

openms code for peptide to protein aggregation
msfragger code for peptide to protein aggregation

3_select_data.ipynb

Fokus: peptides.txt

change data loading to new format (less memory needed with folder based structure, gene focused)
- read non-filtered peptide dumps (
transfer code to library
get tensorboard in notebook running

FASTA file analysis

Does every gene have unique peptides?

add analysis to 01_FASTA_tryptic_digest.ipynb on how many genes have no unique peptide associated to them.

How many duplicated entries are in fasta files?

I3L1U9 and I3L3I0 have identical AA sequences

run analysis per gene for proteins of equal length

Aggregating peptides

# up to two missed cleavage sites.
peptides = ("ILTERGYSFTTTAEREIVR",
                 "GYSFTTTAEREIVRDIK",
                           "EIVRDIKEK",
                               "DIKEKLCYVALDFEQEMATAASSSSLEK")

math consecutive sequences (the order is known and leads to consecutive overlaps)
aggregate peptides in case of overlap with peptides resulting from no miscleavage (having a min lenght of 6), otherwise keep them?
distinguish observed vs non-observed for aggregation (only consider peptides with evidence)

Next Steps

Create index by date
Cluster HeLa Cellines into two parts to see if change of biological sample of HeLa cellline is matched.
Download HeLas for comparison from Pride
copy files to /tmp/ on Computerome1
Uniprod search space of possible peptides sequences (using defined constraints)
get latest MaxQuant Parameter file for v1.6.1.*'
fasta-trypsin-digest.ipynb (Johannes Müller)
gene-name (and more) look -up using knowledge graph package
Blast tool (to see how unique peptides match to the genome)

MQ files

use previous notebook (to process MQ-output) to analyze an entire MQ-OUTPUT folder
provide a set of peptides and check for different additional information in this specific MQ-OUTPUT ("Retention Time", group of proteins

contaminants

Idea: Replace MQ list of internal contaminants by explicit list of contaminants

In order to reduce the dependency of an internal list of contaminants of a specific tool (or MQ version), specify explicitly a list of contaminants.