aiproteomics / aiproteomics Goto Github PK

A package for MSMS spectral library prediction models from the field of (phospho-)proteomics, intended to facilitate the testing and comparison of different neural network architectures and existing models.

Home Page: https://aiproteomics.github.io/aiproteomics/

License: Apache License 2.0

Shell 1.03% Python 98.97%

msms mass-spectrometry deep-learning spectral-library

aiproteomics's Introduction

`aiproteomics` python package

This package contains various tools, datasets and ML model implementations from the field of (phospho-)proteomics. It is intended to facilitate the testing and comparison of different neural network architectures and existing models, using the same datasets. Both retention time and fragmentation (MSMS) models are included.

Implementations of existing models from the literature are intended to be modifiable/extendable. For example, so that tests may be carried out with different peptide input lengths etc.

Installation instructions

The current package can be installed using poetry after cloning the repository.
Installation instructions for poetry itself can be found here.
Once poetry is installed, run:

git clone [email protected]:aiproteomics/aiproteomics.git
cd aiproteomics/
poetry install

Try demo notebook

After installation, you can try out the demo notebook by running the following:

poetry run jupyter lab demo/uniprot_e2e.ipynb

This will open the notebook using jupyter lab.

Redesign in progress

This package is in the process of being redesigned to make it more general and portable. The redesign is focussing on the creation of:

Generators of models (in the open and portable ONNX format)
Converters from .msp format to input for each model type
Converters from each model type to .msp

Below is a diagram showing how the proposed tools will be combined to produce a pipeline for training proteomics models and using them to generate synthetic spectral libraries:

Contributing

If you want to contribute to the development of aiproteomics, have a look at the contribution guidelines.

aiproteomics's People

Contributors

Stargazers

Watchers

Forkers

danielavdar

aiproteomics's Issues

Check preloading/loading of datasets on snellius

When we want to run training on e.g. snellius or other HPC platforms, we need it to be possible to load the tensorflow datasets and use them on the compute nodes. Compute nodes often don't allow outgoing connections like this, and even if they do it is obviously not efficient to burn GPU time waiting for a 5GB download.

Check whether pre-loading the dataset is convenient/simple enough to do in HPC environments, prior to submission of the training job on the GPU nodes
Are there better ways of doing this?

Build and train a transformer-based model for prediction of MSMS spectra

Adapt the transformer model in aiproteomics.rt.models.transformer_rt to make it predict MSMS spectra for non-phosphorylated peptides.

It might make sense to start by training with the prosit fragmentation data (but other approaches are fine).

Add DeepDIA fragmentation model

Convert/adapt the DeepDIA model to be buildable in the aiproteomics package. Check e.g. aiproteomics/frag/models/prosit_model.py to see an example of a model from the literature that has been adapted. DeepDIA should be easier to do than the prosit model was.

Remove tensor flow datasets from the package

As described in the current README (and can be seen in the redesigned pipeline here) we will no longer be using TFDS for the dataset conversion and creation. This is in part because it is a bit of a dependency nightmare currently (although this will no doubt improve as it matures) but also because we wish to have more flexibility for users when it comes to conversions between different input sets, and not rely on tensorflow etc.

In preparation for the creation of the msp2model converters, we should:

Remove all tfds related code from the package
Make sure it now builds correctly

Add test for first msp2model converter

Once the test .msp data has been obtained in #21, and the first msp2model converter has been created in #20, then the first test for this converter may be added to this repo. We will use pytest as the testing framework. The test should check that the right data is filtered for the chosen model (e.g. sequences greater than the max sequence size accepted by the model should not make it through into the training set)

pip vs conda install clash

pip install does not work for tensorflow==2.6.0 (might work for different version)
conda install does not work for tensorflow-datasets==4.5.2 (I believe it is just not set up for conda at all)

Add the autort dataset

Add the training/testing/validation data for autort as a tensorflow dataset to this package (in aiproteomics.rt.datasets)

Add 'U' filtering to csv_to_speclib

When given a csv containing sequences not compatible with the prediction model (e.g. 'U' amino acid with prosit) the library should filter this out rather than fail. Alternatively, have an earlier validation/filtering function.

Next step: Read the Docs

Your Python package should have publicly available documentation, including API documentation for your users.
Read the Docs can host your user documentation for you.

To host the documentation of this repository please perform the following instructions:

go to Read the Docs
log in with your GitHub account
find https://github.com/ai-proteomics/aiproteomics in list and press + button.
- If repository is not listed,
  1. go to Read the Docs GitHub app
  2. make sure https://github.com/ai-proteomics has been granted access.
  3. reload repository list on Read the Docs import page
wait for the first build to be completed at https://readthedocs.org/projects/aiproteomics/builds
check that the link of the documentation badge in the README.md works

See README.dev.md# how to build documentation site locally.

Add autoRT model to aiproteomics

Make the autoRT retention time prediction model available, somehow, for use from the aiproteomics package.

Next step: Citation data

It is likely that your CITATION.cff currently doesn't pass validation. The error messages you get from the cffconvert GitHub Action are unfortunately a bit cryptic, but doing the following helps:

Check if the given-name and family-name keys need updating. If your family name has a name particle like von or van or de, use the name-particle key; if your name has a suffix like Sr or IV, use name-suffix. For details, refer to the schema description: https://github.com/citation-file-format/citation-file-format
Update the value of the orcid key. If you do not have an orcid yet, you can get one here https://orcid.org/.
Add more authors if needed
Update date-released using the YYYY-MM-DD format.
Update the doi key with the conceptDOI for your repository (see https://help.zenodo.org for more information on what a conceptDOI is). If your project doesn't have a DOI yet, you can use the string 10.0000/FIXME to pass validation.
Verify that the keywords array accurately describes your project.

Once you do all the steps above, the cffconvert workflow will tell you what content it expected to see in .zenodo.json. Copy-paste from the GitHub Action log into a new file .zenodo.json. Afterwards, the cffconvert GitHub Action should be green.

To help you keep the citation metadata up to date and synchronized, the cffconvert GitHub Action checks the following 6 aspects:

Whether your repository includes a CITATION.cff file.

By including this file, authors of the software can receive credit for the work they put in.
Whether your CITATION.cff is valid YAML.

Visit http://www.yamllint.com/ to see if the contents of your CITATION.cff are valid YAML.
Whether your CITATION.cff adheres to the schema (as listed in the CITATION.cff file itself under key cff-version).

The Citation File Format schema can be found here, along with an explanation of all the keys. You're advised to use the latest available schema version.
Whether your repository includes a .zenodo.json file.

With this file, you can control what metadata should be associated with any future releases of your software on Zenodo: things like the author names, along with their affiliations and their ORCIDs, the license under which the software has been released, as well as the name of your software and a short description. If your repository doesn't have a .zenodo.json file, Zenodo will take a somewhat crude guess to assign these metadata.

The cffconvert GitHub action will tell you what it expects to find in .zenodo.json, just copy and paste it to a new file named .zenodo.json. The suggested text ignores CITATION.cff's version, commit, and date-released. cffconvert considers these keys suspect in the sense that they are often out of date, and there is little purpose to telling Zenodo about these properties: Zenodo already knows.
Whether .zenodo.json is valid JSON.

Currently unimplemented, but you can check for yourself on https://jsonlint.com/.
Whether CITATION.cff and .zenodo.json contain equivalent data.

This final check verifies that the two files are in sync. The check ignores CITATION.cff's version, commit, and date-released.

Create comparison module for generating metrics of spectral similarity etc

We need a comparison module in the package, containing one or more tools that assess:

How well does an RT model fit with a given validation set
How well does a fragmentation model fit with a given validation set

It seems likely we could use scikit-learn for some of the metric calculations and possibly to make nice plots too.

For metrics of similarity between spectra, we can apparently make use of matchms to calculate a similarity score.

The comparison tools should go in a comparison directory in this repo, such that they are accessible via the package as aiproteomics.comparison.*

Changing chunksize gives different results

In csv_to_speclib, changing the chunksize apparently affects the predicted results. Test this with chunksize set to 1 and 2, for example.

Create or obtain an example .msp file with training data

For testing and developing the pipeline, it would be useful to have some training set data encoded in "standard" .msp format. This would be, essentially, the retention time and spectral information for as any different sequences as possible. The sequence can be encoded using PEPSEQ (or something similar) in the metadata for each entry.

Perhaps we can get something like this from @tvpham ?

Create the first msp2model converter

Add a msp2model directory to the base of aiproteomics (so that it would be accessible as aiproteomics.msp2model in the python package)
Add a module containing a function that converts a provided, standard .msp spectral library file, into the raw input expected for the transformer retention time model (this is the model currently generated here, but that will be made into a model generator in issue #19)
Make converter flexible enough to handle e.g. different input sequence lengths
Output as HDF5 containing the necessary input an output layer data for training

Building docs is failing due to IndexError from Sphinx

We've opened an issue with sphinx here: sphinx-doc/sphinx#11016

Demo notebook could use some changes and/or documentation updates

The demo notebook gives some error messages, warnings and requires some additional packages:

Tensorflow gives warnings when imported and when used, if it cannot find a GPU. For a demo notebook it should be ok to run it locally on systems without GPU probably. Maybe good to warn the user with a comment that this warning will show, or suppress it, for example by using the warning module:
```
import warnings
warnings.filterwarnings('ignore')
```
The third cell tries to load already trained weights, because this file is not present in the gh repository, it will throw a 'file not found' error. The problematic line is: model_frag.load_weights('trained_transformer_frag/weight_49_0.25681.hdf5'). The line should either be removed or the file should be provided.
The fourth cell uses tf.keras.utils.plot_model, but this needs pydot and graphviz to work. Running the cell does provide the following message:
You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
However it might still be a good idea to add extra documentation (in the readme?) such that the user can install the required packages beforehand. If it is used more often, they can also be added to the requirements.

Move model saving code into separate module to be called by each individual model that will use it

Currently it is part of modelgen/prosit1 model, but should become something generic for all models

Developer documentation needs poetry instructions

The developer docs linked from the main README still talk about venv. We should put the correct dev installation instructions using poetry.

Black formatting

Don't think there is formatting yet

Fix memory issue with spectral library generation

The csv_to_msp() function currently goes out of memory due to holding the entire output in memory. Also check the new prosit code to see if it is more efficient/improved. (Old repo is now archived)

Next step: Linting

Your repository has a workflow which lints your code after every push and when creating a pull request.

Linter workflow may fail if description or keywords field in setup.cfg is empty. Please update these fields. To validate your changes run:

prospector

Enabling githook will automatically lint your code in every commit. You can enable it by running the command below.

git config --local core.hooksPath .githooks

Add functionality to the dataset loader to download the data from the internet

Right now the user needs to load the data manually. Let's create functionality to do it automatically.

All Prosit data for the 2019 model can be found here: https://figshare.com/projects/Prosit/35582

Training and holdout datasets for the fragmentation models are here: https://figshare.com/articles/dataset/ProteomeTools_-_Prosit_fragmentation_-_Data/6860261

Investigate hyperparameter tuning using external tool

Look into use of e.g. wandb.ai to carry out hyperparameter tuning for our model training. We are looking for something that:

Plugs in nicely with our model generators
Can do Bayesian optimization

Investigate how DeepDIA might be split into msp2model, modelgen and model2msp components

A literature model we haven't adapted yet is DeepDIA (https://github.com/lmsac/DeepDIA). It would be useful to:

Install it, following the instructions on the DeepDIA README
Try it with some example input and see what outputs it is capable of producing (can it go straight to a spectral library format, for example?)
Look through the codebase and identify which sections could be converted to each of the three components in our design: msp2model, modelgen and model2msp
Write up a small report on your findings in this issue thread

Next step: Enable Zenodo integration

By enabling Zenodo integration, your package will automatically get a DOI which can be used to cite your package. After enabling Zenodo integration for your GitHub repository, Zenodo will create a snapshot and archive each release you make on GitHub. Moreover, Zenodo will create a new DOI for each GitHub release of your code.

To enable Zenodo integration:

Go to http://zenodo.org and login with your GitHub account. When you are redirected to GitHub, Authorize application to give permission to Zenodo to access your account.
Go to https://zenodo.org/account/settings/github/ and enable Zenodo integration of your repository by clicking on On toggle button.
Your package will get a DOI only after you make a release. Create a new release as described in README.dev.md
At this point you should have a DOI. To find out the DOI generated by Zenodo:
1. Visit https://zenodo.org/deposit and click on your repository link
2. You will find the latest DOI in the right column in Versions box in Cite all versions? section
3. Copy the text of the link. For example 10.5281/zenodo.1310751
Update the badge in your repository
1. Edit README.md and replace the badge placeholder with the badge link you copied in previous step.
  The badge placeholder is shown below.
  
  [![DOI](https://zenodo.org/badge/DOI/<replace-with-created-DOI>.svg)](https://doi.org/<replace-with-created-DOI>)

For FAQ about Zenodo please visit https://help.zenodo.org/.

Add Prosit RT dataset

Make the Prosit retention time dataset into a tensorflow dataset, available in aiproteomics.rt.datasets

_generate_examples() is really slow for large data sets

Tensorflow Datasets extend tfds.core.GeneratorBasedBuilder in order to create datasets. This has a user implemented _generate_examples() method that reads the data and returns a dict of each 'row'. This works fine for the retention time data sets which are typically very small (see e.g. aiproteomics.rt.datasets.AIProteomicsHela1) but is horrifically slow for the large datasets involved in fragmentation models.

For this issue, look at the aiproteomics.frag.datasets.AIProteomicsProsit1Frag dataset.

Are tensorflow datasets the best way of packaging/serving large data sets?
Is there a way to do this optimally? Through batching of data, or running a transform on the hdf5 data etc?

Convert dataset loader to tfds framework thing

Seems like the right thing to do

WARNING: merge #59 first!!!!

Data should be loaded in a lazy fashion

Currently the data loading module DataSetPrositFrag loads all the data in one go.

This puts unnecessary strain on machine resources. This needs to be changed to loading one batch at a time.

Create the first model generator

For this issue we will use the transformer retention time model generating code already available in the package here

Add a modelgen directory to the repo (that will be accessible in the python package as aiproteomics.modelgen) to put any new model generator code in
Adapt the code so that it outputs the model in .onnx format (allow optionally still to output the model as a keras model as it currently does)
Make the file optionally also runnable from the commandline (using the if __name__ == "__main__": approach). When run from the commandline, it should only output in ONNX format

Add compare module for fragmentation models

Add a compare module to aiproteomics for the fragmentation models branch, that contains tools for comparing the output MSMS spectra of different models against each other and validation data sets. So a user could import tools from aiproteomics.frag.compare.

Look into the possibility of using https://github.com/matchms/matchms (or something like it) to compare mass spectra with each other. Also check out the prosit paper for metrics used there.

Create first model2msp converter

The output layer(s) of a given model need translation/conversion to produce the spectra/iRT data in .msp format. Especially for the fragmentation models, this can represent a lot of post-processing.

Add a model2msp directory to the base of aiproteomics (so that it would be accessible as aiproteomics.model2msp in the python package)
Add a module containing a function that converts the output of the transformer retention time model (currently generated here into .msp format. It's ok if that .msp only contains retention time for now (no spectra).

Improve data loading

Explore tensorflow dataset
Explore alternatives

Problem: currently whole dataset is loaded in memory, tensorflow dataset might solve this.

There also needs to be functionality to download the dataset from the internet (figshare)

@raar1 can you share the link to the dataset?

Update dependencies to use tensorflow datasets

Tensorflow datasets doesn't support the current version of tf.
We have to upgrade to the newest version.

The newest tf version doesn't support python 3.9 anymore so we also have to drop 3.9

Next step: Sonarcloud integration

Continuous code quality can be handled by Sonarcloud. This repository is configured to use Sonarcloud to perform quality analysis and code coverage report on each push.

In order to configure Sonarcloud analysis GitHub Action workflow you must follow the steps below:

go to Sonarcloud to create a new Sonarcloud project
login with your GitHub account
add Sonarcloud organization or reuse existing one
set up a repository
go to new code definition administration page and select Number of days option
To be able to run the analysis:
1. a token must be created at Sonarcloud account
2. the created token must be added as SONAR_TOKEN to secrets on GitHub

Add compare module to aiproteomics for RT

Add a compare module to aiproteomics for the retention time branch, that contains tools for comparing the performance of different models against each other. So a user could import tools from aiproteomics.rt.compare and determine a number of metrics about how good the performance of different models are, when trained from the same data.