Git Product home page Git Product logo

aiproteomics / aiproteomics Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 1.0 4.75 MB

A package for MSMS spectral library prediction models from the field of (phospho-)proteomics, intended to facilitate the testing and comparison of different neural network architectures and existing models.

Home Page: https://aiproteomics.github.io/aiproteomics/

License: Apache License 2.0

Shell 1.03% Python 98.97%
msms mass-spectrometry deep-learning spectral-library

aiproteomics's Introduction

RSD fair-software badge build cffconvert sonarcloud markdown-link-check DOI

aiproteomics python package

This package contains various tools, datasets and ML model implementations from the field of (phospho-)proteomics. It is intended to facilitate the testing and comparison of different neural network architectures and existing models, using the same datasets. Both retention time and fragmentation (MSMS) models are included.

Implementations of existing models from the literature are intended to be modifiable/extendable. For example, so that tests may be carried out with different peptide input lengths etc.

Installation instructions

The current package can be installed using poetry after cloning the repository.
Installation instructions for poetry itself can be found here.
Once poetry is installed, run:

git clone [email protected]:aiproteomics/aiproteomics.git
cd aiproteomics/
poetry install

Try demo notebook

After installation, you can try out the demo notebook by running the following:

poetry run jupyter lab demo/uniprot_e2e.ipynb

This will open the notebook using jupyter lab.

Redesign in progress

This package is in the process of being redesigned to make it more general and portable. The redesign is focussing on the creation of:

  1. Generators of models (in the open and portable ONNX format)
  2. Converters from .msp format to input for each model type
  3. Converters from each model type to .msp

Below is a diagram showing how the proposed tools will be combined to produce a pipeline for training proteomics models and using them to generate synthetic spectral libraries:

Proposed aiproteomics pipeline

Contributing

If you want to contribute to the development of aiproteomics, have a look at the contribution guidelines.

aiproteomics's People

Contributors

danibodor avatar drcandacemakedamoore avatar dsmits avatar raar1 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

danielavdar

aiproteomics's Issues

Check preloading/loading of datasets on snellius

When we want to run training on e.g. snellius or other HPC platforms, we need it to be possible to load the tensorflow datasets and use them on the compute nodes. Compute nodes often don't allow outgoing connections like this, and even if they do it is obviously not efficient to burn GPU time waiting for a 5GB download.

  • Check whether pre-loading the dataset is convenient/simple enough to do in HPC environments, prior to submission of the training job on the GPU nodes
  • Are there better ways of doing this?

Add DeepDIA fragmentation model

Convert/adapt the DeepDIA model to be buildable in the aiproteomics package. Check e.g. aiproteomics/frag/models/prosit_model.py to see an example of a model from the literature that has been adapted. DeepDIA should be easier to do than the prosit model was.

Remove tensor flow datasets from the package

As described in the current README (and can be seen in the redesigned pipeline here) we will no longer be using TFDS for the dataset conversion and creation. This is in part because it is a bit of a dependency nightmare currently (although this will no doubt improve as it matures) but also because we wish to have more flexibility for users when it comes to conversions between different input sets, and not rely on tensorflow etc.

In preparation for the creation of the msp2model converters, we should:

  • Remove all tfds related code from the package
  • Make sure it now builds correctly

Add test for first msp2model converter

Once the test .msp data has been obtained in #21, and the first msp2model converter has been created in #20, then the first test for this converter may be added to this repo. We will use pytest as the testing framework. The test should check that the right data is filtered for the chosen model (e.g. sequences greater than the max sequence size accepted by the model should not make it through into the training set)

pip vs conda install clash

pip install does not work for tensorflow==2.6.0 (might work for different version)
conda install does not work for tensorflow-datasets==4.5.2 (I believe it is just not set up for conda at all)

Add the autort dataset

Add the training/testing/validation data for autort as a tensorflow dataset to this package (in aiproteomics.rt.datasets)

Add 'U' filtering to csv_to_speclib

When given a csv containing sequences not compatible with the prediction model (e.g. 'U' amino acid with prosit) the library should filter this out rather than fail. Alternatively, have an earlier validation/filtering function.

Next step: Read the Docs

Your Python package should have publicly available documentation, including API documentation for your users.
Read the Docs can host your user documentation for you.

To host the documentation of this repository please perform the following instructions:

  1. go to Read the Docs
  2. log in with your GitHub account
  3. find https://github.com/ai-proteomics/aiproteomics in list and press + button.
  4. wait for the first build to be completed at https://readthedocs.org/projects/aiproteomics/builds
  5. check that the link of the documentation badge in the README.md works

See README.dev.md# how to build documentation site locally.

Next step: Citation data

It is likely that your CITATION.cff currently doesn't pass validation. The error messages you get from the cffconvert GitHub Action are unfortunately a bit cryptic, but doing the following helps:

  • Check if the given-name and family-name keys need updating. If your family name has a name particle like von or van or de, use the name-particle key; if your name has a suffix like Sr or IV, use name-suffix. For details, refer to the schema description: https://github.com/citation-file-format/citation-file-format
  • Update the value of the orcid key. If you do not have an orcid yet, you can get one here https://orcid.org/.
  • Add more authors if needed
  • Update date-released using the YYYY-MM-DD format.
  • Update the doi key with the conceptDOI for your repository (see https://help.zenodo.org for more information on what a conceptDOI is). If your project doesn't have a DOI yet, you can use the string 10.0000/FIXME to pass validation.
  • Verify that the keywords array accurately describes your project.

Once you do all the steps above, the cffconvert workflow will tell you what content it expected to see in .zenodo.json. Copy-paste from the GitHub Action log into a new file .zenodo.json. Afterwards, the cffconvert GitHub Action should be green.

To help you keep the citation metadata up to date and synchronized, the cffconvert GitHub Action checks the following 6 aspects:

  1. Whether your repository includes a CITATION.cff file.

    By including this file, authors of the software can receive credit for the work they put in.

  2. Whether your CITATION.cff is valid YAML.

    Visit http://www.yamllint.com/ to see if the contents of your CITATION.cff are valid YAML.

  3. Whether your CITATION.cff adheres to the schema (as listed in the CITATION.cff file itself under key cff-version).

    The Citation File Format schema can be found here, along with an explanation of all the keys. You're advised to use the latest available schema version.

  4. Whether your repository includes a .zenodo.json file.

    With this file, you can control what metadata should be associated with any future releases of your software on Zenodo: things like the author names, along with their affiliations and their ORCIDs, the license under which the software has been released, as well as the name of your software and a short description. If your repository doesn't have a .zenodo.json file, Zenodo will take a somewhat crude guess to assign these metadata.

    The cffconvert GitHub action will tell you what it expects to find in .zenodo.json, just copy and paste it to a new file named .zenodo.json. The suggested text ignores CITATION.cff's version, commit, and date-released. cffconvert considers these keys suspect in the sense that they are often out of date, and there is little purpose to telling Zenodo about these properties: Zenodo already knows.

  5. Whether .zenodo.json is valid JSON.

    Currently unimplemented, but you can check for yourself on https://jsonlint.com/.

  6. Whether CITATION.cff and .zenodo.json contain equivalent data.

    This final check verifies that the two files are in sync. The check ignores CITATION.cff's version, commit, and date-released.

Create comparison module for generating metrics of spectral similarity etc

We need a comparison module in the package, containing one or more tools that assess:

  1. How well does an RT model fit with a given validation set
  2. How well does a fragmentation model fit with a given validation set

It seems likely we could use scikit-learn for some of the metric calculations and possibly to make nice plots too.

For metrics of similarity between spectra, we can apparently make use of matchms to calculate a similarity score.

The comparison tools should go in a comparison directory in this repo, such that they are accessible via the package as aiproteomics.comparison.*

Create or obtain an example .msp file with training data

For testing and developing the pipeline, it would be useful to have some training set data encoded in "standard" .msp format. This would be, essentially, the retention time and spectral information for as any different sequences as possible. The sequence can be encoded using PEPSEQ (or something similar) in the metadata for each entry.

Perhaps we can get something like this from @tvpham ?

Create the first msp2model converter

  • Add a msp2model directory to the base of aiproteomics (so that it would be accessible as aiproteomics.msp2model in the python package)
  • Add a module containing a function that converts a provided, standard .msp spectral library file, into the raw input expected for the transformer retention time model (this is the model currently generated here, but that will be made into a model generator in issue #19)
  • Make converter flexible enough to handle e.g. different input sequence lengths
  • Output as HDF5 containing the necessary input an output layer data for training

Demo notebook could use some changes and/or documentation updates

The demo notebook gives some error messages, warnings and requires some additional packages:

  • Tensorflow gives warnings when imported and when used, if it cannot find a GPU. For a demo notebook it should be ok to run it locally on systems without GPU probably. Maybe good to warn the user with a comment that this warning will show, or suppress it, for example by using the warning module:
    import warnings
    warnings.filterwarnings('ignore')
    
  • The third cell tries to load already trained weights, because this file is not present in the gh repository, it will throw a 'file not found' error. The problematic line is: model_frag.load_weights('trained_transformer_frag/weight_49_0.25681.hdf5'). The line should either be removed or the file should be provided.
  • The fourth cell uses tf.keras.utils.plot_model, but this needs pydot and graphviz to work. Running the cell does provide the following message:
    You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
    However it might still be a good idea to add extra documentation (in the readme?) such that the user can install the required packages beforehand. If it is used more often, they can also be added to the requirements.

Fix memory issue with spectral library generation

The csv_to_msp() function currently goes out of memory due to holding the entire output in memory. Also check the new prosit code to see if it is more efficient/improved. (Old repo is now archived)

Next step: Linting

Your repository has a workflow which lints your code after every push and when creating a pull request.

Linter workflow may fail if description or keywords field in setup.cfg is empty. Please update these fields. To validate your changes run:

prospector

Enabling githook will automatically lint your code in every commit. You can enable it by running the command below.

git config --local core.hooksPath .githooks

Investigate how DeepDIA might be split into msp2model, modelgen and model2msp components

A literature model we haven't adapted yet is DeepDIA (https://github.com/lmsac/DeepDIA). It would be useful to:

  • Install it, following the instructions on the DeepDIA README
  • Try it with some example input and see what outputs it is capable of producing (can it go straight to a spectral library format, for example?)
  • Look through the codebase and identify which sections could be converted to each of the three components in our design: msp2model, modelgen and model2msp
  • Write up a small report on your findings in this issue thread

Next step: Enable Zenodo integration

By enabling Zenodo integration, your package will automatically get a DOI which can be used to cite your package. After enabling Zenodo integration for your GitHub repository, Zenodo will create a snapshot and archive each release you make on GitHub. Moreover, Zenodo will create a new DOI for each GitHub release of your code.

To enable Zenodo integration:

  1. Go to http://zenodo.org and login with your GitHub account. When you are redirected to GitHub, Authorize application to give permission to Zenodo to access your account.
  2. Go to https://zenodo.org/account/settings/github/ and enable Zenodo integration of your repository by clicking on On toggle button.
  3. Your package will get a DOI only after you make a release. Create a new release as described in README.dev.md
  4. At this point you should have a DOI. To find out the DOI generated by Zenodo:
    1. Visit https://zenodo.org/deposit and click on your repository link
    2. You will find the latest DOI in the right column in Versions box in Cite all versions? section
    3. Copy the text of the link. For example 10.5281/zenodo.1310751
  5. Update the badge in your repository
    1. Edit README.md and replace the badge placeholder with the badge link you copied in previous step.
      The badge placeholder is shown below.

      [![DOI](https://zenodo.org/badge/DOI/<replace-with-created-DOI>.svg)](https://doi.org/<replace-with-created-DOI>)

For FAQ about Zenodo please visit https://help.zenodo.org/.

Add Prosit RT dataset

Make the Prosit retention time dataset into a tensorflow dataset, available in aiproteomics.rt.datasets

_generate_examples() is really slow for large data sets

Tensorflow Datasets extend tfds.core.GeneratorBasedBuilder in order to create datasets. This has a user implemented _generate_examples() method that reads the data and returns a dict of each 'row'. This works fine for the retention time data sets which are typically very small (see e.g. aiproteomics.rt.datasets.AIProteomicsHela1) but is horrifically slow for the large datasets involved in fragmentation models.

For this issue, look at the aiproteomics.frag.datasets.AIProteomicsProsit1Frag dataset.

  • Are tensorflow datasets the best way of packaging/serving large data sets?
  • Is there a way to do this optimally? Through batching of data, or running a transform on the hdf5 data etc?

Data should be loaded in a lazy fashion

Currently the data loading module DataSetPrositFrag loads all the data in one go.

This puts unnecessary strain on machine resources. This needs to be changed to loading one batch at a time.

Create the first model generator

For this issue we will use the transformer retention time model generating code already available in the package here

  • Add a modelgen directory to the repo (that will be accessible in the python package as aiproteomics.modelgen) to put any new model generator code in
  • Adapt the code so that it outputs the model in .onnx format (allow optionally still to output the model as a keras model as it currently does)
  • Make the file optionally also runnable from the commandline (using the if __name__ == "__main__": approach). When run from the commandline, it should only output in ONNX format

Add compare module for fragmentation models

Add a compare module to aiproteomics for the fragmentation models branch, that contains tools for comparing the output MSMS spectra of different models against each other and validation data sets. So a user could import tools from aiproteomics.frag.compare.

Look into the possibility of using https://github.com/matchms/matchms (or something like it) to compare mass spectra with each other. Also check out the prosit paper for metrics used there.

Create first model2msp converter

The output layer(s) of a given model need translation/conversion to produce the spectra/iRT data in .msp format. Especially for the fragmentation models, this can represent a lot of post-processing.

  • Add a model2msp directory to the base of aiproteomics (so that it would be accessible as aiproteomics.model2msp in the python package)
  • Add a module containing a function that converts the output of the transformer retention time model (currently generated here into .msp format. It's ok if that .msp only contains retention time for now (no spectra).

Improve data loading

  • Explore tensorflow dataset
  • Explore alternatives

Problem: currently whole dataset is loaded in memory, tensorflow dataset might solve this.

There also needs to be functionality to download the dataset from the internet (figshare)

@raar1 can you share the link to the dataset?

Update dependencies to use tensorflow datasets

Tensorflow datasets doesn't support the current version of tf.
We have to upgrade to the newest version.

The newest tf version doesn't support python 3.9 anymore so we also have to drop 3.9

Next step: Sonarcloud integration

Continuous code quality can be handled by Sonarcloud. This repository is configured to use Sonarcloud to perform quality analysis and code coverage report on each push.

In order to configure Sonarcloud analysis GitHub Action workflow you must follow the steps below:

  1. go to Sonarcloud to create a new Sonarcloud project
  2. login with your GitHub account
  3. add Sonarcloud organization or reuse existing one
  4. set up a repository
  5. go to new code definition administration page and select Number of days option
  6. To be able to run the analysis:
    1. a token must be created at Sonarcloud account
    2. the created token must be added as SONAR_TOKEN to secrets on GitHub

Add compare module to aiproteomics for RT

Add a compare module to aiproteomics for the retention time branch, that contains tools for comparing the performance of different models against each other. So a user could import tools from aiproteomics.rt.compare and determine a number of metrics about how good the performance of different models are, when trained from the same data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.