Git Product home page Git Product logo

aalto-ics-kepaco / msms_rt_score_integration Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 2.0 7.24 GB

Code, Data and Results of the publication: "Probabilistic Framework for Integration of Tandem-Mass Spectrum and Retention Time Information in Small Molecule Identification" by Bach et al. 2020

License: Other

MAXScript 48.58% Python 22.52% Jupyter Notebook 24.70% Shell 4.20%
metabolite-identification mass-spectrum retention-order-prediction lc-msms

msms_rt_score_integration's People

Contributors

bachi55 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

animesh mattoslmp

msms_rt_score_integration's Issues

[REV] Error Bars for Figure 2

Figure 2 shows the averaged (across all four data-setups) MS + RT score-integration framework performance for different numbers of random spanning trees (used to approximate the MRF). The reviewer asked to add some bars indicating the variance to the plot.

I propose present the variance for each data-setup (i.e. ionization mode and dataset) separately. That is because the baseline performance (i.e. topk-accuracy) is very different for each data-setup and therefore the lots of the variance would come from this fact.

Incorrect Margin Normalization

Background
The normalized marginals are calculated as follows (see Section 2.3.1 and 2.3.4):

Margin type Formula
Sum-margin image
Max-margin image

The current implementation of the margin normalization is not exactly the (correct) one described in the paper
In the _marginals function all marginals (image and image) are divided by image. Originally intended to "increase the numerical stability" of the exponentiation (np.exp) it actually interferes with the normalization for the sum-margin. Further down the in the pipeline the get_marginals and get_max_marginals both use the same normalization function _normalize_marginals_sum_to_one, if normalized marginals are requested. This is only correct for the sum-marginal though.

Up to this point: If no normalization is requested (normalize=False), than the max-marginals are correctly calculated, but the sum-marginals have some additional "normalization" step.

In the eval__TFG scripts, we actually request un-normalized marginals. Those are passed to the margin aggregation (over the tree ensemble) where they are normalized according to the sum-margin formula before the topk-acc is calculated for the parameter selection. Later, when the marginals are calculated for the test set, than normalized marginals are requested and this normalization is incorrect for the max-marginal.

How to fix this

Interface to apply score-integration framework to new data

Design of an interface able to process a list of MS-features, with

  • retention times (RTs)
  • precursor ion mass (MS1)
  • (optional) fragmentation spectra (MS2)
  • (optional) molecular candidate list

and output ranked candidate lists using the MS and RT information.

First steps:

  • Which input to expect: MS2 (which format), mzXML (more raw format), ...?
  • Outline different pipelines, e.g.
    • Data -> SIRIUS -(rest api)-> Score integration -> Output
    • Data -> matchms -> Score integration -> Output
  • Evaluate the potential of OpenMS (+ KNIME)

Add slurm scripts to re-run experiments

Slurm scripts to re-run experiments are not in the repository.

To Do

  • add the slurm scripts
  • rename them according to the sections / experiments in the paper
  • add a status file tracking the re-running progress

Fix calculation of protonated and deprotonated mass for the Missing MS2 experiments

Background
To calculate the protonated mass of a molecule, we need to add the mass of a proton or hydrogen ion. For deportonation we subtract it.

Bug
The current implementation of get_measured_mass, however, adds respectively removes the mass of hydrogen.

Fix

mass_of_proton = 1.007276  # fixed valued !
if adduct == "[M+H]":
    measured_mass = precursor_mz - mass_of_proton 
elif adduct == "[M-H]":
    measured_mass = precursor_mz + mass_of_proton

Create a separate function to reproduce each table and figure in paper

Encapsulate the code to produce the tables and figures from the paper into separate functions.

To Do

  • Extract code related to each table and figures from the notebooks
  • Use the encapsulated code in the notebooks
  • Write a script that produces all figures from the paper in the correct format and size

[REV] Analyse the Margins for correct and incorrect top-1 Ranked Structures

The reviewer pointed out, that we only look at the ranking performance of our score-integration framework. She/he suggests to analyse the candidate margins and thereby, e.g., take a look on the margin-values for correct and incorrect top-1 ranked molecular structures.

  • First, do we have the relevant margin values already computed?
  • Comparison between margin-values of correct and incorrect top-1 structures (e.g. for Only MS vs. MS + RT)
  • Are the correct top-1 structure margin values significantly different from the remaining ones (log-odds?)

[REV] Measure Running Times

I propose we run score integration for each (dataset, ionization) separately for a growing number of (MS, RT)-tuples. Let say we run for 15, 30, 45, 60 and 75 (MS, RT)-tuples. We need to track the following quantities:

  • training runtime, i.e. hyper-parameter estimation (D)
  • test runtime, i.e. scoring a set of (MS, RT)-tuples
  • memory consumption
  • median length of the candidate lists (maybe even more statistics)

All experiments will run on a T470p with i5 4-core.

Note: Assuming the same number of (MS, RT)-tuples and candidates the training runtime should be just a factor |D| longer than the test runtime.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.