aalto-ics-kepaco / msms_rt_score_integration Goto Github PK

Code, Data and Results of the publication: "Probabilistic Framework for Integration of Tandem-Mass Spectrum and Retention Time Information in Small Molecule Identification" by Bach et al. 2020

License: Other

MAXScript 48.58% Python 22.52% Jupyter Notebook 24.70% Shell 4.20%

metabolite-identification mass-spectrum retention-order-prediction lc-msms

msms_rt_score_integration's Issues

[REV] Error Bars for Figure 2

Figure 2 shows the averaged (across all four data-setups) MS + RT score-integration framework performance for different numbers of random spanning trees (used to approximate the MRF). The reviewer asked to add some bars indicating the variance to the plot.

I propose present the variance for each data-setup (i.e. ionization mode and dataset) separately. That is because the baseline performance (i.e. topk-accuracy) is very different for each data-setup and therefore the lots of the variance would come from this fact.

Incorrect Margin Normalization

Background
The normalized marginals are calculated as follows (see Section 2.3.1 and 2.3.4):

Margin type	Formula
Sum-margin
Max-margin

The current implementation of the margin normalization is not exactly the (correct) one described in the paper
In the _marginals function all marginals ( and ) are divided by . Originally intended to "increase the numerical stability" of the exponentiation (np.exp) it actually interferes with the normalization for the sum-margin. Further down the in the pipeline the get_marginals and get_max_marginals both use the same normalization function _normalize_marginals_sum_to_one, if normalized marginals are requested. This is only correct for the sum-marginal though.

Up to this point: If no normalization is requested (normalize=False), than the max-marginals are correctly calculated, but the sum-marginals have some additional "normalization" step.

In the eval__TFG scripts, we actually request un-normalized marginals. Those are passed to the margin aggregation (over the tree ensemble) where they are normalized according to the sum-margin formula before the topk-acc is calculated for the parameter selection. Later, when the marginals are calculated for the test set, than normalized marginals are requested and this normalization is incorrect for the max-marginal.

How to fix this

Implement a normalization function for the exact solver that is specific to the desired margin
Remove the "numerical stability trick"
Simplify the aggregation function for the parameter estimation, so that we can just simply input normalized margins all the time, as we anyway use the topk-acc calculated from the normalized marginals
Extra: Implement a class that handles a set of tree models and cleanly implements the margin aggregation approach presented in the paper.

Interface to apply score-integration framework to new data

Design of an interface able to process a list of MS-features, with

retention times (RTs)
precursor ion mass (MS1)
(optional) fragmentation spectra (MS2)
(optional) molecular candidate list

and output ranked candidate lists using the MS and RT information.

First steps:

Which input to expect: MS2 (which format), mzXML (more raw format), ...?
Outline different pipelines, e.g.
- Data -> SIRIUS -(rest api)-> Score integration -> Output
- Data -> matchms -> Score integration -> Output
Evaluate the potential of OpenMS (+ KNIME)

Add slurm scripts to re-run experiments

Slurm scripts to re-run experiments are not in the repository.

To Do

add the slurm scripts
rename them according to the sections / experiments in the paper
add a status file tracking the re-running progress

Fix calculation of protonated and deprotonated mass for the Missing MS2 experiments

Background
To calculate the protonated mass of a molecule, we need to add the mass of a proton or hydrogen ion. For deportonation we subtract it.

Bug
The current implementation of get_measured_mass, however, adds respectively removes the mass of hydrogen.

Fix

mass_of_proton = 1.007276  # fixed valued !
if adduct == "[M+H]":
    measured_mass = precursor_mz - mass_of_proton 
elif adduct == "[M-H]":
    measured_mass = precursor_mz + mass_of_proton

Create a separate function to reproduce each table and figure in paper

Encapsulate the code to produce the tables and figures from the paper into separate functions.

To Do

Extract code related to each table and figures from the notebooks
Use the encapsulated code in the notebooks
Write a script that produces all figures from the paper in the correct format and size

Add function to regenerate Table S2

Comparison of the retention order prediction using RankSVM and CDK XLogP.

[REV] Analyse the Margins for correct and incorrect top-1 Ranked Structures

The reviewer pointed out, that we only look at the ranking performance of our score-integration framework. She/he suggests to analyse the candidate margins and thereby, e.g., take a look on the margin-values for correct and incorrect top-1 ranked molecular structures.

First, do we have the relevant margin values already computed?
Comparison between margin-values of correct and incorrect top-1 structures (e.g. for ~~Only MS vs.~~ MS + RT)
Are the correct top-1 structure margin values significantly different from the remaining ones (log-odds?)

[REV] Measure Running Times

I propose we run score integration for each (dataset, ionization) separately for a growing number of (MS, RT)-tuples. Let say we run for 15, 30, 45, 60 and 75 (MS, RT)-tuples. We need to track the following quantities:

training runtime, i.e. hyper-parameter estimation (D)
test runtime, i.e. scoring a set of (MS, RT)-tuples
memory consumption
median length of the candidate lists (maybe even more statistics)

~~All experiments will run on a T470p with i5 4-core.~~

Note: Assuming the same number of (MS, RT)-tuples and candidates the training runtime should be just a factor |D| longer than the test runtime.

Fix triton script to run TFG with hinge_sigmoid and re-run experiments

fix triton script for TFG with hinge_sigmoid
re-run experiments (only goes to supplementary material)

aalto-ics-kepaco / msms_rt_score_integration Goto Github PK

msms_rt_score_integration's People

Contributors

Stargazers

Watchers

Forkers

msms_rt_score_integration's Issues

[REV] Error Bars for Figure 2

Incorrect Margin Normalization

Interface to apply score-integration framework to new data

Add slurm scripts to re-run experiments

Fix calculation of protonated and deprotonated mass for the Missing MS2 experiments

Create a separate function to reproduce each table and figure in paper

Add function to regenerate Table S2

[REV] Analyse the Margins for correct and incorrect top-1 Ranked Structures

[REV] Measure Running Times

Fix triton script to run TFG with hinge_sigmoid and re-run experiments

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent