How does acquisitionNum get determined in mzR,about lgatto/msidmatching

Comments (17)

lgatto commented on August 25, 2024

Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:
id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format
A simple case
id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

No, I don't know. I have never used a Wiff file, so I'm not even sure
how the former would look like once converted into mzML and read into
R. Do you have a wiff file at hand? I am happy to convert it and give it
a go.

If we can get the mzR mapping I think a simple lookup table with the
name of the IDFormats as keys and a regular expression that extracts
the correct number from the spectrumID column in the mzIDpsm class as
value would do the trick.

Then we could have:
require(stringr)
getConverter <- function(nativeID) {
 if (nativeID %in% names(lookup)) {
  regexp <- lookup[nativeID]
 } else {
  regexp <- nativeID
 }
 return(
  function(spectrumID) {
   str_extract(spectrumID, regexp)
  }
 )
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.

Yes, that seems a good way forward.

Laurent

from msidmatching.

thomasp85 commented on August 25, 2024

No I don't have any of the files. The examples were just picked semi randomly to display the difficulties. Furthermore I don't think we should schedule an investigation of all possible ms data formats. For this to be viable we need to get the info from the source code of the parser. As far as I remember it still uses ramp?

Den 05/02/2014 kl. 15.41 skrev Laurent Gatto [email protected]:
Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:
id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format
A simple case
id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format
No, I don't know. I have never used a Wiff file, so I'm not even sure
how the former would look like once converted into mzML and read into
R. Do you have a wiff file at hand? I am happy to convert it and give it
a go.
If we can get the mzR mapping I think a simple lookup table with the
name of the IDFormats as keys and a regular expression that extracts
the correct number from the spectrumID column in the mzIDpsm class as
value would do the trick.

Then we could have:
require(stringr)
getConverter <- function(nativeID) {
if (nativeID %in% names(lookup)) {
regexp <- lookup[nativeID]
} else {
regexp <- nativeID
}
return(
function(spectrumID) {
str_extract(spectrumID, regexp)
}
)
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.
Yes, that seems a good way forward.

Laurent
—
Reply to this email directly or view it on GitHub.

from msidmatching.

lgatto commented on August 25, 2024

No I don't have any of the files. The examples were just picked semi
randomly to display the difficulties. Furthermore I don't think we
should schedule an investigation of all possible ms data formats. For
this to be viable we need to get the info from the source code of the
parser. As far as I remember it still uses ramp?

Yes, indeed.

Den 05/02/2014 kl. 15.41 skrev Laurent Gatto [email protected]:
Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:
id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format
A simple case
id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format
No, I don't know. I have never used a Wiff file, so I'm not even sure
how the former would look like once converted into mzML and read into
R. Do you have a wiff file at hand? I am happy to convert it and give it
a go.
If we can get the mzR mapping I think a simple lookup table with the
name of the IDFormats as keys and a regular expression that extracts
the correct number from the spectrumID column in the mzIDpsm class as
value would do the trick.

Then we could have:
require(stringr)
getConverter <- function(nativeID) {
if (nativeID %in% names(lookup)) {
regexp <- lookup[nativeID]
} else {
regexp <- nativeID
}
return(
function(spectrumID) {
str_extract(spectrumID, regexp)
}
)
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.
Yes, that seems a good way forward.

Laurent
—
Reply to this email directly or view it on GitHub.
Reply to this email directly or view it on GitHub:
#2 (comment)

from msidmatching.

thomasp85 commented on August 25, 2024

Do you have contact to some of the spc folks who might know of the inner workings of RAMP or do we need to dive into the TPP source code?

On 05 Feb 2014, at 16:23, Laurent Gatto [email protected] wrote:

No I don't have any of the files. The examples were just picked semi
randomly to display the difficulties. Furthermore I don't think we
should schedule an investigation of all possible ms data formats. For
this to be viable we need to get the info from the source code of the
parser. As far as I remember it still uses ramp?

Yes, indeed.
Den 05/02/2014 kl. 15.41 skrev Laurent Gatto [email protected]:
Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:
id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format
A simple case
id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format
No, I don't know. I have never used a Wiff file, so I'm not even sure
how the former would look like once converted into mzML and read into
R. Do you have a wiff file at hand? I am happy to convert it and give it
a go.
If we can get the mzR mapping I think a simple lookup table with the
name of the IDFormats as keys and a regular expression that extracts
the correct number from the spectrumID column in the mzIDpsm class as
value would do the trick.

Then we could have:
require(stringr)
getConverter <- function(nativeID) {
if (nativeID %in% names(lookup)) {
regexp <- lookup[nativeID]
} else {
regexp <- nativeID
}
return(
function(spectrumID) {
str_extract(spectrumID, regexp)
}
)
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.
Yes, that seems a good way forward.

Laurent
—
Reply to this email directly or view it on GitHub.
Reply to this email directly or view it on GitHub:
#2 (comment)
—
Reply to this email directly or view it on GitHub.

from msidmatching.

lgatto commented on August 25, 2024

Do you have contact to some of the spc folks who might know of the
inner workings of RAMP or do we need to dive into the TPP source code?

See https://github.com/sneumann/mzR/tree/master/src

from msidmatching.

thomasp85 commented on August 25, 2024

Yeah I realized it must be in the mzR source right after I hit the send button… I’ll go diggin’ tomorrow

On 05 Feb 2014, at 17:25, Laurent Gatto [email protected] wrote:

Do you have contact to some of the spc folks who might know of the
inner workings of RAMP or do we need to dive into the TPP source code?

See https://github.com/sneumann/mzR/tree/master/src
—
Reply to this email directly or view it on GitHub.

from msidmatching.

sgibb commented on August 25, 2024

Hello Thomas, hello Laurent,

I am not quite sure whether I nailed it completely down. But it seems to be save to ignore the vendor specific nativeIDs and to do a match on the acquisitionNum and the last number present in the mzID/spectrumID.

The RAMP ignores the nativeID stuff (e.g. controllerType=0, controllerNumber=1, scan=xsd:positiveInteger for Thermo) and simply uses the scan number:
mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 141-211

void RAMPAdapter::Impl::getScanHeader(size_t index, ScanHeaderStruct& result, bool reservePeaks /*= true*/) const
{
    // ...
    result.acquisitionNum = getScanNumber(index); 
    // ...
}

mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 126-138

int RAMPAdapter::Impl::getScanNumber(size_t index) const
{
    const SpectrumIdentity& si = msd_.run.spectrumListPtr->spectrumIdentity(index);
    string scanNumber = id::translateNativeIDToScanNumber(nativeIdFormat_, si.id);

    if (scanNumber.empty()) // unsupported nativeID type
    {
        // assume scanNumber is a 1-based index, consistent with this->index() method
        return static_cast<int>(index) + 1;
    } 
    else
        return lexical_cast<int>(scanNumber);
}

mzR/src/pwiz/data/msdata/MSData.cpp, ll. 552-580

PWIZ_API_DECL string translateNativeIDToScanNumber(CVID nativeIdFormat, const string& id)
{
    switch (nativeIdFormat)
    {
        case MS_spectrum_identifier_nativeID_format: // mzData
            return value(id, "spectrum");

        case MS_multiple_peak_list_nativeID_format: // MGF
            return value(id, "index");

        case MS_Agilent_MassHunter_nativeID_format:
            return value(id, "scanId");

        case MS_Thermo_nativeID_format:
            // conversion from Thermo nativeIDs assumes default controller information
            if (id.find("controllerType=0 controllerNumber=1") != 0)
                return "";

            // fall through to get scan

        case MS_Bruker_Agilent_YEP_nativeID_format:
        case MS_Bruker_BAF_nativeID_format:
        case MS_scan_number_only_nativeID_format:
            return value(id, "scan");

        default:
            return "";
    }
}

It seems that MS:1000770 (WIFF), MS:1000773 (Bruker FID) and MS:1000775 (single peak list) mentioned in section 5.1.3 Use of identifiers for input spectra to a search in the mzIdentML Specification Document are not supported yet (by PWIZ/RAMP).

IMHO something like that should be sufficient (at least for all ID formats supported by PWIZ/RAMP):

acquisitionNum <- header(mzML)$acquisitionNum
mzIdScanNum <- as.numeric(sub("^.*=([[:digit:]]+)$", "\\1",
                          flattenMzId$spectrumID))
m <- match(acquisitionNum, mzIdScanNum)

I will try to implement a prototype in the next days. Any comments?

Best wishes,

Sebastian

from msidmatching.

thomasp85 commented on August 25, 2024

Hi Sebastian

This is great news as it greatly reduces the complexity of the problem - furthermore I have implemented something similar in MSGFgui as a placeholder (albeit with a different regex), so it’s nice to know that it should be pretty stable.

I still believe this should be put in a new package that handles mzR mzID interfacing and common operations, so that mzR and mzID are kept at parsing raw data…

Thanks for looking into it!

best

Thomas

On 19 Mar 2014, at 13:58, Sebastian Gibb [email protected] wrote:

Hello Thomas, hello Laurent,

I am not quite sure whether I nailed it completely down. But it seems to be save to ignore the vendor specific nativeIDs and to do a match on the acquisitionNum and the last number present in the mzID/spectrumID.

The RAMP ignores the nativeID stuff (e.g. controllerType=0, controllerNumber=1, scan=xsd:positiveInteger for Thermo) and simply uses the scan number:
mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 141-211

void RAMPAdapter::Impl::getScanHeader(size_t index, ScanHeaderStruct& result, bool reservePeaks /= true/) const
{
// ...
result.acquisitionNum = getScanNumber(index);
// ...
}
mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 126-138

int RAMPAdapter::Impl::getScanNumber(size_t index) const
{
const SpectrumIdentity& si = msd_.run.spectrumListPtr->spectrumIdentity(index);
string scanNumber = id::translateNativeIDToScanNumber(nativeIdFormat_, si.id);
if (scanNumber.empty()) // unsupported nativeID type
{
    // assume scanNumber is a 1-based index, consistent with this->index() method
    return static_cast<int>(index) + 1;
} 
else
    return lexical_cast<int>(scanNumber);
}
mzR/src/pwiz/data/msdata/MSData.cpp, ll. 552-580

PWIZ_API_DECL string translateNativeIDToScanNumber(CVID nativeIdFormat, const string& id)
{
switch (nativeIdFormat)
{
case MS_spectrum_identifier_nativeID_format: // mzData
return value(id, "spectrum");
    case MS_multiple_peak_list_nativeID_format: // MGF
        return value(id, "index");

    case MS_Agilent_MassHunter_nativeID_format:
        return value(id, "scanId");

    case MS_Thermo_nativeID_format:
        // conversion from Thermo nativeIDs assumes default controller information
        if (id.find("controllerType=0 controllerNumber=1") != 0)
            return "";

        // fall through to get scan

    case MS_Bruker_Agilent_YEP_nativeID_format:
    case MS_Bruker_BAF_nativeID_format:
    case MS_scan_number_only_nativeID_format:
        return value(id, "scan");

    default:
        return "";
}
}
It seems that MS:1000770 (WIFF), MS:1000773 (Bruker FID) and MS:1000775 (single peak list) mentioned in section 5.1.3 Use of identifiers for input spectra to a search in the mzIdentML Specification Document are not supported yet (by PWIZ/RAMP).

IMHO something like that should be sufficient (at least for all ID formats supported by PWIZ/RAMP):

acquisitionNum <- header(mzML)$acquisitionNum
mzIdScanNum <- as.numeric(sub("^.*=([[:digit:]]+)$", "\1",
flattenMzId$spectrumID))
m <- match(acquisitionNum, mzIdScanNum)
I will try to implement a prototype in the next days. Any comments?

Best wishes,

Sebastian

—
Reply to this email directly or view it on GitHub.

from msidmatching.

lgatto commented on August 25, 2024

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

from msidmatching.

thomasp85 commented on August 25, 2024

Certainly : )

My idea is to expose a single high level object that takes care of communicating between mzR and mzID objects and contains proteomics related methods useful for evaluating proteomic experiments or extend when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra, parent ion EIC etc, as well as summary functions and getters and filters that incorporate information from both raw data and identification data.

This would potentially be a rather lightweight package but I don’t see this as a problem - I see lot of benefits in the future for this kind of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto [email protected] wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

—
Reply to this email directly or view it on GitHub.

from msidmatching.

lgatto commented on August 25, 2024

Certainly : )

My idea is to expose a single high level object that takes care of
communicating between mzR and mzID objects and contains proteomics
related methods useful for evaluating proteomic experiments or extend
when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra,
parent ion EIC etc, as well as summary functions and getters and
filters that incorporate information from both raw data and
identification data.

That pretty much what MSnbase already does, just that the link with the
identification data was not straightforward. But it will be now.

Laurent

This would potentially be a rather lightweight package but I don’t see
this as a problem - I see lot of benefits in the future for this kind
of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto [email protected] wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

—
Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHub:
#2 (comment)

from msidmatching.

thomasp85 commented on August 25, 2024

Well then theres less work : ) I’ll begin contributing there then…

best

Thomas

On 20 Mar 2014, at 09:33, Laurent Gatto [email protected] wrote:

Certainly : )

My idea is to expose a single high level object that takes care of
communicating between mzR and mzID objects and contains proteomics
related methods useful for evaluating proteomic experiments or extend
when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra,
parent ion EIC etc, as well as summary functions and getters and
filters that incorporate information from both raw data and
identification data.

That pretty much what MSnbase already does, just that the link with the
identification data was not straightforward. But it will be now.

Laurent

This would potentially be a rather lightweight package but I don’t see
this as a problem - I see lot of benefits in the future for this kind
of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto [email protected] wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

—
Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHub:
#2 (comment)
—
Reply to this email directly or view it on GitHub.

from msidmatching.

thomasp85 commented on August 25, 2024

The reason why I didn’t think of this is that when I first read about MSnbase it was labelled as a package for labelled proteomics, which I don’t do - have the scope of the package moved beyond that since its release?

On 20 Mar 2014, at 09:33, Laurent Gatto [email protected] wrote:

Certainly : )

My idea is to expose a single high level object that takes care of
communicating between mzR and mzID objects and contains proteomics
related methods useful for evaluating proteomic experiments or extend
when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra,
parent ion EIC etc, as well as summary functions and getters and
filters that incorporate information from both raw data and
identification data.

That pretty much what MSnbase already does, just that the link with the
identification data was not straightforward. But it will be now.

Laurent

This would potentially be a rather lightweight package but I don’t see
this as a problem - I see lot of benefits in the future for this kind
of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto [email protected] wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

—
Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHub:
#2 (comment)
—
Reply to this email directly or view it on GitHub.

from msidmatching.

sgibb commented on August 25, 2024

In my opinion it would be the best to "translate" the nativeIDs into acquisitionNum in the mzID package. So it would be very easy to match spectra and identification information, e.g.: m <- match(header(mzML)$acquisitionNum, flattenMzId$spectrumID) would be enough.

Maybe it would also be good to rename the spectrumID column into acquisitionNum (maybe only if the translation was done) to avoid any confusion about different names/ids by the user.

Please see also my PR: thomasp85/mzID#17

from msidmatching.

thomasp85 commented on August 25, 2024

Thats also a possibility, though in that case I would just add an addition column and keep spectrumID as is…

On 20 Mar 2014, at 11:31, Sebastian Gibb [email protected] wrote:

In my opinion it would be the best to "translate" the nativeIDs into acquisitionNum in the mzID package. So it would be very easy to match spectra and identification information, e.g.: m <- match(header(mzML)$acquisitionNum, flattenMzId$spectrumID) would be enough.

Maybe it would also be good to rename the spectrumID column into acquisitionNum (maybe only if the translation was done) to avoid any confusion about different names/ids by the user.

Please see also my PR: thomasp85/mzID#17

—
Reply to this email directly or view it on GitHub.

from msidmatching.

sgibb commented on August 25, 2024

I think there is no need to add a new column. Nobody is interested in the nativeIDs if he uses mzR and mzID (and if he is, he could use translateNativeIDs=FALSE). IMHO it is just a waste of memory (ok, it is only 1 Mb but size sometimes matters 😉):

fid <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=FALSE, verbose=FALSE))
fid2 <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=TRUE, verbose=FALSE))
print(object.size(fid$spectrumid), units="Mb")
1.2 Mb
print(object.size(fid2$spectrumid), units="Mb")
0.1 Mb

from msidmatching.

thomasp85 commented on August 25, 2024

It is mainly from the viewpoint that a parser should not change or remove existing data that gets parsed - per my reply to your PR i think it should be calculated at parsing, so that people are not limited to using flatten() for this feature and in that case removing spectrumID is changing the parsed data… don’t know whether this makes sense?

On 20 Mar 2014, at 11:46, Sebastian Gibb [email protected] wrote:

I think there is no need to add a new column. Nobody is interested in the nativeIDs if he uses mzR and mzID (and if he is, he could use translateNativeIDs=FALSE). IMHO it is just a waste of memory (ok, it is only 1 Mb but size sometimes matters ):

fid <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=FALSE, verbose=FALSE))
fid2 <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=TRUE, verbose=FALSE))
print(object.size(fid$spectrumid), units="Mb")
1.2 Mb
print(object.size(fid2$spectrumid), units="Mb")
0.1 Mb
—
Reply to this email directly or view it on GitHub.

from msidmatching.

How does acquisitionNum get determined in mzR about msidmatching HOT 17 OPEN

Comments (17)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent