Is it feasible to switch to biobambam2 for the duplicate marking and BAM validation st

biobambam2 instead of picard-tools? about paleomix HOT 10 OPEN

sameoldmike commented on July 29, 2024

biobambam2 instead of picard-tools?

from paleomix.

Comments (10)

MikkelSchubert commented on July 29, 2024

Hi Mike,

Thank you for the suggestion!
I've actually been looking for alternatives to ValidateSamFile due to the poor performance, so I'll definitely take a look at biobambam2. However, at a glance it seems like it seems like bamvalidate performs a much more superficial validation (it only seems to check for straight up corrupt BAM files/records), so I'm not sure can be used is a full replacement for ValidateSamFile.

For duplicate marking I've switched to samtools markdup on 2.0-alpha version of paleomix, which was a pretty significant improvement, but note that this version is undergoing a lot of changes at the moment.

Cheers,
Mikkel

from paleomix.

sameoldmike commented on July 29, 2024

Regarding it being a superficial check, agreed that this is the case! So possibly not a great suggestion, but for practical reasons we often have to use it as an alternative to picard-tools.

By the way, on large HPC environments (e.g. computerome2 or similar), our solution for paleomix has been to run each sample in its own yaml file on its own compute node. Then we don't get all the failing BAM validation and markduplicates steps -- I still feel it's related to all the open files, but we can't seem to solve this issue when running >100 large genome samples in a single yaml on our smaller local cluster.

from paleomix.

MikkelSchubert commented on July 29, 2024

In that case you could probably just disable validation outright, since the tools used in the pipeline are unlikely to generate straight up invalid files. The easiest way to do that is probably to add a return node (4 spaces indentation) at

paleomix/paleomix/pipelines/bam/nodes.py

Line 34 in 0fafea3

The problem with ValidateSamFile is most likely that it creates/opens up to 8,000 files by default, one per sequence in your target genome, which it uses to track mate information (it's a really poor implementation!). And I believe that picard MarkDuplicates does something similar to track read pairs. So the problem is probably way too many open file handles.

I'll get back to you once I've figured out how to handle this. The solution might just be to implement some minimal checks myself (I already check for some things not handled by ValidateSamFile) and then just drop ValidateSamFile entirely. The tools used by paleomix are quite mature at this point, so there is less of a need to check everything all the time.

from paleomix.

MikkelSchubert commented on July 29, 2024

After some consideration, I've decided to stop running ValidateSamFile as part of the pipeline going forward.
The overhead is too large and it is very rare that problems are detected due to the maturity of the tools used in the rest of the pipeline.

I am not going to remove picard ValidateSamFile or swap out picard MarkDuplicates in the 1.3.x branch, since that is a significant change in methodology, but if it makes your lives easier then I can add an option to skip the validation checks while running the pipeline.

However, as I mentioned before, the 2.0-alpha release already uses samtools markdup, so you could try that version (the master branch). Just note that this branch is under active development, that the coverage/depths/coverage reports are currently disabled (you can still run the latter to by hand), and that I am planning on making more methodological changes in connection with the next major release of AdapterRemoval that I am working on (`--collapse-conservatively´ will become the standard[1] and the collapsed truncated will be written to the same file as full collapsed reads). The YAML file layout is also changing, though it should be pretty easy to convert existing YAML files to the new layout if you compare with the template that the pipeline generates.

[1] Motivated by the poor performance a merging algorithm similar to the current default in AdapterRemoval in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2579-2

from paleomix.

sameoldmike commented on July 29, 2024

Thank you Mikkel!

from paleomix.

sameoldmike commented on July 29, 2024

Just to be clear, have you already added the option to skip the BAM validation checks? Or is that going to be in a future release?

from paleomix.

MikkelSchubert commented on July 29, 2024

I haven't added it yet.

I'll try to make a point release (v1.3.5) before the end of the week, that adds a "Validation" feature to the YAML file. Or maybe it'd be more useful with a command-line option? E.g. --validation off and --validation strict. What do you think?

from paleomix.

sameoldmike commented on July 29, 2024

Command-line option sounds great if that makes sense for implementation. We would use it every time! Thanks so much Mikkel

On Tue, Oct 12, 2021 at 8:21 PM Mikkel Schubert ***@***.***> wrote: I haven't added it yet. I'll try to make a point release (v1.3.5) before the end of the week, that adds a "Validation" feature to the YAML file. Or maybe it'd be more useful with a command-line option? E.g. --validation off and --validation strict. What do you think? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#44 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTSQC6JZ6JLBQN7H6BOVRLUGR4BZANCNFSM5FB2OK3Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Mike D. Martin Associate professor, Holomuseomics research group <https://www.ntnu.edu/museum/holomuseomics> Department of Natural History Norwegian University of Science and Technology (NTNU) Google scholar <https://scholar.google.com/citations?user=WosqmUMAAAAJ&hl=en> DarwinPlants <https://darwin-plants.com/>

from paleomix.

MikkelSchubert commented on July 29, 2024

I couldn't think of anything else that needed to be fixed, so v1.3.5 is available now:

# Validate everything with ValidateSamFile; this is the default behavior
$ paleomix bam run --validation full project.yaml

# Validate only the final BAM file
$ paleomix bam run --validation partial project.yaml

# Validate nothing
$ paleomix bam run --validation off project.yaml

from paleomix.

sameoldmike commented on July 29, 2024

Cool, thanks so much Mikkel. I've got it running, and it seems to be working beautifully. Mike

…

On Tue, Oct 12, 2021 at 9:20 PM Mikkel Schubert ***@***.***> wrote: I couldn't think of anything else that needed to be fixed, so v1.3.5 is available now: # Validate everything with ValidateSamFile; this is the default behavior $ paleomix bam run --validation full project.yaml # Validate only the final BAM file $ paleomix bam run --validation partial project.yaml # Validate nothing $ paleomix bam run --validation off project.yaml — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#44 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTSQC6VYV2DDHYZYZDJGITUGSC57ANCNFSM5FB2OK3Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

from paleomix.

biobambam2 instead of picard-tools? about paleomix HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent