Git Product home page Git Product logo

Comments (10)

MikkelSchubert avatar MikkelSchubert commented on July 29, 2024

Hi Mike,

Thank you for the suggestion!
I've actually been looking for alternatives to ValidateSamFile due to the poor performance, so I'll definitely take a look at biobambam2. However, at a glance it seems like it seems like bamvalidate performs a much more superficial validation (it only seems to check for straight up corrupt BAM files/records), so I'm not sure can be used is a full replacement for ValidateSamFile.

For duplicate marking I've switched to samtools markdup on 2.0-alpha version of paleomix, which was a pretty significant improvement, but note that this version is undergoing a lot of changes at the moment.

Cheers,
Mikkel

from paleomix.

sameoldmike avatar sameoldmike commented on July 29, 2024

Regarding it being a superficial check, agreed that this is the case! So possibly not a great suggestion, but for practical reasons we often have to use it as an alternative to picard-tools.

By the way, on large HPC environments (e.g. computerome2 or similar), our solution for paleomix has been to run each sample in its own yaml file on its own compute node. Then we don't get all the failing BAM validation and markduplicates steps -- I still feel it's related to all the open files, but we can't seem to solve this issue when running >100 large genome samples in a single yaml on our smaller local cluster.

from paleomix.

MikkelSchubert avatar MikkelSchubert commented on July 29, 2024

In that case you could probably just disable validation outright, since the tools used in the pipeline are unlikely to generate straight up invalid files. The easiest way to do that is probably to add a return node (4 spaces indentation) at

The problem with ValidateSamFile is most likely that it creates/opens up to 8,000 files by default, one per sequence in your target genome, which it uses to track mate information (it's a really poor implementation!). And I believe that picard MarkDuplicates does something similar to track read pairs. So the problem is probably way too many open file handles.

I'll get back to you once I've figured out how to handle this. The solution might just be to implement some minimal checks myself (I already check for some things not handled by ValidateSamFile) and then just drop ValidateSamFile entirely. The tools used by paleomix are quite mature at this point, so there is less of a need to check everything all the time.

from paleomix.

MikkelSchubert avatar MikkelSchubert commented on July 29, 2024

After some consideration, I've decided to stop running ValidateSamFile as part of the pipeline going forward.
The overhead is too large and it is very rare that problems are detected due to the maturity of the tools used in the rest of the pipeline.

I am not going to remove picard ValidateSamFile or swap out picard MarkDuplicates in the 1.3.x branch, since that is a significant change in methodology, but if it makes your lives easier then I can add an option to skip the validation checks while running the pipeline.

However, as I mentioned before, the 2.0-alpha release already uses samtools markdup, so you could try that version (the master branch). Just note that this branch is under active development, that the coverage/depths/coverage reports are currently disabled (you can still run the latter to by hand), and that I am planning on making more methodological changes in connection with the next major release of AdapterRemoval that I am working on (`--collapse-conservatively´ will become the standard[1] and the collapsed truncated will be written to the same file as full collapsed reads). The YAML file layout is also changing, though it should be pretty easy to convert existing YAML files to the new layout if you compare with the template that the pipeline generates.

[1] Motivated by the poor performance a merging algorithm similar to the current default in AdapterRemoval in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2579-2

from paleomix.

sameoldmike avatar sameoldmike commented on July 29, 2024

Thank you Mikkel!

from paleomix.

sameoldmike avatar sameoldmike commented on July 29, 2024

Just to be clear, have you already added the option to skip the BAM validation checks? Or is that going to be in a future release?

from paleomix.

MikkelSchubert avatar MikkelSchubert commented on July 29, 2024

I haven't added it yet.

I'll try to make a point release (v1.3.5) before the end of the week, that adds a "Validation" feature to the YAML file. Or maybe it'd be more useful with a command-line option? E.g. --validation off and --validation strict. What do you think?

from paleomix.

sameoldmike avatar sameoldmike commented on July 29, 2024

from paleomix.

MikkelSchubert avatar MikkelSchubert commented on July 29, 2024

I couldn't think of anything else that needed to be fixed, so v1.3.5 is available now:

# Validate everything with ValidateSamFile; this is the default behavior
$ paleomix bam run --validation full project.yaml

# Validate only the final BAM file
$ paleomix bam run --validation partial project.yaml

# Validate nothing
$ paleomix bam run --validation off project.yaml

from paleomix.

sameoldmike avatar sameoldmike commented on July 29, 2024

from paleomix.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.