Comments (10)
Hi Mike,
Thank you for the suggestion!
I've actually been looking for alternatives to ValidateSamFile due to the poor performance, so I'll definitely take a look at biobambam2. However, at a glance it seems like it seems like bamvalidate
performs a much more superficial validation (it only seems to check for straight up corrupt BAM files/records), so I'm not sure can be used is a full replacement for ValidateSamFile.
For duplicate marking I've switched to samtools markdup
on 2.0-alpha version of paleomix, which was a pretty significant improvement, but note that this version is undergoing a lot of changes at the moment.
Cheers,
Mikkel
from paleomix.
Regarding it being a superficial check, agreed that this is the case! So possibly not a great suggestion, but for practical reasons we often have to use it as an alternative to picard-tools.
By the way, on large HPC environments (e.g. computerome2 or similar), our solution for paleomix has been to run each sample in its own yaml file on its own compute node. Then we don't get all the failing BAM validation and markduplicates steps -- I still feel it's related to all the open files, but we can't seem to solve this issue when running >100 large genome samples in a single yaml on our smaller local cluster.
from paleomix.
In that case you could probably just disable validation outright, since the tools used in the pipeline are unlikely to generate straight up invalid files. The easiest way to do that is probably to add a return node
(4 spaces indentation) at
paleomix/paleomix/pipelines/bam/nodes.py
Line 34 in 0fafea3
The problem with ValidateSamFile is most likely that it creates/opens up to 8,000 files by default, one per sequence in your target genome, which it uses to track mate information (it's a really poor implementation!). And I believe that picard MarkDuplicates does something similar to track read pairs. So the problem is probably way too many open file handles.
I'll get back to you once I've figured out how to handle this. The solution might just be to implement some minimal checks myself (I already check for some things not handled by ValidateSamFile) and then just drop ValidateSamFile entirely. The tools used by paleomix are quite mature at this point, so there is less of a need to check everything all the time.
from paleomix.
After some consideration, I've decided to stop running ValidateSamFile as part of the pipeline going forward.
The overhead is too large and it is very rare that problems are detected due to the maturity of the tools used in the rest of the pipeline.
I am not going to remove picard ValidateSamFile
or swap out picard MarkDuplicates
in the 1.3.x branch, since that is a significant change in methodology, but if it makes your lives easier then I can add an option to skip the validation checks while running the pipeline.
However, as I mentioned before, the 2.0-alpha release already uses samtools markdup
, so you could try that version (the master branch). Just note that this branch is under active development, that the coverage/depths/coverage reports are currently disabled (you can still run the latter to by hand), and that I am planning on making more methodological changes in connection with the next major release of AdapterRemoval that I am working on (`--collapse-conservatively´ will become the standard[1] and the collapsed truncated will be written to the same file as full collapsed reads). The YAML file layout is also changing, though it should be pretty easy to convert existing YAML files to the new layout if you compare with the template that the pipeline generates.
[1] Motivated by the poor performance a merging algorithm similar to the current default in AdapterRemoval in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2579-2
from paleomix.
Thank you Mikkel!
from paleomix.
Just to be clear, have you already added the option to skip the BAM validation checks? Or is that going to be in a future release?
from paleomix.
I haven't added it yet.
I'll try to make a point release (v1.3.5) before the end of the week, that adds a "Validation" feature to the YAML file. Or maybe it'd be more useful with a command-line option? E.g. --validation off
and --validation strict
. What do you think?
from paleomix.
from paleomix.
I couldn't think of anything else that needed to be fixed, so v1.3.5 is available now:
# Validate everything with ValidateSamFile; this is the default behavior
$ paleomix bam run --validation full project.yaml
# Validate only the final BAM file
$ paleomix bam run --validation partial project.yaml
# Validate nothing
$ paleomix bam run --validation off project.yaml
from paleomix.
from paleomix.
Related Issues (20)
- paleomix bam_pipeline: error: unrecognized arguments: --gatk_max_threads=1 --progress_ui=running --jre_options= HOT 5
- ImportError: libhts.so.2: cannot open shared object file: No such file or directory HOT 14
- checkpointing HOT 2
- About the MinQuality setting HOT 6
- BOWTIE2 errors in the pipeline HOT 8
- A problem of PALEOMIX 2.0.0-alpha documentation HOT 2
- Should the .rmdup.collapsed.bam and .rmdup.normal.bam be merged? HOT 1
- Paleomix can not find picard even though it is there HOT 8
- Issue When Employing `RegionsOfInterest`. HOT 6
- similar bams with rescale and non rescale HOT 2
- Phylo pipeline "unknown command" HOT 2
- BWA backtrack additional options added to samse, not aln HOT 3
- conda environment perpetually solving during installation HOT 6
- Errors running node HOT 2
- Error with trimming SE adapters from sample HOT 2
- Receieved a NodeError while running the pipeline HOT 1
- BWA terminated by SIGKILL, PALEOMIX in BAM pipline HOT 2
- Duplicated reads error HOT 3
- Paleomix output reads folder no fastq reads HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paleomix.