delocalizer / streammd Goto Github PK
View Code? Open in Web Editor NEWSingle-pass probabilistic duplicate marking of alignments with a Bloom filter.
Home Page: https://doi.org/10.1093/bioinformatics/btad181
License: MIT License
Single-pass probabilistic duplicate marking of alignments with a Bloom filter.
Home Page: https://doi.org/10.1093/bioinformatics/btad181
License: MIT License
Tests of SamRecord.update_dup_status
fail when compiling with clang...
Looks like pgtag_
is empty, and therefore pgidx_
that is set in parse()
is actually the value of PNEXT
instead of either the position of PG:Z
(if present) or the end of the record.
something to do with the inline static const
declaration and initialization of pgtag_
???
The fractional capacity reported in the log message after processing is calculated from templates/n
where templates
is actually "templates seen". It should be templates - templates_marked_duplicate
i.e. "unique templates".
This has no effect on duplicate marking or metrics but can give an unnecessary exit due to overcapacity because the true used capacity of the Bloom filter is lower than stated.
Compilation fails due to inclusion of immintrin.h
. It looks like this can be fixed pretty easily. See e.g. coin-or/Clp#127.
Make it clear in the help and documentation this is the maximum acceptable marginal FP rate. Not the true bulk FP rate that will be achieved after processing, which will be lower.
use branches, tags and bump versions when required
Hi,
streammd is a great tool, but it is not so easy to compile in some system environments, would you please add it to bioconda clouds?
Best,
Kun
streammd
uses broadly the same strategy as SAMBLASTER
to detect and mark duplicates on templates where one end is not mapped:
the signature of the mapped read must match a previously seen orphan
i.e. orphans are compared only to orphans. This is in contrast to MarkDuplicates
where orphans may be compared to and match reads in completely mapped templates.
There is however one difference between streammd 4.1.7
and SAMBLASTER
in that streammd
currently aligns with SAMBLASTER
logic prior to 0.1.25 i.e.
Starting with SAMBLASTER
version 0.1.25:
forward and reverse strand orphans/singletons are allowed to be duplicates of each other
In our test data this marks between 5-10% more orphan dups. The logic behind this is probably good and what certainly isn't debatable is that aligning streammd
logic exactly with the most recent SAMBLASTER
makes it easier to compare the tools directly.
Expose m value from optimal_m_k as a CLI option
we should explicitly test builds on other platforms before releases to avoid things like #24
This is in 2.0.0 and all previous releases:
When batchsize is > 4 and streammd
output is piped directly to samtools view
, the latter dies with parser error and we get broken pipe.
Although docs and --help
have been updated in an attempt to clarify what exactly -p, --fp-rate
setting does, to minimize potential for confusion it would be good to output an estimate of the actual marginal false positive rate reached at the end of a run.
either try converting, or fail with an explicit ValueError
... basically do anything except fail naively as currently happens
--metrics
argument
Hello!
I'd like to use your tool as an alternative of Picard Markdup. But I have a problem with build the tool.
I use gcc version 7.5.0 (requirements gcc >= 7.1), but command make
returns me an error:
depbase=`echo bloomfilter.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I. -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT bloomfilter.o -MD -MP -MF $depbase.Tpo -c -o bloomfilter.o bloomfilter.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo markdups.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I. -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT markdups.o -MD -MP -MF $depbase.Tpo -c -o markdups.o markdups.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo streammd.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I. -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT streammd.o -MD -MP -MF $depbase.Tpo -c -o streammd.o streammd.cxx &&\
mv -f $depbase.Tpo $depbase.Po
In file included from streammd.cxx:4:0:
../external/argparse/argparse.hpp:36:10: fatal error: charconv: No such file or directory
#include <charconv>
^~~~~~~~~~
compilation terminated.
Makefile:444: recipe for target 'streammd.o' failed
make: *** [streammd.o] Error 1
Could you help me, please?
Best regards,
Polina.
Replace current README.md (which is essentially dev notes) with proper install instructions and usage.
Keep performance profiling, perhaps.
Hi,
I got this error when running streammd 4.0.2,
[2023-01-07 21:40:43.242] [main] [info] BloomFilter initialized with p=1e-06 n=993917924 m=34359738368 k=10
[2023-01-07 21:40:43.242] [main] [info] BloomFilter capacity: 993917924 items
[2023-01-07 21:40:43.993] [main] [error] A00601:371:HLNYMDSXY:3:1101:3983:19413: got 1 primary alignment(s). Input is not paired or not qname-grouped?
The pipeline I used (for huge sequencing data),
step 1
split fastq into many pieces,
step 2
bwa mem map each pieces seperately,
step 3
samtools sort each pieces seperately,
step 4
samtools merge pieces into a single bam
step 5
samtools-1.9 view -@ 4 -h F1.sort.bam|streammd --metrics F1.dedup.metrics |samtools-1.9 view -@ 4 -bS -o F1.dedup.bam
And got the error I have mentioned above.
Best,
Kun
For convenience we currently perform template end calculations using pysam.AlignedSegment
.
We do however incur a fair deserialize/serialize cost to get reads into and out of that form.
Investigate using just the raw record.
Most of it is pretty easy:
split by TAB, cast FLAG and POS to int, do bitwise mask, use bitwise mask on the flag for all the is_mapped
, is_forward
, is_first
etc.
Use regex to capture leading (fwd) and trailing (rev) soft clips.
The only mildly tricky bit is doing the CIGAR calculation properly to get reference_end
โ need to handle D
and I
ops.
Currently TEMPLATE_DUPLICATE_FRACTION
uses the count of all templates in the denominator; i.e. it's just equal to TEMPLATES_MARKED_DUPLICATE/TEMPLATES
. The denominator should however omit templates where neither end is aligned, as they are not assessed for duplicate status.
Whatever formula is used, it should be documented in the help or at least in the source README
Capture Exceptions in child processes
ID:streammd PN:streammd VN:...
etc
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.