delocalizer / streammd Goto Github PK

Single-pass probabilistic duplicate marking of alignments with a Bloom filter.

Home Page: https://doi.org/10.1093/bioinformatics/btad181

License: MIT License

Makefile 0.84% Shell 0.06% M4 0.56% C++ 97.83% C 0.13% Python 0.58%

streammd's Issues

some tests fail when compiling with clang

Tests of SamRecord.update_dup_status fail when compiling with clang...
Looks like pgtag_ is empty, and therefore pgidx_ that is set in parse() is actually the value of PNEXT instead of either the position of PG:Z (if present) or the end of the record.

something to do with the inline static const declaration and initialization of pgtag_ ???

unit tests from MarkDuplicates

https://twitter.com/nilshomer/status/1582603486629199873?t=dy0HNME-WofbtnTNdzzaWA&s=19

stored items count incorrect

The fractional capacity reported in the log message after processing is calculated from templates/n where templates is actually "templates seen". It should be templates - templates_marked_duplicate i.e. "unique templates".

This has no effect on duplicate marking or metrics but can give an unnecessary exit due to overcapacity because the true used capacity of the Bloom filter is lower than stated.

Does not build on ARM

Compilation fails due to inclusion of immintrin.h. It looks like this can be fixed pretty easily. See e.g. coin-or/Clp#127.

Improve help for --fp-rate option

Make it clear in the help and documentation this is the maximum acceptable marginal FP rate. Not the true bulk FP rate that will be achieved after processing, which will be lower.

Versioning

use branches, tags and bump versions when required

Add streammd to bioconda clouds

Hi,

streammd is a great tool, but it is not so easy to compile in some system environments, would you please add it to bioconda clouds?

Best,
Kun

align orphan read logic with SAMBLASTER

streammd uses broadly the same strategy as SAMBLASTER to detect and mark duplicates on templates where one end is not mapped:

the signature of the mapped read must match a previously seen orphan

i.e. orphans are compared only to orphans. This is in contrast to MarkDuplicates where orphans may be compared to and match reads in completely mapped templates.

There is however one difference between streammd 4.1.7 and SAMBLASTER in that streammd currently aligns with SAMBLASTER logic prior to 0.1.25 i.e.

all orphans are treated as if on forward strand

Starting with SAMBLASTER version 0.1.25:

forward and reverse strand orphans/singletons are allowed to be duplicates of each other

In our test data this marks between 5-10% more orphan dups. The logic behind this is probably good and what certainly isn't debatable is that aligning streammd logic exactly with the most recent SAMBLASTER makes it easier to compare the tools directly.

Tool to calculate mem from input n, p

Expose m value from optimal_m_k as a CLI option

set up local cross-compile envs

we should explicitly test builds on other platforms before releases to avoid things like #24

samtools parser error when stdout output into samtools view

This is in 2.0.0 and all previous releases:
When batchsize is > 4 and streammd output is piped directly to samtools view, the latter dies with parser error and we get broken pipe.

Handle pairs where one read is unmapped

output an estimate for actual marginal FP rate at the end of a run

Although docs and --help have been updated in an attempt to clarify what exactly -p, --fp-rate setting does, to minimize potential for confusion it would be good to output an estimate of the actual marginal false positive rate reached at the end of a run.

Handle non-string values in BloomFilter.add

either try converting, or fail with an explicit ValueError... basically do anything except fail naively as currently happens

Tests

write metrics to their own file

--metrics argument

Problem with installation

Hello!
I'd like to use your tool as an alternative of Picard Markdup. But I have a problem with build the tool.
I use gcc version 7.5.0 (requirements gcc >= 7.1), but command make returns me an error:

depbase=`echo bloomfilter.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT bloomfilter.o -MD -MP -MF $depbase.Tpo -c -o bloomfilter.o bloomfilter.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo markdups.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT markdups.o -MD -MP -MF $depbase.Tpo -c -o markdups.o markdups.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo streammd.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT streammd.o -MD -MP -MF $depbase.Tpo -c -o streammd.o streammd.cxx &&\
mv -f $depbase.Tpo $depbase.Po
In file included from streammd.cxx:4:0:
../external/argparse/argparse.hpp:36:10: fatal error: charconv: No such file or directory
 #include <charconv>
          ^~~~~~~~~~
compilation terminated.
Makefile:444: recipe for target 'streammd.o' failed
make: *** [streammd.o] Error 1

Could you help me, please?
Best regards,
Polina.

Documentation

Replace current README.md (which is essentially dev notes) with proper install instructions and usage.

Keep performance profiling, perhaps.

got 1 primary alignment(s). Input is not paired or not qname-grouped?

Hi,

I got this error when running streammd 4.0.2,
[2023-01-07 21:40:43.242] [main] [info] BloomFilter initialized with p=1e-06 n=993917924 m=34359738368 k=10
[2023-01-07 21:40:43.242] [main] [info] BloomFilter capacity: 993917924 items
[2023-01-07 21:40:43.993] [main] [error] A00601:371:HLNYMDSXY:3:1101:3983:19413: got 1 primary alignment(s). Input is not paired or not qname-grouped?

The pipeline I used (for huge sequencing data),
step 1
split fastq into many pieces,
step 2
bwa mem map each pieces seperately,
step 3
samtools sort each pieces seperately,
step 4
samtools merge pieces into a single bam
step 5
samtools-1.9 view -@ 4 -h F1.sort.bam|streammd --metrics F1.dedup.metrics |samtools-1.9 view -@ 4 -bS -o F1.dedup.bam

And got the error I have mentioned above.

Best,
Kun

option to discard or separate duplicates

investigate alternatives to pysam.AlignedSegment

For convenience we currently perform template end calculations using pysam.AlignedSegment.
We do however incur a fair deserialize/serialize cost to get reads into and out of that form.
Investigate using just the raw record.
Most of it is pretty easy:
split by TAB, cast FLAG and POS to int, do bitwise mask, use bitwise mask on the flag for all the is_mapped, is_forward, is_first etc.
Use regex to capture leading (fwd) and trailing (rev) soft clips.
The only mildly tricky bit is doing the CIGAR calculation properly to get reference_end — need to handle D and I ops.

Handle single-ended reads

TEMPLATE_DUPLICATE_FRACTION calculation

Currently TEMPLATE_DUPLICATE_FRACTION uses the count of all templates in the denominator; i.e. it's just equal to TEMPLATES_MARKED_DUPLICATE/TEMPLATES. The denominator should however omit templates where neither end is aligned, as they are not assessed for duplicate status.
Whatever formula is used, it should be documented in the help or at least in the source README

Error queue

Capture Exceptions in child processes

Add @PG line into output SAM file header

ID:streammd PN:streammd VN:... etc

delocalizer / streammd Goto Github PK

streammd's Issues

Recommend Projects

Recommend Topics

Recommend Org