Git Product home page Git Product logo

streammd's Issues

some tests fail when compiling with clang

Tests of SamRecord.update_dup_status fail when compiling with clang...
Looks like pgtag_ is empty, and therefore pgidx_ that is set in parse() is actually the value of PNEXT instead of either the position of PG:Z (if present) or the end of the record.

something to do with the inline static const declaration and initialization of pgtag_ ???

stored items count incorrect

The fractional capacity reported in the log message after processing is calculated from templates/n where templates is actually "templates seen". It should be templates - templates_marked_duplicate i.e. "unique templates".

This has no effect on duplicate marking or metrics but can give an unnecessary exit due to overcapacity because the true used capacity of the Bloom filter is lower than stated.

Improve help for --fp-rate option

Make it clear in the help and documentation this is the maximum acceptable marginal FP rate. Not the true bulk FP rate that will be achieved after processing, which will be lower.

Versioning

use branches, tags and bump versions when required

Add streammd to bioconda clouds

Hi,

streammd is a great tool, but it is not so easy to compile in some system environments, would you please add it to bioconda clouds?

Best,
Kun

align orphan read logic with SAMBLASTER

streammd uses broadly the same strategy as SAMBLASTER to detect and mark duplicates on templates where one end is not mapped:

the signature of the mapped read must match a previously seen orphan

i.e. orphans are compared only to orphans. This is in contrast to MarkDuplicates where orphans may be compared to and match reads in completely mapped templates.

There is however one difference between streammd 4.1.7 and SAMBLASTER in that streammd currently aligns with SAMBLASTER logic prior to 0.1.25 i.e.

all orphans are treated as if on forward strand

Starting with SAMBLASTER version 0.1.25:

forward and reverse strand orphans/singletons are allowed to be duplicates of each other

In our test data this marks between 5-10% more orphan dups. The logic behind this is probably good and what certainly isn't debatable is that aligning streammd logic exactly with the most recent SAMBLASTER makes it easier to compare the tools directly.

Problem with installation

Hello!
I'd like to use your tool as an alternative of Picard Markdup. But I have a problem with build the tool.
I use gcc version 7.5.0 (requirements gcc >= 7.1), but command make returns me an error:

depbase=`echo bloomfilter.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT bloomfilter.o -MD -MP -MF $depbase.Tpo -c -o bloomfilter.o bloomfilter.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo markdups.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT markdups.o -MD -MP -MF $depbase.Tpo -c -o markdups.o markdups.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo streammd.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT streammd.o -MD -MP -MF $depbase.Tpo -c -o streammd.o streammd.cxx &&\
mv -f $depbase.Tpo $depbase.Po
In file included from streammd.cxx:4:0:
../external/argparse/argparse.hpp:36:10: fatal error: charconv: No such file or directory
 #include <charconv>
          ^~~~~~~~~~
compilation terminated.
Makefile:444: recipe for target 'streammd.o' failed
make: *** [streammd.o] Error 1

Could you help me, please?
Best regards,
Polina.

Documentation

Replace current README.md (which is essentially dev notes) with proper install instructions and usage.

Keep performance profiling, perhaps.

got 1 primary alignment(s). Input is not paired or not qname-grouped?

Hi,

I got this error when running streammd 4.0.2,
[2023-01-07 21:40:43.242] [main] [info] BloomFilter initialized with p=1e-06 n=993917924 m=34359738368 k=10
[2023-01-07 21:40:43.242] [main] [info] BloomFilter capacity: 993917924 items
[2023-01-07 21:40:43.993] [main] [error] A00601:371:HLNYMDSXY:3:1101:3983:19413: got 1 primary alignment(s). Input is not paired or not qname-grouped?

The pipeline I used (for huge sequencing data),
step 1
split fastq into many pieces,
step 2
bwa mem map each pieces seperately,
step 3
samtools sort each pieces seperately,
step 4
samtools merge pieces into a single bam
step 5
samtools-1.9 view -@ 4 -h F1.sort.bam|streammd --metrics F1.dedup.metrics |samtools-1.9 view -@ 4 -bS -o F1.dedup.bam

And got the error I have mentioned above.

Best,
Kun

investigate alternatives to pysam.AlignedSegment

For convenience we currently perform template end calculations using pysam.AlignedSegment.
We do however incur a fair deserialize/serialize cost to get reads into and out of that form.
Investigate using just the raw record.
Most of it is pretty easy:
split by TAB, cast FLAG and POS to int, do bitwise mask, use bitwise mask on the flag for all the is_mapped, is_forward, is_first etc.
Use regex to capture leading (fwd) and trailing (rev) soft clips.
The only mildly tricky bit is doing the CIGAR calculation properly to get reference_end โ€” need to handle D and I ops.

TEMPLATE_DUPLICATE_FRACTION calculation

Currently TEMPLATE_DUPLICATE_FRACTION uses the count of all templates in the denominator; i.e. it's just equal to TEMPLATES_MARKED_DUPLICATE/TEMPLATES. The denominator should however omit templates where neither end is aligned, as they are not assessed for duplicate status.
Whatever formula is used, it should be documented in the help or at least in the source README

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.