Git Product home page Git Product logo

Comments (11)

vdauwera avatar vdauwera commented on June 2, 2024

Oops, didn't mean to close this -- wrong button.

@s-andrews I agree this looks like an overflow. Forgive the naive question, but does it make biological sense to want to try to handle such a large intron? To be honest my instinct would be to treat this as a bad input and filter out this read.

@nh13 Has any thought been put into setting specific thresholds for what size elements are acceptable in CIGARs, or are we just using some variable type's max value? Would be good to do the former if only so we could document what are the limitations. Unless we do want to be able to handle insanely large introns (no offense @s-andrews)?

from htsjdk.

s-andrews avatar s-andrews commented on June 2, 2024

I don't actually think this makes any biological sense, but if there are programs which are generating these long splices (which apparently there are) then the program should parse them correctly, or give a sensible error if the value is too long, rather than just overflowing.

Personally I'd have thought an int (31 bits) should be enough for this. I don't know of any examples of individual chromosomes which are longer than this, which is the ultimate limit. Internally picard must be using something like this anyway since parsing the same data from a sam file works - it's only the BAM parser which is affected.

from htsjdk.

jmarshall avatar jmarshall commented on June 2, 2024

The practical limit on this is the way CIGAR is encoded in the BAM format. It gets 28 bits: there is a uint32_t, but the bottom four bits are used to encode the MIDN… operator.

Your N operator has a length > 227 so sets the top/"sign" bit of that unit32_t, so probably the htsjdk bug is that it is doing an arithmetic shift to unpack the cigar length rather than a logical shift.

from htsjdk.

nh13 avatar nh13 commented on June 2, 2024

Looks like an arithmetic shift (>>) is being used rather than a logical shift (>>>): https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BinaryCigarCodec.java#L84

@vdauwera looks like a bug to me, but we should also fix the other arithmetic shifts in this code too.

from htsjdk.

vdauwera avatar vdauwera commented on June 2, 2024

Ah ok, thanks all for the very helpful clarifications. This looks like a very secon-friendly task.

I'll put in a separate task to find & fix other instances of injudicious arithmetic shifts since that's a broader scope than this particular bug fix.

from htsjdk.

nh13 avatar nh13 commented on June 2, 2024

@vdauwera only need to look for incorrect arithmetic shifts in this class as it affects the cigar decoding and encoding, so probably not worth another task.

from htsjdk.

vdauwera avatar vdauwera commented on June 2, 2024

To be clear, the work for this issue consists of:

  1. Fix the shift type that is causing the bug
  2. Add a test to check that the test case included above gets handled correctly

from htsjdk.

vdauwera avatar vdauwera commented on June 2, 2024

Oh I thought you meant throughout the codebase -- so it's just the operations on lines 50 and 86 then?

from htsjdk.

nh13 avatar nh13 commented on June 2, 2024

Ah, now you have made me think about it. Line 50 shouldn't matter since it is shifting up not down.

from htsjdk.

vdauwera avatar vdauwera commented on June 2, 2024

@s-andrews Do we have your permission to use your bam snippet in the integration test set?

from htsjdk.

vdauwera avatar vdauwera commented on June 2, 2024

Thanks Nick!

from htsjdk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.