Comments (11)
Oops, didn't mean to close this -- wrong button.
@s-andrews I agree this looks like an overflow. Forgive the naive question, but does it make biological sense to want to try to handle such a large intron? To be honest my instinct would be to treat this as a bad input and filter out this read.
@nh13 Has any thought been put into setting specific thresholds for what size elements are acceptable in CIGARs, or are we just using some variable type's max value? Would be good to do the former if only so we could document what are the limitations. Unless we do want to be able to handle insanely large introns (no offense @s-andrews)?
from htsjdk.
I don't actually think this makes any biological sense, but if there are programs which are generating these long splices (which apparently there are) then the program should parse them correctly, or give a sensible error if the value is too long, rather than just overflowing.
Personally I'd have thought an int (31 bits) should be enough for this. I don't know of any examples of individual chromosomes which are longer than this, which is the ultimate limit. Internally picard must be using something like this anyway since parsing the same data from a sam file works - it's only the BAM parser which is affected.
from htsjdk.
The practical limit on this is the way CIGAR is encoded in the BAM format. It gets 28 bits: there is a uint32_t
, but the bottom four bits are used to encode the MIDN
… operator.
Your N operator has a length > 227 so sets the top/"sign" bit of that unit32_t
, so probably the htsjdk bug is that it is doing an arithmetic shift to unpack the cigar length rather than a logical shift.
from htsjdk.
Looks like an arithmetic shift (>>
) is being used rather than a logical shift (>>>
): https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BinaryCigarCodec.java#L84
@vdauwera looks like a bug to me, but we should also fix the other arithmetic shifts in this code too.
from htsjdk.
Ah ok, thanks all for the very helpful clarifications. This looks like a very secon-friendly task.
I'll put in a separate task to find & fix other instances of injudicious arithmetic shifts since that's a broader scope than this particular bug fix.
from htsjdk.
@vdauwera only need to look for incorrect arithmetic shifts in this class as it affects the cigar decoding and encoding, so probably not worth another task.
from htsjdk.
To be clear, the work for this issue consists of:
- Fix the shift type that is causing the bug
- Add a test to check that the test case included above gets handled correctly
from htsjdk.
Oh I thought you meant throughout the codebase -- so it's just the operations on lines 50 and 86 then?
from htsjdk.
Ah, now you have made me think about it. Line 50 shouldn't matter since it is shifting up not down.
from htsjdk.
@s-andrews Do we have your permission to use your bam snippet in the integration test set?
from htsjdk.
Thanks Nick!
from htsjdk.
Related Issues (20)
- Nashorn engine removed from OpenJDK 17 causes htsjdk to throw exceptions HOT 2
- Question: how to code a GTFCodec beside of the existing GFF3Codec HOT 1
- LittleEndianInputStream EOF test is incorrect HOT 9
- EdgeReadIterator (used in Picard's CollectWgsMetrics) doesn't handle intervals correctly HOT 2
- IntervalTree.Node's methods have package visibility only HOT 1
- What is the non-deprecated replacement for acceptableAlleleBases? HOT 4
- Our FastaSequenceIndex too large to fit in memory HOT 4
- The last-but-one argument of test testQueryAlignmentStart in HtsCRAMCodec30And21QueryTest.java is long HOT 1
- CRAM files have zero-length blocks with a compression method associated with them (so cannot be decompressed). HOT 1
- VCF sorting change breaks existing VCFs HOT 1
- Race conditions accessing CRAM's reference data through ReferenceSource
- Add a better error message when the reference is too long for indexing HOT 5
- Question: Tabix index to find min(POS) and max(POS) in each contig
- Histogram class uses unsafe methods for returning min/max
- CRAM: 8-Bin Base Quality Compression HOT 1
- SamReader slower than expected on network drive HOT 1
- Escaped doublequotes in INFO descriptions result in invalid VCF file HOT 2
- Question: How to generate FAI files and TBI files HOT 5
- SamReaderFactory should use current defaults when makeDefault() is called
- Error when reading a VCF 4.3 file with spaces in INFO field values HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from htsjdk.