Git Product home page Git Product logo

cramtools's Introduction

CRAMTools is no longer supported. Please use Samtools: http://www.htslib.org/.

CRAMTools is a set of Java tools and APIs for efficient compression of sequence read data. Although this is intended as a stable version the code is released as early access. Parts of the CRAMTools are experimental and may not be supported in the future.

http://www.ebi.ac.uk/ena/about/cram_toolkit Version 3.0

Change log:

Input files:

  1. Reference sequence in fasta format <fasta file>
  2. Reference sequence index file <fasta file>.fai created using samtools (samtools faidx <fasta file>)
  3. Input BAM file <BAM file> sorted by reference coordinates
  4. BAM index file <BAM file>.bai created using samtools (samtools index <BAM file>)
  5. Download and run the program: download the prebuilt runnable jar file from https://github.com/enasequence/cramtools/blob/master/cramtools-3.0.jar?raw=true
  6. Execute the command line program: java -jar cramtools-3.0.jar. Usage is printed if no arguments were given

To convert a BAM file to CRAM:

java -jar cramtools-3.0.jar cram \
    --input-bam-file <bam file> \
    --reference-fasta-file <reference fasta file> \
    [--output-cram-file <output cram file>]

To convert a CRAM file to BAM:

java -jar cramtools-3.0.jar bam \
    --input-cram-file <input cram file> \
    --reference-fasta-file <reference fasta file> \
    --output-bam-file <output bam file>

To build the program from source:

  1. To check out the source code from github you will need git client: http://git-scm.com/
  2. Make sure you have java 1.7 or higher: http://openjdk.java.net/ or http://www.oracle.com/us/technologies/java/index.html
  3. Make sure you have ant version 1.7 or higher: http://ant.apache.org/
  4. Run the following commands
# Clone the repository to your local directory:
git clone https://github.com/enasequence/cramtools.git

# Change to the directory: 
cd cramtools

# Build a runnable jar file: 
ant -f build/build.xml runnable

# Run cramtools
java -jar cramtools-3.0.jar 

Picard integraion

Picard tools have been removed from cramtools because Picard supports CRAM via htsjdk.

Reference sequence discovery

cramtools supports the following reference discovery mechanism:

  1. check local file provided in the command line (-R or --reference-fasta-file option)
  2. download sequences from the ENA reference registry using MD5 checksums from the SAM header
  3. access local cache using REF_CACHE/REF_PATH environment variables. Please refer to http://www.htslib.org/doc/samtools.html for more details.

The following tools have been included into this release

  • cram: (SAM/BAM to CRAM conversion)
  • bam: (CRAM to SAM/BAM conversion)
  • index: (CRAM indexing, can produce CRAI or BAI index)
  • merge: (merges several SAM/BAM/CRAM files into one)
  • fastq: (dump reads in fastq format)
  • fixheader: (fix CRAM file header, namely reference sequence MD5 checksums)
  • getref: (download all reference sequences mentioned in a CRAM file)
  • qstat: (produce some basic quality score stats for a CRAM file)

The usage can be accessed by calling cramtools with the corresponding command as a single argument.

Lossy model

Bam2Cram allows to specify lossy model via a string which can be composed of one or more words separated by '-'.

Each word is read or base selector and quality score treatment, which can be binning (Illumina 8 bins) or full scale (40 values).

Here are some examples:
  • N40-D8: preserve quality scores for non-matching bases with full precision, and bin quality scores for positions flanking deletions.
  • m5: preserve quality scores for reads with mapping quality score lower than 5
  • R40X10-N40: preserve non-matching quality scores and those matching with coverage lower than 10
  • *8: bin all quality scores

Selectors:

  • R: bases matching the reference sequence
  • N: aligned bases mismatching the reference, this only applies to 'M', '=' (EQ) or 'X' BAM cigar elements.
  • U: unmapped read
  • Pn: pileup: capture all bases at a given position on the reference if there are at least n mismatches
  • D: read positions flanking a deletion
  • Mn: reads with mapping quality score higher than n
  • mn: reads with mapping quality score lower than n
  • I: insertions
  • *: all

By default no quality scores will be preserved.

Illimuna 8-binning scheme:

0, 1, 6, 6, 6, 6, 6, 6, 6, 6, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 22, 22, 22, 22, 22, 27, 27, 27, 27, 27, 33, 33, 33, 33, 33, 37,
37, 37, 37, 37, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40 

Check for more on our web site: http://www.ebi.ac.uk/ena/about/cram_toolkit

cramtools's People

Contributors

fduckart avatar flashton2003 avatar raskoleinonen avatar vadimzalunin avatar zyxue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cramtools's Issues

NullPointerException at Bam2Cram.java:283

When I run:
cramtools cram --input-bam-file 5M.bam --reference-fasta-file /hive/groups/encode/3/encValData/hg19/hg19.fa --output-cram-file 5M.cram

I get:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at net.sf.cram.CramTools.invoke(CramTools.java:93)
at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.NullPointerException
at net.sf.cram.Bam2Cram.main(Bam2Cram.java:283)
... 6 more
If you got time it'd be great if you'd look into it. I'm not sure how best to get the 5M.bam file to you. It's 228,839,335 bytes.

Lossy compression: cramtools 2.1 outperforms other implementations

Experimenting with reducing Q-score resolution I have found that cramtools.2.1 producing significantly smaller crams. For instance if we throw away all Q-scores from a 25x genome with cramtool.2.1 I get cram of 2.3GB. With cramtools.3.0 or scramble (-V 2.1 and -V 3.0) crams are around 6GB.
I get similar behaviour with multiple datasets when reducing the number of Q-score bins to 4 or 2 but not with an 8 scheme.
I wonder if anyone else has seen this behaviour?

CIGAR alignment length changed for sequence "*"

A SAM sequence of "*" with a CIGAR aliginment of 10M gets the alignment converted to 0M after passing through cramtools 3.0 (2.1 blows up). Additionally it computes the checksum differently to htslib, but it is unclear which is incorrect. In part parts:

Part 1) Using htslib/test/c1#noseq.sam

jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar cram -R c1.fa -I c1#noseq.sam -O _tmp.cram -n -Q --capture-all-tags
jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar bam -R c1.fa -I _tmp.cram -O _tmp.sam
jkb@seq3a[htslib/test] grep -n SQ1 c1#noseq.sam _tmp.sam
c1#noseq.sam:4:SQ1      0       c1      1       0       10M     *       0       0       *       *  MD:Z:10  NM:i:0
_tmp.sam:7:SQ1  0       c1      1       0       0M      *       0       0       *       *

It appears to be because it's stored read length 0 in the file, which is the length of the BAM sequence, but not the length that is needed in order to reconstruct the CIGAR string. Cram_dump shows:

Rec 3/9 at 1,3
BF = 0 => SAM 0x0 (ret 0, out_sz 1)
CF = 10 (ret 0, out_sz 1)
RL = 0 (ret 0, out_sz 1)
AP = 0 (ret 0, out_sz 1)
RG = -1 (ret 0, out_sz 1)
RN = SQ1 (ret 0, out_sz 3)
Detached
MF = 0 (ret 0, out_sz 1)
NS = -1 (ret 0, out_sz 1)
NP = 0 (ret 0, out_sz 1)
TS = 0 (ret 0, out_sz 1)
TL = 0 (ret 0, out_sz 1)
FN = 0 (ret 0, out_sz 1)
MQ = 0 (ret 0, out_sz 1)

If I create the cram file using htslib then Cramtools can decode it and get the 10M back again. Cram_dump shows the difference:

Rec 3/9 at 1,3
BF = 0 => SAM 0x0 (ret 0, out_sz 1)
CF = 11 (ret 0, out_sz 1)
RL = 10 (ret 0, out_sz 1)
AP = 0 (ret 0, out_sz 1)
RG = -1 (ret 0, out_sz 1)
RN = SQ1 (ret 0, out_sz 3)
Detached
MF = 0 (ret 0, out_sz 1)
NS = -1 (ret 0, out_sz 1)
NP = 0 (ret 0, out_sz 1)
TS = 0 (ret 0, out_sz 1)
TL = 1 (ret 0, out_sz 1)
0: TN= MDZ
id=5063770
0: Val 31 30 00
1: TN= NMC
id=5131587
1: Val 00
FN = 0 (ret 0, out_sz 1)
MQ = 0 (ret 0, out_sz 1)
QS = (ret 0, out_sz 10)

Note here it has also stored the MD:Z: and NM:i tag verbatim as it cannot be sure that they would be recreated correctly during decoding as it has no sequence to work from.

Part 2)

Decoding the above htslib generated file, cramtools-3.0.jar gives me an error regarding the sequence checksums:

jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar bam -R c1.fa -I _tmp.cram -O _tmp.sam
ERROR 2015-04-16 15:07:16 ContentDigests Content hash mismatch for tag SD, actual: 06607F2A; expected: 8244A410

The actual sequence shown doesn't seem to differ between the input and output files:

jkb@seq3a[htslib/test] cat c1#noseq.sam _tmp.sam
@SQ     SN:c1   LN:10
sq1     0       c1      1       0       10M     *       0       0       AACCGCGGTT      ********** MD:Z:10  NM:i:0
sQ1     0       c1      1       0       10M     *       0       0       AACCGCGGTT      *       MD:Z:10     NM:i:0
SQ1     0       c1      1       0       10M     *       0       0       *       *       MD:Z:10 NM:i:0
sq2     0       c1      1       0       4M1D5M  *       0       0       AACCCGGTT       *********  MD:Z:4^G5        NM:i:1
sQ2     0       c1      1       0       4M1D5M  *       0       0       AACCCGGTT       *       MD:Z:4^G5   NM:i:1
SQ2     0       c1      1       0       4M1D5M  *       0       0       *       *       MD:Z:4^G5  NM:i:1
sq3     4       c1      1       0       *       *       0       0       AACCCGGTT       *********
sQ3     4       c1      1       0       *       *       0       0       AACCCGGTT       *
SQ3     4       c1      1       0       *       *       0       0       *       *
@HD     VN:1.4  SO:unsorted
@SQ     SN:c1   LN:10   UR:/nfs/users/nfs_j/jkb/work/samtools_master/htslib/test/c1.fa  M5:e224bd8a70e5466c6a033be886045c85
@PG     ID:0    PN:cramtools    VN:3.0-b12      CL:java /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar bam -R c1.fa -I _tmp.cram -O _tmp.sam
sq1     0       c1      1       0       10M     *       0       0       AACCGCGGTT      **********
sQ1     0       c1      1       0       10M     *       0       0       AACCGCGGTT      *
SQ1     0       c1      1       0       10M     *       0       0       *       *       MD:Z:10 NM:i:0
sq2     0       c1      1       0       4M1D5M  *       0       0       AACCCGGTT       *********
sQ2     0       c1      1       0       4M1D5M  *       0       0       AACCCGGTT       *
SQ2     0       c1      1       0       4M1D5M  *       0       0       *       *       MD:Z:4^G5  NM:i:1
sq3     4       c1      1       0       *       *       0       0       AACCCGGTT       *********
sQ3     4       c1      1       0       *       *       0       0       AACCCGGTT       *
SQ3     4       c1      1       0       *       *       0       0       *       *

Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.utils.CountingOutputStream

Getting this exception using the versioned download in v3.0 (https://github.com/enasequence/cramtools/archive/v3.0.tar.gz):

time java \
    -jar "cramtools-3.0/cramtools-3.0.jar" \
    cram \
    --input-bam-file '130551a4-2010-4556-b817-6a295940c8e5/257235f3926b2be84e8a9e80acdfb345.bam' \
    --reference-fasta-file 'genome.fa' \
    --output-cram-file 'test-cramtools-3.0.cram'
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at net.sf.cram.CramTools.invoke(CramTools.java:91)
    at net.sf.cram.CramTools.main(CramTools.java:121)
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/CountingOutputStream
    at htsjdk.samtools.cram.build.CramIO.writeContainerForSamFileHeader(CramIO.java:247)
    at htsjdk.samtools.cram.build.CramIO.writeCramHeader(CramIO.java:161)
    at net.sf.cram.Bam2Cram.main(Bam2Cram.java:405)
    ... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.utils.CountingOutputStream
    at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 9 more

Sequence * with cigar strings

I have a SAM file (htslib/test/ce#5b.sam) that has a cigar string (7S20M1D23M10I30M10S) and sequence/quality of *.

This is legal, but is not supported from Cramtools. I get this error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
        at net.sf.cram.Bam2Cram.updateTracks(Bam2Cram.java:93)
        at net.sf.cram.Bam2Cram.convert(Bam2Cram.java:132)
        at net.sf.cram.Bam2Cram.main(Bam2Cram.java:387)
        ... 6 more

fastq for paired-end sequencing

I am trying to create fastq file(s) for each respective read group within the cram file. There should be a total of 4 different files instead of the 1 large fastq file that has been made. Are there any special parameters that need to be added if this is possible through cramtools?

java -jar cramtools-3.0.jar
fastq
-I cram_file.cram
-R ref_genome \

Cramtools-3.0.jar error

I don't seem able to create any CRAM files using the cramtools-3.0.jar. I checked out the latest "cram3" branch and rebuilt. I get this error:

java -Xmx4000m -jar ~/work/cram/cramtools-3.0.jar cram -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa --capture-all-tags -Q -n -I _.sam -O _.cram
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.NullPointerException
        at net.sf.cram.encoding.writer.Writer.write(Writer.java:185)
        at net.sf.cram.build.ContainerFactory.buildSlice(ContainerFactory.java:199)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:89)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:58)
        at net.sf.cram.Bam2Cram.main(Bam2Cram.java:433)
        ... 6 more

Vadim, please email me for an example if you need one,.

Unsorted data crash in gamma codec

jkb@seq3a[htslib/test] cat a.sam
@SQ     SN:xx   LN:20
s2      0       xx      11      1       10M     *       0       0       TTTTTTTTTT      *
s3      0       xx      1       1       10M     *       0       0       AAAAAAAAAA      *
jkb@seq3a[htslib/test] java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar cram -R xx.fa -I a.sam -O _.cram -l debug
DEBUG   2015-04-09 17:18:38     Bam2Cram        Creating tracks for reference: name=xx, length=20.

INFO    2015-04-09 17:18:38     FixBAMFileHeader        Reference sequence MD5 not found, calculating: xx
DEBUG   2015-04-09 17:18:38     Bam2Cram        Writing 2 records for sequence 0, xx
INFO    2015-04-09 17:18:38     Bam2Cram        create: tracks 0ms, records 1ms, mating 0ms.
DEBUG   2015-04-09 17:18:38     CompressionHeaderFactory        Assigned external id to bases: 0
DEBUG   2015-04-09 17:18:38     CompressionHeaderFactory        Assigned external id to quality scores: 1
DEBUG   2015-04-09 17:18:38     CompressionHeaderFactory        Assigned external id to read names: 2
DEBUG   2015-04-09 17:18:38     CompressionHeaderFactory        Assigned external id to mate info: 3
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.RuntimeException: Invalid valid: -9
        at net.sf.cram.encoding.GammaIntegerCodec.numberOfBits(GammaIntegerCodec.java:68)
        at net.sf.cram.encoding.GammaIntegerCodec.numberOfBits(GammaIntegerCodec.java:23)
        at net.sf.cram.build.CompressionHeaderFactory$EncodingLengthCalculator.add(CompressionHeaderFactory.java:750)
        at net.sf.cram.build.CompressionHeaderFactory$IntegerEncodingCalculator.addValue(CompressionHeaderFactory.java:810)
        at net.sf.cram.build.CompressionHeaderFactory.getOptmialIntegerEncoding(CompressionHeaderFactory.java:618)
        at net.sf.cram.build.CompressionHeaderFactory.build(CompressionHeaderFactory.java:129)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:67)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:58)
        at net.sf.cram.Bam2Cram.main(Bam2Cram.java:369)
        ... 6 more

If I swap the two reads around to make them sorted, so pos 1, pos 11, then it works.

I am unsure if this is just lack of support for unsorted data or if it is just triggering a bug somewhere else.

Cramtools can decode this file if it is created by htslib.

Handling of sequences beyond reference extents.

If I have a sequence with position and CIGAR string that aligns bases off the end of the reference, the slice coordinates end up beyond the reference extents too.

For example:

@SQ     SN:c1   LN:10
s0      0       c1      1       0       10M     *       0       0       AACCGCGGTT      **********
s1      0       c1      2       0       10M     *       0       0       ACCGCGGTTC      **********
s2      0       c1      3       0       10M     *       0       0       CCGCGGTTCG      **********

(Note @sq LN:10, but alignments extend to base 12)

Cram_dump on the output tells me:

Slice 1/1, container offset 251, file offset 719
    Slice content type MAPPED_SLICE
    Slice ref seq    0
    Slice ref start  1
    Slice ref span   12
    Slice MD5        e224bd8a70e5466c6a033be886045c85
    Rec counter      0
    No. records      3
    No. blocks       6
    Blk IDS:         {0, 1, 2, 3, 4}
    Ref base id:     -1

The Slice MD5 is infact from base 1 to 10 inclusive as it's all it can possibly compute, so that is fine.

The question is, is 1+12 a valid slice start/span or should it be 1+10? I can see an argument for both, but personally I feel the slice coordinates should match the reference limits and be corrected.

The second related issue is that the overhanging bases are not stored as the encoder believes they match. They should be encoded as B features so they can be reconstructed. Arguably this isn't a bug though: broken data in => broken data out!

Note: this isn't hypothetical. I have seen aligners that produce such data.

BAM to CRAM conversion: Error:New position is beyond the reference: 1

Hi all,

I am trying to convert a BAM to CRAM and CRAMtools produces the following error:
$: java -jar cramtools-1.0-b357/cramtools-1.0.jar cram --input-bam-file input.bam --referencefasta-file GCA_000001405.15_GRCh38_genomic.fa --output-cram-file output.cram

Exception in thread "main" java.lang.RuntimeException: New position is beyond the reference: 1

Any help?
many thanks in advance
anto

RG lines are created where they should not exist

htslib/test/xx#rg.sam contains two different RG tags and also some reads with no RG at all. These latter reads get an RG tag assigned to them regardless, thus changing the contents of the SAM file on a SAM->CRAM->SAM round trip.

This happens with both 2.1 and 3.0 versions.

Edit: this is an encoder issue. If I decode the file created by Cramtools with scramble then it has the extra RG tags too. If I decode the file created by scramble with cramtools then it does not (working).

Before:

@HD     VN:1.4  SO:coordinate
@SQ     SN:xx   LN:20   AS:?    SP:?    UR:?    M5:bbf4de6d8497a119dda6e074521643dc
@RG     ID:x1   SM:x1
@RG     ID:x2   SM:x2   LB:x    PG:foo:bar      PI:1111
@PG     ID:emacs        PN:emacs        VN:23.1.1
@CO     also test
@CO     other   headers
a1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      **********      RG:Z:x1
b1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      **********      RG:Z:x2
c1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      **********
a2      16      xx      11      1       10M     *       0       0       TTTTTTTTTT      **********      RG:Z:x1
b2      16      xx      11      1       10M     *       0       0       TTTTTTTTTT      **********      RG:Z:x2
c2      16      xx      11      1       10M     *       0       0       TTTTTTTTTT      **********

After:

@HD     VN:1.4  SO:coordinate
@SQ     SN:xx   LN:20   AS:?    UR:?    M5:bbf4de6d8497a119dda6e074521643dc     SP:?
@RG     ID:x1   SM:x1
@RG     ID:x2   LB:x    PI:1111 SM:x2   PG:foo:bar
@PG     ID:emacs        PN:emacs        VN:23.1.1
@PG     ID:0    PN:cramtools    VN:2.1-b268     CL:java /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-2.1.jar cram -R xx.fa -I xx#rg.sam -O _tmp.cram -n -Q --capture-all-tags
@PG     ID:1    PN:cramtools    VN:2.1-b268     CL:java /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-2.1.jar bam -R xx.fa -I _tmp.cram -O _tmp.sam
@CO     also test
@CO     other   headers
a1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      **********      RG:Z:x1
b1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      **********      RG:Z:x2
c1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      **********      RG:Z:x2
a2      16      xx      11      1       10M     *       0       0       TTTTTTTTTT      **********      RG:Z:x1
b2      16      xx      11      1       10M     *       0       0       TTTTTTTTTT      **********      RG:Z:x2
c2      16      xx      11      1       10M     *       0       0       TTTTTTTTTT      **********      RG:Z:x2

NoClassDefFoundError in 2.1

Hi,
I got an exception when running following command:

$ java -cp cramtools-2.1.jar net.sf.picard.sam.ValidateSamFile
Exception in thread "main" java.lang.NoClassDefFoundError: net/sf/samtools/util/CollectionUtil$MultiMap
        at net.sf.picard.cmdline.CommandLineParser.<init>(CommandLineParser.java:178)
        at net.sf.picard.cmdline.CommandLineParser.<init>(CommandLineParser.java:203)
        at net.sf.picard.cmdline.CommandLineProgram.parseArgs(CommandLineProgram.java:232)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:124)
        at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:100)
Caused by: java.lang.ClassNotFoundException: net.sf.samtools.util.CollectionUtil$MultiMap
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
        ... 5 more

Unmapped read decoding failure

My trivial SAM file:

I       4       *       0       1       *       *       0       0       CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
II      4       *       0       1       *       *       0       0       CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
III     4       *       0       1       *       *       0       0       CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
IV      4       *       0       1       *       *       0       0       CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
V       4       *       0       1       *       *       0       0       CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
VI      4       *       0       1       *       *       0       0       ACTAAGCCTAAGCCTAAGCCTAAGCCAATTATCGATTTCTGAAAAAATTATCGAATTTTCTAGAAATTTTGCAAATTTTTTCATAAAATTATCGATTTTA    #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

It encodes to a cram file, but decodes as blank. Htslib cannot decode the CRAM produced at all and gets a CRC failure.

If I convert this file to CRAM using htslib then Cramtools 3 gets the same case as before - no data reported and a zero exit code.

(I cannot test version 2.1 on this file as cramtools 2.1 fails to encode it.)

I haven't yet worked out who is at fault for the CFC failure, but definitely decoding as a blank file seems incorrect.

Exception running ValidateCramFile

Best,

We would like to use CRAMtools to archive large amounts of BAM files. We want to make sure the proces of cramming went without problems and therefor we run ValidateCramFile afterward.

Yesterday I crammed about 400 BAM files. For all files I ran ValidateCramFile like this:

java -cp /share/apps/cramtools-2.1.jar net.sf.cram.ValidateCramFile -I /scratch/DNAXXXXX.cram -R /data/references/GrCh37/GrCh37_reference.fa

For all 400 samples I run into the following exception:

Exception in thread "main" net.sf.samtools.SAMException: Exception counting mismatches for read 4923571 1/2 100b aligned read.
at net.sf.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:296)
at net.sf.samtools.util.SequenceUtil.calculateSamNmTag(SequenceUtil.java:481)
at net.sf.picard.sam.SamFileValidator.validateNmTag(SamFileValidator.java:461)
at net.sf.picard.sam.SamFileValidator.validateSamRecords(SamFileValidator.java:296)
at net.sf.cram.ValidateCramFile.main(ValidateCramFile.java:116)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 142535434
at net.sf.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:282)
... 4 more

Does someone know what goes wrong and what I can do to fix this?

Many thanks,

Rick

X vs B read features

For mismatching non-ACTGN bases cramtools always uses substitution from N. It should use a 'base' read feature or similar.

Content hash mismatch for tag SD

From a subdirectory below htslib/test dir, I pick the first 7 lines (5 header, 2 seq) from ce#5b.sam and convert from SAM to CRAM and back again via the latest cramtools-3.0.jar.

jkb@seq3a[test/_3.0] head -7 ../ce#5b.sam > a.sam
jkb@seq3a[test/_3.0] java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar cram -R ../ce.fa -I a.sam -O _.cram && java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -R ../ce.fa -I _.cram
ERROR   2015-04-09 12:15:55     ContentDigests  Content hash mismatch for tag SD, actual: DEB5218B; expected: 03D28681
ERROR   2015-04-09 12:15:55     ContentDigests  Content hash mismatch for tag SD, actual: DEB5218B; expected: 03D28681
I       16      CHROMOSOME_I    2       1       27M1D73M        *       0       0       CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA ????????????????????????????????????????????????????????????????????????????????????????????????????
II.14978392     16      CHROMOSOME_II   2       1       27M1D73M        *       0       0     CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA    ????????????????????????????????????????????????????????????????????????????????????????????????????

It complains about a hash mismatch. Version 2.1 looks to work OK.

Bam2Cram is slow

Converting a 80G BAM file to CRAM takes more than 10 hours. I wonder if this is something which can be improved by e.g. parallel processing?

3.0 noclassdef errors

Hi.
I tried cramtools for the first time this morning. Running 3.0 I go the following error:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at net.sf.cram.CramTools.invoke(CramTools.java:91)
at net.sf.cram.CramTools.main(CramTools.java:121)
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/CountingOutputStream
at htsjdk.samtools.cram.build.CramIO.writeContainerForSamFileHeader(CramIO.java:247)
at htsjdk.samtools.cram.build.CramIO.writeCramHeader(CramIO.java:161)
at net.sf.cram.Bam2Cram.main(Bam2Cram.java:405)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.utils.CountingOutputStream
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 9 more

Command:
java -jar cramtools-3.0.jar cram -I A52294.sorted.bam -Q -O A52294.sorted.bam.cram -n -R /genome/GRCh37-lite.fa -l INFO

Using 2.1 I've been able to cram a bam without error

CRAM with data out of bounds causes a Java stack trace.

Given a 10 base read aligned to base 3 of a 10 base reference (so 3 to 12 inclusive), io_lib/htslib generate the CRAM file by matching the first 8 bases and then adding explicit features for the bases overhanging the end of the reference. (Unfortunately this sort of brokenness in alignments can occur in the wild from BWA.)

Eg from cram_dump:

Rec 3/3 at 1,7
BF = 0 => SAM 0x0 (ret 0, out_sz 1)
CF = 3 (ret 0, out_sz 1)
RL = 10 (ret 0, out_sz 1)
AP = 1 (ret 0, out_sz 1)
RG = -1 (ret 0, out_sz 1)
RN = s2 (ret 0, out_sz 2)
Detached
MF = 0 (ret 0, out_sz 1)
NS = -1 (ret 0, out_sz 1)
NP = 0 (ret 0, out_sz 1)
TS = 0 (ret 0, out_sz 1)
TL = 0 (ret 0, out_sz 1)
FN = 2 (ret 0, out_sz 1)
  0: FC = B (ret 0, out_sz 1)
  0: FP = 9+0 = 9 (ret 0, out_sz 1)
  0: BA/QS(B) = C/9 (ret 0)
  1: FC = B (ret 0, out_sz 1)
  1: FP = 1+9 = 10 (ret 0, out_sz 1)
  1: BA/QS(B) = G/9 (ret 0)

This is from the c1#bounds.sam file in the htslib/test directory.

On the cram file generated by htslib, Java cramtools-2.1 and 3.0 reports an error:

jkb@seq3a[test/_2.1] java -Xmx4000m -jar ~/work/cram/cramtools-3.0.jar bam -R ../c1.fa -I c1#bounds.sam.cram
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 10
        at net.sf.cram.build.CramNormalizer.restoreReadBases(CramNormalizer.java:278)
        at net.sf.cram.build.CramNormalizer.normalize(CramNormalizer.java:130)
        at net.sf.cram.Cram2Bam.main(Cram2Bam.java:285)
        ... 6 more

Supplementary reads need to be supported

See htslib/test/ce#supp.sam

jkb@seq3a[htslib/test] cat ce#supp.sam
@SQ     SN:CHROMOSOME_I LN:1009800
@CO     Test supplementary reads, for CRAM
supp    99      CHROMOSOME_I    100     1       50M50S  *       0       0       TAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTACCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCC    *
supp    2147    CHROMOSOME_I    200     1       50H50M  *       0       0       CCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCC      *
supp    2195    CHROMOSOME_I    500     1       50M50H  *       0       0       TTTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCC      *
supp    147     CHROMOSOME_I    600     1       50S50M  *       0       0       TTTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCAGCCTAATACTAAGCCTAAGCCTAAGACTAAGCCTAATACTAAGCCTAAG    *

jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar cram -R ce.fa -I ce#supp.sam -O _tmp.cram -n -Q --capture-all-tags
jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar bam -R ce.fa -I _tmp.cram
supp    67      CHROMOSOME_I    100     1       50M50S  =       200     150     TAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTACCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCC    *
supp    67      CHROMOSOME_I    200     1       50H50M  =       100     -150    CCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCC      *
supp    179     CHROMOSOME_I    500     1       50M50H  =       600     150     TTTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCC      *
supp    179     CHROMOSOME_I    600     1       50S50M  =       500     -150    TTTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCAGCCTAATACTAAGCCTAAGCCTAAGACTAAGCCTAATACTAAGCCTAAG    *

Regression: some v2.1 files cannot be decoded by 3.0

ce#5b.sam.cram and xx#unsorted.sam.cram produced by htslib test_view with file format version 2.1 cannot be decoded by cramtools-3.0.jar. The cramtools-2.1.jar can decode these.

This is probably the most important problem to fix with 3.0 as most of the other bugs I've reported have now either been fixed or affected 2.1 too (so it's not a regression, just a corner case never covered).

The stack traces are similar:

ce#5b.sam.cram
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at net.sf.cram.CramTools.invoke(CramTools.java:93)
at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at net.sf.cram.build.CramNormalizer.restoreReadBases(CramNormalizer.java:255)
at net.sf.cram.build.CramNormalizer.normalize(CramNormalizer.java:133)
at net.sf.cram.Cram2Bam.main(Cram2Bam.java:287)
... 6 more

xx#unsorted.sam.cram
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at net.sf.cram.CramTools.invoke(CramTools.java:93)
at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at net.sf.cram.build.CramNormalizer.restoreReadBases(CramNormalizer.java:240)
at net.sf.cram.build.CramNormalizer.normalize(CramNormalizer.java:133)
at net.sf.cram.Cram2Bam.main(Cram2Bam.java:287)
... 6 more

I'll email you the two files and their references.

MD5sum checking

I notice that if I encode a CRAM file with a reference, then change that reference (eg a blatant "tr A C") then cramtools.jar will happily decode it again without complaining. Could it please check the md5sums in the slice headers?

cramtools package name is illigetimate and confusing

Cramtools uses the package names net.sf. *. The domain sf.net is owned by slashdotmedia.com and is aliased to sourceforge.net. The cramtools project should not be using a domain name it does not control. The java package name rules state that package names should be constructed from a domain name associated with the group creating the software:

http://docs.oracle.com/javase/specs/jls/se5.0/html/packages.html#7.7

It's also very confusing and adds no value to understanding the code using these packages.

I would suggest registering a domain name for this project. While an EBI name would at least make it legitimate. An independent .org for the project (maybe include hstlib) would make it more welcoming to non-EBI developers.

Stub net.sf.* packages that import the new package names would support backwards compatibility.

Problem with fastq command?

I have been testing out the cram toolkit but came across a problem when I tried to use the fastq function to extract fastqs from my cram file.

This is my command line...

java -jar ~/programs/cramtools-2.0.jar fastq -R refs/gasMGAS315.fa -I bwa-mem/gasH113360108.lossless.cram -F gas.lossless

I made the cram file using...

java -jar ~/programs/cramtools-2.0.jar cram -Q -I bwa-mem/gasH113360108.sorted.bam -R refs/gasMGAS315.fasta -O bwa-mem/gasH113360108.lossless.cram

This is the error thrown in DEBUG mode...

DEBUG 2014-02-08 14:57:57 SubstitutionMatrix Subs matrix: [27, 39, 99, -121, 27]: 0001101100100111011000111000011100011011
DEBUG 2014-02-08 14:57:57 SubstitutionMatrix Subs matrix decoded: A:CGTN C:ATGN G:TACN T:CGAN N:ACGT
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: BF_BitFlags, HUFFMAN, [9, 65, 67, 81, 83, -128, -127, -128, -125, -128, -123, -128, -111, -128, -109, 9, 7, 2, 5, 3].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: AP_AlignmentPositionOffset, HUFFMAN, [13, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 13, 1, 2, 3, 4, 5].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: FP_FeaturePosition, BETA, [0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: FC_FeatureCode, HUFFMAN, [4, 68, 83, 88, 105, 4, 3, 2, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: QS_QualityScore, EXTERNAL, [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: DL_DeletionLength, HUFFMAN, [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: BA_Base, EXTERNAL, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: TN_TagNameAndType, HUFFMAN, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: NF_RecordsToNextFragment, GAMMA, [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: RL_ReadLength, HUFFMAN, [51, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: RG_ReadGroup, HUFFMAN, [1, -1, -1, -1, -1, -1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: MQ_MappingQualityScore, HUFFMAN, [19, 0, 8, 13, 20, 37, 39, 40, 42, 43, 45, 47, 49, 51, 53, 54, 55, 57, 58, 60].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: RN_ReadName, BYTE_ARRAY_LEN, [3, 10, 4, 40, 41, 42, 43, 4, 3, 2, 1, 3, 1, 1, 2, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: NP_NextFragmentAlignmentStart, EXTERNAL, [3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: TS_InsetSize, EXTERNAL, [3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: FN_NumberOfReadFeatures, HUFFMAN, [7, 0, 1, 2, 3, 4, 12, 21, 7, 1, 2, 3, 4, 5, 6, 6, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: BS_BaseSubstitutionCode, HUFFMAN, [4, 0, 1, 2, 3, 4, 1, 2, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: IN_Insertion, BYTE_ARRAY_STOP, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: TC_TagCount, HUFFMAN, [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: MF_MateBitFlags, HUFFMAN, [3, 0, 1, 2, 3, 2, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: NS_NextFragmentReferenceSequenceID, HUFFMAN, [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: CF_CompressionBitFlags, HUFFMAN, [3, 1, 3, 5, 3, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: TL_TagIdList, HUFFMAN, [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: RI_RefId, HUFFMAN, [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: RS_RefSkip, HUFFMAN, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: SC_SoftClip, BYTE_ARRAY_STOP, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: HC_HardClip, HUFFMAN, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CompressionHeader FOUND ENCODING: PD_padding, HUFFMAN, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2014-02-08 14:57:57 CramIO READ CONTAINER: seqid=0, astart=1, aspan=7926, records=10000, slices=1, blocks=8.
Failed at record 0.
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at net.sf.cram.CramTools.invoke(CramTools.java:93)
at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at net.sf.cram.encoding.reader.AbstractFastqReader.read(AbstractFastqReader.java:153)
at net.sf.cram.Cram2Fastq$Dumper.doRun(Cram2Fastq.java:235)
at net.sf.cram.Cram2Fastq$CollatingDumper.doRun(Cram2Fastq.java:329)
at net.sf.cram.Cram2Fastq$Dumper.run(Cram2Fastq.java:253)
at net.sf.cram.Cram2Fastq.main(Cram2Fastq.java:121)
... 6 more
Caused by: java.lang.NullPointerException
at net.sf.cram.encoding.reader.FastqRead.(FastqRead.java:22)
at net.sf.cram.encoding.reader.MultiFastqOutputter.writeRead(MultiFastqOutputter.java:124)
at net.sf.cram.encoding.reader.AbstractFastqReader.read(AbstractFastqReader.java:146)
... 10 more

Do you have any suggestions? I'm keen to use the format but extracting the fastqs is an important part of that.

Thanks!

CIGAR "N" produces incorrect sequence when decoding.

A cram file using the CIGAR "N" operator should perform a reference skip. It appears to just be ignored. This means subsequent bases are decoded incorrectly.

See c1#clip.sam in htslib tests.

jkb@seq3a[htslib/test] grep 3M4N3M c1#clip.sam
s0A 0 c1 1 0 3M4N3M * 0 0 AACGTT ******

After passing through Cramtools (both 2.1 and 3.0) and back again it becomes:

jkb@seq3a[htslib/test] grep 3M4N3M _tmp.sam
s0A 0 c1 1 0 3M4N3M * 0 0 AACCGC ******

The reference is:

c1
AACCGCGGTT

So you can see the 3M4N3M has read "AAC", failed to skip "CGCG" and then decoded "CGC" instead of "GTT".

CRAM 3.0 position change for unmapped reads

I seem to be having a position decrement by one in some cases. Eg:

jkb@seq3a[new/encoding...] scramble example.bam | grep H06JUADXX130110:1:2101:11706:99386
H06JUADXX130110:1:2101:11706:99386      69      1       64843   0       *       =       64843   0GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGAAGAACATGTCGGATCACCTCCTCGCACGTCTAGTCTGGCCGAACCCCGACCTGTGCAAAGCACAACACCTGACACACACGACAGTAGCTGAAGGAACCATGTATCGACTGCATGCGACGAGACCAACCGATGGCCGTGACCTCATGTGGAGACCTGCGATGAATGATGACTAGT   ?@<=9@A@@@??@=@=8?>@?@?>@?>@>>??<@@>?;88*(8@80??;?<65=@??7?=AA?A@@@B######################################################################################################################################################################################    BD:Z:IIKLMMKKLLKONHHGIJLHLLHIJIHIMKLKGGIKKGIJJFFKIJJFILLHJLHHHHLLLHILHBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIJJJJJJJJJJJJJKKKKKKLLLMMNNIIII     BI:Z:KKNNNMMLMLMQNIJIJLNKNNJLMJJKNNNMIJKMMIJLMHIMKLMILNNILNKKIKNNOKLNJFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKJKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLMMMMMMMMMMMNNNNNNNNMLKIJKKKK       AS:i:0  XS:i:0  RG:Z:H06JU.1
H06JUADXX130110:1:2101:11706:99386      137     1       64843   0       233M17S =       64843   0CTTAAAAGCACATTTAATGTGCCTGTGCTTAACTTAAGGTGCTTAGGACAAAGAAGGCGATTGACATCTTTCAGGTAAAACCTGGTAAGTTTGGTGGTCAAGGAACACAACTGAGACATCACTTGGATGTATTTCTATGACTATTTTAAGAAACATAAATTGTGGTGACTCACTCAGCTCACTTTTAACTACTGCATGGTAATTAAAGATGCAGATTAACATAAGTTACAAGACGCAAGGCATCGTATTG   =?=>?=?@=?;A@??@@???@==??-?>6?@?>@@?@??<????@?>@>@?@AAA@;?7A:<?B>@??B?@@@@@B@@@@>@CA@BBBABBBA@<@>AABBAACB@B@BB@CABAB@BAAA?CBA@BAA@AABAACBB@B@C>@CACBBBBCBAC=@>A@C<:<59>B?BAA@DBC66;@BA@A9?@AAA@;A=@BA#####################################################    BD:Z:IIIIMFELPNHHJIBIJHIGGLLKLGGLLHHIIJHHIJLKGLLHHKLHKGIBJJHJLMKKKIJMLHJLIIBILNMLKJCCJLLMJLKJKLJBJJLHJLMLJKMIIJHHHJJKMMKKLHJLLHKIJJIKJHKGIBIIKGJMLKKGIBBIJKKICJHJGKDJJKIIKMINMLKMILKMOPNKMJMKDDKLLMMMMOOPMLLLLLLLLLLLLMMMMMLLLLLLKKKKJJJJJIIIIIIIIIIIIIIIIIIIII     BI:Z:KKLMNHHNRPJKLKEMLLMKKOONOKKOPLMLMNLMLMNOKOPLMMNLMKLFMLKMNOLMNKMOMKLNLLEJNONONLFFMNNOLONLMOLEMLOKLOPNMNOMLNLKLMNOPPNMNLMOOKOMNMMONLOKLFKMOKNPNOOKLFFNMNMLGNLMJMGMLNLLMPLPOOLPLPMPQSRMPLPNGGPOPQQOQRRRONNNOOOOOOOOOOOOONNNNNNMMMMLLLLKKKKKKKKKKKKKKKKKKKKKKK       BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       AS:i:213        XS:i:213        RG:Z:H06JU.1

jkb@seq3a[new/encoding...] java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa -I v30.cram |grep H06JUADXX130110:1:2101:11706:99386
H06JUADXX130110:1:2101:11706:99386      69      1       64842   0       *       =       64843   0GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGAAGAACATGTCGGATCACCTCCTCGCACGTCTAGTCTGGCCGAACCCCGACCTGTGCAAAGCACAACACCTGACACACACGACAGTAGCTGAAGGAACCATGTATCGACTGCATGCGACGAGACCAACCGATGGCCGTGACCTCATGTGGAGACCTGCGATGAATGATGACTAGT   ?@<=9@A@@@??@=@=8?>@?@?>@?>@>>??<@@>?;88*(8@80??;?<65=@??7?=AA?A@@@B######################################################################################################################################################################################    BD:Z:IIKLMMKKLLKONHHGIJLHLLHIJIHIMKLKGGIKKGIJJFFKIJJFILLHJLHHHHLLLHILHBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIJJJJJJJJJJJJJKKKKKKLLLMMNNIIII     RG:Z:H06JU.1    BI:Z:KKNNNMMLMLMQNIJIJLNKNNJLMJJKNNNMIJKMMIJLMHIMKLMILNNILNKKIKNNOKLNJFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKJKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLMMMMMMMMMMMNNNNNNNNMLKIJKKKK       AS:i:0  XS:i:0
H06JUADXX130110:1:2101:11706:99386      137     1       64843   0       233M17S =       64843   0CTTAAAAGCACATTTAATGTGCCTGTGCTTAACTTAAGGTGCTTAGGACAAAGAAGGCGATTGACATCTTTCAGGTAAAACCTGGTAAGTTTGGTGGTCAAGGAACACAACTGAGACATCACTTGGATGTATTTCTATGACTATTTTAAGAAACATAAATTGTGGTGACTCACTCAGCTCACTTTTAACTACTGCATGGTAATTAAAGATGCAGATTAACATAAGTTACAAGACGCAAGGCATCGTATTG   =?=>?=?@=?;A@??@@???@==??-?>6?@?>@@?@??<????@?>@>@?@AAA@;?7A:<?B>@??B?@@@@@B@@@@>@CA@BBBABBBA@<@>AABBAACB@B@BB@CABAB@BAAA?CBA@BAA@AABAACBB@B@C>@CACBBBBCBAC=@>A@C<:<59>B?BAA@DBC66;@BA@A9?@AAA@;A=@BA#####################################################    BD:Z:IIIIMFELPNHHJIBIJHIGGLLKLGGLLHHIIJHHIJLKGLLHHKLHKGIBJJHJLMKKKIJMLHJLIIBILNMLKJCCJLLMJLKJKLJBJJLHJLMLJKMIIJHHHJJKMMKKLHJLLHKIJJIKJHKGIBIIKGJMLKKGIBBIJKKICJHJGKDJJKIIKMINMLKMILKMOPNKMJMKDDKLLMMMMOOPMLLLLLLLLLLLLMMMMMLLLLLLKKKKJJJJJIIIIIIIIIIIIIIIIIIIII     RG:Z:H06JU.1    BI:Z:KKLMNHHNRPJKLKEMLLMKKOONOKKOPLMLMNLMLMNOKOPLMMNLMKLFMLKMNOLMNKMOMKLNLLEJNONONLFFMNNOLONLMOLEMLOKLOPNMNOMLNLKLMNOPPNMNLMOOKOMNMMONLOKLFKMOKNPNOOKLFFNMNMLGNLMJMGMLNLLMPLPOOLPLPMPQSRMPLPNGGPOPQQOQRRRONNNOOOOOOOOOOOOONNNNNNMMMMLLLLKKKKKKKKKKKKKKKKKKKKKKK       BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       AS:i:213        XS:i:213

jkb@seq3a[new/encoding...] scramble v30.cram | grep H06JUADXX130110:1:2101:11706:99386
H06JUADXX130110:1:2101:11706:99386      69      1       64843   0       *       =       64843   0GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGAAGAACATGTCGGATCACCTCCTCGCACGTCTAGTCTGGCCGAACCCCGACCTGTGCAAAGCACAACACCTGACACACACGACAGTAGCTGAAGGAACCATGTATCGACTGCATGCGACGAGACCAACCGATGGCCGTGACCTCATGTGGAGACCTGCGATGAATGATGACTAGT   ?@<=9@A@@@??@=@=8?>@?@?>@?>@>>??<@@>?;88*(8@80??;?<65=@??7?=AA?A@@@B######################################################################################################################################################################################    BD:Z:IIKLMMKKLLKONHHGIJLHLLHIJIHIMKLKGGIKKGIJJFFKIJJFILLHJLHHHHLLLHILHBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIJJJJJJJJJJJJJKKKKKKLLLMMNNIIII     BI:Z:KKNNNMMLMLMQNIJIJLNKNNJLMJJKNNNMIJKMMIJLMHIMKLMILNNILNKKIKNNOKLNJFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKJKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLMMMMMMMMMMMNNNNNNNNMLKIJKKKK       AS:i:0  XS:i:0  RG:Z:H06JU.1
H06JUADXX130110:1:2101:11706:99386      137     1       64843   0       233M17S =       64843   0CTTAAAAGCACATTTAATGTGCCTGTGCTTAACTTAAGGTGCTTAGGACAAAGAAGGCGATTGACATCTTTCAGGTAAAACCTGGTAAGTTTGGTGGTCAAGGAACACAACTGAGACATCACTTGGATGTATTTCTATGACTATTTTAAGAAACATAAATTGTGGTGACTCACTCAGCTCACTTTTAACTACTGCATGGTAATTAAAGATGCAGATTAACATAAGTTACAAGACGCAAGGCATCGTATTG   =?=>?=?@=?;A@??@@???@==??-?>6?@?>@@?@??<????@?>@>@?@AAA@;?7A:<?B>@??B?@@@@@B@@@@>@CA@BBBABBBA@<@>AABBAACB@B@BB@CABAB@BAAA?CBA@BAA@AABAACBB@B@C>@CACBBBBCBAC=@>A@C<:<59>B?BAA@DBC66;@BA@A9?@AAA@;A=@BA#####################################################    BD:Z:IIIIMFELPNHHJIBIJHIGGLLKLGGLLHHIIJHHIJLKGLLHHKLHKGIBJJHJLMKKKIJMLHJLIIBILNMLKJCCJLLMJLKJKLJBJJLHJLMLJKMIIJHHHJJKMMKKLHJLLHKIJJIKJHKGIBIIKGJMLKKGIBBIJKKICJHJGKDJJKIIKMINMLKMILKMOPNKMJMKDDKLLMMMMOOPMLLLLLLLLLLLLMMMMMLLLLLLKKKKJJJJJIIIIIIIIIIIIIIIIIIIII     BI:Z:KKLMNHHNRPJKLKEMLLMKKOONOKKOPLMLMNLMLMNOKOPLMMNLMKLFMLKMNOLMNKMOMKLNLLEJNONONLFFMNNOLONLMOLEMLOKLOPNMNOMLNLKLMNOPPNMNLMOOKOMNMMONLOKLFKMOKNPNOOKLFFNMNMLGNLMJMGMLNLLMPLPOOLPLPMPQSRMPLPNGGPOPQQOQRRRONNNOOOOOOOOOOOOONNNNNNMMMMLLLLKKKKKKKKKKKKKKKKKKKKKKK       BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       AS:i:213        XS:i:213        RG:Z:H06JU.1

This is the v30.cram from ftp://ngs.sanger.ac.uk/scratch/project/jkb/test_crams

I think it is valid. Note that cramtools-2.1.jar on the v21.cram works as expected.

Lossless compression ?

In order to not lose any data, I use the following commands compression:
/************************** .BAM ==> .CRAM *******/
cramtools cram --capture-all-tags
--lossless-quality-score --preserve-read-names
...
/**
* .CRAM ==> .cram.BAM ***********************/
cramtools bam --calculate-md-tag --calculate-nm-tag
...
But, .BAm != .cram.BAM,
compression: provide capture-all-tags, uncompression:provide only two tags(nm,md).
Lossless compression, how should I do?

Count records in *.cram fails with FileNotFoundException

Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: http://www.ebi.ac.uk/ena/cram/md5/2648ae1bacce4ec4b6cf337dcae37816

java -jar cramtools-3.0.jar bam -I ./GSM1973810.cram -c
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
        at net.sf.cram.CramTools.invoke(CramTools.java:91)
        at net.sf.cram.CramTools.main(CramTools.java:121)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: http://www.ebi.ac.uk/ena/cram/md5/2648ae1bacce4ec4b6cf337dcae37816
        at net.sf.cram.ref.ReferenceSource.findBases(ReferenceSource.java:184)
        at net.sf.cram.ref.ReferenceSource.getReferenceBases(ReferenceSource.java:128)
        at net.sf.cram.FixBAMFileHeader.fixSequence(FixBAMFileHeader.java:55)
        at net.sf.cram.FixBAMFileHeader.fixSequences(FixBAMFileHeader.java:48)
        at net.sf.cram.Cram2Bam.main(Cram2Bam.java:120)
        ... 6 more
Caused by: java.io.FileNotFoundException: http://www.ebi.ac.uk/ena/cram/md5/2648ae1bacce4ec4b6cf337dcae37816
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1834)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
        at java.net.URL.openStream(URL.java:1038)
        at net.sf.cram.ref.ReferenceSource.loadFromPath(ReferenceSource.java:250)
        at net.sf.cram.ref.ReferenceSource.findBasesByMD5(ReferenceSource.java:287)
        at net.sf.cram.ref.ReferenceSource.findBases(ReferenceSource.java:180)
        ... 10 more

P.S.: Works ok, if reference (hg38) is specified.

print out variations

Someone requested such feature, although possibly it could be achieved with existing tools. Needs checking.

MD5 on reference sequences that are too short

A single read file consisting of an entry 9 bases long causes an error.

 jkb@seq3a[htslib/test] cat a.sam
@SQ     SN:xx   LN:20
a1      0       xx      1       50      9M      *       0       0       AAAAAAAAA       *
jkb@seq3a[htslib/test] java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar cram -R xx.fa -I a.sam -O _.cram
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at java.util.Arrays.copyOfRange(Arrays.java:2551)
        at net.sf.cram.structure.Slice.setRefMD5(Slice.java:272)
        at net.sf.cram.Bam2Cram.main(Bam2Cram.java:372)
        ... 6 more

The cause appears to be the use of "int shoulder = 10" on line 268 of src/main/java/net/sf/cram/structure/Slice.java.

What is the purpose of this shoulder? Is it to create a slightly larger pos/span range than is used by the reads within the slice?

Runtime exception while converting the bam to fastq

I converted the bam file to cram by using the following command.

CRT=/home/bin/cramtools/
java -jar $CRT/cramtools-3.0.jar cram --input-bam-file sample_sorted.bam --reference-fasta-file HG19.fa --output-cram-file sample_sorted.cram

However on trying to convert it to fastq results in the following error.
$java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
$cramtools fastq -I sample_sorted.cram | head -20

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at net.sf.cram.CramTools.invoke(CramTools.java:91)
at net.sf.cram.CramTools.main(CramTools.java:121)
Caused by: java.lang.RuntimeException: java.net.UnknownHostException: www.ebi.ac.uk
at net.sf.cram.ref.ReferenceSource.findBases(ReferenceSource.java:184)
at net.sf.cram.ref.ReferenceSource.getReferenceBases(ReferenceSource.java:128)
at net.sf.cram.FixBAMFileHeader.fixSequence(FixBAMFileHeader.java:55)
at net.sf.cram.FixBAMFileHeader.fixSequences(FixBAMFileHeader.java:48)
at net.sf.cram.Cram2Fastq.main(Cram2Fastq.java:95)
... 6 more
Caused by: java.net.UnknownHostException: www.ebi.ac.uk
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at sun.net.NetworkClient.doConnect(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.(Unknown Source)
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.URL.openStream(Unknown Source)
at net.sf.cram.ref.ReferenceSource.loadFromPath(ReferenceSource.java:250)
at net.sf.cram.ref.ReferenceSource.findBasesByMD5(ReferenceSource.java:287)
at net.sf.cram.ref.ReferenceSource.findBases(ReferenceSource.java:180)

Error decoding c1#pad1.tmp.cram

The CRAM file produced by htslib's tests from c1#pad1.sam fails when decoding with cramtools (both 2.1 or 3.0):

Failed at record 8. Here is the previously read record: [s5; flags=0; aloffset=0; mateoffset=-1; mappingQuality=0; D@0+3; D@0+3; Insertion[position=0; sequence=TA] ; P@0+1; D@6+3; qscores: ??????????]
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.RuntimeException: java.io.EOFException: End of stream.
        at net.sf.cram.encoding.reader.CramRecordReader.read(CramRecordReader.java:183)
        at net.sf.cram.build.ContainerParser.getRecords(ContainerParser.java:118)
        at net.sf.cram.build.ContainerParser.getRecords(ContainerParser.java:60)
        at net.sf.cram.Cram2Bam.main(Cram2Bam.java:222)
        ... 6 more
Caused by: java.io.EOFException: End of stream.
        at net.sf.cram.io.DefaultBitInputStream.readBits(DefaultBitInputStream.java:73)
        at net.sf.cram.encoding.huffint.Helper.read(Helper.java:164)
        at net.sf.cram.encoding.huffint.CanonicalHuffmanIntegerCodec2.read(CanonicalHuffmanIntegerCodec2.java:46)
        at net.sf.cram.encoding.huffint.CanonicalHuffmanIntegerCodec2.read(CanonicalHuffmanIntegerCodec2.java:33)
        at net.sf.cram.encoding.reader.DataReaderFactory$DefaultDataReader.readData(DataReaderFactory.java:133)
        at net.sf.cram.encoding.reader.CramRecordReader.read(CramRecordReader.java:134)
        ... 9 more

It also fails decoding a cram file it produced itself from the same SAM input. See the htslib/test directory for the source.

Content hash mismatch for tag BD

The restoreReadBases function in CramNormalizer restores the read bases inconsistently.

Let us assume we have a reference sequence with an IUPAC ambiguity base (as it occurs on Chr3 of GRCh37 with an R) and a bam file with a read being aligned with an N to the IUPAC base (e.g. R) of the reference sequence. Let's further assume, this mismatch is the only variation within this read.

Now, lets perform a bam to cram conversion using cramtools.
First note that cramtools will calculate a CRC32 value for the base digests (BD) based on the original unchanged read sequences. We will come back to this point later.
Cramtools Bam2Cram will convert the reference's R to an N via the method Sam2CramRecordFactory.addSubstitutionsAndMaskedBases and therein via the net.sf.cram.common.Utils.normalizeBase method (with its default case in switch).
Therefore, the mismatch between the read's N and the reference's original R (now an N) becomes a match and no variation instance will be created for this position.
Hence, the base information for this position is discarded in the cram file and has to be restored from the reference - which will cause the respective base to be an R instead of the original N!
When we now perform a samtools view on the cram file, we get an R for this position, which is not what we had in the original bam file.
This is due to the application of the normalizeBase method.

No let's convert this cram file back to bam using cramtools. As we assumed to have only this single varation event in the aligned read and this variation has been removed as described, the condition
(record.readFeatures == null || record.readFeatures.isEmpty())
in the method CramNormalizer.restoreReadBases(CramRecord record, byte[] ref, int refOffset_zeroBased,SubstitutionMatrix substitutionMatrix)
holds. Therefore, the full sequence is restored from the reference sequence an the R (instead of original N) will make it into the cram.bam file (normalizeBase is not called here!)
Now we come back to the CRC32 value of the base digest. As the read sequence differs from the original read sequence (N has become an R), the CRC32 value has changed. Hence, cramtools will throw a Content hash mismatch error for the BD tag when we perform the cram2bam conversion.
All this happens, when there is only the one variation which is removed via the normalizeBase method and hence a perfect match results in the cram and therefore also in the cram.bam.

However, when there are other variations besides such a IUPAC normalization issue, the read sequence restoration is a little bit different:
The positions without variation events are restored directly from the reference sequence and the positions with variation events are treated separatedly (e.g. substitutions with the substitutionMatrix). When the full read sequence has been restored, and this is the important difference to the case above, a normalizeBase is called for each base of the read sequence. This will remove the IUPAC symbol again and replace it by an N. In this case, the read sequence from the cram.bam file is equal to that one of the original bam file. The CRC32 values are identical and no Content hash mismatch error is thrown.
However, the cram file has an incorrect base whereas the cram.bam file does not.

I would suggest to adapt the normalizeBase method and also apply it to the restored read sequence in the restoreReadBases section with no variation events.

Best
Gregor

best approach for lossless compression

Hi,
I have been testing cramtools (2.1 and 3.0) with bams created by a number of versions of BWA. I have yet to be able to completely reproduce my original bam after converting to CRAM and back to BAM.

It seems that when I have either NM or MD tags in the original bam, I can't seem to get them stored in the CRAM and instead I need to recreate them during decompression. If I don't ask to have those two fields recalculated during decompression, they don't show up for any reads, but when I use the "--calculate-md-tag" and " --calculate-nm-tag" switches I get the flags for all reads.

What I want is to get the NM and MD flags for only the reads that had those flags in the original bam. is there a way to do this?

Java Error

Using cramtools 3.0

The same error was posted in Issue 24 #24

but the issue still looks open, so posting as separate


 **cramtools cram -I TEST.bam -R Human_Decoy_REF/hs37d5.fa  -O TEST.bam.cram**

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:91)
        at net.sf.cram.CramTools.main(CramTools.java:121)
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/CountingOutputStream
        at htsjdk.samtools.cram.build.CramIO.writeContainerForSamFileHeader(CramIO.java:247)
        at htsjdk.samtools.cram.build.CramIO.writeCramHeader(CramIO.java:161)
        at net.sf.cram.Bam2Cram.main(Bam2Cram.java:405)
        ... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.utils.CountingOutputStream
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 9 more

>= 129 auxiliary tags

The htslib/test/xx#large_aux.sam has some sequences containing lots of auxiliary tags. If I cut it down to just the first read with 129 tags it fails, but at 128 it passes.

An example SAM file with 129 ??:i:1 entries is:

@SQ     SN:xx   LN:20
a1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      *     aa:i:1   ab:i:1  ac:i:1  ad:i:1  ae:i:1  af:i:1  ag:i:1  ah:i:1  ai:i:1  aj:i:1  ak:i:1  al:i:1am:i:1   an:i:1  ao:i:1  ap:i:1  aq:i:1  ar:i:1  as:i:1  at:i:1  au:i:1  av:i:1  aw:i:1  ax:i:1ay:i:1   az:i:1  ba:i:1  bb:i:1  bc:i:1  bd:i:1  be:i:1  bf:i:1  bg:i:1  bh:i:1  bi:i:1  bj:i:1bk:i:1   bl:i:1  bm:i:1  bn:i:1  bo:i:1  bp:i:1  bq:i:1  br:i:1  bs:i:1  bt:i:1  bu:i:1  bv:i:1bw:i:1   bx:i:1  by:i:1  bz:i:1  ca:i:1  cb:i:1  cc:i:1  cd:i:1  ce:i:1  cf:i:1  cg:i:1  ch:i:1ci:i:1   cj:i:1  ck:i:1  cl:i:1  cm:i:1  cn:i:1  co:i:1  cp:i:1  cq:i:1  cr:i:1  cs:i:1  ct:i:1cu:i:1   cv:i:1  cw:i:1  cx:i:1  cy:i:1  cz:i:1  da:i:1  db:i:1  dc:i:1  dd:i:1  de:i:1  df:i:1dg:i:1   dh:i:1  di:i:1  dj:i:1  dk:i:1  dl:i:1  dm:i:1  dn:i:1  do:i:1  dp:i:1  dq:i:1  dr:i:1ds:i:1   dt:i:1  du:i:1  dv:i:1  dw:i:1  dx:i:1  dy:i:1  dz:i:1  ea:i:1  eb:i:1  ec:i:1  ed:i:1ee:i:1   ef:i:1  eg:i:1  eh:i:1  ei:i:1  ej:i:1  ek:i:1  el:i:1  em:i:1  en:i:1  eo:i:1  ep:i:1eq:i:1   er:i:1  es:i:1  et:i:1  eu:i:1  ev:i:1  ew:i:1  ex:i:1  ey:i:1
jkb@seq3a[htslib/test] java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar cram -R xx.fa -n -Q --capture-all-tags -I a.sam -O _.cram -l debug
DEBUG   2015-04-09 16:50:40     Bam2Cram        Creating tracks for reference: name=xx, length=20.

INFO    2015-04-09 16:50:40     FixBAMFileHeader        Reference sequence MD5 not found, calculating: xx
DEBUG   2015-04-09 16:50:40     Bam2Cram        Writing 1 records for sequence 0, xx
INFO    2015-04-09 16:50:40     Bam2Cram        create: tracks 0ms, records 8ms, mating 0ms.
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to bases: 0
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to quality scores: 1
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to read names: 2
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to mate info: 3
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for NF_RecordsToNextFragment: GAMMA[1]
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for read feature position: GAMMA[1]
DEBUG   2015-04-09 16:50:40     SubstitutionMatrix      Subs matrix: [27, 27, 27, 27, 27]: 0001101100011011000110110001101100011011
DEBUG   2015-04-09 16:50:40     SubstitutionMatrix      Subs matrix decoded: A:CGTN     C:AGTNG:ACTN   T:ACGN  N:ACGT
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for HC_HardClip: GAMMA[0]
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for PD_padding: GAMMA[0]
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        NS: HUFFMAN:[0x01 0xff 0xff 0xff 0xff 0xff 0x01 0x00]
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.nio.BufferUnderflowException
        at java.nio.Buffer.nextGetIndex(Buffer.java:492)
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135)
        at net.sf.cram.io.ByteBufferUtils.readUnsignedITF8(ByteBufferUtils.java:305)
        at net.sf.cram.encoding.HuffmanByteEncoding.fromByteArray(HuffmanByteEncoding.java:72)
        at net.sf.cram.encoding.writer.DataWriterFactory.createWriter(DataWriterFactory.java:87)
        at net.sf.cram.encoding.writer.DataWriterFactory.buildWriter(DataWriterFactory.java:54)
        at net.sf.cram.build.ContainerFactory.buildSlice(ContainerFactory.java:194)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:89)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:58)
        at net.sf.cram.Bam2Cram.main(Bam2Cram.java:369)
        ... 6 more

This is strictly an encoder issue as cramtools can decode the data generated from htslib.

NPE in Cram2Bam for refid -2

Caused by: java.lang.IllegalArgumentException: Reference index -2 not found in sequence dictionary.
at net.sf.samtools.SAMRecord.setReferenceIndex(SAMRecord.java:367)
at net.sf.cram.Cram2Bam.main(Cram2Bam.java:317)
... 6 more
Caused by: java.lang.NullPointerException
at net.sf.samtools.SAMRecord.setReferenceIndex(SAMRecord.java:365)

Looks like slices with ref id -2 are not handled correctly: Cram2Bam tries to set refid in SAMRecord but it is not a valid refid.

Unindexed FASTA might lead to NPE

Using a FASTA file without index causes NPE in Bam2Cram, because all of the options in ReferenceSource#getReferenceBases fail:

ref = referenceSource.getReferenceBases(samSequenceRecord, true);  // null.
log.debug(String.format("Creating tracks for reference: name=%s, length=%d.\n",
        samSequenceRecord.getSequenceName(), ref.length));  // NPE!

Validated for both the v3.0 release and the master version

Index piped cram file

According to the help for the indexing option of cramtools, I am allowed to create a crai for streamed data:

Version 3.0-b12

Usage: <main class> [options]
  Options:    --bam-style-index  Choose between BAM index (bai) and CRAM index (crai).  (default: false)
    --help, -h         Print help and exit. (default: false)
    --input-file, -I   Path to a BAM or CRAM file to be indexed. Omit if standard input (pipe).
    -l, --log-level    Change log level: DEBUG, INFO, WARNING, ERROR. (default: ERROR)

However, if I leave the -I parameter away, I get either the help screen or when I call cramtools index with log-level debug (-l DEBUG), I will get a NullPointerException for the FileInputStream:

Exception in thread "main" java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at net.sf.cram.CramTools.invoke(CramTools.java:93)
   at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.NullPointerException
   at java.io.FileInputStream.<init>(FileInputStream.java:130)
   at net.sf.samtools.CRAMFileReader.isCRAMFile(CRAMFileReader.java:331)
   at net.sf.cram.index.CramIndexer.main(CramIndexer.java:78)
   ... 6 more

Am I missing something or is the option to create a crai for a streamed cram file just a copy and paste error from e.g. cramtools bam/cram (conversion parts of cramtools)?
If it should be possible to create a crai with piped data, what has to be changed?

Support for multiple reference containers.

If I have a bunch of reads with 1 read per reference, htslib/scramble on the 3rd such occurance in quick succession will switch to multi-ref mode were the container reference sequence id is set to -2. htslib/test/ce#5b.sam is an example of this.

This used to work with cramtools-2.1 but seems to fail with 3.0.

A trivial example with cram_dump output for the 3rd container:

Container pos 11030 size 286
    Ref id:            -2
    Ref pos:           0 + 0
    Rec counter:       2
    No. recs:          1
    No. bases          10
    No. blocks:        5
    No. landmarks:     1
    {164}
...
Rec 1/1 at 0,7
BF = 0 => SAM 0x0 (ret 0, out_sz 1)
CF = 3 (ret 0, out_sz 1)
RI = 2 (ret 0, out_sz 1)
RL = 10 (ret 0, out_sz 1)
AP = 1 (ret 0, out_sz 1)
RG = -1 (ret 0, out_sz 1)
RN = III (ret 0, out_sz 3)
Detached
MF = 0 (ret 0, out_sz 1)
NS = -1 (ret 0, out_sz 1)
NP = 0 (ret 0, out_sz 1)
TS = 0 (ret 0, out_sz 1)
TL = 0 (ret 0, out_sz 1)
FN = 0 (ret 0, out_sz 1)
MQ = 1 (ret 0, out_sz 1)
QS =            (ret 0, out_sz 10)

The corresponding SAM record for this looks like:

III     0       s3      1       1       10M     *       0       0       AAAAAAAAAA      *

Cramtools 3.0 extracts this as:

III     0       s3      1       1       10M     *       0       0       NNNNNNNNNN      ??????????

NB this also shows another issue. The quality should be "*" for this case, although I am unsure if that is due to my encoding or your decoding!

Note that the CRAM specification is perhaps a bit ambiguous. I set ref id to -2 and pos/span to 0 as they have no meaning for ref -2. Unless that is you interpret the pos/span 0 to indicate unmapped reads. I assume that statement in the spec is simply there to indicate with ref id -1 we'd expect those fields to be zero rather than those fields being zero implies unmapped data. It should probably be replaced with text like "the alignment start position or 0 when reference sequence id < 0".

Strange exception behaviour

Hi all,

We have accumulated quite a lot of BAM files over the years and we would like (need) to archive the old(er) ones. We have tested CRAMtools 2.1 and we are very happy with the results. Since we are talking about thousands of BAM files, we cannot go without automation. But...

The problem is as followed:

  • when I run CRAMtools in a job file, it terminates immediatly (throwing an exception).
  • when I copy the command from the job file and execute it on the head node or on the node it previously failed on, I have no issues what so ever.

My job file looks like this:

PBS -N DNAXXXXXX

PBS -j oe

PBS -o /data/results/CRAM_archive/2013/october//DNAXXXXXX.log

PBS -r y

PBS -q bs_secondary

PBS -m ea

PBS -M [email protected]

PBS -l nodes=1:ppn=12

cp /data/results/BGI/DNAXXXXXX/DNAXXXXXX.bam /scratch/
cp /data/results/BGI/DNAXXXXXX/DNAXXXXXX.bam.bai /scratch/
java -jar /share/apps/cramtools-2.1.jar cram -I /data/results/BGI/DNAXXXXXX/DNAXXXXXX.bam -O /scratch/DNAXXXXXX.cram -R /data/references/GrCh37/GrCh37_reference.fa --preserve-read-names --capture-all-tags -L *8
java -cp /share/apps/cramtools-2.1.jar net.sf.cram.ValidateCramFile -I /scratch/DNAXXXXXX.cram -R /data/references/GrCh37/GrCh37_reference.fa
mv /scratch/DNAXXXXXX.cram /data/results/CRAM_archive/2013/october//
rm -f /scratch/DNAXXXXXX.bam
rm -f /scratch/DNAXXXXXX.bam.bai

The exception I get when submitting the job file to torque:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at net.sf.cram.CramTools.invoke(CramTools.java:93)
at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.RuntimeException: Uknown read or base category: 6
at net.sf.cram.lossy.QualityScorePreservation.parseSinglePolicy(QualityScorePreservation.java:155)
at net.sf.cram.lossy.QualityScorePreservation.parsePolicies(QualityScorePreservation.java:79)
at net.sf.cram.lossy.QualityScorePreservation.(QualityScorePreservation.java:39)
at net.sf.cram.Bam2Cram.main(Bam2Cram.java:273)
... 6 more

I have tried to include the full path to java to make sure the same version is used in all cases, but the problem remains. The issue only seem to occur when applied to Illumina (mapped with BWA) BAM files. When I apply the same job file on our SOLiD data (mapped with Lifescope) I have no issues.

Does anyone have an idea what could cause this problem and how I can fix it?

Many thanks,

Rick

sam roundtrip does not respect N in cigar

For instance, with this SAM file:
@hd VN:1.4 SO:coordinate
@sq SN:chr1 LN:249250621
HWI-D00259:191:C5KM0ANXX:5:1312:14834:20114 99 chr1 15003 3 36M757N65M = 15004
859 CCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCC
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
NH:i:2 HI:i:1 AS:i:204 nM:i:0
HWI-D00259:191:C5KM0ANXX:5:1312:14834:20114 147 chr1 15004 3 35M757N66M = 15003
-859 CGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCT
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB
NH:i:2 HI:i:1 AS:i:204 nM:i:0

doing a SAM -> CRAM -> SAM loop leads to the first sequence being changed
to
CCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTGCAACGGGAAAGCCACAGACTGGGGTGAAGAGTTCAGTCACATGCGACCGGTGACTCCCTGTC
which is the sequence starting at chr1:15003 for 101 bases.

Cramtools 3.0 decoding error

A file containing just the two reads mentioned in #21 caused cramtools to have a decoding failure.

Cramtools generated the CRAM (v3.0) just fine, but on decoding produces this error:

java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa -I _.cram
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.lang.NullPointerException
        at net.sf.cram.Cram2Bam.main(Cram2Bam.java:342)
        ... 6 more

Edit: now stored in Gist as a binary blob:

https://gist.github.com/jkbonfield/d4e391a8a5867d74a222/raw/50ea17ca7af6cd305052ea60ad1aab94fb73233a/_.cram

Old uuencoded form:

https://gist.githubusercontent.com/jkbonfield/d4e391a8a5867d74a222/raw/d2f8a70e2d6cec0cfa69399b64db8741f11cc70a/_.cram

add proper support for multi-fragment reads

cramtools treats trios as pair+1, where the third fragment becomes a separate read with the same name. Even though in some cases this restores correctly this should be revised. More details follow:

H06JUADXX130110:1:2101:14964:72615 99 1 13699 0 25
H06JUADXX130110:1:2101:14964:72615 147 1 13765 0 1
H06JUADXX130110:1:2101:14964:72615 387 1 13878 0 3

Ie it's

read1 read2 2ndary2
----> <---- ---->

When I convert this to CRAM and back again with Java then I correctly
get 99, 147 and 387 as the flags. Similarly if I convert it to CRAM
and back again with Scramble I also get 99, 147 and 387.

More interestingly is scramble's conversion of cramtools output works
(and gives 99, 147, 387) but cramtools conversion from scramble cram
output gives 99, 147, 419. 419 is 0x1a3 vs 387 being 0x183, so it
gains an extra +0x20 (seq of the next segment is reversed).

On the face of it it sounds like my output isn't correct, but I am
unsure what the correct solution is. cram_dump shows some of the
differences between the reads.


Cramtools output:

Rec 1/3
BF = 67 => SAM 0x43 (ret 0, out_sz 1)
CF = 5 (ret 0, out_sz 1)
Not detached, and mate is downstream

Rec 2/3
BF = 147 => SAM 0x93 (ret 0, out_sz 1)
CF = 1 (ret 0, out_sz 1)
Not detached, and no mate downstream

Rec 3/3
BF = 387 => SAM 0x183 (ret 0, out_sz 1)
CF = 3 (ret 0, out_sz 1)
Detached, and no mate downstream


Scramble output:

Rec 1/3 at 0,7
BF = 99 => SAM 0x63 (ret 0, out_sz 1)
CF = 5 (ret 0, out_sz 1)
Not detached, and mate is downstream

Rec 2/3 at 1,5
BF = 147 => SAM 0x93 (ret 0, out_sz 1)
CF = 5 (ret 0, out_sz 1)
Not detached, and mate is downstream

Rec 3/3 at 4,1
BF = 387 => SAM 0x183 (ret 0, out_sz 1)
CF = 1 (ret 0, out_sz 1)
Not detached, and no mate downstream


I just have 3 reads and they're all in the same slice.

It seems like Cramtools has read 1+2 as a pair, and then 3 as the
start of a new pair. While Scramble has read 1+2+3 as a trio.
Regardless of anything else I think Cramtools Rec 3 with CF=3 is wrong
as it is stating the read has the next segment out of the horizon
which is wrong - we only have 1 slice.

Now it's sounding more like Scramble is the correct behaviour, but it
all hinges on what the CRAM spec means by "CF 0x4; has mate
downstream; tells is the next segment should be expected further in
the stream" means.

Does "has mate downstream" restrict use of 0x4 solely to proper pairs
and not to trios? (Note they all have BF 0x2 set so the aligner claims
a proper pair.)

James

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.