Git Product home page Git Product logo

cljam's Introduction

cljam

A DNA Sequence Alignment/Map (SAM) library for Clojure. [API Reference] [Annotated Source]

Clojars Project

Build Status

codecov

Installation

cljam is available as a Maven artifact from Clojars.

Clojure CLI/deps.edn:

cljam {:mvn/version "0.8.5"}

Leiningen/Boot:

[cljam "0.8.5"]

Breaking changes in 0.8.0

  • cljam.io.tabix is rewritten. #180
  • cljam.io.bam-index.writer/pos->lidx-offset is moved to cljam.io.util.bin/pos->lidx-offset. #180
  • cljam.io.sam.util/reg->bin is moved to cljam.io.util.bin/reg->bin. Also, a coordinate system of its argument is changed from 0-based half-open to 1-based fully-closed. #190

Getting started

To read a SAM/BAM format file,

(require '[cljam.io.sam :as sam])

;; Open a file
(with-open [r (sam/reader "path/to/file.bam")]
  ;; Retrieve header
  (sam/read-header r)
  ;; Retrieve alignments
  (doall (take 5 (sam/read-alignments r))))

To create a sorted file,

(require '[cljam.io.sam :as sam]
         '[cljam.algo.sorter :as sorter])

(with-open [r (sam/reader "path/to/file.bam")
            w (sam/writer "path/to/sorted.bam")]
  ;; Sort by chromosomal coordinates
  (sorter/sort-by-pos r w))

To create a BAM index file,

(require '[cljam.algo.bam-indexer :as bai])

;; Create a new BAM index file
(bai/create-index "path/to/sorted.bam" "path/to/sorted.bam.bai")

To calculate coverage depth for a BAM file,

(require '[cljam.io.sam :as sam]
         '[cljam.algo.depth :as depth])

(with-open [r (sam/reader "path/to/sorted.bam")]
  ;; Pileup "chr1" alignments
  (depth/depth r {:chr "chr1", :start 1, :end 10}))
;;=> (0 0 0 0 0 0 1 1 3 3)

If you are Clojure beginner, read Getting Started for Clojure Beginners.

Command-line tool

cljam provides a command-line tool to use the features easily.

NOTICE The command-line tool functionality will be removed from cljam on the next release. The functionality has been already integrated into Gnife, which is a pure command-line tool.

Executable installation

lein bin creates standalone console executable into target directory.

$ lein bin
Creating standalone executable: /path/to/cljam/target/cljam

Copy the executable cljam somewhere in your $PATH.

Usage

All commands are displayed by cljam -h, and detailed help for each command are displayed by cljam [cmd] -h.

$ cljam view -h

For example, to display contents of a SAM file including the header,

$ cljam view --header path/to/file.sam

See command-line tool manual for more information.

Development

Test

To run tests,

  • lein test for basic tests,
  • lein test :slow for slow tests with local resources,
  • lein test :remote for tests with remote resources.

To get coverage

$ lein cloverage

And open target/coverage/index.html.

Generating document

cljam uses Codox for API reference and Marginalia for annotated source code.

$ lein docs

generates these documents in target/docs and target/literate directories.

Citing cljam

T. Takeuchi, A. Yamada, T. Aoki, and K. Nishimura. cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing. Source Code for Biology and Medicine, Vol. 11, No. 1, pp. 1-4, 2016.

Contributors

Sorted by first commit.

License

Copyright 2013-2024 Xcoo, Inc.

Licensed under the Apache License, Version 2.0.

cljam's People

Contributors

alumi avatar athos avatar ayamada avatar dependabot[bot] avatar federkasten avatar k-kom avatar matsutomo81 avatar niyarin avatar nuggetoriniku avatar r6eve avatar toshimitsuarai avatar totakke avatar xckitahara avatar yito88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cljam's Issues

Read/write invariance breaks with integer tags in BAM/CRAM

When reading alignments from a BAM or CRAM file and writing them to another BAM/CRAM file as they are, the values of integer tags may change.

Repro

$ samtools view int_tag_overflow.bam
r1      4       *       0       0       *       *       0       0       ATGC    ####    XA:i:4294967295
(require '[cljam.io.sasm :as sam])

(with-open [r (sam/reader "int_tag_overflow.bam")]
  (doall (sam/read-alignments r)))
;=>
({:qname "r1",
  :flag 4,
  :rname "*",
  ...
  :seq "ATGC",
  :qual "####",
  :options ({:XA {:type "i", :value 4294967295}})})

(with-open [r (sam/reader "int_tag_overflow.bam")
            w (sam/writer "int_tag_overflow.rewrite.bam")]
  (sam/write-header w (sam/read-header r))
  (sam/write-refs w (sam/read-refs r))
  (sam/write-alignments w (sam/read-alignments r) (sam/read-header r)))

(with-open [r (sam/reader "int_tag_overflow.rewrite.bam")]
  (doall (sam/read-alignments r)))
;=>
({:qname "r1",
  :flag 4,
  :rname "*",
  ...
  :seq "ATGC",
  :qual "####",
  :options ({:XA {:type "i", :value -1}})})  ;; <- this value has changed from the original one

Cause

  • The SAM format defines the only integer tag type i (signed arbitrary-precision integer) while the BAM/CRAM format has the i integer tag type with different semantics (signed 32bit integer), as well as other integer types (c/C/s/S/I)
  • cljam's BAM/CRAM reader interprets any integer tag value as the i tag type
  • cljam's BAM/CRAM writer doesn't check if each integer tag value fits the specified tag type. It writes a tag value as the i tag type even if it can't be represented as a signed 32bit integer.

Random read of 2bit is quite slower than FASTA

(require '[cljam.io.sequence :as cseq])

(with-open [r (cseq/reader "path/to/hg38.2bit")]
  (time (cseq/read-sequence r {:chr "chr7" :start 55174773, :end 55174783})))
;; "Elapsed time: 1678.031253 msecs"

(with-open [r (cseq/reader "path/to/hg38.fa")]
  (time (cseq/read-sequence r {:chr "chr7" :start 55174773, :end 55174783})))
;; "Elapsed time: 15.879277 msecs"

Seemingly useless nil entry occasionally included in VCF parsing results

When parsing a VCF file just having the 8 required fields (ie. chrom/pos/id/ref/alt/qual/filter/info), the resulting maps contain a seemingly useless nil nil entry as follows:

(def r (vcf/reader "/path/to/file.vcf"))
(def vs (vcf/read-variants r))
(first vs)
;=> {nil nil, :alt ["C"], :ref "T", :pos 123456, :filter (:PASS), :id nil, :info {:DP 40, :RD 31, :AD (9), :AF (0.225)}, :qual nil, :chr "chr1"}
;    ^^^^^^^

If the VCF file has some extra fields other than the required ones, it dosen't happen.

Suggestion on effective VCF file reading?

I am a experiencing OOM java heap problem when trying to read a multisample, chromosome partitioned VCF from the 1000 genomes project, specifically ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.

I am doing:

(with-open [r (vcf/reader "ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz")]
    (into [] (vcf/read-variants (vcf/clone-vcf-reader r) {:depth :deep})))

Is there a better approach to parsing VCFs witch cljam, which will not run out of memory?
I tried running this on an M2 Pro, allocating 20GB to jvm.

Coding style fixes from clj-kondo's linter with default :off

This issue is a answer to comment(#269 (comment))

  • Lists options that clj-kondo defaults to :off that should be introduced.
  • After listing, fix the corresponding code and add the clj-lint-action settings.

Linters that should be introduced

:single-key-in

  • This option warns about get-in and update-in where the number of elements in keys is 1.
  • Cljam reported 1 warning.
./src/cljam/io/fasta/reader.clj:74:39: warning: update-in with single key

:aliased-namespace-symbol

  • This option warns about code that doesn't use namespace aliases.
  • cljam reported 2 warnings.
./src/cljam/tools/cli.clj:332:32: warning: An alias is defined for clojure.java.io: cio
./src/cljam/tools/cli.clj:333:32: warning: An alias is defined for clojure.java.io: cio

Linters that may be introduced

:reduce-without-init
:redundant-fn-wrapper

Linters that may be introduced (without Kibit)

:minus-one :plus-one :equals-true

CRAM support?

This is a neat clojure library for bioinformatics. Do you plan on adding CRAM file support?

bgzipped VCF file over HTTP causes infinite loop

A bgzipped VCF file over HTTP causes infinite loop.

$ gzip -c test-resources/vcf/test-v4_3.vcf > test-resources/vcf/test-v4_3-gzip.vcf.gz
$ bgzip -c test-resources/vcf/test-v4_3.vcf > test-resources/vcf/test-v4_3-bgzip.vcf.gz
$ python3 -m http.server 8000
(require '[cljam.io.vcf :as vcf])

;; gzipped file correctly finished.
(with-open [rdr (vcf/vcf-reader "http://localhost:8000/test-resources/vcf/test-v4_3-gzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;;=> ({:NA00001 {:GT "0|0", :GQ 48, :DP 1, :HQ (51 51)},
;;     :alt ["A"],
;;     :ref "G",
;;     :FORMAT (:GT :GQ :DP :HQ),
;;     :NA00002 {:GT "1|0", :GQ 48, :DP 8, :HQ (51 51)},
;;     :pos 14370,
;;     :filter (:PASS),
;;     :id "rs6054257",
;;     :info {:NS 3, :DP 14, :AF (0.5), :DB :exists, :H2 :exists},
;;     :qual 29.0,
;;     :NA00003 {:GT "1/1", :GQ 43, :DP 5, :HQ (nil nil)},
;;     :chr "20"}
;;    ...)

;; bgzipped file causes infinite loop.
(with-open [rdr (vcf/vcf-reader "http://localhost:8000/test-resources/vcf/test-v4_3-bgzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;; infinite loop...

cljam can read the bgzipped file as a local file, so I think this issue is related to bgzf4j.

(with-open [rdr (vcf/vcf-reader "test-resources/vcf/test-v4_3-bgzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;;=> ({:NA00001 {:GT "0|0", :GQ 48, :DP 1, :HQ (51 51)},
;;     :alt ["A"],
;;     :ref "G",
;;     :FORMAT (:GT :GQ :DP :HQ),
;;     :NA00002 {:GT "1|0", :GQ 48, :DP 8, :HQ (51 51)},
;;     :pos 14370,
;;     :filter (:PASS),
;;     :id "rs6054257",
;;     :info {:NS 3, :DP 14, :AF (0.5), :DB :exists, :H2 :exists},
;;     :qual 29.0,
;;     :NA00003 {:GT "1/1", :GQ 43, :DP 5, :HQ (nil nil)},
;;     :chr "20"}
;;    ...)

I noticed the issue while reading COSMIC's bgzipped VCF, but this NegativeArraySizeException seems to be another problem. A further inverstigation is needed.

(with-open [rdr (vcf/vcf-reader "http://localhost:8000/CosmicCodingMuts.vcf.gz")]
  (dorun (vcf/read-variants rdr)))
;; Caused by java.lang.NegativeArraySizeException
;; -481634578
;;
;;    BGZFInputStream.java:  365  bgzf4j.BGZFInputStream/inflateBlock
;;    BGZFInputStream.java:  353  bgzf4j.BGZFInputStream/readBlock
;;    BGZFInputStream.java:  102  bgzf4j.BGZFInputStream/available
;;    BGZFInputStream.java:  202  bgzf4j.BGZFInputStream/readLine
;;              reader.clj:  179  cljam.io.vcf.reader/read-data-lines
;;              reader.clj:  175  cljam.io.vcf.reader/read-data-lines
;;              reader.clj:  182  cljam.io.vcf.reader/read-data-lines/fn

Bgziped VCF is illegal.

We can reproduce it with the following code.

(with-open [wtr (vcf/writer "./test.vcf.gz"
                            {:format nil :contig {:id "1", :length nil} :filter [{:id "PASS", :description "All filters passed"}] :info nil}
                            ["CHROM" "POS" "ID" "REF" "ALT" "QUAL" "FILTER" "INFO" "FORMAT"])]
  (vcf/write-variants wtr [{:chr "1" :pos 10 :ref "A" :alt ["G"]}]))

(with-open [rdr (vcf/reader "./test.vcf.gz")]
  (vcf/read-variants rdr))
;Error printing return value (IOException) at java.io.BufferedReader/ensureOpen (BufferedReader.java:122).

The generated bgzip file also fails with the'bcftools index'.
Now, if we do bgzip -d and bgzip, the error doesn't occur.

exec binary does not work on Java 9+

$ lein bin
...
Creating standalone executable: /path/to/cljam/target/cljam

# 1.8
$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
$ target/cljam version
0.5.2-SNAPSHOT

# 9
$ java -version
java version "9.0.1"
Java(TM) SE Runtime Environment (build 9.0.1+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)
$ target/cljam version
Exception in thread "main" java.lang.ExceptionInInitializerError
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:375)
        at clojure.lang.RT.classForName(RT.java:2168)
        at clojure.lang.RT.classForName(RT.java:2177)
        at clojure.lang.RT.loadClassForName(RT.java:2196)
        at clojure.lang.RT.load(RT.java:443)
        at clojure.lang.RT.load(RT.java:419)
        at clojure.core$load$fn__5677.invoke(core.clj:5893)
        at clojure.core$load.invokeStatic(core.clj:5892)

# 10
$ java -version
java 10.0.1 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)
$ target/cljam -h
Exception in thread "main" java.lang.ExceptionInInitializerError
        at java.base/java.lang.Class.forName0(Native Method)
        ... (same error)

`dict` command after cljam 0.8.2 fails

$ cljam version
0.8.4
$ cljam dict test-resources/fasta/test.fa /tmp/test.dict
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.commons.compress.compressors.xz.XZUtils.<clinit>(XZUtils.java:59)
        at org.apache.commons.compress.compressors.CompressorStreamFactory.detect(CompressorStreamFactory.java:271)
        at org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:522)
        at cljam.util$compressor_input_stream.invokeStatic(util.clj:122)
        at cljam.io.fasta.core$reader.invokeStatic(core.clj:38)
        at cljam.io.sequence$fasta_reader.invokeStatic(sequence.clj:20)
        at cljam.algo.dict$create_dict.invokeStatic(dict.clj:9)
        at cljam.tools.cli$dict.invokeStatic(cli.clj:297)
        at cljam.tools.cli$run.invokeStatic(cli.clj:357)
        at cljam.tools.main$_main.invokeStatic(main.clj:6)
        at cljam.tools.main$_main.doInvoke(main.clj:6)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at cljam.tools.main.main(Unknown Source)
Caused by: java.lang.NullPointerException: Cannot invoke "Object.getClass()" because the return value of "java.lang.Class.getClassLoader()" is null
        at org.apache.commons.compress.utils.OsgiUtils.<clinit>(OsgiUtils.java:31)
        ... 13 more
$ echo $?
1

<=0.8.1 succeeds but 0.8.2<= fails.

$ cljam version
0.8.1
$ cljam dict test-resources/fasta/test.fa /tmp/test.dict
$ echo $?
0

Problem that it takes a long time to create an index file from VCF / BCF with many columns.

I think it's better to minimize the interpretation function of each line of read-file-offsets in vcf / bcf.

The current parse-fn of read-file-offsets is set to : deep, but it takes time due to unnecessary interpretation. When I set the row interpretation function to :shallow and the end value of the variant to only (dec (+ pos (count ref)))), I found that the computation time finished faster than if I had not done so.

Character escaping in VCF string fields

As discussed in PR #135, the character escaping for " and \ should be handled correctly in VCF string fields.
More enhancement is needed for the VCF reader to support character escaping.

Problematic behaviors around BAM index generation

Currently, cljam's BAM writer supports simultaneous writing of a BAM file and its index file. This feature can be enabled by specifying true as the second argument of cljam.io.sam/writer. There are two problematic behaviors around this:

  1. Index file writing is only performed if SO:coordinate is specified in the BAM file header
    • If SO:coordinate is not specified, enabling index file writing is ignored, and no index file is created without any error
  2. The BAM indexer can also create BAM index files although it does not check for SO:coordinate in the BAM file header and always creates an index file
    • This is not only inconsistent with the BAM writer's behavior but also can result in invalid index files if the input BAM file is not sorted

I think the following changes can be made to improve usability and consistency:

  1. If index file writing is enabled for the BAM writer and SO:coordinate is not specified in the BAM file header, an error should be raised to tell the user that the BAM writer didn't succeed in generating an index file
  2. The BAM indexer should also check for SO:coordinate in the input BAM file header and raise an error if SO:coordinate is not specified, ensuring consistency with the BAM writer

Note that these changes will introduce breaking changes in that they will cause errors in situations where no errors were found previously.

Add tests to check generated data correctness

(Came from #60 (review) )

Add many tests by #60, but some tests doesn't check generated(converted) result data.
Should check these data (but may need correct result data).

  1. ;; TODO: examine result

  2. (are [?args] (not-throw? (with-out-file temp-out (cli/pileup ?args)))

    • Done by #65
    • But a comment say NB: "pileup" output format may change in future (maybe).
  3. ;; TODO: examine contents of temp-bam

  4. ;; TODO: check out-file

  5. (deftest about-create-dict

  6. ;; TODO: need bam-file include PCR duplications

  7. ;; TODO: Add test sorter/sort-by-pos with bam file includes many blocks

    • Need lightweight bam-file with many blocks (require to over 1500000) for sorter test coverage.
    • I commit ayamada@f2ed61b , it can work, but it cause very long time in lein cloverage, may not merge it.

Wrong FASTA index

cljam generates wrong FASTA index (.fai) from exampleFASTA.fasta (ftp://gftp.broadinstitute.org/bundle/2.8/exampleFASTA/)

$ lein run faidx /tmp/exampleFASTA.fasta
$ cat /tmp/exampleFASTA.fasta.fai 
chr1    100000  6       100000  100001

But correct index is the following.

chr1    100000  6       60      61

Features for validating VCF/SAM records

While we are writing out VCF/SAM records to a file, and if we have some invalid data in them, we often end up seeing a confusing error that's pretty hard to tell what was wrong.

It would be nice to have easy-to-understand error messages in order to make it easier to debug errors, but it's highly likely to degrade the serialization performance. So, how about adding features for validating the input VCF/SAM records to see if they are conformant to the format the cljam writer expects, apart from the existing VCF/SAM file IO APIs.

##META meta info support for VCF/BCF

Currently, the VCF reader/writer aren't supporting ##META meta info line, which is mentioned in the spec v4.3.

They don't throw an error when encountering such lines, but silently process them in an incorrect way.
Need to be more compliant with that part of the spec (though the spec itself looks kind of half-baked).

Replacing LSB protocol

The LSB protocol (i.e., LSBReadable/LSBWritable) is widely used within cljam as a foundation for encoding and decoding various binary formats.

However, the LSB protocol has some performance issues:

  • In general, each method of Clojure protocols has its own method cache, which keeps the overhead of method invocation relatively small as long as the method is called only on one implementation (in a so-called monomorphic situation)
  • However, the LSB protocol has multiple implementations, and decoding a file of a particular binary format occasionally involves method invocations to multiple implementations
  • Even worse, decoding in parallel causes method invocations to occur in parallel, which disturbs the method cache for each method invocation and results in much more overhead

In particular, reading a BAM file (including reading the BAM index) involves three implementations of the LSB protocol (namely, DataInput/InputStream/ByteBuffer) and has a potentially large overhead. My rough profiling for cljam.algo.depth, where a BAM file is read in parallel, shows that the overhead to do with the method invocations to the LSB protocol sums up around 10%.

The LSB protocol does not inherently require dynamic dispatching and has too much overhead to be one of the underpinning parts of cljam’s low layers. I think we should replace the LSB protocol with something more efficient.

A concrete plan to replace the LSB protocol is like the following:

  • #290
  • #291
    • The current implementation of TwoBitWriter is somewhat tricky in that two LSB protocol implementations are used at the same time
  • #293

I’m filing three individual PRs for these.

Pileup reads with adjacent indels.

CIGAR code with adjacent indels (e.g. 1M2D3I, 1M2N3I) is allowed by SAM file format specification.

Insertions and deletions are piled up at the base position just before them according to samtools.

I couldn't figure out the specification for mpileup format, but mpileup output of samtools 1.3.1 seems to be incorrect.

input SAM
@SQ SN:ref  LN:20
r001 0 ref 2 60  1M5D5I  * 0 0 ATGCAT  IIIIII
samtools 1.3.1 output
ref	2	N	1	^]A-5NNNNN	I
ref	3	N	1	*	I
ref	4	N	1	*	I
ref	5	N	1	*	I
ref	6	N	1	*	I
ref	7	N	1	*+5GCATC$	I
cljam output
ref	2	N	1	A-5NNNNN	I
ref	3	N	1	*	~
ref	4	N	1	*	~
ref	5	N	1	*	~
ref	6	N	1	*	~
ref	7	N	1	*+5TGCAT	~

I think inserted sequence +5GCATC should be +5TGCAT.
The last base C of the samtools output is affected by upper nibble of the first quality score.
There might be an off-by-one error & buffer overrun or something in htslib/samtools implementation.

What should cljam output? Should it be the same as samtools?

`mpileup` loses bases of alignments starting with a insertion

input SAM

@SQ	SN:ref	LN:45
q001	0	ref	10	60	4I4M4D4M4I	*	0	0	ATGCATGCATGCATGC	ABCDEFGHIJKLMNOP

samtools output

ref	10	N	1	^]A	E
ref	11	N	1	T	F
ref	12	N	1	G	G
ref	13	N	1	C-4NNNN	H
ref	14	N	1	*	I
ref	15	N	1	*	I
ref	16	N	1	*	I
ref	17	N	1	*	I
ref	18	N	1	A	I
ref	19	N	1	T	J
ref	20	N	1	G	K
ref	21	N	1	C+4ATGC$	L

cljam output

ref	10	N	1	^]A	E
ref	11	N	1	T	F
ref	12	N	1	G	G
ref	13	N	1	C-4NNNN	H
ref	14	N	1	*	~
ref	15	N	1	*	~
ref	16	N	1	*	~
ref	17	N	1	*	~
ref	18	N	1	A	I
ref	19	N	1	T	J
ref	20	N	1	G	K
ref	21	N	1	C+4ATGC$	L

Losing contigs from VCF info

Hi,
There is an issue with load-meta-info function in cljam.vcf.reader with VCF files
which have more than one config; it returns only last one.
Adding :contig key in

(if (#{:info :filter :format :alt :sample :pedigree} k)

should fixed this.

Regards
Vladimir

SAM/BAM/CRAM readers don't support @CO header

The SAM/BAM/CRAM specification has the @CO header for one-line comments. It's defined as below:

@CO: One-line text comment. Unordered multiple @CO lines are allowed. UTF-8 encoding may be
used.
https://github.com/samtools/hts-specs/blob/be74ef71f3fad34eb86af83bd66338d7d569af99/SAMv1.tex#L356

However, the current implementation of the SAM/BAM reader doesn't read the @CO header properly.

Repro

$ samtools view -h header_comment.sam
@SQ	SN:chr1	LN:1000	M5:258e88dcbd3cd44d8e7ab43f6ecb6af0
@CO	This is a comment.
@CO	This is also a comment.
(require '[cljam.io.sam :as sam])

(with-open [r (sam/reader "header_comment.sam")]
  (sam/read-header r))
;=>
{:SQ [{:SN "chr1", :LN 1000, :M5 "258e88dcbd3cd44d8e7ab43f6ecb6af0"}],
 :CO [{:This is a comment. nil} {:This is also a comment. nil}]}

Note that the comment contents themselves are read as keywords including whitespaces.

Ambiguity in empty SEQ representation of SAM/BAM format

If reading an alignment with an "empty" (void) SEQ from a SAM file using the SAM reader, you'll get something like:

{..., :seq "*", ...}

while you'll get {..., :seq "", ...} if you read such an alignment from a BAM file using the BAM reader.

Similarly, the SAM/BAM writers expect "*" / "", respectively, as the empty SEQ representation.

So, users wouldn't get in trouble with this inconsistency as far as they use the same file format as their input/output. But once trying to convert SAM to BAM (or vice versa), they would see an error or unexpected behavior.

Considering that empty (void) values for most other fields are represented as "*", I think it would be better to make the empty SEQ representation of BAM aligned with that of SAM.

The use of ByteBuffer & CharBuffer breaks cljam's binary compatibility

Once you build cljam on a Java 9+ environmont:

$ java -version
openjdk version "11" 2018-09-25
OpenJDK Runtime Environment 18.9 (build 11+28)
OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
$ lein uberjar
Compiling cljam.tools.main                                     
Compiling cljam.tools.cli                                                 
Compiling cljam.util.region                                                                                                                                                                                                                                                                       
Compiling cljam.util.chromosome                                                                                                                                                                                                                                                                   
Compiling cljam.util.whole-genome  
...
Created /Users/sohta/work/cljam/target/cljam-0.7.1-SNAPSHOT.jar
Created /Users/sohta/work/cljam/target/cljam-0.7.1-SNAPSHOT-standalone.jar
$

And use the generated JAR on Java 8, you'll run into a NoSuchMethodError:

$ java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
$ lein repl
user=> (require '[cljam.io.sam.util.quality :as sam-qual])
nil
user=> (sam-qual/phred-bytes->fastq (.getBytes "A"))
NoSuchMethodError java.nio.CharBuffer.flip()Ljava/nio/CharBuffer;  cljam.io.sam.util.quality/phred-bytes->fastq (quality.clj:30)
user=>

The root cause of this issue is that Java 9 introduced some overriden methods for ByteBuffer and other Buffer impl classes, and that may sometimes break binary-level compatibility. See the links below for more details on the issue itself:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.