chrovis / cljam Goto Github PK

View Code? Open in Web Editor NEW

90.0 11.0 12.0 36.54 MB

A DNA Sequence Alignment/Map (SAM) library for Clojure

Home Page: https://chrovis.github.io/cljam

License: Apache License 2.0

Clojure 99.99% Shell 0.01%

clojure bioinformatics genomics sam bam vcf bcf fasta 2bit bed

cljam's Introduction

cljam

A DNA Sequence Alignment/Map (SAM) library for Clojure. [API Reference] [Annotated Source]

Installation

cljam is available as a Maven artifact from Clojars.

Clojure CLI/deps.edn:

cljam {:mvn/version "0.8.5"}

Leiningen/Boot:

[cljam "0.8.5"]

Breaking changes in 0.8.0

cljam.io.tabix is rewritten. #180
cljam.io.bam-index.writer/pos->lidx-offset is moved to cljam.io.util.bin/pos->lidx-offset. #180
cljam.io.sam.util/reg->bin is moved to cljam.io.util.bin/reg->bin. Also, a coordinate system of its argument is changed from 0-based half-open to 1-based fully-closed. #190

Getting started

To read a SAM/BAM format file,

(require '[cljam.io.sam :as sam])

;; Open a file
(with-open [r (sam/reader "path/to/file.bam")]
  ;; Retrieve header
  (sam/read-header r)
  ;; Retrieve alignments
  (doall (take 5 (sam/read-alignments r))))

To create a sorted file,

(require '[cljam.io.sam :as sam]
         '[cljam.algo.sorter :as sorter])

(with-open [r (sam/reader "path/to/file.bam")
            w (sam/writer "path/to/sorted.bam")]
  ;; Sort by chromosomal coordinates
  (sorter/sort-by-pos r w))

To create a BAM index file,

(require '[cljam.algo.bam-indexer :as bai])

;; Create a new BAM index file
(bai/create-index "path/to/sorted.bam" "path/to/sorted.bam.bai")

To calculate coverage depth for a BAM file,

(require '[cljam.io.sam :as sam]
         '[cljam.algo.depth :as depth])

(with-open [r (sam/reader "path/to/sorted.bam")]
  ;; Pileup "chr1" alignments
  (depth/depth r {:chr "chr1", :start 1, :end 10}))
;;=> (0 0 0 0 0 0 1 1 3 3)

If you are Clojure beginner, read Getting Started for Clojure Beginners.

Command-line tool

cljam provides a command-line tool to use the features easily.

NOTICE The command-line tool functionality will be removed from cljam on the next release. The functionality has been already integrated into Gnife, which is a pure command-line tool.

Executable installation

lein bin creates standalone console executable into target directory.

$ lein bin
Creating standalone executable: /path/to/cljam/target/cljam

Copy the executable cljam somewhere in your $PATH.

Usage

All commands are displayed by cljam -h, and detailed help for each command are displayed by cljam [cmd] -h.

$ cljam view -h

For example, to display contents of a SAM file including the header,

$ cljam view --header path/to/file.sam

See command-line tool manual for more information.

Development

Test

To run tests,

lein test for basic tests,
lein test :slow for slow tests with local resources,
lein test :remote for tests with remote resources.

To get coverage

$ lein cloverage

And open target/coverage/index.html.

Generating document

cljam uses Codox for API reference and Marginalia for annotated source code.

$ lein docs

generates these documents in target/docs and target/literate directories.

Citing cljam

T. Takeuchi, A. Yamada, T. Aoki, and K. Nishimura. cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing. Source Code for Biology and Medicine, Vol. 11, No. 1, pp. 1-4, 2016.

Contributors

Sorted by first commit.

Toshiki Takeuchi (@totakke)
Takashi Aoki (@federkasten)
Atsuo Yamada (@ayamada)
Jun Imura (@alumi)
Shogo Ohta (@athos)
Shunya Kawabata (@r6eve)
Yuji Ito (@yito88)
Akira Inoue (@niyarin)
Atsushi Kitahara (@xckitahara)
Kei Komazaki (@k-kom)
Tomoki Matsuda (@matsutomo81)
Ryutaro Nagata (@nuggetoriniku)

License

Licensed under the Apache License, Version 2.0.

cljam's People

Contributors

Stargazers

Watchers

Forkers

federkasten ayamada biolisp na4da stjordanis clojure-pirates dagams abhi18av-forked-repos niyarin-another k-kom matsutomo81 nihilee

cljam's Issues

Test coverage fails sometimes on some environments

I ran into several errors and test failures while running test coverage on my local environment (Apple M1 Max).

The test coverage result is here: https://gist.github.com/athos/7a707012cacd16850951a98b5a44755a

This issue is reproducible, though it only occurs about once in a few attempts. I haven't seen this on an x64 machine, so it might be a processor-dependent problem (or might be just my M1 Mac? Idk).

Read/write invariance breaks with integer tags in BAM/CRAM

When reading alignments from a BAM or CRAM file and writing them to another BAM/CRAM file as they are, the values of integer tags may change.

Repro

$ samtools view int_tag_overflow.bam
r1      4       *       0       0       *       *       0       0       ATGC    ####    XA:i:4294967295

(require '[cljam.io.sasm :as sam])

(with-open [r (sam/reader "int_tag_overflow.bam")]
  (doall (sam/read-alignments r)))
;=>
({:qname "r1",
  :flag 4,
  :rname "*",
  ...
  :seq "ATGC",
  :qual "####",
  :options ({:XA {:type "i", :value 4294967295}})})

(with-open [r (sam/reader "int_tag_overflow.bam")
            w (sam/writer "int_tag_overflow.rewrite.bam")]
  (sam/write-header w (sam/read-header r))
  (sam/write-refs w (sam/read-refs r))
  (sam/write-alignments w (sam/read-alignments r) (sam/read-header r)))

(with-open [r (sam/reader "int_tag_overflow.rewrite.bam")]
  (doall (sam/read-alignments r)))
;=>
({:qname "r1",
  :flag 4,
  :rname "*",
  ...
  :seq "ATGC",
  :qual "####",
  :options ({:XA {:type "i", :value -1}})})  ;; <- this value has changed from the original one

Cause

The SAM format defines the only integer tag type i (signed arbitrary-precision integer) while the BAM/CRAM format has the i integer tag type with different semantics (signed 32bit integer), as well as other integer types (c/C/s/S/I)
cljam's BAM/CRAM reader interprets any integer tag value as the i tag type
cljam's BAM/CRAM writer doesn't check if each integer tag value fits the specified tag type. It writes a tag value as the i tag type even if it can't be represented as a signed 32bit integer.

Random read of 2bit is quite slower than FASTA

(require '[cljam.io.sequence :as cseq])

(with-open [r (cseq/reader "path/to/hg38.2bit")]
  (time (cseq/read-sequence r {:chr "chr7" :start 55174773, :end 55174783})))
;; "Elapsed time: 1678.031253 msecs"

(with-open [r (cseq/reader "path/to/hg38.fa")]
  (time (cseq/read-sequence r {:chr "chr7" :start 55174773, :end 55174783})))
;; "Elapsed time: 15.879277 msecs"

Seemingly useless nil entry occasionally included in VCF parsing results

When parsing a VCF file just having the 8 required fields (ie. chrom/pos/id/ref/alt/qual/filter/info), the resulting maps contain a seemingly useless nil nil entry as follows:

(def r (vcf/reader "/path/to/file.vcf"))
(def vs (vcf/read-variants r))
(first vs)
;=> {nil nil, :alt ["C"], :ref "T", :pos 123456, :filter (:PASS), :id nil, :info {:DP 40, :RD 31, :AD (9), :AF (0.225)}, :qual nil, :chr "chr1"}
;    ^^^^^^^

If the VCF file has some extra fields other than the required ones, it dosen't happen.

Create an index file while writing a sorted BAM file

When alignments are sorted by coordinate (SO:coordinate in @HD), create an .bai file simultaneously with the .bam.

Suggestion on effective VCF file reading?

I am a experiencing OOM java heap problem when trying to read a multisample, chromosome partitioned VCF from the 1000 genomes project, specifically ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.

I am doing:

(with-open [r (vcf/reader "ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz")]
    (into [] (vcf/read-variants (vcf/clone-vcf-reader r) {:depth :deep})))

Is there a better approach to parsing VCFs witch cljam, which will not run out of memory?
I tried running this on an M2 Pro, allocating 20GB to jvm.

Coding style fixes from clj-kondo's linter with default :off

This issue is a answer to comment(#269 (comment))

Lists options that clj-kondo defaults to :off that should be introduced.
After listing, fix the corresponding code and add the clj-lint-action settings.

Linters that should be introduced

:single-key-in

This option warns about get-in and update-in where the number of elements in keys is 1.
Cljam reported 1 warning.

./src/cljam/io/fasta/reader.clj:74:39: warning: update-in with single key

:aliased-namespace-symbol

This option warns about code that doesn't use namespace aliases.
cljam reported 2 warnings.

./src/cljam/tools/cli.clj:332:32: warning: An alias is defined for clojure.java.io: cio
./src/cljam/tools/cli.clj:333:32: warning: An alias is defined for clojure.java.io: cio

Linters that may be introduced

:reduce-without-init
:redundant-fn-wrapper

Linters that may be introduced (without Kibit)

:minus-one :plus-one :equals-true

CRAM support?

This is a neat clojure library for bioinformatics. Do you plan on adding CRAM file support?

Removing dupricate reads

A new feature for removing dupricate reads like Picard MarkDupricates.

bgzipped VCF file over HTTP causes infinite loop

A bgzipped VCF file over HTTP causes infinite loop.

$ gzip -c test-resources/vcf/test-v4_3.vcf > test-resources/vcf/test-v4_3-gzip.vcf.gz
$ bgzip -c test-resources/vcf/test-v4_3.vcf > test-resources/vcf/test-v4_3-bgzip.vcf.gz
$ python3 -m http.server 8000

(require '[cljam.io.vcf :as vcf])

;; gzipped file correctly finished.
(with-open [rdr (vcf/vcf-reader "http://localhost:8000/test-resources/vcf/test-v4_3-gzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;;=> ({:NA00001 {:GT "0|0", :GQ 48, :DP 1, :HQ (51 51)},
;;     :alt ["A"],
;;     :ref "G",
;;     :FORMAT (:GT :GQ :DP :HQ),
;;     :NA00002 {:GT "1|0", :GQ 48, :DP 8, :HQ (51 51)},
;;     :pos 14370,
;;     :filter (:PASS),
;;     :id "rs6054257",
;;     :info {:NS 3, :DP 14, :AF (0.5), :DB :exists, :H2 :exists},
;;     :qual 29.0,
;;     :NA00003 {:GT "1/1", :GQ 43, :DP 5, :HQ (nil nil)},
;;     :chr "20"}
;;    ...)

;; bgzipped file causes infinite loop.
(with-open [rdr (vcf/vcf-reader "http://localhost:8000/test-resources/vcf/test-v4_3-bgzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;; infinite loop...

cljam can read the bgzipped file as a local file, so I think this issue is related to bgzf4j.

(with-open [rdr (vcf/vcf-reader "test-resources/vcf/test-v4_3-bgzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;;=> ({:NA00001 {:GT "0|0", :GQ 48, :DP 1, :HQ (51 51)},
;;     :alt ["A"],
;;     :ref "G",
;;     :FORMAT (:GT :GQ :DP :HQ),
;;     :NA00002 {:GT "1|0", :GQ 48, :DP 8, :HQ (51 51)},
;;     :pos 14370,
;;     :filter (:PASS),
;;     :id "rs6054257",
;;     :info {:NS 3, :DP 14, :AF (0.5), :DB :exists, :H2 :exists},
;;     :qual 29.0,
;;     :NA00003 {:GT "1/1", :GQ 43, :DP 5, :HQ (nil nil)},
;;     :chr "20"}
;;    ...)

I noticed the issue while reading COSMIC's bgzipped VCF, but this NegativeArraySizeException seems to be another problem. A further inverstigation is needed.

(with-open [rdr (vcf/vcf-reader "http://localhost:8000/CosmicCodingMuts.vcf.gz")]
  (dorun (vcf/read-variants rdr)))
;; Caused by java.lang.NegativeArraySizeException
;; -481634578
;;
;;    BGZFInputStream.java:  365  bgzf4j.BGZFInputStream/inflateBlock
;;    BGZFInputStream.java:  353  bgzf4j.BGZFInputStream/readBlock
;;    BGZFInputStream.java:  102  bgzf4j.BGZFInputStream/available
;;    BGZFInputStream.java:  202  bgzf4j.BGZFInputStream/readLine
;;              reader.clj:  179  cljam.io.vcf.reader/read-data-lines
;;              reader.clj:  175  cljam.io.vcf.reader/read-data-lines
;;              reader.clj:  182  cljam.io.vcf.reader/read-data-lines/fn

Bgziped VCF is illegal.

We can reproduce it with the following code.

(with-open [wtr (vcf/writer "./test.vcf.gz"
                            {:format nil :contig {:id "1", :length nil} :filter [{:id "PASS", :description "All filters passed"}] :info nil}
                            ["CHROM" "POS" "ID" "REF" "ALT" "QUAL" "FILTER" "INFO" "FORMAT"])]
  (vcf/write-variants wtr [{:chr "1" :pos 10 :ref "A" :alt ["G"]}]))

(with-open [rdr (vcf/reader "./test.vcf.gz")]
  (vcf/read-variants rdr))
;Error printing return value (IOException) at java.io.BufferedReader/ensureOpen (BufferedReader.java:122).

The generated bgzip file also fails with the'bcftools index'.
Now, if we do bgzip -d and bgzip, the error doesn't occur.

exec binary does not work on Java 9+

$ lein bin
...
Creating standalone executable: /path/to/cljam/target/cljam

# 1.8
$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
$ target/cljam version
0.5.2-SNAPSHOT

# 9
$ java -version
java version "9.0.1"
Java(TM) SE Runtime Environment (build 9.0.1+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)
$ target/cljam version
Exception in thread "main" java.lang.ExceptionInInitializerError
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:375)
        at clojure.lang.RT.classForName(RT.java:2168)
        at clojure.lang.RT.classForName(RT.java:2177)
        at clojure.lang.RT.loadClassForName(RT.java:2196)
        at clojure.lang.RT.load(RT.java:443)
        at clojure.lang.RT.load(RT.java:419)
        at clojure.core$load$fn__5677.invoke(core.clj:5893)
        at clojure.core$load.invokeStatic(core.clj:5892)

# 10
$ java -version
java 10.0.1 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)
$ target/cljam -h
Exception in thread "main" java.lang.ExceptionInInitializerError
        at java.base/java.lang.Class.forName0(Native Method)
        ... (same error)

`dict` command after cljam 0.8.2 fails

$ cljam version
0.8.4
$ cljam dict test-resources/fasta/test.fa /tmp/test.dict
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.commons.compress.compressors.xz.XZUtils.<clinit>(XZUtils.java:59)
        at org.apache.commons.compress.compressors.CompressorStreamFactory.detect(CompressorStreamFactory.java:271)
        at org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:522)
        at cljam.util$compressor_input_stream.invokeStatic(util.clj:122)
        at cljam.io.fasta.core$reader.invokeStatic(core.clj:38)
        at cljam.io.sequence$fasta_reader.invokeStatic(sequence.clj:20)
        at cljam.algo.dict$create_dict.invokeStatic(dict.clj:9)
        at cljam.tools.cli$dict.invokeStatic(cli.clj:297)
        at cljam.tools.cli$run.invokeStatic(cli.clj:357)
        at cljam.tools.main$_main.invokeStatic(main.clj:6)
        at cljam.tools.main$_main.doInvoke(main.clj:6)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at cljam.tools.main.main(Unknown Source)
Caused by: java.lang.NullPointerException: Cannot invoke "Object.getClass()" because the return value of "java.lang.Class.getClassLoader()" is null
        at org.apache.commons.compress.utils.OsgiUtils.<clinit>(OsgiUtils.java:31)
        ... 13 more
$ echo $?
1

<=0.8.1 succeeds but 0.8.2<= fails.

$ cljam version
0.8.1
$ cljam dict test-resources/fasta/test.fa /tmp/test.dict
$ echo $?
0

Problem that it takes a long time to create an index file from VCF / BCF with many columns.

I think it's better to minimize the interpretation function of each line of read-file-offsets in vcf / bcf.

The current parse-fn of read-file-offsets is set to : deep, but it takes time due to unnecessary interpretation. When I set the row interpretation function to :shallow and the end value of the variant to only (dec (+ pos (count ref)))), I found that the computation time finished faster than if I had not done so.

QUAL value overflows when writing VCF

VCF QUAL values may contain huge numbers that are out of the Integer range.
Attempts to write such data to a file will fail.

Example¹

chr1	143214613	rs1483186290	T	A	5.60878e+09	VQSRTrancheSNP99.95to100.00;PlatformBiasObserved	...

tommo-38kjpn-20220630-GRCh38-af-autosome.vcf.gz ↩

Character escaping in VCF string fields

As discussed in PR #135, the character escaping for " and \ should be handled correctly in VCF string fields.
More enhancement is needed for the VCF reader to support character escaping.

Problematic behaviors around BAM index generation

Currently, cljam's BAM writer supports simultaneous writing of a BAM file and its index file. This feature can be enabled by specifying true as the second argument of cljam.io.sam/writer. There are two problematic behaviors around this:

Index file writing is only performed if SO:coordinate is specified in the BAM file header
- If SO:coordinate is not specified, enabling index file writing is ignored, and no index file is created without any error
The BAM indexer can also create BAM index files although it does not check for SO:coordinate in the BAM file header and always creates an index file
- This is not only inconsistent with the BAM writer's behavior but also can result in invalid index files if the input BAM file is not sorted

I think the following changes can be made to improve usability and consistency:

If index file writing is enabled for the BAM writer and SO:coordinate is not specified in the BAM file header, an error should be raised to tell the user that the BAM writer didn't succeed in generating an index file
The BAM indexer should also check for SO:coordinate in the input BAM file header and raise an error if SO:coordinate is not specified, ensuring consistency with the BAM writer

Note that these changes will introduce breaking changes in that they will cause errors in situations where no errors were found previously.

Add tests to check generated data correctness

(Came from #60 (review) )

Add many tests by #60, but some tests doesn't check generated(converted) result data.
Should check these data (but may need correct result data).

cljam/test/cljam/t_cli.clj

Line 53 in ff91c66

;; TODO: examine result
- Done by #66
cljam/test/cljam/t_cli.clj

Line 96 in ff91c66

(are [?args] (not-throw? (with-out-file temp-out (cli/pileup ?args)))
- Done by #65
- But a comment say NB: "pileup" output format may change in future (maybe).
cljam/test/cljam/t_cli.clj

Line 142 in ff91c66

;; TODO: examine contents of temp-bam
- Done by #70
cljam/test/cljam/t_pileup.clj

Line 218 in ff91c66

;; TODO: check out-file
- Done by #67
cljam/test/cljam/t_dict.clj

Line 17 in ff91c66

(deftest about-create-dict
- Done by #64
cljam/test/cljam/t_dedupe.clj

Line 11 in ff91c66

;; TODO: need bam-file include PCR duplications
- Done by #63
cljam/test/cljam/t_sorter.clj

Line 110 in ff91c66

;; TODO: Add test sorter/sort-by-pos with bam file includes many blocks
- Need lightweight bam-file with many blocks (require to over 1500000) for sorter test coverage.
- I commit ayamada@f2ed61b , it can work, but it cause very long time in lein cloverage, may not merge it.

Wrong FASTA index

cljam generates wrong FASTA index (.fai) from exampleFASTA.fasta (ftp://gftp.broadinstitute.org/bundle/2.8/exampleFASTA/)

$ lein run faidx /tmp/exampleFASTA.fasta
$ cat /tmp/exampleFASTA.fasta.fai 
chr1    100000  6       100000  100001

But correct index is the following.

chr1    100000  6       60      61

Features for validating VCF/SAM records

While we are writing out VCF/SAM records to a file, and if we have some invalid data in them, we often end up seeing a confusing error that's pretty hard to tell what was wrong.

It would be nice to have easy-to-understand error messages in order to make it easier to debug errors, but it's highly likely to degrade the serialization performance. So, how about adding features for validating the input VCF/SAM records to see if they are conformant to the format the cljam writer expects, apart from the existing VCF/SAM file IO APIs.

##META meta info support for VCF/BCF

Currently, the VCF reader/writer aren't supporting ##META meta info line, which is mentioned in the spec v4.3.

They don't throw an error when encountering such lines, but silently process them in an incorrect way.
Need to be more compliant with that part of the spec (though the spec itself looks kind of half-baked).

Replacing LSB protocol

The LSB protocol (i.e., LSBReadable/LSBWritable) is widely used within cljam as a foundation for encoding and decoding various binary formats.

However, the LSB protocol has some performance issues:

In general, each method of Clojure protocols has its own method cache, which keeps the overhead of method invocation relatively small as long as the method is called only on one implementation (in a so-called monomorphic situation)
However, the LSB protocol has multiple implementations, and decoding a file of a particular binary format occasionally involves method invocations to multiple implementations
Even worse, decoding in parallel causes method invocations to occur in parallel, which disturbs the method cache for each method invocation and results in much more overhead

In particular, reading a BAM file (including reading the BAM index) involves three implementations of the LSB protocol (namely, DataInput/InputStream/ByteBuffer) and has a potentially large overhead. My rough profiling for cljam.algo.depth, where a BAM file is read in parallel, shows that the overhead to do with the method invocations to the LSB protocol sums up around 10%.

The LSB protocol does not inherently require dynamic dispatching and has too much overhead to be one of the underpinning parts of cljam’s low layers. I think we should replace the LSB protocol with something more efficient.

A concrete plan to replace the LSB protocol is like the following:

#290
#291
- The current implementation of TwoBitWriter is somewhat tricky in that two LSB protocol implementations are used at the same time
#293

I’m filing three individual PRs for these.

Report position in read for reads in a pileup

context: #260

Report an original position in a read for a base piled up

Pileup reads with adjacent indels.

CIGAR code with adjacent indels (e.g. 1M2D3I, 1M2N3I) is allowed by SAM file format specification.

Insertions and deletions are piled up at the base position just before them according to samtools.

I couldn't figure out the specification for mpileup format, but mpileup output of samtools 1.3.1 seems to be incorrect.

input SAM

@SQ SN:ref  LN:20
r001 0 ref 2 60  1M5D5I  * 0 0 ATGCAT  IIIIII

samtools 1.3.1 output

ref	2	N	1	^]A-5NNNNN	I
ref	3	N	1	*	I
ref	4	N	1	*	I
ref	5	N	1	*	I
ref	6	N	1	*	I
ref	7	N	1	*+5GCATC$	I

cljam output

ref	2	N	1	A-5NNNNN	I
ref	3	N	1	*	~
ref	4	N	1	*	~
ref	5	N	1	*	~
ref	6	N	1	*	~
ref	7	N	1	*+5TGCAT	~

I think inserted sequence +5GCATC should be +5TGCAT.
The last base C of the samtools output is affected by upper nibble of the first quality score.
There might be an off-by-one error & buffer overrun or something in htslib/samtools implementation.

What should cljam output? Should it be the same as samtools?

`mpileup` loses bases of alignments starting with a insertion

input SAM

@SQ	SN:ref	LN:45
q001	0	ref	10	60	4I4M4D4M4I	*	0	0	ATGCATGCATGCATGC	ABCDEFGHIJKLMNOP

samtools output

ref	10	N	1	^]A	E
ref	11	N	1	T	F
ref	12	N	1	G	G
ref	13	N	1	C-4NNNN	H
ref	14	N	1	*	I
ref	15	N	1	*	I
ref	16	N	1	*	I
ref	17	N	1	*	I
ref	18	N	1	A	I
ref	19	N	1	T	J
ref	20	N	1	G	K
ref	21	N	1	C+4ATGC$	L

cljam output

ref	10	N	1	^]A	E
ref	11	N	1	T	F
ref	12	N	1	G	G
ref	13	N	1	C-4NNNN	H
ref	14	N	1	*	~
ref	15	N	1	*	~
ref	16	N	1	*	~
ref	17	N	1	*	~
ref	18	N	1	A	I
ref	19	N	1	T	J
ref	20	N	1	G	K
ref	21	N	1	C+4ATGC$	L

Losing contigs from VCF info

Hi,
There is an issue with load-meta-info function in cljam.vcf.reader with VCF files
which have more than one config; it returns only last one.
Adding :contig key in

(if (#{:info :filter :format :alt :sample :pedigree} k)

should fixed this.

Regards
Vladimir

May need to fix a test for bam-reader with `:chr` argument

(came from #60 (comment) )

I add test for bam-reader with :chr argument in #60, but I found peculiar results in

cljam/test/cljam/t_bam.clj

Line 84 in ff91c66

(deftest bamreader-with-index

.
Possibly I wrong something, but I didn't know this reason.

Case insensitive support for NaN/Infinity

The current cljam implementation of VCF reader throws if it encounters a lowercased nan as a Float value.

The VCF spec says that the lexical syntax of Float allows for case insensitive [-+]?(INF|INFINITY|NAN), and in practice, there are sometimes VCFs out there (e.g. some of ToMMo’s VCFs) that contain such values.

SAM/BAM/CRAM readers don't support @CO header

The SAM/BAM/CRAM specification has the @CO header for one-line comments. It's defined as below:

@CO: One-line text comment. Unordered multiple @CO lines are allowed. UTF-8 encoding may be
used.
https://github.com/samtools/hts-specs/blob/be74ef71f3fad34eb86af83bd66338d7d569af99/SAMv1.tex#L356

However, the current implementation of the SAM/BAM reader doesn't read the @CO header properly.

Repro

$ samtools view -h header_comment.sam
@SQ	SN:chr1	LN:1000	M5:258e88dcbd3cd44d8e7ab43f6ecb6af0
@CO	This is a comment.
@CO	This is also a comment.

(require '[cljam.io.sam :as sam])

(with-open [r (sam/reader "header_comment.sam")]
  (sam/read-header r))
;=>
{:SQ [{:SN "chr1", :LN 1000, :M5 "258e88dcbd3cd44d8e7ab43f6ecb6af0"}],
 :CO [{:This is a comment. nil} {:This is also a comment. nil}]}

Note that the comment contents themselves are read as keywords including whitespaces.

Ambiguity in empty SEQ representation of SAM/BAM format

If reading an alignment with an "empty" (void) SEQ from a SAM file using the SAM reader, you'll get something like:

{..., :seq "*", ...}

while you'll get {..., :seq "", ...} if you read such an alignment from a BAM file using the BAM reader.

Similarly, the SAM/BAM writers expect "*" / "", respectively, as the empty SEQ representation.

So, users wouldn't get in trouble with this inconsistency as far as they use the same file format as their input/output. But once trying to convert SAM to BAM (or vice versa), they would see an error or unexpected behavior.

Considering that empty (void) values for most other fields are represented as "*", I think it would be better to make the empty SEQ representation of BAM aligned with that of SAM.

The use of ByteBuffer & CharBuffer breaks cljam's binary compatibility

Once you build cljam on a Java 9+ environmont:

$ java -version
openjdk version "11" 2018-09-25
OpenJDK Runtime Environment 18.9 (build 11+28)
OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
$ lein uberjar
Compiling cljam.tools.main                                     
Compiling cljam.tools.cli                                                 
Compiling cljam.util.region                                                                                                                                                                                                                                                                       
Compiling cljam.util.chromosome                                                                                                                                                                                                                                                                   
Compiling cljam.util.whole-genome  
...
Created /Users/sohta/work/cljam/target/cljam-0.7.1-SNAPSHOT.jar
Created /Users/sohta/work/cljam/target/cljam-0.7.1-SNAPSHOT-standalone.jar
$

And use the generated JAR on Java 8, you'll run into a NoSuchMethodError:

$ java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
$ lein repl
user=> (require '[cljam.io.sam.util.quality :as sam-qual])
nil
user=> (sam-qual/phred-bytes->fastq (.getBytes "A"))
NoSuchMethodError java.nio.CharBuffer.flip()Ljava/nio/CharBuffer;  cljam.io.sam.util.quality/phred-bytes->fastq (quality.clj:30)
user=>

The root cause of this issue is that Java 9 introduced some overriden methods for ByteBuffer and other Buffer impl classes, and that may sometimes break binary-level compatibility. See the links below for more details on the issue itself:

Support CIGARs longer than 65535 operations

Detect CIGARs like kSmN and refer to CG tags.

chrovis / cljam Goto Github PK

cljam's Introduction

cljam

Installation

Breaking changes in 0.8.0

Getting started

Command-line tool

Executable installation

Usage

Development

Test

Generating document

Citing cljam

Contributors

License

cljam's People

Contributors

Stargazers

Watchers

Forkers

cljam's Issues

Repro

Cause

Linters that should be introduced

:single-key-in

:aliased-namespace-symbol

Linters that may be introduced

Linters that may be introduced (without Kibit)

Example1

Footnotes

input SAM

samtools 1.3.1 output

cljam output

input SAM

samtools output

cljam output

Repro

Recommend Projects

Recommend Topics

Recommend Org

Example¹