Git Product home page Git Product logo

htsjdk's Introduction

Build and Test Maven Central License Language Join the chat at https://gitter.im/samtools/htsjdk

A Java API for high-throughput sequencing data (HTS) formats.

HTSJDK is an implementation of a unified Java library for accessing common file formats, such as SAM and VCF, used for high-throughput sequencing data. There are also a number of useful utilities for manipulating HTS data.

NOTE: HTSJDK has only partial support for the latest Variant Call Format Specification. VCFv4.3 can be read but not written and there is no support for BCFv2.2

Documentation & Getting Help

API documentation for all versions of HTSJDK since 1.128 are available through javadoc.io.

If you believe you have found a bug or have an issue with the library please a) search the open and recently closed issues to ensure it has not already been reported, then b) log an issue.

The project has a gitter chat room if you would like to chat with the developers and others involved in the project.

To receive announcements of releases and other significant project news please subscribe to the htsjdk-announce google group.

Building HTSJDK

HTSJDK is built using gradle.

A wrapper script (gradlew) is included which will download the appropriate version of gradle on the first invocation.

Example gradle usage from the htsjdk root directory:

  • compile and build a jar
./gradlew

or

./gradlew jar

The jar will be in build/libs/htsjdk-<version>.jar where version is based on the current git commit.

  • run tests, a specific test class, or run a test and wait for the debugger to connect
./gradlew test

./gradlew test --tests AlleleUnitTest

./gradlew test --tests AlleleUnitTest --debug-jvm
  • run tests and collect coverage information (report will be in build/reports/jacoco/test/html/index.html)
./gradlew jacocoTestReport
  • clean the project directory
./gradlew clean
  • build a monolithic jar that includes all of htsjdk's dependencies
./gradlew shadowJar
  • create a snapshot and install it into your local maven repository
./gradlew install
  • for an exhaustive list of all available targets
./gradlew tasks

Create an HTSJDK project in IntelliJ

To create a project in IntelliJ IDE for htsjdk do the following:

  1. Select from the menu: File -> New -> Project from Existing Sources
  2. In the resulting dialog, chose Import from existing model, select Gradle and Next
  3. Choose the default gradle wrapper and Finish.

From time to time if dependencies change in htsjdk you may need to refresh the project from the View -> Gradle menu.

Code style

Style guides files are included for Intellij and Eclipse. These are a variation of the Google Java Style with 4 space indentation. This style is suggested for new code but not rigidly checked. We allow for contributors to deviate from the style when it improves clarity or to match surrounding code. Existing code does not necessarily conform to this and does not need to be modified to do so, but users are encouraged to correct the formatting of code that they modify.

Licensing Information

Not all sub-packages of htsjdk are subject to the same license, so a license notice is included in each source file or sub-package as appropriate. Please check the relevant license notice whenever you start working with a part of htsjdk that you have not previously worked with to avoid any surprises. Broadly speaking the majority of the code is covered under the MIT license with the following notable exceptions:

  • Much of the CRAM code is under the Apache License, Version 2
  • Core tribble code (underlying VCF reading/writing amongst other things) is under LGPL
  • Code supporting the reading/writing of SRA format is uncopyrighted & public domain

Java Minimum Version Support Policy

Htsjdk currently targets Java 8 and is tested on both 8 and 11.

We intend to drop support for 8/11 and switch exclusively to 17+ in our next release (4.0.0).

Given our promise of 6 months of warning before a move off of 8, we will atttempt to provide 3.x releases on demand if any critical bugs are discovered in the next 6 months.

Java SE Major Release End of Java SE Oracle Public Updates / OpenJDK support Proposed End of Support in HTSJDK Actual End of Support in HTSJDK
6 Feb 2013 Aug 2013 Oct 2015
7 Apr 2015 Oct 2015 Oct 2015
8 Jan 2019 Feb 2022 TBD
11 Sep 2022 Feb 2022 TBD
17 TBD TBD TBD

Meaning of the Htsjdk version number

We encourage downstream projects to use the most recent htsjdk release in order to have access to the most up to date features and bug fixes. It is therefore important therefore to make upgrading to newer versions as easy as possible. We make a best effort to adhere to the following principles in order to minimize disruption to projects that depend on htsjdk:

  • Avoid making breaking changes whenever possible. A breaking change is one which requires downstream projects to recompile against the new version of htsjdk or make changes to their source code. These include both binary incompatibilities and source incompatibilities.
  • Deprecate and provide new alternatives instead of removing existing APIs.
  • Document breaking changes in the release notes.
  • Provide clear instructions for upgrading to new API's when breaking changes/ deprecations occur.
  • Provide explanations for the rare cases when functionality is deprecated or removed without replacement.

We treat any accessible class/method/field as part of our API and attempt to minimize changes to it with the following exceptions:

  • The htsjdk.samtools.cram package and subpackages are considered unstable and are undergoing major changes.
  • Code which has not yet been released in a numbered version is considered unstable and subject to change without warning.
  • We consider changes to public code more disruptive than changes to protected code in classes that we believe are not generally subclassed by the downstream community.

Our current version number has 3 parts. ex: 2.19.0

  • Major version bumps (2.19.0 -> 3.0.0) allow large changes to the existing API's and require substantial changes in downstream projects. These are extremely rare.
  • Minor versions bumps ( 2.18.2 -> 2.19.0) may include additions to the API and well as breaking changes which may require recompiling downstream projects. We attempt to limit breaking changes as much as possible and generally most projects which depend on htsjdk should be able to update to a new minor version with no changes or only simple and obvious changes. We may introduce deprecations which suggest but don't mandate more complex code changes. Minor releases may also remove functionality which has been deprecated for a long time.
  • Patch version changes (2.18.1 -> 2.18.2) include additions and possibly deprecations but no breaking changes.

htsjdk's People

Contributors

akiezun avatar alecw avatar bbimber avatar bpow avatar bw2 avatar cmnbroad avatar cristynkells avatar d-cameron avatar droazen avatar eitanbanks avatar gbggrant avatar geoffjentry avatar jean-philippe-martin avatar jmthibault79 avatar jrobinso avatar ktibbett avatar lbergelson avatar lenbok avatar lindenb avatar magicdgs avatar mattsooknah avatar mccowan avatar meganshand avatar nh13 avatar ronlevine avatar shuang-broad avatar tfenne avatar tomwhite avatar vadimzalunin avatar yfarjoun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htsjdk's Issues

How to work with BED or vcf files

Hello,
I'm unsure if this is the right way to ask a developing question. But I searched for examples how to use the htsjdk and for the SamReader/SamReaderFactory there are nice examples.
Is there also some available examples how to read BED or vcf files?

Best Air

Add a check for correct platform values, case insensitively

We never use INVALID_PLATFORM_VALUE but the spec defines platform values. The platform values are uppercase but in production cases there are BAMs with mixed or lowercase (:/) platform values. Lets add a check in SamFileValidator for the correct set of values, case-insensitively, along with unit tests. This is motivated by a new @RG-PL value in the SAM Spec: samtools/hts-specs#39

Separate JAR files

Hello, I want to use the samtools part of the htsjdk, but the current Ant file doesn't have a task for building separate JAR files for each of the subprojects. Is there any particular reason for this?

CRAM support for non-ref encoding

I'm not sure if this has been fixed yet or not, as I was only able to test cramtools-2.1.jar from EBI web site. Is there a way to run picard tools with CRAM (I tried ViewSam and it didn't work).

Anyway, that aside, cramtools is unable to read CRAM files that has no reference. My file entirely uses cram features to store the base calls and never has need to reference an external sequence. It does now have @sq M5 auxiliary tags (but does not require them), nor is it using an internally embedded reference.

I think this is still a legal file (apart from the oddity of officially needing @sq M5, but I think we already decided to relax that for the case of embedded reference sequences), albeit a fairly inefficient one.

CRAM code base javadoc and code documentation.

There are a number of classes that are lacking documentation. In general any public API methods should be documented (unless they are self explanatory). Also any complicated control logic should be properly commented.

moving the code from picard to htsjdk

Hi, I'm currently moving my code from picard1.100 to htsjdk. So far I cannot see how I should now map those 'old' classes:

  • PicardException
  • net.sf.samtools.cmdline.*

any hint ?
Thanks.

JEXLMap update to help GATK VariantFiltration

JEXLMap already has JEXL context variables for isHomRef, isHet and isHomVar (circa line 97 in variant/variantcontext/JEXLMap.java), but it would be great if variables could be added for the other genotype types, specifically noCall.

Of course if GSA had commit rights we could take care of some of these issues on our own.... ;)

CRAM rewriting fails on assert

I'm not going to say I think it's CRAM and the NM tag1, but I think it's CRAM and the NM tag2,3.

This command line works, using the test crams in the current cramtools master branch4:

$ java \
  -jar ~/src/picard/dist/picard.jar SamFormatConverter \
  R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
  I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
  O= test.cram
[Sun Dec 21 10:05:07 EST 2014] picard.sam.SamFormatConverter INPUT=test.bam OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sun Dec 21 10:05:07 EST 2014] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17; Picard version: 1.126(5b984f92664175debe3826040fdae1496d1d8fa7_1418347776) JdkDeflater
[Sun Dec 21 10:05:08 EST 2014] picard.sam.SamFormatConverter done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=163053568
$

The same command with the addition of -ea crashes on an assert5:

$ java -ea \
  -jar ~/src/picard/dist/picard.jar SamFormatConverter \
  R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
  I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
  O= test.cram
[Sun Dec 21 10:05:19 EST 2014] picard.sam.SamFormatConverter INPUT=test.bam OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sun Dec 21 10:05:19 EST 2014] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17; Picard version: 1.126(5b984f92664175debe3826040fdae1496d1d8fa7_1418347776) JdkDeflater
[Sun Dec 21 10:05:19 EST 2014] picard.sam.SamFormatConverter done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=128974848
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.AssertionError
    at htsjdk.samtools.CRAMFileWriter.flushContainer(CRAMFileWriter.java:275)
    at htsjdk.samtools.CRAMFileWriter.finish(CRAMFileWriter.java:337)
    at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:205)
    at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:88)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:89)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:99)
$

Bam slicing issue: Unable to query by chromosome

I tried to use the SamReader:

reader.query(intervals, false);

where intervals contain a single QueryInterval which is:

QueryInterval interval = new QueryInterval(0,0,0);

It will throw an exception as follows:

htsjdk.samtools.util.RuntimeIOException: Expected 8 bytes, got 4
    at htsjdk.samtools.AbstractBAMFileIndex$IndexStreamBuffer.readLong(AbstractBAMFileIndex.java:684) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.AbstractBAMFileIndex.readLong(AbstractBAMFileIndex.java:438) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:332) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:60) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:729) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:412) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:468) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]

The BAM file size is about 15 GB downloaded from 1000 Genomes: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00595/alignment/HG00595.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522.bam

Any help is appreciated.

NPE in TabixReader

There is a bug in the TabixReader constructor.
You can specify the path to the index file ("idxFn", second argument of the last constructor), but when this value is not null, it is never assigned to the class variable "mIdxFn". Because of that you will get a NullPointerException in the readIndex() method.

VCFFileReader: InputStream?

Hello I'm in need of showing completion percentages from an import operation that uses the VCFFIleReader.
At the moment the constructor asks for a java.io.File object which, being only a fancy way to represent a path, doesn't allow me to tell() how much data has been consumed (since this is the only way to get an idea about how much time is left to read the entire VCF file).

Is there a way to access the underlying InputStream?
Otherwise, can someone add this feature?

Being able to know (even approximately) how much time is left seems kind of a must-have feature, IMO.

Thanks!

htsjdk.version.property.xml not saved as xml

I'm running ant 1.9.4 :

$ /commun/data/packages/apache-ant/apache-ant-1.9.4/bin/ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014
$ echo $ANT_HOME
/commun/data/packages/apache-ant/apache-ant-1.9.4

in build.xml the file htsjdk.version.property.xml should be saved as XML but it is not:

$ /commun/data/packages/apache-ant/apache-ant-1.9.4/bin/ant write-version-property
(...)
$ cat /commun/data/users/lindenb/src/jvarkit-git/htsjdk/htsjdk.version.property.xml
#htsjdk version
#Mon, 08 Sep 2014 10:53:51 +0200

htsjdk-version=1.119

It breaks ant when reading the xml to get the htsjdk.version( lindenb/jvarkit#11 )

may be it should be named 'htsjdk.version.properties' ?

What's with switching to "contig" everywhere? It's not even correct/accurate?

I just noticed that there are changes to many core classes (SAMRecord, VariantContext, Interval) to implement a new Locatable interface. The main result seems to be deprecation of getChr(), getChrom() getChromosome(), getReference() etc. methods in preference to a new getContig().

I get the desire to standardize, but it's confusing to long time users of the library to suddenly change and deprecate all the existing method names. But more to the point "contig" is not really even an accurate term for most uses. For human, and most larger references genomes, the chromosomes are scaffolds of multiple contigs (contiguous sequences) assembled with Ns in the middle. How about using something more generic like getReferenceSequence()?

Use commons-compress instead of apache-ant-1.8.2-bzip2.jar (library provenance and support more compression types)?

htsjdk.samtools.util.IOUtil depends on CBZip2{Input,Output}Stream in order to implement openBzipFileForReading and openBzipFileForWriting and the calls in openFileForReading and openFileForWriting where file types are autodetected by extension.

CBZip2{Input,Output}Stream are provided in the source in lib/apache-ant-1.8.2-bzip2.jar, but this is effectively a binary blob. There is no expressed provenance to how this jar was compiled (although presumably the relevant files from apache ant were carved out). Also, since this portion of ant is not distributed by apache as an individual component, this causes problems for the possible move towards maven for compilation and deployment (pull request for #55 currently declares a dependency on all of apache-ant-1.8.2, which is a bit excessive just for bzip2 support).

Possible solutions would be to:

  1. Include the source to CBZip2InputStream and CBZip2OutputStream in htsjdk, probably in the htsjdk.samtools.util.bzip2 package. This should be fine from a license standpoint (ant is under Apache License, version 2.0)
  2. Depend upon commons-compress instead, which would require some small rewriting of the relevant portions of IOUtil. On the up side, this would allow for support for several more compression types (in addition to bzip2, also pack200, .Z, xz and lzma). A slight downside is that this dependency is about 357k (compared to the 30k or so that the apache-ant-1.8.2-bzip2.jar is).

I would be happy to submit a pull request for either (probably would take less time than it did to type up this message), but wanted to know if there was a consensus as to which is the better approach?

Request that Feature have getters for start/end that use a consistent coordinate system

Feature is a very useful interface when dealing with different input file types. However, the Feature interface explicitly states: "The coordinate conventions for start and end are implementation dependent, no specific contact is specified here." As a feature request, I would propose that the interface provide some method to obtain the start/end coordinates with a known coordinate scheme, such that every type of feature will give you the same value. This way you know whether you're getting 0- or 1- based numbers. This could be either:

  1. create get1BasedStart()/get1BasedEnd() methods

  2. Create some sort of CoordinateType enum, with 1- and 0-based values. add a getCoordinateType() method.

or there are plenty of other options. Doesnt really matter what, so long as it always returns the same value from any implementation of Feature.

Apologies if there's already a way to get at this problem...

move HTSJDK to java 8

I suspect this suggestion will be controversial.

It would be helpful for a lot of reasons to allow java 8 code in htsjdk. There are many useful features which included in 8 which were not in 6, or 7. See (here)[http://www.oracle.com/technetwork/java/javase/8-whats-new-2157071.html] for a complete list.

In my view, the biggest and most desirable changes are the introduction of lambdas, default methods in interfaces, and the stream APIs.

Licence

Hi what about the MIT licence in the htsjdk.
If I use some classes to develop a new tool, I should implement the licence-text into all my classes?
Or I'm not allowed to use this packages in my new tool?

CRAM code base coding style

Update CRAM code base to our code standards. This mostly involves code
style, variable names and standard conventions.

CRAM to support interval querying

Before we release, we would like to support interval range querying of CRAM files. @vadimzalunin could this be supported?

CloseableIterator<SAMRecord> query(final QueryInterval[] intervals,
                                              final boolean contained)

Bug when parsing spliced BAM record with very large intron

I've had a bug report against our software which appears to be an underlying bug in the htsjdk parser for BAM files. The read which fails to parse is a spliced read with a huge intron in it (226Mbp). When parsing the cigar string htsjdk corrupts this value (it looks like it overflows so I get a negative value), but if I use samtools to convert the bam file to sam, then htsjdk reads it just fine.

I've put the example files I used (only 1 read in them) up at:

http://ftp2.babraham.ac.uk/ftpusr68/

Some code which illustrates the problem is below. For the sam file I get:
Cigar is 109M226643263N25M17S Start=10002 End=226653398

..but for the equivalent BAM file I get:
Ignoring SAM validation error: ERROR: Record 1, Read name GR_10257_idx_5/2, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
Cigar is 109M-41792193N25M17S Start=10002 End=-41782058

This is using htsjdk-1.129.jar

import java.io.File;

import htsjdk.samtools.SAMRecord;
import htsjdk.samtools.SamReader;
import htsjdk.samtools.SamReaderFactory;
import htsjdk.samtools.ValidationStringency;

public class PicardCigarBug {

public static void main (String [] args) {

    File smallBugBam = new File("C:/Users/andrewss/Desktop/Bug/small.bam");

// File smallBugBam = new File("C:/Users/andrewss/Desktop/Bug/small.sam");

    SamReaderFactory samFactory = SamReaderFactory.makeDefault().validationStringency(ValidationStringency.LENIENT);
    SamReader inputSam = samFactory.open(smallBugBam); 


    for (SAMRecord samRecord : inputSam) {
        System.err.println("Cigar is "+samRecord.getCigarString()+" Start="+samRecord.getAlignmentStart()+" End="+samRecord.getAlignmentEnd());
    }
}

}

Move Picard's CompareSAMs functionality into HTSJDK

The ability to compare two *AM files seems generally useful, especially for unit tests.

There is currently a Picard CLP named CompareSAMs that tests the equality of two such files, and collects more detailed statistics as well. Many unit tests in Picard instantiate this class in a non-CLP context, making dangerous assumptions about mutated fields and such.

It could easily be refactored into a standalone class that takes two SamReaders and returns an object with various collected metrics, with the Picard tool being a thin wrapper.

Alternatively, if such functionality already exists in HTSJDK, the Picard tool should be using it...

(see also the SamFileValidator / ValidateSamFile duality)

Refactor Filters to be functional interface

Currently filters implement 2 different functions, a filter on a single SAMRecord and a filter on 2 mated SAMRecords.

This should be reduced to just a single method that takes 1 SAMRead. Classes that need a mate pair filter can build their own, or filters that are need for mated pairs could implement their own MateFilter interface. In almost all cases the result on mates is just && or || of the primary filter, it's easy to compose these. It would also eliminate many filters throwing UnimplementedOperation exceptions.

This would make #177 easier since filters would then fit into the standard notion of what a filter or predicate is.

Merging out of order sequence headers results in error

Recently, I got an error

Cannot merge sequence dictionaries because sequence chrM and chrY are in different orders in two input sequence dictionaries.

Can someone explain this error further?

Does this need to be an error or is this case not implemented yet? Is there a downstream problem if sequence header that are in different orders are still added?

CIO.isCRAM uses DIS.readFully instead of SFR.readBytes and throws EOFE

CramIO.isCRAM() should not use DataInputStream.readFully() on unknown input streams. SamFileReader.isBAMFile() instead uses SamFileReader.readBytes(), and hence does not throw an exception when the stream is not as expected.

Test case:

Assert.assertFalse(CramIO.isCRAM(new ByteArrayInputStream(new byte[0])));

Also, like SamFileReader.isBAMFile() (via isValidFile()), isCRAM() should probably be checking that markSupported() before using mark().

Specify Index-File in VCFFileReader

It would be nice to be able to specify the index-file in the VCFFileReader.
The underlying FeatureReader has the possibility to specify the index resource, but there is no possibility in the VCFFileReader to set the index resource (e.g. an additional constructor). At the moment this VCFFileReader only works if the index file is located in the same folder as the VCF file.

IndexFactory.createTabixIndex() is not able to handle block-compressed files

Currently, if you pass a block-compressed file into IndexFactory.createTabixIndex(), the file never gets decompressed internally, causing the underlying codec to blow up with errors like the following:

    Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:115)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:88)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:41)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readHeader(IndexFactory.java:413)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.<init>(IndexFactory.java:401)
at htsjdk.tribble.index.IndexFactory.createTabixIndex(IndexFactory.java:327)

I see two possible solutions, and wanted to solicit feedback on which would be preferable:

  1. Create an overload of createTabixIndex() that accepts a stream rather than a File (this places the burden of setting up the stream correctly on the client)
  2. Have the existing FeatureIterator.initStream() check for block-compressed input via AbstractFeatureReader.hasBlockCompressedExtension(), and wrap the stream returned within a BlockCompressedInputStream as needed.

Thoughts?

Potential Bug: Missing reads when querying using Reader API

Hi developers,

I need your help again! I got an issue when I try to query bam files using SamReader/SAMFileReader.
Basically, there are two reads in the BAM:

ERR019497.24419495      73      1       11966   0       108M    =       11966   0       TTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGGTT    =@BDCFFGEEGGFFBGGDFHDEEDFBIHHBF3GIDDHF;IAHHIHFAGGID9CHDEEFIHJHGJIHJ?1CCG5?EFFB.AAEHEEFGFBFGFIIBHGGFD35==C%:9    X0:i:3  X1:i:4  MD:Z:105T2      RG:Z:ERR019497  AM:i:0  NM:i:1  SM:i:0  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@D@BB       XT:A:R
ERR019497.24419495      133     1       11966   0       104M4S  =       11966   0       ACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGAGGGCCTGCCTGATCTTCCACCCCCTCTCCCAGGGCCAAACCTAAACCCCCCTACCCCCCCCCCCCCGCCGGCTG    9@@EAC8C6DF?EEFDEFFFCEEFFDFGFE?:&8),4.51.+59*88'/9//62-$%,.-;%+420#&.2C4B'%2-,'&/$%%$$'%)%.4(*'0/5+*/25#%-%&    XC:i:104        RG:Z:ERR019497

Note that they both have the same QNAME (ERR019497.24419495), RNAME (1) and POS (11966).

When I use the reader query API as follows against this BAM file:

SAMRecordIterator itr = reader.queryOverlapping(new QueryInterval[] { query });

where

QueryInterval query=new QueryInterval(reader.getFileHeader().getSequenceIndex(1), 11966,11966));

I got 2 reads back which is perfect!

However when I use:

QueryInterval query=new QueryInterval(reader.getFileHeader().getSequenceIndex(1), 11967,11967));

Only one read is back:

ERR019497.24419495      73      1       11966   0       108M    =       11966   0       TTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGGTT    =@BDCFFGEEGGFFBGGDFHDEEDFBIHHBF3GIDDHF;IAHHIHFAGGID9CHDEEFIHJHGJIHJ?1CCG5?EFFB.AAEHEEFGFBFGFIIBHGGFD35==C%:9    X0:i:3  X1:i:4  MD:Z:105T2      RG:Z:ERR019497  AM:i:0  NM:i:1  SM:i:0  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@D@BB       XT:A:R

Note that both reads have about 100 bases long. So I would expect I will have 2 reads back in the second query as well but it didn't happen.

Is it a bug in the library? If it is not, can you provide me a rationale?

You can download the BAM files to reproduce the issue at http://s000.tinyupload.com/index.php?file_id=00308273047953999316

Thank you!

Jerry

Maven central?

Hi,

Are you planning on distributing htsjdk on maven central? I can only see 1.118 there, from user "seqdoop" which I expect isn't you guys.

(The scenario is that I'm using the GATK-framework to develop custom walker but need to update htsjdk from the supplied version 1.120).

thanks

Daniel

aux.bam and aux.sam is not valid file names on windows

Hello,

I have previously been doing some work on Picard and Htsjdk, but now when I try to update to the latest master branch in htsjdk there is a problem with the aux.bam and aux.sam files in the testdata/htsjdk/samtools directory. Apparently it seems like files named 'aux' are forbidden in Windows due to some restrictions since the MSDOS times!!! (https://social.technet.microsoft.com/Forums/windows/en-US/e22c021d-d188-4ff2-a4dd-b5d58d979c58/the-specified-device-name-is-invalid)

Would it be possible for you to change the file names of those two files?

Integer overflow problem in tribble.util.popgen.HardyWeinbergCalculation

Users like Monkol are finding the HWP annotation to be very valuable but unfortunately we stopped emitting it in GenotypeGVCF because there is an integer overflow problem in the code with 91K samples. Hopefully this stack trace from Tim will be enough to reproduce the issue.

Hi All,

First I thought you’d all like to know that the 91k got kicked off on Saturday. However, all of the GenotypeGVCF jobs are failing with the following message:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -4408
at htsjdk.tribble.util.popgen.HardyWeinbergCalculation.hwCalculate(HardyWeinbergCalculation.java:94)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.computeHardyWeinbergPvalue(GATKVariantContextUtils.java:1878)
at org.broadinstitute.gatk.tools.walkers.annotator.GenotypeSummaries.annotate(GenotypeSummaries.java:94)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:177)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:227)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:186)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:110)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-163-g4284d7a):

Ryan: would you be able to take a look at this ASAP today? The new joint calling is blocked on this. A few things you might need:

logs: /seq/dax/macarthur_joint_calling/v2/logs/genotypeGvcfs.scattered.macarthur_joint_calling.*.log
GATK: /seq/tng/tfenne/gatkbin/GenomeAnalysisTK-v3.1-163-g4284d7a.jar (built from gsa-unstable on Friday afternoon)
Example command:

java -Djava.io.tmpdir=/seq/picardtemp3/macarthur_joint_calling
-Djava.io.tmpdir=/local/scratch/macarthur_joint_calling
-Djava.io.tmpdir=/seq/picardtemp3/macarthur_joint_calling
-Djava.io.tmpdir=/seq/picardtemp4/macarthur_joint_calling
-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx16000m
-jar /seq/tng/tfenne/gatkbin/GenomeAnalysisTK-v3.1-163-g4284d7a.jar
-T GenotypeGVCFs
--disable_auto_index_creation_and_locking_when_reading_rods
-R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta
-o /seq/dax/macarthur_joint_calling/v2/scattered/temp_0102_of_1000/genotypes.unfiltered.vcf.gz
-D /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dbsnp.vcf
-L /seq/dax/macarthur_joint_calling/v2/scattered/temp_0102_of_1000/scattered.intervals
-V /seq/dax/macarthur_joint_calling/v2/combined_gvcfs/all_combined_gvcfs.list

htsjdk.samtools.fastq.FastqReader doesn't accept empty records

I used the htsjdk.samtools.fastq.FastqReader to read fastq files processed with cutadapt: some reads were empty : the FastqReader throws an exception from FastqReader.checkLine

is there a way to fix this ? the field 'skipBlankLines' makes it difficult to fix.

Another idea; in FastqReader would it be advantageous to use htsjdk.samtools.util.LineReader instead of a java.io BufferedReader ?

CRAM code base test cases.

Test cases need to be converted to testng and additional coverage added. A fair amount of
the tests are currently run by main()s. These tests should be converted to
a full testng test suite such that our automated builds can run them. This
is also a good time to look at code coverage. Once the tests are converted
to testng we can run some code coverage plugins to get a good sense of how
well this is currently covered.

add getBackingResourceName() to SAMReader

There is currently no way to get information about the datasource backing the SAMReader. This makes it difficult to return useful error messages when there is any sort of exception.

It should return a String which sufficiently describes the resource enough that a human could identify which file, url, database, etc was the problem.

CRAM doesn't set Slice.refMD5

There appears to be a Slice.setRefMD5(), but it appears that it's not wired in. By the time SliceIO writes the Slice.refMD5 it is still null.

Creating the cram "works", but crashes in CRAMTools 2.1 unless passed the hidden --resilient argument.

$ java \
  -jar ~/src/picard/dist/picard.jar SamFormatConverter \
  R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
  I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
  O= test.cram
[Wed Jan 14 15:58:07 ART 2015] picard.sam.SamFormatConverter INPUT=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.cram OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jan 14 15:58:07 ART 2015] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_67-b01; Picard version: 1.128(c8e12338d226532b30e9ecdbf33180a073c3ffc7_1421081159) JdkDeflater
[Wed Jan 14 15:58:09 ART 2015] picard.sam.SamFormatConverter done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=163577856
$ java \
  -jar cramtools-2.1.jar \
  bam \
  -R ~/src/cramtools/src/test/resources/data/set1/small.fa \
  -I test.cram
ERROR   2015-01-14 15:55:38 Slice   Reference MD5 mismatch for slice 0:2-22180, CCTAAGCCTA...TCGGTACTCC
ERROR   2015-01-14 15:55:38 Cram2Bam    Reference sequence MD5 mismatch for slice: seq id 0, start 2, span 22179, expected MD5 00000000000000000000000000000000
$

write tests for VariantContext validation

htsjdk 1.129
none of the validation methods on VariantContext is tested:
extraStrictValidation
validateReferenceBases
validateRSIDs(rsIDs);
validateAlternateAlleles();
validateChromosomeCounts();

Support VN:1.5

I've used biobambam to merge some files, and it ouputs
@HD VN:1.5 SO:coordinate
using https://github.com/gt1/libmaus/blob/master/src/libmaus/bambam/BamCatHeader.hpp#L116

According to the spec ( http://samtools.github.io/hts-specs/SAMv1.pdf ), 1.5 is the current version, but picard ValidateSamFile gives an error, which the diff below eliminates.

diff --git a/src/java/htsjdk/samtools/SAMFileHeader.java b/src/java/htsjdk/samtools/SAMFileHeader.java
index 2e237c4..ac361b1 100644
--- a/src/java/htsjdk/samtools/SAMFileHeader.java
+++ b/src/java/htsjdk/samtools/SAMFileHeader.java
@@ -46,9 +46,9 @@ public class SAMFileHeader extends AbstractSAMHeaderRecord
     public static final String VERSION_TAG = "VN";
     public static final String SORT_ORDER_TAG = "SO";
     public static final String GROUP_ORDER_TAG = "GO";
-    public static final String CURRENT_VERSION = "1.4";
+    public static final String CURRENT_VERSION = "1.5";
     public static final Set<String> ACCEPTABLE_VERSIONS =
-            new HashSet<String>(Arrays.asList("1.0", "1.3", "1.4"));
+            new HashSet<String>(Arrays.asList("1.0", "1.3", "1.4", "1.5"));

     /**
      * These tags are of known type, so don't need a type field in the text representation.

Publish "unstable" build on each push to master?

(related to #115)

There's been some interest in having an "unstable" jar published to Maven on each push to master, in addition to the "stable" release every ~2 weeks.

This would be useful for larger projects that depend on HTSJDK where a few members of the project want to contribute features to HTSJDK, without everyone having to worry about syncing their local versions of HTSJDK before building the other project.

I'm not sure on the best way to do this. We'd also need to make it very clear that it's for experimental use only. In the short term, a fork might be an easier solution, but let's keep this on the radar.

@lbergelson @akiezun @droazen are interested in this.

Explicit declaration of license is missing

The README doesn't state a license, and there is also no LICENSE file. It seems from at least some of the source code that the code is under the MIT license. I'd recommend making that explicit by putting a LICENSE file in the root folder of the repo, and/or declaring the license under which the code is released explicitly in the README.

Extended API for symbolic alleles

Extending the htsjdk to parse symbolic alt alleles and return meaningful Alleles based on the parsed result would be useful to structural variation users of the API. In particular the following features would be useful:

  • separation of symbolic alleles into named structural variations, named contig insertions (both with sequence such as "A" and without ""), breakends and breakpoints
  • methods to expose the parsed breakpoint contig, position, directions
  • incorporation of all 31 SV headers into VCFStandardHeaderLines (or split out into VCFStructuralVariationHeaderLines)
  • methods to return the base call portions of symbolic alleles (ie: extract the ACGT from ACGT[chr1:1[ )

One possible design is to subclass Allele:
abstract Allele
Allele <- BaseCallAllele
Allele <- SymbolicAllele
Allele <- NoCallAllele
Allele <- MissingAllele
SymbolicAllele <- StructuralVariationAllele (with enum, subclass hierarchy or getters for SV type and subtypes (subsubtypes?) such as and DUP:TANDEM)
SymbolicAllele <- BreakendAllele
SymbolicAllele <- BreakpointAllele
SymbolicAllele <- NamedContigAllele

The low-level design and implementation depends on resolution of ambiguities in the VCF specifications as the are multiple points in which the VCF specifications are under-defined and ambiguous. For example, it is unclear as to whether "AA", "", "AA" are valid alleles.

Clean up Iterators / Iterables

  1. See the comment on SecondaryOrSupplementarySkippingIterator - it combines a peekable iterator with an implicit filter, but it would be cleaner if these were implemented separately and then composed.
  2. Should we get rid of, or rename, NotPrimarySkippingIterator (currently unused)? The problem is that it only filters out secondary (but not supplementary) alignments, but someone new to the code might well assume that "not primary" means neither secondary not supplementary.
  3. Where possible, we should make Iterators also implement Iterable so that for-each syntax can be used.

Header lines with same key are lost when parsing VCF.

Consider a VCF file with the following two lines in the header:

##GATKCommandLine=<ID=ExtractConsensusSites, ....
##GATKCommandLine=<ID=FilterLiftedVariants,....

When this is parsed by the VCFHeader class, both lines are fed into the meta data object as follows: (VCFHeader.java: 279)

mOtherMetaData.put(line.getKey(), line);

And the second entry overwrites the first, so the first line is lost from the meta-data.

Is there supposed to a restriction on the number of elements with the same key allowed in the VCF format?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.