samtools / htslib Goto Github PK

C library for high-throughput sequencing data formats

License: Other

C 91.12% Makefile 1.89% Perl 2.76% Shell 0.77% M4 1.41% Roff 1.35% Scilab 0.62% C++ 0.07%

htslib bioinformatics sam bam cram vcf bcf ngs

htslib's Introduction

HTSlib is an implementation of a unified C library for accessing common file formats, such as SAM, CRAM and VCF, used for high-throughput sequencing data, and is the core library used by samtools and bcftools. HTSlib only depends on zlib. It is known to be compatible with gcc, g++ and clang.

HTSlib implements a generalized BAM index, with file extension .csi (coordinate-sorted index). The HTSlib file reader first looks for the new index and then for the old if the new index is absent.

This project also includes the popular tabix indexer, which creates both .tbi and .csi formats, and the bgzip compression utility.

Building HTSlib

See INSTALL for complete details. Release tarballs contain generated files that have not been committed to this repository, so building the code from a Git repository requires extra steps:

autoreconf -i  # Build the configure script and install files it uses
./configure    # Optional but recommended, for choosing extra functionality
make
make install

Citing

Please cite this paper when using HTSlib for your publications.

HTSlib: C library for reading/writing high-throughput sequencing data
James K Bonfield, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, Robert M Davies
GigaScience, Volume 10, Issue 2, February 2021, giab007, https://doi.org/10.1093/gigascience/giab007

@article{10.1093/gigascience/giab007,
    author = {Bonfield, James K and Marshall, John and Danecek, Petr and Li, Heng and Ohan, Valeriu and Whitwham, Andrew and Keane, Thomas and Davies, Robert M},
    title = "{HTSlib: C library for reading/writing high-throughput sequencing data}",
    journal = {GigaScience},
    volume = {10},
    number = {2},
    year = {2021},
    month = {02},
    abstract = "{Since the original publication of the VCF and SAM formats, an explosion of software tools have been created to process these data files. To facilitate this a library was produced out of the original SAMtools implementation, with a focus on performance and robustness. The file formats themselves have become international standards under the jurisdiction of the Global Alliance for Genomics and Health.We present a software library for providing programmatic access to sequencing alignment and variant formats. It was born out of the widely used SAMtools and BCFtools applications. Considerable improvements have been made to the original code plus many new features including newer access protocols, the addition of the CRAM file format, better indexing and iterators, and better use of threading.Since the original Samtools release, performance has been considerably improved, with a BAM read-write loop running 5 times faster and BAM to SAM conversion 13 times faster (both using 16 threads, compared to Samtools 0.1.19). Widespread adoption has seen HTSlib downloaded \\&gt;1 million times from GitHub and conda. The C library has been used directly by an estimated 900 GitHub projects and has been incorporated into Perl, Python, Rust, and R, significantly expanding the number of uses via other languages. HTSlib is open source and is freely available from htslib.org under MIT/BSD license.}",
    issn = {2047-217X},
    doi = {10.1093/gigascience/giab007},
    url = {https://doi.org/10.1093/gigascience/giab007},
    note = {giab007},
    eprint = {https://academic.oup.com/gigascience/article-pdf/10/2/giab007/36332285/giab007.pdf},
}

htslib's People

Contributors

Stargazers

Watchers

Forkers

arq5x pd3 mp15 bpow kmsquire podpearson alfyboza jkbonfield avilella thomasleyer kdm9 mcshane mnagasaki dkj debian goldenhelix pchines dreampursue theoryno3 mlin dnanexus-archive ryansisk intronic daviesrob sjackman broadinstitute spalte tonynaggs gdroc lomereiter monkollek samstudio8 pkrusche lh3 winni2k ytchen0323 azlicic mtmorgan idiv-biodiversity ashwinkalbhor hjanime adeshpande jlec nh13 litswu theboocock kjsanger reinders allenday gkno smowton cambridgehackers kontakter shesketh mdshw5 rafalcode ekg aertslab wtsi-hgi bosz wewela legaultmarc ctwu1 ionzhangj8 bradleycolquitt cnh johanneskoester ryan10x weese humanlongevity xtmgah kirkmcclure modelorganism gabrielnicolasavellaneda nsubtil rtvt123 donfreed pb-cdunn nordhuang holtgrewe gchensi dannon cyenyxe zhmz90 noporpoise mbekritsky angelven ctsa gatoravi ousamg jianlong-peng adamspargo sg-james rrahn hannespetur pontikos vsuaste dlaehnemann annaisystems dpryan79

htslib's Issues

_vcf_parse_format() crashes when trailing genotype fields are omitted

See test/omitgenotypes.vcf in the bcftools repository. This is distilled from a test case from @aylwyn in which

bcftools view test/omitgenotypes.vcf

segfaults on the 1110696 site. This record has a four-part genotype format ("GT:GQ:DP:HQ") like the other records, but omits the trailing missing fields. @sm15 suspected that the similar 1230237 site (with "." sample fields, for completely missing) might succeed, but _vcf_parse_format() also crashes on that record.

Omitting these trailing fields is valid; the penultimate paragraph of VCFv4.2 §1.4.2 contains

Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field).

vcfmerge doesn't take "String" type FORMAT values

When trying to merge VCF files with "String" type of data on the (per sample) format fields, htscmd outputs an error: "Unexpected case: 7, FT" (For FORMAT tag FT).

Fix might be an easy addition of a line vcfmerge.c:821 starting with
"case BCF_BT_CHAR: BRANCH("

Rest of the line and knowing whether it works, are non-trivial.

cram_write_SAM_hdr generates incorrect M5 tags

cram_write_SAM_hdr tries generates the @sq M5 if it isn't present. At the moment it gets this wrong if the reference has not been loaded yet, and puts in the MD5 of an empty file.

To reproduce, supply a SAM file with @sq headers that include UR: but not M5:. Convert it to CRAM using samtools view -C, and look at the header it makes (can be done with less). It will include:

@SQ   .... M5:d41d8cd98f00b204e9800998ecf8427e

I'll send a pull request for this that includes the fix that I've already put into io_lib.

bcf_dup function

An empty BGZF block elsewhere in the file stream causes premature EOF.

Adding ^_\213^H^D^@^@^@^@^@377^F^@bc^B^@^[^@^C^@^@^@^@^@^@^@^@^@ elsewhere in a BAM file (other than the end) causes samtools view to end prematurely. This will either produce a decoding error, or if it happens to coincide with a genuine end of sequence record then it'll silently just truncate the output. This is visible if we add it as the first block after the SAM header.

Picard continues past this and ignores it as an end of file identifier unless it actually occurs at the end. The proposed amendment to the specification states "should by convention finish with a specific empty BGZF block". Note it does not say anything about the behaviour of empty blocks anywhere else, so we should not be interpreting them as a special indicator for EOF.

hts_readlines() should check that the lines are textual

Some uses of samtools's read_file_list() are being recoded to use hts_readlines() instead.

This is mostly equivalent functionality, but read_file_list() checks that the lines are extant filenames and stops reading and fails if not. This is in response to samtools/samtools#21 (fixed by samtools/samtools@4b0058f), in which read_file_list() previously appeared to read forever when accidentally presented with a BAM file instead of a file list.

At the least, hts_readlines() needs to validate that it is indeed getting lines composed primarily of sensible printable textual characters. Perhaps also a mode (or an associated hts_readfilenames() function) that also checks that it is getting the names of extant files.

checking of existence of filter flags

Can you add a mechanism for checking existence of filter flags?

e.g. bcf_exists_filter_flag("PASS");

or is there already a mechanism in place?

Expose bcf_fmt_sized_array in vcf.h

Hi Heng,

I am working on a C++ interface to BCF2 that exposes bcf1_t structs through a C++ class. In order to do so, I need to mimic much of the logic currently in vcf_format1(). This function uses bcf_fmt_sized_array (and possibly other functions) that are not declared in vcf.h, which prevents mimicking the unwinding of the BCF record in my class.

Would you be willing to declare a prototype for this function in vcf.h?

Interoperability with other BCF implementations

We need to ensure that BCF files produced by Picard and other tools can be read by htslib/bcftools/etc, and also vice versa.

Writing BCF for consumption by Picard-based tools

First generate some test files based on bcftools/test/ex2.vcf:

bcftools view -ob -n ex2-bcftools.bcf test/ex2.vcf
bcftools view -ou -n ex2-bcftools.u.bcf test/ex2.vcf

picard VcfFormatConverter REQUIRE_INDEX=false I=ex2-bcftools.u.bcf O=fred.vcf

This produces fred.vcf with equivalent contents to the original test/ex2.vcf (provided the input file has no instances of BCFv2.2-specific sentinels), as expected.

picard VcfFormatConverter REQUIRE_INDEX=false I=ex2-bcftools.bcf O=fred.vcf

This currently produces

Exception in thread "main" org.broad.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:, for input source: /Users/jm18/bcftools/ex2-bcftools.bcf

as it seems that Picard currently only accepts what we call uncompressed-BCF.

Reading BCF files produced by Picard-based tools

This produces what we call an uncompressed BCF file, which should be autodetected by hts_open() and bcftools:

picard VcfFormatConverter REQUIRE_INDEX=false INPUT=test/ex2.vcf OUTPUT=ex2-picard.bcf

However bcftools refuses to read it:

bcftools view ex2-picard.bcf

[vcf.c:529 bcf_hdr_read] invalid BCF2 magic string: only BCFv2.2 is supported.
Fail to read VCF/BCF header: ex2-picard.bcf

The expected result is that we read ex2-picard.bcf successfully.

[bcf_get_genotypes] mixed ploidy

In the case of mixed ploidy, say on the X chromosome. How do we differentiate an individual with ./. genotype compared to an individual which has .

Am I correct to say that for the single ploidy individual, it will be represented by -1 and a number that is much lower than -1? Is there a specified value in this case?

This is with respect to bcf_get_genotypes.

[bcf_get_info_*]

Right now there is the API for INT and FLOAT but it is missing for STR.

vcf.h lines 458-460

Is there another mechanism for picking up string types?

writing out of VCF records with no individual data even though the header says otherwise

Should this return an error or at least a warning?

Segmentation Fault when using vcf-merge

When using vcf-merge the program segfaults after it gets through the header. When I use the non-C version of vcf-merge it works fine. There are some duplicate lines in the vcfs I am trying to merge.

htscmd vcfmerge file_a.gz file_b.gz

bam/sam API consistency with bcf/vcf API

Is it possible to make it so?

e.g. rename sam_* to bam_*

int sam_parse1(kstring_t *s, bam_hdr_t *h, bam1_t *b);
int sam_format1(const bam_hdr_t *h, const bam1_t *b, kstring_t *str);
int sam_read1(samFile *fp, bam_hdr_t *h, bam1_t *b);
int sam_write1(samFile *fp, const bam_hdr_t *h, const bam1_t *b);

mising parameters in vcfmerge.c at line 195

line 195 in vcfmerge.c should be
if ( strncmp(a[0],b[0],rla<rlb?rla:rlb) ) error("The REF prefixes differ: %s vs %s (%d,%d)\n", a[0],b[0],rla,rlb );

Before was if ( strncmp(a[0],b[0],rla<rlb?rla:rlb) ) error("The REF prefixes differ: %s vs %s (%d,%d)\n", a[0],b[0] )

[bcf_itr_querys] bcf_name2id not defined

as above.

vcf.h line 516

vcfmerge

Hi,

I'm getting a segfault with vcfmerge (checkout sept 5th). Some of the INFO/FORMAT fields aren't getting picked up from the header. Error and header below.

$ htscmd vcfmerge aaa.capture.tcga.vcf.gz bbb.indel.cap
ture.tcga.vcf.gz > tmp.vcf
[W::vcf_parse1] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'Gene' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'MQ0' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'SOMATIC' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'SS' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'VC' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'VT' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'TID' is not defined in the header, assuming Type=String
[W::vcf_parse1] INFO 'VLSC' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'GT' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'AD' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'FA' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'MQ0' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'BQ' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'SS' is not defined in the header, assuming Type=String
[W::vcf_parse1] FORMAT 'SSC' is not defined in the header, assuming Type=String
Segmentation fault

bcftools tabix segfaults on vcf.gz input that is actually not gzipped

bcftools tabix segfaults where tabix used to give a helpful warning.

Demonstrative test case, using bcftools and htslib tag 0.2.0-rc3:

$ cp test/norm.vcf norm.vcf.gz
$ bcftools tabix norm.vcf.gz
Segmentation fault
$ tabix norm.vcf.gz
[tabix] was bgzip used to compress this file? norm.vcf.gz

I've run this on Linux x86_64.

hts_close calls cram_eof after closing the cram file

hts.c line 200 calls cram_eof after having closed the file at line 198. This leads to an invalid read.

P CIGAR operator causes mpileup to stutter

When given a BAM file using the P operator, mpileup can sometimes skip bases and duplicate others. Additionally there is no way to actually use the P operator properly, to know which bases align with which in an insertion region, but that is a different fault and a fundamental flaw in the output format.

An example:

@sq SN:c1 LN:10
s0a 0 c1 1 0 10M * 0 0 AACCGCGGTT *
s0b 0 c1 1 0 10M * 0 0 AACCGCGGTT *
s0c 0 c1 1 0 10M * 0 0 AACCGCGGTT *
s1 0 c1 1 0 5M6I5M * 0 0 AACCGGTTAACCGGTT *
s2 0 c1 1 0 5M1P4I1P5M * 0 0 AACCGTTAACGGTT *
s3 0 c1 1 0 5M3I3P5M * 0 0 AACCGGTTCGGTT *
s4 0 c1 1 0 5M3P3I5M * 0 0 AACCGAACCGGTT *
s5 0 c1 1 0 4M1D2P2I2P1D4M * 0 0 AACCTAGGTT *
s6 0 c1 1 0 2M3D6I3D2M * 0 0 AAGTTAACTT *

jkb@seq3a[work/samtools...] ./samtools mpileup c1#pad1.bam
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
c1 1 N 9 ^!A^!A^!A^!A^!A^!A^!A^!A^!A ~~~~~~~~~
c1 2 N 9 AAAAAAAAA-3NNN ~~~~~~~~~
c1 3 N 9 CCCCCCCC* ~~~~~~~~~
c1 4 N 9 CCCCCCCC-1N* ~~~~~~~~~
c1 5 N 9 GGGG+6GTTAACG+4TTAAG+3GTTG+3AAC_+2AG_+6TTAACT ~~~~~~~~~
c1 6 N 9 CCCCCCC** ~~~~~~~~~
c1 7 N 9 GGGGGGGG* ~~~~~~~~~
c1 8 N 9 GGGGGGGG* ~~~~~~~~~
c1 9 N 9 TTTTTTTTT ~~~~~~~~~
c1 10 N 9 T$T$T$T$T$T$T$T$T$ ~~~~~~~~~

Note that position 5 should end +2TA+6GTTAAC

Uuencoded SAM file to demonstrate the problem. (It's also c1#pad1.sam)

begin 644 -
M0%-1"5-..F,Q"4Q..C$P"G,P80DP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#
M1T=45`DJ"G,P8@DP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#1T=45`DJ"G,P
M8PDP"6,Q"3$),`DQ,$T)*@DP"3`)04%#0T=#1T=45`DJ"G,Q"3`)8S$),0DP
M"35--DDU30DJ"3`),`E!04-#1T=45$%!0T-'1U14"2H*<S(),`EC,0DQ"3`)
M-4TQ4#1),5`U30DJ"3`),`E!04-#1U1404%#1T=45`DJ"G,S"3`)8S$),0DP
M"35-,TDS4#5-"2H),`DP"4%!0T-'1U140T='5%0)*@IS-`DP"6,Q"3$),`DU
M33-0,TDU30DJ"3`),`E!04-#1T%!0T-'1U14"2H*<S4),`EC,0DQ"3`)-$TQ
M1#)0,DDR4#%$-$T)*@DP"3`)04%#0U1!1T=45`DJ"G,V"3`)8S$),0DP"3)-
<,T0V23-$,DT)*@DP"3`)04%'5%1!04-45`DJ"@``
`
end

missing htscmd binary

after running make I don't get a htscmd binary /htslib-master. Has anyone seen/fixed this problem? (I'm trying to install on an iMac OSX 10.9 Mavericks)

vcfcheck.c:703:39: warning: format specifies type 'long' but the argument has type 'uint64_t' (aka 'unsigned long long') [-Wformat]
printf("\t%ld\t%f\n", stats->dp.vals[i], stats->dp.vals[i]*100./sum);

  %llu 
  1 warning generated. 

.....
.....
.....

Undefined symbols for architecture x86_64:
"_bcf_gt_type", referenced from:
_main_vcfcheck in vcfcheck.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [htscmd] Error 1

Unchecked results from "write"s

Given the propensity of lustre and other network files systems to give up the ghost we probably should be checking our return values more carefully. The current situations means its possible we'll miss a failed write and return success when we should bail out and fail. _razf_write is probably one place to start looking.

bcf_remove_alleles not ready for BCF output

This can be tested with

  bcftools view -asA test/view.vectors.vcf -Ob > rmme.bcf
  bcftools view rmme.bcf

[bcf_get_genotypes & bcf_update_genotypes]

I'm trying to merge the genotype fields of multiple VCF files but there appears to be some inconsistency in how genotypes are read and then written in the API.

int8_t *gt = NULL;
int32_t n = 0;

int k = bcf_get_genotypes(h, v, &gt, &n);
//according to its usage in vcfgtcheck.c, k is suppose to be the ploidy but because
//this is processed as a string type, it returns ploidy+1

int ngt = 0; //keeps track of samples across all vcf files
for (uint32_t i=0; i<no_vcf_files; ++i) {
for (int32_t j=0; j<bcf_hdr_nsamples(h); ++j) {
int8_t * igt = gt+j * 2;
int a = bcf_gt_allele(igt[0]);
int b = bcf_gt_allele(igt[1]);
std::cerr << a << "/" << b << "\n";
combined_gt[ngt_2] = igt[0];
combined_gt[ngt_2+1] = igt[1];
}
++ngt;
}

//this update genotypes but as a INT type. I followed the usage shown in ccall.c
bcf_update_genotypes(output_hdr,nv,combined_gt,ngt*2);

When I print this out later, I get garbage. Is there anything wrong with what I did? The code here is not complete, if it does not help, I can point you to the source later.

I tried just reading one file and writing out using the API on this file, you can find it here.
ftp://share.sph.umich.edu/vt/grch37/mills.chip.158samples.8904indels.genotypes.bcf

Invalid memory access in hts_idx_push when passed an unaligned reading

Running samtools index on the bam version of this file:

@HD     VN:1.4  SO:coordinate
@RG     ID:grp3 DS:Group 3      LB:Library 3    SM:Sample
@SQ     SN:ref2 LN:60   M5:7c35feac7036c1cdef3bee0cc4b21437
ref2_grp3_p002  147     ref2    16      99      15M     =       46      45      CTAAATAGCTTGGCG }}}}}}}}}}}}}}} RG:Z:grp3
ref2_grp3_p002  99      ref2    46      99      15M     =       16      -45     CTGTTTCCTGTGTGA {{{{{{{{{{{{{{{ RG:Z:grp3
unaligned_grp3_p001     77      *       0       0       *       *       0       0       CACTCGTTCATGACG 0123456789abcde RG:Z:grp3
unaligned_grp3_p001     141     *       0       0       *       *       0       0       GAAAGTGAGGAGGTG edcba9876543210 RG:Z:grp3

prints up:

[E::hts_idx_push] chromosome blocks not continuous

This is due to a bug in hts_idx_push at hts.c line 591, where it tries to use tid, which has the value -1 for the unaligned reads, in an array lookup. The comparison at line 580 suggests that this code block should be able to handle tid < 0.

Make command error

Undefined symbols for architecture x86_64:
"_bcf_gt_type", referenced from:
_main_vcfcheck in vcfcheck.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [htscmd] Error 1

I can't compile and instead get this error. Anyone know why?

(Solution: download an older version.)

Cannot build HTSlib due to missing zlib.h

When I install HTSlib, using make, it says "bgzf.h:33:18: fatal error: zlib.h: no such file or directory, and cause compile interruption.In this case, what should I do? And how can I install it correctly?

Please pass LDFLAGS to main compilations.

Hello,

in Debian, we are passing hardening options when compiling, by overriding the CFLAGS, CPPFLAGS and LDFLAGS environment variables.

However, in htslib's Makefile, LDFLAGS are only passed to the compilations made for the test suite. Could you also pass it in the other cases ?

For your information, as of today, the flags that we pass are:

$ dpkg-buildflags --get CFLAGS
-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security

$ dpkg-buildflags --get CPPFLAGS
-D_FORTIFY_SOURCE=2

$ dpkg-buildflags --get LDFLAGS
-Wl,-z,relro

Have a nice day,

Charles Plessy
Debian Med packaging team
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan

pileup outputs incorrect bases for insertions following deletions

I think this is an HTSlib issue rather than samtools issue. It was initially reported in 2010, but has only partially been fixed:

https://sourceforge.net/mailarchive/forum.php?thread_name=424A790A-3AC0-4FC2-8981-5E3F6671C0DD%40sanger.ac.uk&forum_name=samtools-devel

For example, the following SAM file:

@SQ     SN:c2   LN:9
@CO      c2    CC***AA**T**AA***CC
@CO
@CO     +s1    CT***AA**T**AA***TC
@CO     +s1b   CT*******T*******TC
@CO     +s2    CT*****G**G******TC
@CO     +s3    CT*****GG*GG*****TC
@CO     +s3b   CT****CGGCGGC****TC
@CO     +s4    CT***AAG**G*AA***TC
@CO     +s5    CTGGG*********GGGTC
s1      0       c2      1       0       9M      *       0       0       CTAATAATC       XXXXXXXXX
s1b     0       c2      1       0       2M2D1M2D2M      *       0       0       CTTTC   *
s2      0       c2      1       0       2M2D1I1D1I2D2M  *       0       0       CTGGTC  *
s3      0       c2      1       0       2M2D2I1D2I2D2M  *       0       0       CTGGGGTC        *
s3b     0       c2      1       0       2M1D1M2I1M2I1M1D2M      *       0       0       CTCGGCGGCTC     *
s4      0       c2      1       0       4M1I1D1I4M      *       0       0       CTAAGGAATC      *
s5      0       c2      1       0       2M3I5D3I2M      *       0       0       CTGGGGGGTC      *

displays for position 5

c2      5       T       7       ..-2AA*+1T*+2GTC+2GG*+1A*       X~~~~~~

It should be:

c2      5       T       7       ..-2AA*+1G*+2GGC+2GG*+1A*       X~~~~~~

Information on how to use vcf filter

Hi,
I am trying to use the vcf filter to filter some variants I have called using samtools on a set of exomes. I wonder if you have any information about what is vcf filter is actually doing and how beyond the short help text. I am guessing it is using self organized maps but I do not really know how to interpret the ouput files or what sensible parameter values can be.

Thanks in advance,
Inti

Packaging htslib

Hello, this is Charles Plessy from the Debian Med project.

We are already distributing samtools, tabix and vcftools. I have seen that next version of samtools will be built on htslib so I am preparing a package for it.

In large distributions such as Debian, we benefit a lot from dynamic linking since it allows us to better propagate bug fixes and security fixes.

Would it be possible for you to support dynamic linking of the htslib ?

By the way, can you confirm that the license that applies to files without copyright notices in htslib repository is the MIT license ?

Have a nice day,

Charles Plessy

[bcf_clear1] initial value of QUAL

I think it is more appropriate for an empty BCF record to have a QUAL being missing rather than 0.

vcf.c 584 : v->qual = 0; => bcf_float_set_missing(x);

although this generates a warning when compiling: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]

htscmd vcfview segfaults for .vcf.gz file when regions are given

I bgzipped "test/norm.vcf" and indexed it with "tabix -p vcf".

htscmd vcfview -S norm.vcf.gz 2
ignores the region, even though there is an index file.

htscmd vcfview norm.vcf.gz 2
writes the correct header, but then segfaults.

Expected behaviour: Show only variants from chromosome "2".

bam_plp_set_mask defined in sam.h but not sam.c

commit cf12fba removed bam_plp_set_mask from sam.c without removing it from htslib/sam.h

[bcf_update_info] dies when an bcf_info_t object exists but does not have sufficient allocated space to accomodate

Dies when an bcf_info_t object exists but does not have sufficient space to accommodate new info to be updated.

Is it possible accomodate for this, my other solution is to construct a kstring_t of the text VCF record and then parse it which is not very elegant.

issue on vcf.c line 1901.

// Is the INFO tag already present
if ( inf )
{
// Is it big enough to accommodate new block?
if ( str.l <= inf->vptr_len + inf->vptr_off )
{
if ( str.l != inf->vptr_len + inf->vptr_off ) line->d.shared_dirty |= BCF1_DIRTY_INF;
uint8_t ptr = inf->vptr - inf->vptr_off;
memcpy(ptr, str.s, str.l);
free(str.s);
bcf_unpack_info_core1(ptr, inf);
}
else
{
assert( !inf->vptr_free ); // fix the caller or improve here: this has been modified before
bcf_unpack_info_core1((uint8_t)str.s, inf);
inf->vptr_free = 1;
line->d.shared_dirty |= BCF1_DIRTY_INF;
}
}

Add Windows support

Would it be possible to add (64 bit) Windows support to htslib?
I've just emailed Petr Danecek, who says he is unaware of anyone already having done this. If noone is already doing it, we will have a go in my team, but we need only a tiny fraction of htslib functionality (reading BAM, fastq, fastq.gz).

Non-portable inline usage

The way we're using inline in htslib is incompatible with the C99 inline model, it looks like it's the old gnu89_inline way. Not a top priority but we should probably clean this up at some point. In the mean time if we wish to use any C99 stuff we need to add -fgnu89-inline to the Makefile.

bcf_hdr_subset

When bcf_hdr_subset() is called, it calls bcf_hdr_init() which always initializes the VCF format version as 4.2. I think it is more appropriate to initialize the VCF version as that of the header that is being subsetted or at least allow the programmer to assign an appropriate version.

The reason for this is that tools like GATK will keep failing if it encounters non v4.1 VCF formats. So it is a pain to have to go back and fix the headers.

On the other hand, htslib implements v4.2 so it is appropriate to always set v4.2.

I think a flexible solution is to have a bcf_hdr_get_version and bcf_hdr_set_version() API.

reverse complement of S and W not quite right in bam2fq.c

Hi,
I believe I found a tiny bug.
In bam2fq.c there is a table for the reverse complement of all the 16 possible nucleotide codes:
static int8_t seq_comp_table[16] = { 0, 8, 4, 12, 2, 10, 9, 14, 1, 6, 5, 13, 3, 11, 7, 15 };

If I got it right, 6 and 9 denote S=CG and W=AT respectively (or the other way around).
Either way, the reverse complement of S is S and the reverse complement of W is W.
However the table in the code maps S to W and W to S, which is wrong.
Apologies if I'm completely off.
Regards,
~Filipe

Handle INFO field names that contain underscores

I have a VCF file where one of the the INFO fields contains an underscore in the name: GENE_NAME

When I attempt to use GENE_NAME in the --format argument with htscmd vcfquery, the _NAME part of the string is dropped and the program quits with a warning as shown below.

htscmd vcfquery -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/GENE_NAME\n' file.vcf.gz
Error: no such tag defined in the VCF header: INFO/GENE

Instead, it should say:

Error: no such tag defined in the VCF header: INFO/GENE_NAME

Are INFO field names with underscores ok?
If not, I'll have to change the program that produced these INFO fields.
Otherwise, I'd like htscmd vcfquery to handle them properly.

I'll dig around the code a bit and see if I can figure out what to change, but a first glance suggests this will take me a long time. Perhaps this change is a trivial task for one of the developers of this project?

__skip_tag in sam.c doesn't work for floating-point tags

The __skip_tag macro used by bam_aux_get and bam_aux_del converts the tag type to upper case before passing it over to bam_aux_type2size. The latter assumes that the tag type is in the original case, with the result that it returns 0 instead of 4 for type 'f'.

The following patch fixes the problem:

diff --git a/sam.c b/sam.c
index 0004c0e..262d836 100644
--- a/sam.c
+++ b/sam.c
@@ -932,7 +932,7 @@ void bam_aux_append(bam1_t *b, const char tag[2], char type, int len, uint8_t *d
 }

 #define __skip_tag(s) do { \
-               int type = toupper(*(s)); \
+               int type = *(s); \
                ++(s); \
                if (type == 'Z' || type == 'H') { while (*(s)) ++(s); ++(s); } \
                else if (type == 'B') (s) += 5 + bam_aux_type2size(*(s)) * (*(int32_t*)((s)+1)); \

Additionally, __skip_tag should really check for a zero return value from bam_aux_type2size as currently a BAM file with an invalid tag type will cause bam_aux_get and bam_aux_del to get out of synch. leading to undefined behaviour.

[bcf_get_info_flag] Info flag API

Is there an API for checking if a flag exists in the INFO fields?

[bcf_update_format_*] breaks down when updating a new record

When you update an empty bcf record, the variable n_sample in the record is not
updated resulting an error in line 1952 in vcf.c
1952: int nps = n / line->n_sample; // number of values per sample

e.g.

bcf_hdr_t h = bcf_hdr_init();
bcf_hdr_append(h, "##FORMAT=<ID=E,Number=1,Type=Integer,Description="Number of reads">")
bcf1_t v* = bcf_init1();
int n = 1;

bcf_update_format_int32(hdr, v, "E", &n, 1);

I think a good solution is to make sure that the update method get the appropriate number of samples from the hdr and update the record correspondingly before updating the genotype field.

Get tabix regions from BED file disappeared

this used to be an option:

         -B         region1 is a BED file (entire file will be read)

Licensing

Hi!

Just wondering how htslib is licensed as I didn't see a license file in the repo. Is it made available under the MIT license like samtools?

Thanks,

Andrew

Segfault in vcfmerge

After bgzip:ing vcf files in test/ directory, I get a segfault with following command

./htscmd vcfmerge test/check.vcf.gz test/ex2.vcf.gz

I don't know where the problem is but segfault appears because of Invalid Read of Size 8 at merge_buffer (vcfmerge.c:885)

hts_close should not be void

Cram_close can (and does on my mini test files) return -1 due to reference file issues, but hts_close not only blindly ignores the return value it also cannot do anything with it as the function itself has a void return type.

[bcf_get_format_string]

can you also include an API for getting string types for individual data?

bam2fq doesn't generate interleaved FASTQ compatible with "bwa mem -p"?

This may be more of a feature request than a bug, so feel free to assign accordingly.

I am testing "htscmd bam2fq -a" to generates interleaved FASTQ, so I can pipe the output to "bwa mem -p". According to http://bio-bwa.sourceforge.net/bwa.shtml "bwa mem" expects mate-pairs reads in an interleaved FASTQ file (2i), (2i+1), where i is the i-th read.

In other words, if I use:

bam2FastQ --in 111582.test.bam

to create two FASTQ files, I would get the following:

$ grep '^\@F' 111582.test_1.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/1
@FCC0837ACXX:5:1202:10619:55648#/1
$ grep '^\@F' 111582.test_2.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/2
@FCC0837ACXX:5:1202:10619:55648#/2

which would be able to be sent to "bwa mem" since the the i-th read in the first FASTQ file lines up with the i-ith read in the second FASTQ file.

In contrast using the output from
htscmd bam2fq -a 111582.test.bam > 111582.test.interleaved.fastq

on the same original bam I get this:

$ grep '^\@F' 111582.test.interleaved.fastq |head -4
@FCC0837ACXX:5:1307:4114:17474#/2
@FCD0JP0ACXX:6:1101:11361:95577#/1
@FCC0837ACXX:5:1101:8706:79404#/2
@FCD0JP0ACXX:6:1104:14809:174913#/2

and here it appears that the reads aren't in the order that bwa mem (using the "-p" option) would expect.

using the index for remote files

samtools view ftp:://ftp.some.site/some.bam <region>

and

bcftools view ftp:://ftp.some.site/some.vcf.gz <region>

does not currently work as the code for downloading a local copy of the index seems to be commented out by #ifdef _USE_KETFILE in hts.c.

The same commands without the <region> are working fine.