Git Product home page Git Product logo

seqlib's Introduction

Build Status Coverage Status

C++ interface to HTSlib, BWA-MEM and Fermi

License: Apache2

API Documentation

API Documentation

Citation

If you use SeqLib in your applications, please cite: http://bioinformatics.oxfordjournals.org/content/early/2016/12/21/bioinformatics.btw741.full.pdf+html

Note that the values for the SeqAn benchmarking in Table 2 should be corrected to 7.7 Gb memory and 33.92 seconds in CPU time, when compiling SeqAn with -O3 -DNDEBUG. SeqAn also does full string decompression. Wall times for SeqAn may be shorter than CPU time because it uses embedded multi-threading during BAM IO.

Table of contents

Installation

#######

git clone --recursive https://github.com/walaj/SeqLib.git
cd SeqLib
mkdir build
cd build
cmake .. ## if system htslib not found, add to the cmake .. command this: -DHTSLIB_DIR=/path/to/htslib
make

You will need to have JsonCPP (v1.9.4 or higher) installed.

I have successfully compiled with GCC-4.5+ and Clang on Linux and OSX.

SeqLib is compatible with c++11 and later.

Integrating into build system

After building, you will need to add the relevant header directories:

SEQ=<path_to_seqlib_git_repos>
C_INCLUDE_PATH=$C_INCLUDE_PATH:$SEQ:$SEQ/htslib

And need to link the SeqLib static library and Fermi, BWA and HTSlib libraries

SEQ=<path_to_seqlib>
LDFLAGS="$LDFLAGS $SEQ/bin/libseqlib.a $SEQ/bin/libbwa.a $SEQ/bin/libfml.a $SEQ/bin/libhts.a"

To add support for reading BAMs, etc with HTTPS, FTP, S3, Google cloud, etc, you must compile and link with libcurl.

## set hts to build with libcurl links and hfile_libcurl.c
cd SeqLib/htslib
./configure --enable-libcurl 
## compile seqlib with libcurl support
cd ../ # back to SeqLib main directory
./configure LDFLAGS="-lcurl -lcrypto"
make 
make install

Remember then to then link any projects made with SeqLib with the additional -lcurl -lcrypto flags.

Description

SeqLib is a C++ library for querying BAM/SAM/CRAM files, performing BWA-MEM operations in memory, and performing sequence assembly. Core operations in SeqLib are peformed by:

The primary developer for these three projects is Heng Li.

SeqLib also has support for storing and manipulating genomic intervals via GenomicRegion and GenomicRegionCollection. It uses an interval tree (provided by Erik Garrison @ekg) to provide for rapid interval queries.

SeqLib is built to be extendable. See VariantBam for examples of how to take advantage of C++ class extensions to build off of the SeqLib base functionality.

Memory management

SeqLib is built to automatically handle memory management of C code from BWA-MEM and HTSlib by using C++ smart pointers that handle freeing memory automatically. One of the main motivations behind SeqLib is that all access to sequencing reads, BWA, etc should completely avoid malloc and free. In SeqLib all the mallocs/frees are handled automatically in the constructors and destructors.

Other C++ APIs

There are overlaps between this project and BamTools from Derek Barnett, Gamgee from the Broad Institute, and SeqAn from Freie Universitat Berlin. These projects provide excellent and high quality APIs. SeqLib provides further performance enhancement and new capabilites for certain classes of bioinformatics problems.

Some differences:

  • SeqLib has ~2-4x faster read/write speed over BamTools and lower memory footprint.
  • SeqLib has support for CRAM file
  • SeqLib provides in memory access to BWA-MEM, BLAT, chromosome aware interval tree, read correction, and sequence assembly with Fermi.
  • SeqAn provide a substantial amount of additional capabilites not in SeqLib, including graph operations and an expanded suite of multi-sequence alignments.
  • SeqAn embeds multi-threading into some functionality like BAM IO to improve wall times.

For your particular application, our hope is that SeqLib will provide a comprehensive and powerful envrionment to develop bioinformatics tools, or to be used in conjuction with the capablities in SeqAn and BamTools. Feature requests and comments are welcomed.

Examples

Targeted re-alignment of reads to a given region with BWA-MEM
#include "SeqLib/RefGenome.h"
#include "SeqLib/BWAWrapper.h"
using namespace SeqLib;
RefGenome ref;
ref.LoadIndex("hg19.fasta");

// get sequence at given locus
std::string seq = ref.QueryRegion("1", 1000000,1001000);

// Make an in-memory BWA-MEM index of region
BWAWrapper bwa;
UnalignedSequenceVector usv = {{"chr_reg1", seq}};
bwa.ConstructIndex(usv);

// align an example string with BWA-MEM
std::string querySeq = "CAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAG";
BamRecordVector results;
// hardclip=false, secondary score cutoff=0.9, max secondary alignments=10
bwa.AlignSequence(querySeq, "my_seq", results, false, 0.9, 10); 

// print results to stdout
for (auto& i : results)
    std::cout << i << std::endl;
Read a BAM line by line, realign reads with BWA-MEM, write to new BAM
#include "SeqLib/BamReader.h"
#include "SeqLib/BWAWrapper.h"
using namespace SeqLib;

// open the reader BAM/SAM/CRAM
BamReader bw;
bw.Open("test.bam");

// open a new interface to BWA-MEM
BWAWrapper bwa;
bwa.LoadIndex("hg19.fasta");

// open the output BAM
BamWriter writer; // or writer(SeqLib::SAM) or writer(SeqLib::CRAM) 
writer.SetWriteHeader(bwa.HeaderFromIndex());
writer.Open("out.bam");
writer.WriteHeader();

BamRecord r;
bool hardclip = false;
float secondary_cutoff = 0.90; // secondary alignments must have score >= 0.9*top_score
int secondary_cap = 10; // max number of secondary alignments to return
while (GetNextRecord(r)) {
      BamRecordVector results; // alignment results (can have multiple alignments)
      bwa.AlignSequence(r.Sequence(), r.Qname(), results, hardclip, secondary_cutoff, secondary_cap);

      for (auto& i : results)
        writer.WriteRecord(i);
}
Perform sequence assembly with Fermi directly from a BAM

#include "SeqLib/FermiAssembler.h"
using namespace SeqLib;

FermiAssembler f;

// read in data from a BAM
BamReader br;
br.Open("test_data/small.bam");

// retreive sequencing reads (up to 20,000)
BamRecord r;
BamRecordVector brv;
size_t count = 0;
while(br.GetNextRead(r) && count++ < 20000) 
  brv.push_back(r);

// add the reads and error correct them  
f.AddReads(brv);
f.CorrectReads();

// peform the assembly
f.PerformAssembly();

// retrieve the contigs
std::vector<std::string> contigs = f.GetContigs();

// write as a fasta to stdout
for (size_t i = 0; i < contigs.size(); ++i)
    std::cout << ">contig" << i << std::endl << contigs[i] << std::endl;
Plot a collection of gapped alignments
using namespace SeqLib;
BamReader r;
r.Open("test_data/small.bam");

GenomicRegion gr("X:1,002,942-1,003,294", r.Header());
r.SetRegion(gr);

SeqPlot s;
s.SetView(gr);

BamRecord rec;
BamRecordVector brv;
while(r.GetNextRecord(rec))
  if (!rec.CountNBases() && rec.MappedFlag())
    brv.push_back(rec);
s.SetPadding(20);

std::cout << s.PlotAlignmentRecords(brv);

Trimmed output from above (NA12878):

CTATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTGTCCATCCATCCATCCATCCA
CTATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTAT                    CATCCATCCATCCATCCATCCACCCATTCATCCATCCACCTATCCATCTATCAATCCATCCATCCATCCA
 TATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCA              
 TATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTGTCCATCCATCCATCCATC                        
 TATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCA              
  ATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCAT             
   TCTATCTATCTCTTCTTCTGTCCGCTCATGTGTCTGTCCATCTATCTATC                    GTCCATCCATCCATCCATCCATCCATCCACCCATTCATCCATCCACCTATCCATCTATCAATCCATCCATCCATCCATCCGTCTATCTTATGCATCACAGC
   TCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATT                                                                  
    CTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTC                                                                 
    CTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTC                                                                 
      ATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTCAT                                                               
      ATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTCAT                                                               
Read simultaneously from a BAM, CRAM and SAM file and send to stdout
using namespace SeqLib;
#include "SeqLib/BamReader.h"
BamReader r;
  
// read from multiple streams coordinate-sorted order
r.Open("test_data/small.bam");
r.Open("test_data/small.cram");
r.Open("test_data/small.sam");

BamWriter w(SeqLib::SAM); // set uncompressed output
w.Open("-");              // write to stdout
w.SetHeader(r.Header());  // specify the header
w.WriteHeader();          // write out the header

BamRecord rec;
while(r.GetNextRecord(rec)) 
  w.WriteRecord(rec);
w.Close();               // Optional. Will close on destruction
Perform error correction on reads, using BFC
#include "SeqLib/BFC.h"
using namespace SeqLib;

// brv is some set of reads to train the error corrector
for (BamRecordVector::const_iterator r = brv.begin(); r != brv.end(); ++r)
    b.AddSequence(r->Sequence().c_str(), r->Qualities().c_str(), r->Qname().c_str());
b.Train();
b.clear(); // clear the training sequences. Training parameters saved in BFC object

// brv2 is some set to correct
for (BamRecordVector::const_iterator r = brv2.begin(); r != brv2.end(); ++r)
    b.AddSequence(r->Sequence().c_str(), r->Qualities().c_str(), r->Qname().c_str());
b.ErrorCorrect();

// retrieve the sequences
UnalignedSequenceVector v;
std::string name, seq;
while (b.GetSequences(seq, name))
  v.push_back({name, seq});      			   

Support

This project is being actively developed and maintained by Jeremiah Wala ([email protected]).

Attributions

We would like to thank Heng Li (htslib/bwa/fermi), Erik Garrison (interval tree), Christopher Gilbert (aho corasick), and Mengyao Zhao (sw alignment), for providing open-source and robust bioinformatics solutions.

Development, support, guidance, testing:

  • Steve Huang - Research Scientist, Broad Institute
  • Steve Schumacher - Computational Biologist, Dana Farber Cancer Institute
  • Cheng-Zhong Zhang - Research Scientist, Broad Institute
  • Marcin Imielinski - Assistant Professor, Cornell University
  • Rameen Beroukhim - Assistant Professor, Harvard Medical School

seqlib's People

Contributors

agraubert avatar dnbaker avatar edawson avatar jmarshall avatar julianhess avatar mr-c avatar nathanweeks avatar shuang-broad avatar walaj avatar wulj2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seqlib's Issues

undefined reference to xx error for a test

Hi Jeremiah,

Following your intructions,I have successfully installed the library. However,when I tried to test it,I got following error although I specified the static library path:
g++ -L/home/jinsheng_tao/software/SeqLib/bin/ test.cpp
/tmp/ccGnVDiX.o: In function main': test.cpp:(.text+0xc8): undefined reference to SeqLib::BamReader::BamReader()'
test.cpp:(.text+0x10b): undefined reference to SeqLib::BamReader::Open(std::string const&)' test.cpp:(.text+0x148): undefined reference to SeqLib::BamReader::Header() const'
test.cpp:(.text+0x15e): undefined reference to SeqLib::BamWriter::SetHeader(SeqLib::BamHeader const&)' test.cpp:(.text+0x1ad): undefined reference to SeqLib::BamWriter::Open(std::string const&)'
test.cpp:(.text+0x1d4): undefined reference to SeqLib::BamWriter::WriteHeader() const' test.cpp:(.text+0x20c): undefined reference to SeqLib::BamWriter::WriteRecord(SeqLib::BamRecord const&)'
test.cpp:(.text+0x222): undefined reference to SeqLib::BamReader::GetNextRecord(SeqLib::BamRecord&)' test.cpp:(.text+0x235): undefined reference to SeqLib::BamWriter::Close()'
test.cpp:(.text+0x244): undefined reference to `SeqLib::BamReader::Close()'
collect2: error: ld returned 1 exit status

Below is my source code:
#include <SeqLib/BamReader.h>
#include <SeqLib/BamWriter.h>// #include "SeqLib/BWAWrapper.h"
#include <SeqLib/BamRecord.h>
using namespace SeqLib;

int main(int argc,char* argv[]) {
// open the reader BAM/SAM/CRAM
SeqLib::BamReader r;
r.Open(argv[1]);

// open a new interface to BWA-MEM
// BWAWrapper bwa;
// bwa.LoadIndex("hg19.fasta");

// open the output BAM
SeqLib::BamWriter w; // or writer(SeqLib::SAM) or writer(SeqLib::CRAM)
w.SetHeader(r.Header());
w.Open(argv[2]);
w.WriteHeader();

SeqLib::BamRecord rec;
bool hardclip = false;
float secondary_cutoff = 0.90; // secondary alignments must have score >= 0.9*top_score
int secondary_cap = 10; // max number of secondary alignments to return
while (r.GetNextRecord(rec)) {
    /*
    BamRecordVector results; // alignment results (can have multiple alignments)
    bwa.AlignSequence(r.Sequence(), r.Qname(), results, hardclip, secondary_cutoff, secondary_cap);

    for (auto& i : results)
        writer.WriteRecord(i);
    */
    w.WriteRecord(rec);
}
w.Close();
r.Close();
return 0;

}
It would be greatly appreciated if you could help me with this problem.Many thanks.

Best regards,
Jinsheng

lzma.h: No such file or directory - but works for htslib 1.9

Hi,
I am trying to install SeqLib and do not have a system install of lzma.h. I was able to work around that with LDFLAGS and CPPFLAGS with htslib 1.9 and xz-5.2.4

Install xz-5.2.4

cd ~/bin
wget https://tukaani.org/xz/xz-5.2.4.tar.bz2
cd xz-5.2.4
./configure CC=`which icc` --prefix=$PWD

Install samtools 1.9, bcftools 1.9, htslib 1.9

cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
tar xjf htslib-1.9.tar.bz2
cd ../htslib-1.9
./configure CC=`which icc` CPPFLAGS="-I${HOME}/bin/xz-5.2.4/include" LDFLAGS="-L${HOME}/bin/xz-5.2.4/lib" --prefix=$PWD/
make && make install

Try to install SeqLib

cd ~/bin
git clone --recursive https://github.com/walaj/SeqLib.git
cd SeqLib
./configure CC=`which icc` CPPFLAGS="-I${HOME}/bin/xz-5.2.4/include" LDFLAGS="-L${HOME}/bin/xz-5.2.4/lib"

Make

make

cram/cram_io.c:60:18: fatal error: lzma.h: No such file or directory
 #include <lzma.h>
                  ^
compilation terminated.
make[2]: *** [cram/cram_io.o] Error 1
make[2]: Leaving directory `/home/jelber2/bin/SeqLib/htslib'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/jelber2/bin/SeqLib'
make: *** [all] Error 2

What am I doing wrong here? Operating system is Red Hat Enterprise Linux Server release 6.8 (Santiago). Really, I am trying to install SeqLib as part of STITCH, but I couldn't install it through the recommended way. So, I am trying to compile SeqLib by itself with no success.

AlignmentPosition() is wrong with hard clips

There's a bug in the AlignmentPosition code in the BamRecord object (code below).
Hard clipping doesn't impact the start position in the read (removing the other half of the or statement should make the code correct)-- those bases are just missing from the reported read.

e.g., I have a read that looks like:
8:32868238-32874773_1151_1695_2:0:0_3:0:0_3cb 2048 1 1151 0 90H60M 0 -1 0 ACAGCTTAAAACTCAAAGGACTTGGCAGTGGTTTATATCCCTCTAGAGGAGCCTGTTCTA

For which the bamrecord AlignmentPosition() and AlignmentPositionEnd() is 90 and 60, respectively.

/** Get the start of the alignment on the read, by removing soft-clips
/
inline int32_t AlignmentPosition() const {
uint32_t
c = bam_get_cigar(b);
int32_t p = 0;
for (size_t i = 0; i < b->core.n_cigar; ++i) {
// Remove the code after the ||
if ( (bam_cigar_opchr(c[i]) == 'S') || (bam_cigar_opchr(c[i]) == 'H'))
p += bam_cigar_oplen(c[i]);
else // not a clip, so stop counting
break;
}
return p;
}

Build failure with newer libjsoncpp

Hello, with the new libjsoncpp 1.9.4 looks like the package fails to build from source:

libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/include -I/usr/include/htslib -I/usr/include/fml -I/usr/include/jsoncpp -Wno-sign-compare -Wdate-time -D_FORTIFY_SOURCE=2 -g -Wno-unknown-pragmas -g -O2 -fdebug-prefix-map=/libseqlib-1.1.2+dfsg=. -fstack-protector-strong -Wformat -Werror=format-security -c RefGenome.cpp -o libseqlib_la-RefGenome.o >/dev/null 2>&1
ReadFilter.cpp: In constructor 'SeqLib::Filter::ReadFilterCollection::ReadFilterCollection(const string&, const SeqLib::BamHeader&)':
ReadFilter.cpp:181:18: warning: 'Reader' is deprecated: Use CharReader and CharReaderBuilder instead. [-Wdeprecated-declarations]
  181 |     Json::Reader reader;
      |                  ^~~~~~
In file included from /usr/include/jsoncpp/json/json.h:11,
                 from ../SeqLib/ReadFilter.h:10,
                 from ReadFilter.cpp:1:
/usr/include/jsoncpp/json/reader.h:37:63: note: declared here
   37 |     "Use CharReader and CharReaderBuilder instead.") JSON_API Reader {
      |                                                               ^~~~~~
ReadFilter.cpp:181:18: warning: 'Json::Reader::Reader()' is deprecated: Use CharReader and CharReaderBuilder instead [-Wdeprecated-declarations]
  181 |     Json::Reader reader;
      |                  ^~~~~~
In file included from /usr/include/jsoncpp/json/json.h:11,
                 from ../SeqLib/ReadFilter.h:10,
                 from ReadFilter.cpp:1:
/usr/include/jsoncpp/json/reader.h:56:3: note: declared here
   56 |   Reader();
      |   ^~~~~~
ReadFilter.cpp:196:41: error: conversion from 'void' to non-scalar type 'Json::Value' requested
  196 |     Json::Value glob = root.removeMember("global");
      |                        ~~~~~~~~~~~~~~~~~^~~~~~~~~~
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/include -I/usr/include/htslib -I/usr/include/fml -I/usr/include/jsoncpp -Wno-sign-compare -Wdate-time -D_FORTIFY_SOURCE=2 -g -Wno-unknown-pragmas -g -O2 -fdebug-prefix-map=/libseqlib-1.1.2+dfsg=. -fstack-protector-strong -Wformat -Werror=format-security -c BamWriter.cpp -o libseqlib_la-BamWriter.o >/dev/null 2>&1

I don't know, if we can just discard the return value of removeMember (because it now returns void), or some more code changes are needed

BWA/Fermi submodules need update for GCC 10

Hi Jeremiah,

New versions of GCC choke on the current bwa/fermi-lite submodules, due to an issue fixed in the following commit: lh3/bwa@2a1ae7b (there is not a complementary commit in fermi-lite, so you have to add the extern keyword yourself in a forked repo).

As a result, SeqLib cannot build on GCC 10. Updating the bwa submodule to its latest version (and manually fixing the fermi-lite submodule) fixes this.

—Julian

Example showing how to compile my own program containing SeqLib

I am using SeqLib in my own program, because I need to align DNA sequences with BWA-MEM in memory.

Now I am struggling with an endless stream of compiling errors, so can you give me a complete example showing how to compile my program containing SeqLib?

can I use SeqLib at my C scripts?

HI,
I need to write C scripts to deal with bam , and I wanna use SeqLib ,other than htslib, but I am not sure
whether SeqLib is work .
I do not want to use htslib,because it can provide enough examples to tell me how to use .I think your library is better .

Segmentation Fault when running

I am attempting to migrate a program I wrote from using bamtools to using SeqLib. Unfortunately, my C++ isn't that great (I do most of my programing in Python, but would like to use C++ for this purpose). I managed to figure out compiling (I think), but I'm getting a segmentation fault when I call BarRecord::SetQname. The problematic section of code (as determined by getting an output every few lines until the error) is:

SeqLib::BamReader in_bam_file;
SeqLib::BamWriter temp_bam;
in_bam_file.Open(input);
temp_bam.SetHeader(in_bam_file.Header());
//const SeqLib::RefVector references = in_bam_file.GetReferenceData();
int paired_end_count = 1;
temp_bam.Open(prefix + ".temp.bam");
temp_bam.WriteHeader(); 
SeqLib::BamRecord line;
SeqLib::BamRecord temp_read1_entry;
SeqLib::BamRecord temp_bam_entry;
while (in_bam_file.GetNextRecord(line)) {
    if (paired_end_count % 2 == 1) {
        temp_read1_entry = line;
    }
    if (paired_end_count % 2 == 0) {
        std::string tagName;
        SeqLib::BamRecord temp_bam_entry;
        if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) > line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + \
            line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + "#ab";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) < line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + \
            temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + "#ba";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) == line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            paired_end_count++;
            continue;
        }
        // Write entries for Read 1
        temp_bam_entry.SetQname(tagName + ":1"); //ERROR OCCURS AT THIS LINE
        temp_bam_entry.SetSequence(temp_read1_entry.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(temp_read1_entry.Qualities().substr(taglen + spacerlen), 33);
        temp_bam_entry.AddZTag("X?", temp_read1_entry.Qname());
        temp_bam.WriteRecord(temp_bam_entry);
        // Write entries for Read 2
        temp_bam_entry.SetQname(tagName + ":2");
        temp_bam_entry.SetSequence(line.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(line.Qualities().substr(taglen + spacerlen), 33);
        temp_bam_entry.AddZTag("X?", line.Qname());
        temp_bam.WriteRecord(temp_bam_entry);
    }
    paired_end_count++;
}

and the error I am getting is "Segmentation fault: 11". I am working on a Apple computer running MacOSX 10.12.6.

Any idea why this is happening/what I can do to fix it?

Brendan

BamRecord's PositionEnd() inconsistent with Position()

Position() gives the 0-based position of the start of the read (that is, it's the start position in the genome). PositionEnd() is not the (1-based) stop position in the genome as I think most people would expect. It's computed as:

b->core.pos + GetCigar().NumQueryConsumed()

Which mixes the read-coordinate system (number of query bases consumed) with the genome coordinate system (core.pos).

I would suggest:

b->core.pos + GetCigar().NumReferenceConsumed()

instead, which gives the stop position in the genome of the read.

-August

Could you settle with official bwa and fermi-lite libraries

Hi,
I intend to package freebayes for Debian and the latest version is using SeqLib. So I tried to package SeqLib as well and was stumbling about the problem that BWA and FML have conflicting bseq1_t
declaration.
It turned out that you using personal forks of BWA and FML, not the upstream ones. When looking for bseq1_t in the SeqLib clones of BWA and FML, no definition appears. However, they do appear in the respective upstream repositories. Looks like you did patch the BWA and FML code bases to be able to mix them without a multiple-definition error.
Do you see any chance to solve this to make BWA and FML co-exist nicely?
Kind regards, Andreas.

Add different modes to BamWriter.Open

Hello, Jeremiah!
I use SeqLib and bxtools in my research really extensively, thank you a lot:)
Is it possible to add "append" mode for BamWriter.Open function? I faced a problem when this is an only fast solution requires opening bam-files multiple times.

Thanks,
Dima

SeqLib::BamWriter.WriteRecord() seems to be truncating the qname field

When I create a new bam record, the new bam record seems to truncate the read name at 23 characters. I am using the following code (note that this is part of a larger program; I just stopped it in the middle to check the file quickly):

SeqLib::BamReader in_bam_file;
SeqLib::BamWriter temp_bam = SeqLib::BamWriter(SeqLib::SAM);
in_bam_file.Open(input);
bool testOpen = temp_bam.Open(prefix + ".temp.bam");
temp_bam.SetHeader(in_bam_file.Header());
temp_bam.WriteHeader(); 
SeqLib::BamRecord line;
SeqLib::BamRecord temp_read1_entry;
SeqLib::BamRecord temp_bam_entry;
while (in_bam_file.GetNextRecord(line)) {
    if (paired_end_count % 2 == 1) {
        temp_read1_entry = line;
    }
    if (paired_end_count % 2 == 0) {
        std::string tagName;
        SeqLib::BamRecord temp_bam_entry;
        if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) > line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + \
            line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + "#ab";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) < line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + \
            temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + "#ba";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) == line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            paired_end_count++;
            continue;
        }
        // Write entries for Read 1
        std::string tempQualities = temp_read1_entry.Qualities().substr(taglen + spacerlen);
        temp_bam_entry = temp_read1_entry;
        temp_bam_entry.AddZTag("X?", temp_read1_entry.Qname());
        temp_bam_entry.SetQname("abcdefghijklmnopqrstuvwxyz");
        temp_bam_entry.SetSequence(temp_read1_entry.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(tempQualities, 33);
        temp_bam.WriteRecord(temp_bam_entry);
       
        // Write entries for Read 2
        tempQualities = line.Qualities().substr(taglen + spacerlen);
        temp_bam_entry = line;
        temp_bam_entry.AddZTag("X?", line.Qname());
        temp_bam_entry.SetQname("abcdefghijklmnopqrstuvwxyz");
        temp_bam_entry.SetSequence(line.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(tempQualities, 33);
        temp_bam.WriteRecord(temp_bam_entry);
        temp_bam.Close();
        abort();

When I look at the resulting file, I see this:

@hd VN:1.5 SO:queryname
@rg ID:A SM:testNonRef
abcdefghijklmnopqrstuvw 77 * 0 0 * * 0 0 GGGGTGCCCCCAAAGCGGGCCCGTGGGGGCCTCTGGGATGATGATGATGCACCTCGGCCATCCCAATTCGAGGAGGACCTGGCA DGHIIIIIIIIIIHIIIIIHHHIIHIIIGIIIIIIIIIHIIIIIIIIIIIIIIFIIGIIIGIIIIIHHIIIHIIIIIIGIHHII RG:Z:A X?:Z:HISEQ_0537:1:1101:10000:20114#ACGCTC
abcdefghijklmnopqrstuvw 141 * 0 0 * * 0 0 AAGAGAATGAGGGTCAAGGGTCAGGGCCAGAATCACTGTAGGCCACACAGGCTGTACGGGGTGGAAGGGCAGTCCTTTGTAGGA DGHHIIIIIIIHIHHIIIIIIIIIGHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIGHIIIIIHIHHIIIIIFHFGHHIG RG:Z:A X?:Z:HISEQ_0537:1:1101:10000:20114#ACGCTC

Based on my understanding, the read name should be "abcdefghijklmnopqrstuvwxyz", but it isn't.

Thanks for any help you can give,
Brendan

Paired end/mate mair reads

First I'd like to thank you for the time you've dedicated to this project. I really (for one) really appreciate it!

I have two questions than can be construed as "issues". First, it's a little unclear how one might use the BWAWrapper object to perform paired end read mapping. I see that we have access to the underlying bwa-mem routines, but I have to admit it's a little daunting. I don't suppose a routine could be added to BWAWrapper to illustrate how this might be done (this is not entirely a selfless question, I realize, but I do imagine that it would be of general use to the community)?

Second, it's not entirely clear (from the documentation) how and if any of the routines you have provided are thread safe, or if multithreading is happening behind the scenes (it appears that the underlying bwa-mem routines may be doing some sort of bag of tasks multithreading.). Obviously there is some degree of state that is being recorded (getters/setters), so the potential is certainly there for race conditions.

Thanks again, and I apologize if this is not the correct medium to ask this of you.
-August Woerner

BamReader::SetCramReference

Hello, I'm unable to reliably read from a CRAM file when using BamReader::SetCramReference().

Failed to populate reference for id 4
Unable to fetch reference #4 16121..256523

Since I'm able to get ~16750000 reads into the bam before I encounter this issue, would you say that this is more of an issue with the fasta I'm using?

determine if a quality string is empty

In freebayes, I've found an issue where an empty quality string ("*" in SAM) is translated into a string of null characters.

Is there a way to check if the quality string should be empty other than by checking if Quality() returns a bunch of nulls?

function ProperOrientation() have a bug?

Hi, it seems that the function ProperOrientation returns the read pairs in FR. But in the code:

if (b->core.pos < b->core.mpos) {
  return (b->core.flag&BAM_FREVERSE) == 0 && (b->core.flag&BAM_FMREVERSE) != 0 ? true : false;
} else {
  return (b->core.flag&BAM_FREVERSE) == 0 && (b->core.flag&BAM_FMREVERSE) != 0 ? false : true; # here meet a bug?
}

In the "else" part, it return the read pairs in FR, also the FF and RR, as the code only remove the pairs in RF.

It will be better that rewrite it like this?

}else{
return (b->core.flag&BAM_FREVERSE) != 0 && (b->core.flag&BAM_FMREVERSE) == 0 ? true : false;
}

Second, in small insert size data, such as cfDNA data, mpos may == pos. So need add supporting for this situation.

if (b->core.pos < b->core.mpos) {
return (b->core.flag&BAM_FREVERSE) == 0 && (b->core.flag&BAM_FMREVERSE) != 0 ? true : false;
} else if(b->core.pos > b->core.mpos) {
return (b->core.flag&BAM_FREVERSE) != 0 && (b->core.flag&BAM_FMREVERSE) == 0 ? true : false;
}
else{
return (((b->core.flag&BAM_FREVERSE) == 0 && (b->core.flag&BAM_FMREVERSE) != 0) || ((b->core.flag&BAM_FREVERSE) != 0 && (b->core.flag&BAM_FMREVERSE) == 0)) ? true : false;
}

Looking for functionalities to replace BamTools with SeqLib

Hi,

I'm looking for replacing BamTools with SeqLib and I'm actually having problems to find some functionalities in SeqLib.
I already have checked the documentation and didn't find them.
The methods I need are:

  • Setters such as : SetIsPaired, SetIsProperPair, SetIsFirstmate, SetIsMateMapped, SetIsPrimaryAlignment, SetIsReverseStrand
  • Getter IsPrimaryAlignment
  • GetSoftClips for an alignment
  • A lot of information from read groups. I know there is a method "BamHeader.AsString()" but is there a parser available?

Thank you,
Best regards

master refers to non-existent htslib commit

[sam@Sams-MacBook-Pro SeqLib (master *=)]$ git diff
diff --git htslib htslib
index 5163833..0f298ce 160000
--- htslib
+++ htslib
@@ -1 +1 @@
-Subproject commit 5163833ede355f8c2a6788780c55d7598279e767
+Subproject commit 0f298ce22c5c825c506129bf242348a31630c382
[sam@Sams-MacBook-Pro SeqLib (master *=)]$ git submodule update --init --recursive
error: no such remote ref 5163833ede355f8c2a6788780c55d7598279e767
Fetched in submodule path 'htslib', but it did not contain 5163833ede355f8c2a6788780c55d7598279e767. Direct fetching of that commit failed.

Is it even necessary to refer to your fork instead of upstream htslib? I don't immediately see any differences.

Libraries are placed in {prefix}/bin and not {prefix}/lib

When building from source, my libseqlib.a, libbwa.a, libhts.a and libfml.a all get placed in a directory called bin. I'd usually expect this directory to contain binaries, and to find libraries in a directory called lib. This isn't a huge deal, but it does kind of throw off my automated install processes. PR #44 tweaks the Makefile.am to change this.

incomplete error handling

hi, I'm using SeqLib via freebayes. I just unmounted the file system holding the bams that freebayes was accessing. but the program still had a 0 exit status.

I see this message:
[E::bgzf_read] bgzf_read_block error -1 after 1115 of 1202 bytes
which is correctly propagated by htslib (AFAICT).

My C++ is rusty/non-existent, but it looks like _Bam::load_read(BamRecord& r) in SeqLib is just using a non-zero return from sam_read1 as an indicator to end iteration without actually propagating the error.

So it appears that the iterator just stops after getting the -1 return value from sam_read1 and can never error. Am I mistaken?

Here is the code in question:

SeqLib/src/BamReader.cpp

Lines 305 to 355 in 770cd10

bool _Bam::load_read(BamRecord& r) {
// allocated the memory
bam1_t* b = bam_init1();
int32_t valid;
if (hts_itr.get() == NULL) {
valid = sam_read1(fp.get(), m_hdr.get_(), b);
if (valid < 0) {
#ifdef DEBUG_WALKER
std::cerr << "ended reading on null hts_itr" << std::endl;
#endif
//goto endloop;
bam_destroy1(b);
return false;
}
} else {
//changed to sam from hts_itr_next
// move to next region of bam
valid = sam_itr_next(fp.get(), hts_itr.get(), b);
}
if (valid < 0) { // read not found
do {
#ifdef DEBUG_WALKER
std::cerr << "Failed read, trying next region. Moving counter to " << m_region_idx << " of " << m_region.size() << " FP: " << fp_htsfile << " hts_itr " << std::endl;
#endif
// try next region, return if no others to try
++m_region_idx; // increment to next region
if (m_region_idx >= m_region->size()) {
bam_destroy1(b);
return false;
}
//goto endloop;
// next region exists, try it
SetRegion(m_region->at(m_region_idx));
valid = sam_itr_next(fp.get(), hts_itr.get(), b);
} while (valid <= 0); // keep trying regions until works
}
// if we got here, then we found a read in this BAM
empty = false;
next_read.assign(b); // assign the shared_ptr for the bam1_t
r = next_read;
return true;
}

Benchmark run without optimization?

Hi,
this is Hannes from SeqAn. From your benchmark file it looks like you are neither setting c++ optimization levels, nor deactivating debug:
https://github.com/walaj/SeqLib/blob/master/benchmark/Makefile
Were the published benchmark results indeed run without -O3 -DNDEBUG?

If yes, could you please repeat them? As the other libraries are included pre-compiled, it is unclear which options they were built with.
Other third party benchmarks previously had similar problems where optimized versions of htslib were tested against unoptimized versions of SeqAn (the difference is huge!):
wilzbach/bam-perf-test@5aea37e
Since all our tests show that we are faster than htslib in almost all cases we would of course be interested in your test data if it turns out we indeed perform notably worse.

If on the other hand the results are more close to what we are seeing, we would kindly ask you to update the table also in the paper.

Thank you very much,
Hannes

how to know mismatch base position and cigar value of every base at a read ?

HI,Author,
I wanna use SeqLib to deal with BAM files to know mismatch base position and cigar value of every base at a read . for example, we get a read from BAM records , cigar string is 2S3M2I2M ,the read sequence is
'ATCGGCATG'. So I wanna get mismatch index at this read and get cigar value for every base at this read ,the final result ,just like this : ('m' meand mismatch )
A T C G G C A T G
S S M m M I I M M

thank you!

Could SeqLib and htslib work out what a public cram api should be?

Hello,

In investigating packaging issues with htslib on Debian, we discovered that SeqLib is using functions htslib considers private (anything in cram/*.h)

Debian modified htslib to allow linking to the cram functions in the normal shared library, but the htslib developers have told us to not do that. Normally you're using an embedded copy of htslib, and can directly access private functions.

In an ideal world htslib would actually have a supported public cram API.

Could you talk a bit with the htslib developers about what you'd actually need from them?

Version 1.1.2 has changed API

Hi,
in Debian we have invented an ABI version 0 to flag once a library has changed its interface. I realised that you have removed some symbols. For instance TrainAndCorrect and ErrorCorrectToTag were removed. For allocate_sequences_from_char only the declaration in SeqLib/BFC.h remained but there is no definition of this function in the C code.
I think it makes sense to introduce an official ABI version to inform programmers about changes. I'll switch the Debian package now to ABI version 1.
Kind regards, Andreas.

Implicit conversion from istream to bool

I found a little problem inside SGA/Util/ClusterRead.cpp (line 70), due to an implicit conversion from istream to bool. It can be easily solved with the following patch:

diff --git a/SGA/Util/ClusterReader.cpp b/SGA/Util/ClusterReader.cpp
index 5c9460b..a7367dd 100644
--- a/SGA/Util/ClusterReader.cpp
+++ b/SGA/Util/ClusterReader.cpp
@@ -67,8 +67,8 @@ bool ClusterReader::generate(ClusterVector& out)
 bool ClusterReader::readCluster(ClusterRecord& record)
 {
     std::string line;
-    bool good = getline(*m_pReader, line);
-    if(!good || line.empty())
+    getline(*m_pReader, line);
+    if(!m_pReader->good() || line.empty())
         return false;
     std::stringstream parser(line);
     parser >> record.clusterID;

Thank you for efforts with this awesome project!

malloc(): smallbin double linked list corrupted

Hey,

I have high coverage long read WGS. I'm running code like the following, and getting an error like show below. The weird thing is that while I can reproducibly associate it with a specific read when I run my code per-chromosome, the code runs over the read in question when I run the program over a specific region. Any advice? This is using the most recent version of SeqLib (e0daf60)

Thanks,
Robbie

SeqLib::BamReader reader;
// stuff
SeqLib::BamRecord record;
while(reader.GetNextRecord(record)) { <--- error occurs here
 // stuff
}

code around here https://github.com/rwdavies/STITCH/blob/master/STITCH/src/bam_access.cpp#L178

*** glibc detected *** /apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/bin/exec/R: malloc(): smallbin double linked list corrupted: 0x0000000003c63a20 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3b16a75f3e]
/lib64/libc.so.6[0x3b16a7a608]
/lib64/libc.so.6(__libc_malloc+0x5c)[0x3b16a7abfc]
/apps/well/gcc/5.4.0/lib64/libstdc++.so.6(_Znwm+0x18)[0x7f09a78962c8]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZN9__gnu_cxx13new_allocatorISt19_Sp_counted_deleterIP6bam1_tN6SeqLib11free_deleteESaIvELNS_12_Lock_policyE2EEE8allocateEmPKv+0x49)[0x7f09a757ea5f]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZNSt16allocator_traitsISaISt19_Sp_counted_deleterIP6bam1_tN6SeqLib11free_deleteESaIvELN9__gnu_cxx12_Lock_policyE2EEEE8allocateERS9_m+0x28)[0x7f09a757e651]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZSt18__allocate_guardedISaISt19_Sp_counted_deleterIP6bam1_tN6SeqLib11free_deleteESaIvELN9__gnu_cxx12_Lock_policyE2EEEESt15__allocated_ptrIT_ERSB_+0x21)[0x7f09a757e07a]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC1IP6bam1_tN6SeqLib11free_deleteESaIvEEET_T0_T1_+0x42)[0x7f09a757d50e]   
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC1IP6bam1_tN6SeqLib11free_deleteEEET_T0_+0x32)[0x7f09a757c6b4]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZNSt12__shared_ptrI6bam1_tLN9__gnu_cxx12_Lock_policyE2EEC2IS0_N6SeqLib11free_deleteEEEPT_T0_+0x37)[0x7f09a757b945]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZNSt10shared_ptrI6bam1_tEC2IS0_N6SeqLib11free_deleteEEEPT_T0_+0x28)[0x7f09a757b050]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZN6SeqLib9BamRecord6assignEP6bam1_t+0x28)[0x7f09a7576566]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZN6SeqLib4_Bam9load_readERNS_9BamRecordE+0x1dd)[0x7f09a756ed93]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_ZN6SeqLib9BamReader13GetNextRecordERNS_9BamRecordE+0x11a)[0x7f09a756e58e]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_Z31get_sampleReadsRaw_using_SeqLibbiiSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS5_EES7_iS_IiSaIiEES5_S5_S5_+0x453)[0x7f09a754a8c3]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_Z30get_sampleReadsRaw_from_SeqLibbiiSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS5_EES7_iS_IiSaIiEES5_S5_S5_+0x405)[0x7f09a7551855]
/gpfs0/users/flint/rwdavies/R/x86_64-pc-linux-gnu-library/3.3/STITCH/libs/STITCH.so(_STITCH_get_sampleReadsRaw_from_SeqLib+0x667)[0x7f09a753f8a7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0xdb872)[0x7f09aaaae872]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0xe6660)[0x7f09aaab9660]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x83b)[0x7f09aaaf1be8]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12220f)[0x7f09aaaf520f]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_applyClosure+0x6f9)[0x7f09aaaf2ce4]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x94e)[0x7f09aaaf1cfb]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x1231bb)[0x7f09aaaf61bb]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12220f)[0x7f09aaaf520f]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_applyClosure+0x6f9)[0x7f09aaaf2ce4]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x94e)[0x7f09aaaf1cfb]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12220f)[0x7f09aaaf520f]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x121b52)[0x7f09aaaf4b52]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12220f)[0x7f09aaaf520f]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_applyClosure+0x6f9)[0x7f09aaaf2ce4]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(R_forceAndCall+0x516)[0x7f09aaaf36e9]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x74863)[0x7f09aaa47863]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x174ace)[0x7f09aab47ace]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12dffc)[0x7f09aab00ffc]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x253)[0x7f09aaaf1600]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_applyClosure+0x6f9)[0x7f09aaaf2ce4]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12da09)[0x7f09aab00a09]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x253)[0x7f09aaaf1600]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_applyClosure+0x6f9)[0x7f09aaaf2ce4]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x94e)[0x7f09aaaf1cfb]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x1231bb)[0x7f09aaaf61bb]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12220f)[0x7f09aaaf520f]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x1211b6)[0x7f09aaaf41b6]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(+0x12220f)[0x7f09aaaf520f]
/apps/well/R/3.3.2-openblas-0.2.18-omp-gcc5.4.0/lib64/R/lib/libR.so(Rf_eval+0x62a)[0x7f09aaaf19d7]

Improvement: Remove use of "__" reserved C++ prefix

Hi all,

I'm very excited to see this project! I've had the experience of replacing BamTools with htslib + samtools in the past before to reap the substantial speed improvements it provides. Of course, the interface is not nearly as nice, and I miss the nice, clean, c++ interface provided by BamTools. SeqLib seems like a great solution to this tradeoff.

Anyway, on to my issue. The use of __ as a prefix is reserved in C and C++ for implementation and library-specific use. Therefore, it's highly desirable to avoid beginning variables, functions or methods with __ as a prefix. Though I'm sure there is likely no conflict at this point, it's best to observe this rule to avoid potential naming conflicts in the future (and because all such names are technically reserved for the language).

--Rob

Close CRAM reference

Hey, have been using seqLib successfully for a while, but I just ran into an issue. I have a function, and within this function, I create a BamReader object, get reads out, etc. This works well, except I've recently noticed that when I use CRAM files and set the reference using BamReader::SetCramReference, the reference is not closed when calling BamReader::Close(), or when the BamReader object goes out of scope when the function ends. So when I call my function on several thousand CRAM files, the code crashes because I don't have enough file handles. Can you please implement a method to close the cram reference?

Error message wise, when running I get this

connect: Network is unreachable

one for each file (this is run on a cluster without internet access), then once I run out of file handles, I get

connect: Network is unreachable
test-data/mouse_data/mm10_2016_10_02.fa: Too many open files
Failed to populate reference for id 10
Unable to fetch reference #10 10001343..10999757
Failure to decode slice

heap-buffer-overflow

I'm getting a heap overflow associated with memory set in the SetCigar function. The simplest way to reproduce the problem is:

#include
#include
#include
#include
#include
#include <unordered_map>

#include "SeqLib/RefGenome.h"
#include "SeqLib/BWAWrapper.h"
#include "SeqLib/FastqReader.h"
#include "SeqLib/BamHeader.h"
#include "SeqLib/BamWriter.h"
#include "SeqLib/BamReader.h"
#include "SeqLib/BamRecord.h"

using namespace SeqLib;
using namespace std;

int
main(int argc, char **argv) {

BamRecord r;
r.init();
SeqLib::BamReader br;
br.Open( "foo.bam" );
while (br.GetNextRecord(r) ) {
Cigar orig = r.GetCigar();

Cigar newCiggy;

newCiggy.add( CigarField( 'S', 1));
newCiggy.add( CigarField('M', 59));
newCiggy.add( CigarField( 'S', 1));

r.SetCigar(newCiggy);
r.AddZTag("ST",  "0");

}

}

I gave this code the catchy name "test.c"

And compile it with (changing the -I flag as needed):
g++ -fsanitize=address -Wall -std=c++11 -ggdb3 -I/usr/local/src/SeqLib/SeqLib/htslib -I/usr/local/src/SeqLib/SeqLib -L/usr/local/src/SeqLib/SeqLib/src -L/usr/local/src/SeqLib/SeqLib/bwa -Wall -o test test.c -lhts -lseqlib -lhts -lbwa -lz -lpthread

And the one-read bam foo is available here:

https://www.dropbox.com/s/z0zro4pzvq9gx7r/foo.bam?dl=1

When you run ./test you get:

augustw@odin:~/src/RemoveThoseNumts$ ./test

==35523==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6160000005b8 at pc 0x7f001426a77a bp 0x7fff8eee46b0 sp 0x7fff8eee3e58
WRITE of size 2 at 0x6160000005b8 thread T0
#0 0x7f001426a779 (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779)
#1 0x7f0013f7d06a in memcpy /usr/include/x86_64-linux-gnu/bits/string_fortified.h:34
#2 0x7f0013f7d06a in bam_aux_append /usr/local/src/SeqLib/SeqLib/htslib/sam.c:1704
#3 0x5640aed0aa3b in main /home/augustw/src/RemoveThoseNumts/test.c:44
#4 0x7f00133adb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
#5 0x5640aed0a1f9 in _start (/home/augustw/src/RemoveThoseNumts/test+0x51f9)

0x6160000005b8 is located 3 bytes to the right of 565-byte region [0x616000000380,0x6160000005b5)
allocated by thread T0 here:
#0 0x7f00142cfd38 in __interceptor_calloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xded38)
#1 0x5640aed194af in SeqLib::BamRecord::SetCigar(SeqLib::Cigar const&) /usr/local/src/SeqLib/SeqLib/src/BamRecord.cpp:123
#2 0x60200000019b ()

SUMMARY: AddressSanitizer: heap-buffer-overflow (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779)
Shadow bytes around the buggy address:
0x0c2c7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2c7fff8070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2c7fff8080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2c7fff8090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2c7fff80a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c2c7fff80b0: 00 00 00 00 00 00 05[fa]fa fa fa fa fa fa fa fa
0x0c2c7fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2c7fff80d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2c7fff80e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2c7fff80f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2c7fff8100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==35523==ABORTING

The curious part is that you appear to need to both set the tag and change the cigar to induce this behavior. After reading through the error log it seems to suggest that the error is with setting the tags (either setting int tags or string z tags both trip the error), but I'm sure you could tell me better. And in case your curious, doing this with some (near) production code is causing a segfault that points to the same line of code (line 123 of BamRecord.cpp)

Any and all help is greatly appreciated!
Thanks
-August

Some functionality enchancement

Hello walaj:
I have been using your excellent library SeqLib for sometime, in this period, I have added some functionality which will do some good to my own work. You're welcome to put these functionality into your original repository if it's worth and I will be much happier. I myself have tested these functionality and they work properly as I wished in my work.
All the functionality I have made is recorded in the following file:
https://github.com/vanNul/SeqLib/blob/master/change.md

Fermi assembler issue

I'm having some with the Fermi assembler. I've tried various versions of the attached code; none manage to produce any contigs. Further, calling GetSequences() after calling PerformAssembly produces mangled strings (e.g., ascii value of 4 for an 'A'), which is probably a bad sign!

Here's the (very rough) source:
#include
#include
#include "SeqLib/BamReader.h"
#include "SeqLib/FastqReader.h"
#include "SeqLib/BWAWrapper.h"
#include "SeqLib/FermiAssembler.h"
#include "SeqLib/UnalignedSequence.h"

using namespace SeqLib;
using namespace std;

const string REFGENOME = "chrAll.standardChroms.fa";

int
main(int argc, char **argv) {

/* Mapping w/ BWA
BWAWrapper bwa;

bwa.LoadIndex(REFGENOME);

bwa.SetGapOpen(4);
bwa.SetGapExtension(1);
bwa.Set3primeClippingPenalty(1000);
bwa.Set5primeClippingPenalty(1000);

BamRecordVector results;

for (string line; getline(cin, line) ; ) {
bwa.AlignSequence(line, "m", results, false, 0.9, 0);
}
*/

if (argc < 2) {
cerr << "I need a fastq file to work!\n" << endl;
return 1;
}

FermiAssembler f;
FastqReader q(argv[1]);
UnalignedSequence s;

while (q.GetNextSequence(s) )
f.AddRead(s);

UnalignedSequenceVector seqs = f.GetSequences();
cout << seqs.size() << " also is the size!\n";
for (unsigned i = 0; i < seqs.size(); ++i)
cout << seqs[i].Seq << endl;

f.SetMinOverlap(10);
f.SetKmerMinThreshold(1);
f.SetKmerMaxThreshold(1000);

f.PerformAssembly();

// retrieve the contigs
std::vectorstd::string contigs = f.GetContigs();
cerr << contigs.size() << " contigs made! " << endl;

seqs = f.GetSequences();
const char *seq = seqs[0].Seq.c_str();
unsigned len = seqs[0].Seq.length();

cout << len << " is the length...? " << ((int)seq[0]) << " is the first char" << endl;

return 0;
}

Here's the fastq file (please forgive the copy pasting!):
@M01451:39:000000000-AGNHG:1:1101:18315:2050 1:N:0:25
ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT

  • F vWA
    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFDFFGGGGEEFGFGGGGGGGGGGGGGGGGGGGGGFF
    @M01451:39:000000000-AGNHG:1:1101:15489:2067 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    CEGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:11753:2476 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GGGGFFGGGGFGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGFECFGGGFGGFFGGGGGGGGGFGGGGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:8601:2804 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GCGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGFGGGFFGFGGGGGGGGGGGGGGFGGGGGGFG
    @M01451:39:000000000-AGNHG:1:1101:19616:3182 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFGGGGGGGGGGGEFFGGGGEGG
    @M01451:39:000000000-AGNHG:1:1101:12281:3408 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:9468:3637 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    FFFGFGGGGGGEFE<EGGGGGGFEGGGGGGGFGGGFFGGDFCFFAFGGGGDEEFGGFFGGGGGGFFFCFFCFDFGCG@EFFDFGGFGGGGGGGGGG
    @M01451:39:000000000-AGNHG:1:1101:15793:4032 1:N:0:25
    ATTGATCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATCCATCCATCCTATGTATT
  • F vWA
    C6C@,CFFEFGGFFEFEGGGCFEF<EGGCGFFGGDFFFFGFGGGGGGGGFFGFGFGCFGGGGGGBFFFGFGGGGFFGGGGGGGGGGGGFGGGGGGGGGFCFFFG

Default Constructor may be optimized in case segment fault happens

hello walaj
the default FastqReader constructor will take some default value of
fp and seq, while you did not define their default value there.
fp: https://github.com/walaj/SeqLib/blob/master/SeqLib/FastqReader.h/#L52
seq: https://github.com/walaj/SeqLib/blob/master/SeqLib/FastqReader.h/#L53
FastqReader(): https://github.com/walaj/SeqLib/blob/master/SeqLib/FastqReader.h/#L22

however the default value of seq is not null, which will lead to segment fault if some one accidentally create a FastqReader class by FastqReader() and never use it to open any fast/q file at all, in the end,
before the end of main function, the destroyer of FastqReader() will try to destroy something that does not exists and lead to segmentation fault use code in https://github.com/walaj/SeqLib/blob/master/SeqLib/FastqReader.h/#L40

My suggestion is to change FastqReader(): https://github.com/walaj/SeqLib/blob/master/SeqLib/FastqReader.h/#L22
into something like below:
FastqReader(){
seq = NULL;
}

Error when compiling

Hi Jeremiah,

I was compiling SnowTools on my machine and was hit by the following error:

MiniRules.cpp:535:8: error: use of undeclared identifier 'atm'
MiniRules.cpp:540:12: error: expected unqualified-id

I checked and found that possibly the cause is lack of #ifndef APPLE #endif pair around the lines.

Steve

Name space pollution - there is another seqlib

Hi,
in the COVID-19 fighting effort of the Debian Med team where we try to package software that is relevant to do research on COVID-19 I stumbled upon seqlib as a dependency of some of the projects. At first I've thought that we just have this packaged but than I checked the URL and realised that the software that was using seqlib was linking to a totally different project. I have fileed an issue on "the other seqlib" to trigger some discussion about a sensible name. In any case we need distinguishable names inside Debian and I would love to do this with both authors confirmance.
Kind regards, Andreas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.