genomicsdb / genomicsdb Goto Github PK

High performance data storage for importing, querying and transforming variants.

Home Page: https://genomicsdb.readthedocs.io

License: Other

CMake 2.79% C 0.36% Dockerfile 0.08% Shell 2.46% Python 4.89% Java 19.22% C++ 69.60% Scala 0.60%

bioinformatics cpp gatk genomics genomicsdb java mpi precision-medicine scala spark variant variant-calling

genomicsdb's Introduction

Master	Develop

GenomicsDB is built on top of a fork of htslib and a tile-based array storage system for importing, querying and transforming variant data. Variant data is sparse by nature (sparse relative to the whole genome) and using sparse array data stores is a perfect fit for storing such data. GenomicsDB is a highly performant scalable data storage written in C++ for importing, querying and transforming genomic variant data. See genomicsdb.readthedocs.io for documentation and usage.

Supported platforms : Linux and MacOS.
Supported filesystems : POSIX, HDFS, EMRFS(S3), GCS and Azure Blob.

Included are

JVM/Spark wrappers that allow for streaming VariantContext buffers to/from the C++ layer among other functions. GenomicsDB jars with native libraries and only zlib dependencies are regularly published on Maven Central.
Native tools for incremental ingestion of variants in the form of VCF/BCF/CSV into GenomicsDB for performance.
MPI and Spark support for parallel querying of GenomicsDB.

GenomicsDB is packaged into gatk4 and benefits qualitatively from a large user base.

External Contributions

GenomicsDB is open source and all participation is welcome. GenomicsDB is released under the MIT License and all external contributors are expected to grant an MIT License for their contributions.

Checklist before creating Pull Request

Please ensure that the code is well documented in Javadoc style for Java/Scala. For Java/C/C++ code formatting, roughly adhere to the Google Style Guides. See GenomicsDB Style Guide

genomicsdb's People

Contributors

Stargazers

Watchers

Forkers

jmuilu raonyguimaraes kgururaj avrajit yanlinlin82 jacmarjorie jh2663 jpleyte genostack satanson shreyas-muralidhara robbdi schaudge liuh0394 alartin mullerhai eneskuluk id-2

genomicsdb's Issues

ubuntu install help

I'd like some help troubleshooting the install process on ubuntu 16.04

attached is the output for cmake & make

genomecmake.txt
genomemake.txt

Maven wget url invalid

It seems when running the install_prereqs.sh, during the maven installation step, it is trying to wget from the following url,
https://www-us.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz. However, this led to the error wget: unable to resolve host address 'www-us.apache.org'. The host www-us.apache.org is no longer valid. Should this url be updated?

REF, ALT, QUAL, FILTER, INFO fields stored multiple times for each sample?

Hi,
If I import a VCF file into genomicsDB containing many samples, say something from the 1000 genome project, are the fields that are common to all samples (the ones mentioned in the subject line) stored in each TileDB cell? Or are they stored just once per variant?
Regards,
Shubham

Moving to 24.04 for the docker builds is broken.

See #327

Improve (mainly java) logging in GenomicsDB

This is a placeholder to

Hold issues we are seeing with logging currently
Collect ideas on how to improve logging.

Refine gt_mpi_gather and other tools

GenomicsDB tools should not be throwing exceptions and core dump for expected issues, rather should exit gracefully with meaningful messages.

Use unflattened genomic coordinates in GenomicsDB error messages

This issue surfaced from the gatk user-forum.. VariantQueryProcessorException puts out the flattened coordinates in the exception making it difficult for the user to discern what is happening.

plink support has introduced mpi dependency in test_gen.cc

The C unit tests are meant to test the basic functionality exposed by <GenomicsDB>/src/main/cpp and libtiledbgenomicsdb.so, so should not have any mpi dependencies. However, gt_mpi_gather can be used for testing bgen_support as a functional test from either run.py or test_tools.sh from <GenomicsDB>/tests, but test_bgen should only test the functionality exposed by libtiledbgenomicsdb.so

GenomicsDB api returns a null list when querying for variant calls with no rows specified

The default is all rows when no rows are specified, so GenomicsDB api query_variants and query_variant_calls should consider all rows. Thanks @jacmarjorie for pointing this issue.

protobuf version

Hi,
I tried to follow the instructions by first building protobuf 3.0.2, but got an error "google/protobuf/port_def.inc: No such file or directory". Then I did git checkout 3.7.x inside protobuf followed by the installation and that succeeded.
Regards,
Shubham

Dependencies missing in the pom

Hello,

The dependencies commons-io:commons-io, com.google.guava:guava and org.apache.commons:commons-lang3 (test scope only for the last one) are missing in pom.xml, whereas they are needed by some files in the code.

Cheers,
Pierre

Extend Github Actions to run with cloud emulators - azurite, minio, etc.

This could be part of a separate workflow. We could just run run_spark_hdfs.py for pull requests.

Use Protobuf in the query api

@kgururaj mentioned in #41 (review) that some of the structures used have a lot in common with the Protobuf structs and that they can be unified.

Annotation Service needs to be extended to support multiple alternate alleles

This is something to keep in mind when we start using this functionality.

Originally posted by @jPleyte in #186 (comment)

[TileDB::FileSystem] Error and FeatureReader to initialize with exception Error

Dear team,

The GenomicDB was failed (same command and dataset) with different errors across two file systems.
Summary is as follows:

1. BeeGFS file system:

$ cat Cq5A.chunk_1.GenomicsDBImport.11705869.err | tail -13
01:08:38.145 INFO GenomicsDBImport - Starting batch input file preload
01:09:38.916 INFO GenomicsDBImport - Finished batch preload
01:09:38.916 INFO GenomicsDBImport - Importing batch 1 with 564 samples
[TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error; path=/ibex/scratch/projects/c2071/1000quinoa/naga/1128_Phase2/RESULTS/gVCF/Cq5A.1/Cq5A$1$8993623/__06ef12cd-3e3e-445b-8a17-78d392d6e35347261215135488_1600294179899/ExcessHet.tdb; errno=121(Remote I/O error)
[TileDB::WriteState] Error: Cannot write segment to file.
02:44:19.847 erro NativeGenomicsDB - pid=208539 tid=208928 VariantStorageManagerException exception : Error while writing to TileDB array
TileDB error message : [TileDB::WriteState] Error: Cannot write segment to file
[TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error; path=/ibex/scratch/projects/c2071/1000quinoa/naga/1128_Phase2/RESULTS/gVCF/Cq5A.1/Cq5A$1$8993623/__06ef12cd-3e3e-445b-8a17-78d392d6e35347261215135488_1600294179899/ExcessHet.tdb; errno=121(Remote I/O error)
terminate called after throwing an instance of 'std::exception'
what(): std::exception

2. NFS file system

$ cat Cq5A.chunk_1.GenomicsDBImport.11739158.err | tail -13
11:49:12.994 INFO GenomicsDBImport - Starting batch input file preload
11:52:16.216 INFO GenomicsDBImport - Shutting down engine
[September 21, 2020 11:52:16 AM AST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 8.28 minutes.
Runtime.totalMemory()=19555942400

A USER ERROR has occurred: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize with exception: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from file:///ibex/scratch/projects/c2071/1000quinoa/OUT/Sept2020/VCF/S7E1_batch2.snps.indels.g.vcf.gz

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

Note:

export TILEDB_DISABLE_FILE_LOCKING=1 environment variable was set before running GenomicDB
The same NFS file system error was observed in BeeGFS file system during the execution of GenomicDB with different chromosome name/chunks. I used to overcome this issues

Background:

GATK version: 4.1.8
GenomicDB native library version: 1.3.0-e701905
Command used:
gatk GenomicsDBImport --variant $INPUT --genomicsdb-workspace-path $gVCF/$ChrName.$size --intervals $ChrName:$Start-$End --reader-threads $CORES"

Where,

$INPUT is list of Haplotypecaller gVCF files from 1128 samples.

$ChrName is Cq5A

$size is 1,2,3 ..8 (8 chunks)

$ChrName:$Start-$End is based on this below summary:

Cq5A split into 8 parts and here is the summary.

$ cat Chromosome_distribution.txt | grep Cq5A
Cq5A split into 8 parts
Chromosome name:Cq5A, Chunk number: 1, and Interval(Start:1-End:8993623)
Chromosome name:Cq5A, Chunk number: 2, and Interval(Start:8993624-End:17987246)
Chromosome name:Cq5A, Chunk number: 3, and Interval(Start:17987247-End:26980869)
Chromosome name:Cq5A, Chunk number: 4, and Interval(Start:26980870-End:35974492)
Chromosome name:Cq5A, Chunk number: 5, and Interval(Start:35974493-End:44968115)
Chromosome name:Cq5A, Chunk number: 6, and Interval(Start:44968116-End:53961738)
Chromosome name:Cq5A, Chunk number: 7, and Interval(Start:53961739-End:62955361)
Chromosome name:Cq5A, Chunk number: 8, and Interval(Start:62955362-End:64666259)

Observations:

Chunks 1,2,3,7 and 8 are failed (Chunks 4,5 and 6 are successful) in BeeGFS
Chunks 1,2,3,4 and 7 are failed (chunks 5,6 and 8 are successful) in NFS.
For example:

At BeeGFS:
$ cat Cq5A.chunk_5.GenomicsDBImport.11705873.out
Tool returned:
true

At NFS
$ cat Cq5A.chunk_5.GenomicsDBImport.11739162.out
Tool returned:
true

System environment:

OS: CentOS Linux release 7.7.1908 (Core)
Java version: 1.8.0_242
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

Thanks and Regards,
Naga

ERROR: failed to solve: failed to compute cache key

Building 0.1s (12/13) docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 2.32kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 34B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:latest 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 40.18kB 0.0s
=> [release 1/6] FROM docker.io/library/ubuntu 0.0s
=> CACHED [full 2/4] COPY . /build/GenomicsDB/ 0.0s
=> CACHED [full 3/4] WORKDIR /build/GenomicsDB 0.0s
=> CACHED [full 4/4] RUN ./scripts/prereqs/install_prereqs.sh full && useradd -r -U -m genomicsdb && ./scripts/install_genomicsdb.sh genomicsdb /tmp/genomicsdb/install true 0.0s
=> CACHED [release 2/6] COPY ./scripts/prereqs /build 0.0s
=> CACHED [release 3/6] WORKDIR /build 0.0s
=> CACHED [release 4/6] RUN ./install_prereqs.sh release && useradd -r -U -m genomicsdb 0.0s
=> ERROR [release 5/6] COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* /tmp/genomicsdb/install/bin/ 0.0s

[release 5/6] COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* /tmp/genomicsdb/install/bin/:

Dockerfile:57

55 | useradd -r -U -m ${user}
56 |
57 | >>> COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* ${install_dir}/bin/
58 |
59 | USER ${user}

ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 667d83a6-2fb1-41e6-a5d0-fe43821cfc6b::2e3shtccixb8ww9htyrm6vzh8: "/usr/local/bin/gt_mpi_gather": not found

Extend query_variants() api to optionally return only shallow, top level variant info

It would be useful for query_variants() to optionally return top-level variant information without the genotypes field. This should make it easy to associate calls from query_variant_calls() back to variant metadata (reference, alternates, etc.) without having to carry along unnecessary, and repetitive genotype information.

Originally posted by @jacmarjorie in #41 (comment)

Excessive logging with --max-alternate-alleles

@mlathara Hello - we're running GATK GenotypeGVCFs with --max-alternate-alleles=6, with a genomicsDB input (~2500 samples). This line:

GenomicsDB/src/main/cpp/src/query_operations/variant_operations.cc

Line 585 in 6f71676

ss << " has too many genotypes in the combined VCF record : ";

logs every single sample (not site) where it exceeds max-alternate-alleles, and in our case it writing a ridiculous amount to our log. Is there a way to tone this down? Can it log a max of XX times, and then stop? Can we toggle it with some other parameter?

thanks for any ideas.

Query based on attribute filters

This seems to be an oft-requested feature on the lines of "Is there any option we can filter with attribute, example BaseQRankSum > 1.23?"
See Intel-HLS/GenomicsDB#213.

Compile error: ‘bcf_int64_missing’ was not declared in this scope

I have trouble installing GenomicsDB on CentOS Linux (release 7.4.1708).

I ran

cmake ../ -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="/home/riestma1/" -DDISABLE_MPI=1 -DUSE_LIBCSV=0

The output looked good to me:

-- The C compiler identification is GNU 6.4.0
-- The CXX compiler identification is GNU 6.4.0
-- Check for working C compiler: /usr/prog/GCCcore/6.4.0/bin/cc
-- Check for working C compiler: /usr/prog/GCCcore/6.4.0/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/prog/GCCcore/6.4.0/bin/c++
-- Check for working CXX compiler: /usr/prog/GCCcore/6.4.0/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Performing Test test_cpp_2011
-- Performing Test test_cpp_2011 - Success
-- Performing Test test_stack_protector_strong
-- Performing Test test_stack_protector_strong - Success
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Performing Test OPENMPV4_FOUND
-- Performing Test OPENMPV4_FOUND - Success
-- Found libuuid: /usr/include  
-- Found CURL: /lib64/libcurl.so (found version "7.29.0") 
-- Found RapidJSON: /home/riestma1/include  
-- Found htslib: /home/riestma1/git/GenomicsDB/dependencies/htslib  
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7") 
-- Found OpenSSL: /usr/lib64/libssl.so;/usr/lib64/libcrypto.so (found version "1.0.2k") 
-- Found TileDB: /home/riestma1/git/GenomicsDB/dependencies/TileDB/core/include/c_api  
-- Performing Test HAS_STD_CXX11
-- Performing Test HAS_STD_CXX11 - Success
-- Building muparserx version: 4.0.8
-- Using install prefix: /home/riestma1
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.5") 
-- Found OpenMP: -fopenmp  
-- Found JNI: /usr/prog/Java/1.8.0_162/jre/lib/amd64/libjawt.so  
-- Found HDFS: /home/riestma1/git/GenomicsDB/dependencies/TileDB/deps/HDFSWrapper/hadoop-hdfs-native/main/native/libhdfs  
-- Performing Test HAVE_BETTER_TLS
-- Performing Test HAVE_BETTER_TLS - Success
-- Performing Test HAVE_INTEL_SSE_INTRINSICS
-- Performing Test HAVE_INTEL_SSE_INTRINSICS - Success
-- Looking for dlopen in dl
-- Looking for dlopen in dl - found
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Performing Test CXX_2011_FOUND
-- Performing Test CXX_2011_FOUND - Success
-- Compiler supports C++ 2011.
-- The TileDB library is compiled with verbosity.
-- Found Protobuf: /home/riestma1/include  
-- Found PROTOBUF: /home/riestma1/lib/libprotobuf.a  
-- Performing Test PROTOBUF_V3_STABLE_FOUND
-- Performing Test PROTOBUF_V3_STABLE_FOUND - Success
-- Could not find libcsv headers and/or libraries  (missing:  LIBCSV_INCLUDE_DIR LIBCSV_LIBRARY) 
-- Could not find safestring headers and/or libraries  (missing:  SAFESTRINGLIB_INCLUDE_DIR SAFESTRINGLIB_LIBRARY) 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/riestma1/git/GenomicsDB/build

[  7%] Built target htslib
[ 10%] Built target PROTOBUF_GENERATED_CXX_TARGET
[ 10%] Building CXX object src/main/CMakeFiles/GenomicsDB_library_object_files.dir/cpp/src/query_operations/variant_operations.cc.o
In file included from /home/riestma1/git/GenomicsDB/src/main/cpp/include/utils/gt_common.h:29:0,
                 from /home/riestma1/git/GenomicsDB/src/main/cpp/include/genomicsdb/variant.h:27,
                 from /home/riestma1/git/GenomicsDB/src/main/cpp/include/query_operations/variant_operations.h:26,
                 from /home/riestma1/git/GenomicsDB/src/main/cpp/src/query_operations/variant_operations.cc:23:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h: In function ‘T get_bcf_missing_value() [with T = long int]’:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h:83:10: error: ‘bcf_int64_missing’ was not declared in this scope
   return bcf_int64_missing;
          ^~~~~~~~~~~~~~~~~
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h: In function ‘T get_bcf_missing_value() [with T = long unsigned int]’:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h:88:10: error: ‘bcf_int64_missing’ was not declared in this scope
   return bcf_int64_missing;

Seems like the htslib/vcf.h is not included in vcf/vcf.h. But not sure why:

HTSLIB_INCLUDE_DIR:PATH=/home/riestma1/git/GenomicsDB/dependencies/htslib
HTSLIB_INSTALL_DIR:PATH=
HTSLIB_SOURCE_DIR:PATH=/home/riestma1/git/GenomicsDB/dependencies/htslib

Thanks a lot!

Explore using zstd as an optional alternative to using intel zlib from GenomicsDB

Looks like zstd has really good decompress performance and we should at least investigate this.

Extend query_variants and query_variant_calls api to accept filter expressions.

Since the query json allows for filter expressions, we should extend query_variants and query_variant_calls api to accept filter expressions.

Originally posted by @mlathara in #41

GATK GenomicsDBImport - use list as input

I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.

Missing links to fmt

Hello,

When building I get failures due to missing fmt links, the enclosed changes solved this for me.

Cheers,
Pierre

missing_links_in_libs.txt

The compatibility of array_schema.cc

I want to run GATK on an arm machine, and I did these:

compile the source code of GenomicsDB
generate the shared library: libtiledbgenomicsdb.so
insert the shared library into the JAR package (gatk-package-4.4.0.0-local.jar) by:
jar uvf gatk-package-4.4.0.0-local.jar libtiledbgenomicsdb.so
but I found a bug when I run GATK
It turns out that the func deflateInit in libz.so requires the second parameter "level" to be in range -1~9:
https://stackoverflow.com/questions/42893519/zlib-deflateinit-always-returns-z-stream-error
This bug came from the func ArraySchema::deserialize in array_schema.cc:

if (get_version() >= 1L) {
  // char compression_level;
  signed char compression_level;
  for(int i=0; i<=attribute_num_; ++i) {
    assert(offset + sizeof(char) <= buffer_size);
    memcpy(&compression_level, buffer + offset, sizeof(char));
    offset += sizeof(char);
    compression_level_.push_back(compression_level);
  }
}
// Load offsets_compression_. Support added in array schema version 2L
if (get_version() >= 2L) {
  // char offsets_compression;
  signed char offsets_compression;
  for(int i=0; i<attribute_num_; ++i) {
    assert(offset + sizeof(char) <= buffer_size);
    memcpy(&offsets_compression, buffer + offset, sizeof(char));
    offset += sizeof(char);
    offsets_compression_.push_back(static_cast<int>(offsets_compression));
  }
}
// Load offsets_compression_level_. Support added in array schema version 2L
if (get_version() >= 2L) {
  // char offsets_compression_level;
  signed char offsets_compression_level;
  for(int i=0; i<attribute_num_; ++i) {
    assert(offset + sizeof(char) <= buffer_size);
    memcpy(&offsets_compression_level, buffer + offset, sizeof(char));
    offset += sizeof(char);
    offsets_compression_level_.push_back(offsets_compression_level);
  }
}

where compression_level_ is vector int and compression_level is char.
7. The conversion of char to int is ambiguous: char will be recognized as signed or unsigned on different machines, and int(0xff) = -1 or 255 in result.
8. when char is replaced by signed char, this bug will be fixed.

Compilation on Ubuntu jammy fails in aws-cpp-sdk

I tried installing GenomicsDB following the Wiki using an Ubuntu jammy Docker image. Looks like the compilation fails here:


/root/awssdk-install/1.8.x/src/awssdk-build/aws-cpp-sdk-core/source/utils/crypto/openssl/CryptoImpl.cpp:373:31: error: ‘int HMAC_CTX_reset(HMAC_CTX*)’ is deprecated: Since OpenSSL 3.0 [-Werror=deprecated-declarations]
  373 |                 HMAC_CTX_reset(m_ctx);

A workaround (also used in aws-cpp-sdk: aws/aws-sdk-cpp#1582) is setting -Wno-error=deprecated-declarations, but I'm not sure where I have to provide this; my amateurish attempts failed.

Import should automatically increase buffer size as needed

Currently the import tools expect to have a buffer size set that allows reading an entire VCF line (size_per_column_partition: see here).

It would be much more user friendly to have the import process automatically increase the buffer size as needed if the initial buffer size does not fit the input VCF line.

Support MacOS/Linux arm64 architecture

gatk forum users have started requesting support for MacOS and Linux arm64 architectures - see forum post.

Ambiguous call to Assert in the tests

Hello,

In src/test/java/org/genomicsdb/reader/GenomicsDBQueryTest.java in version 1.5.1, there is the call
Assert.assertEquals(intervals.get(0).calls.size(), expected_calls[i]);
which is ambiguous as both method assertEquals(java.lang.Object,java.lang.Object) in org.testng.Assert and method assertEquals(long,long) in org.testng.Assert match. This raises an error, at least with openjdk-17.

Explicitly converting Long to long with
Assert.assertEquals(intervals.get(0).calls.size(), expected_calls[i].longValue());
solves this.

Best,

Pierre

Api should allow for reference genome be optional

See #46

Referencing Issue "CSV Data Import in GenomicsDB" in new repository.

Referencing Issue 227 as development has moved here.

string_view needs c++-17

Hello,

I am packaging GenomicsDB into Debian; I had to make the enclosed change ad string_view needs C++-17, not -14.

Cheers,

Pierre

string_view_needs_cxx17.txt

Move away from protobuf-v3.0.0-beta-1 and into later versions for GenomicsDB

We were stuck with protobuf-v3.0.0-beta-1 because of gatk, but now that gatk has moved to using 3.8.0 of protobuf, genomicsdb can move to this version as well.

Also allow for building protobuf from the cmake Modules, so users do not have to figure out yet another prerequisite to build/install.

Couldn't create GenomicsDBFeatureReader

when i excuted comand
"gatk --java-options "-Xmx4g" GenotypeGVCFs
-R ${ref}
-V gendb://chr1_database
--tmp-dir=${temp_dir}
-O DB76.chr1.vcf.gz".
problem always occured with message "[TileDB::StorageManager] Error: Cannot lock consolidation filelock; Cannot lock ....... A USER ERROR has occurred: Couldn't create GenomicsDBFeatureReader".
i was very confused with that problem for a month. who can help me? thanks so much!

Github action sporadic failure with spark

Some builds fail with sparkDriver not able to find a random free port. Most of the time, restarting the build works. But just wondering if you can take a look.

2021-04-11T18:38:04.1426794Z Sanity test with query.json : spark-submit --master local --deploy-mode client --total-executor-cores 1 --executor-memory 512M  --conf "spark.executor.extraJavaOptions=" --conf "spark.driver.extraJavaOptions=" --class org.genomicsdb.spark.api.GenomicsDBSparkBindings /home/runner/work/GenomicsDB/GenomicsDB/build/target/genomicsdb-x.y.z.test-allinone.jar /tmp/tmp69y8hzf7/sanity_test/t0_1_2.json /tmp/tmp69y8hzf7/sanity_test/query.json failed
2021-04-11T18:38:04.1429971Z Stdout:
2021-04-11T18:38:04.1431279Z Stderr: 21/04/11 18:38:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-04-11T18:38:04.1432436Z 21/04/11 18:38:03 INFO SparkContext: Running Spark version 3.0.1
2021-04-11T18:38:04.1433443Z 21/04/11 18:38:03 INFO ResourceUtils: ==============================================================
2021-04-11T18:38:04.1437692Z 21/04/11 18:38:03 INFO ResourceUtils: Resources for spark.driver:
2021-04-11T18:38:04.1438166Z
2021-04-11T18:38:04.1438643Z 21/04/11 18:38:03 INFO ResourceUtils: ==============================================================
2021-04-11T18:38:04.1439513Z 21/04/11 18:38:03 INFO SparkContext: Submitted application: GenomicsDB API Experimental Bindings
2021-04-11T18:38:04.1440442Z 21/04/11 18:38:03 INFO SecurityManager: Changing view acls to: runner
2021-04-11T18:38:04.1441398Z 21/04/11 18:38:03 INFO SecurityManager: Changing modify acls to: runner
2021-04-11T18:38:04.1442177Z 21/04/11 18:38:03 INFO SecurityManager: Changing view acls groups to:
2021-04-11T18:38:04.1442970Z 21/04/11 18:38:03 INFO SecurityManager: Changing modify acls groups to:
2021-04-11T18:38:04.1444518Z 21/04/11 18:38:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(runner); groups with view permissions: Set(); users  with modify permissions: Set(runner); groups with modify permissions: Set()
2021-04-11T18:38:04.1446748Z 21/04/11 18:38:04 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
...
2021-04-11T18:38:04.1456342Z 21/04/11 18:38:04 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
2021-04-11T18:38:04.1472401Z java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
2021-04-11T18:38:04.1475406Z  at sun.nio.ch.Net.bind0(Native Method)
2021-04-11T18:38:04.1476149Z  at sun.nio.ch.Net.bind(Net.java:461)
2021-04-11T18:38:04.1476956Z  at sun.nio.ch.Net.bind(Net.java:453)
...

Allow for multiple inputs(bed, etc.) to be ingested into GenomicsDB

Currently, only vcf files can be ingested. Refactor code to have an api to ingest from multiple sources. This api should be made callable from Java/Python/R bindings as well.

GenomicsDB IO exception has very limited debugging information

Originally reported in #6745 in GATK. When running a test, the reporting user got a GenomicsDB JNI error, thrown by jniGenomicsDBInit() in GenomicsDBQueryStream.java.

Looking at the C++ code for the native method in genomicsdb_GenomicsDBQueryStream.cc I find that the method handleJNIException() is called. However, that method takes the generic class std::exception, and the stack trace reports only std::exception, rather than a specific exception.

Is there a way to determine the specific exception type here?

Changes to JsonFormat methods when building with protobuf-java-format 1.3 instead of 1.2

Hello,

When building GenomicsDB in Debian unstable, I had to make the enclosed changes related to JsonFormat methods.

Cheers,
Pierre

JsonFormat_methods.txt

Python library?

Hi, is there some Python interface for GenomicsDB?

ulimit issue isn't obvious from error message

Hit this today and took a while to figure it out. I was using vcf2tiledb to load some vcfs into genomicsdb arrays, and kept running into an error of the sort:

VCF2BinaryException : Could not open file : could not load index (VCF/BCF files must be block compressed and indexed)

And both the vcf and index clearly did exist. Took a while (and some help) to figure out that ulimit wasn't high enough and number of open file descriptors was the problem. Would be very helpful to change the error message somehow to hint at the problem.

Add incremental support to GenomicsDBImport

See broadinstitute/gatk#4773.

Will have to persist the callset.json files at the workspace/array/fragment level in GenomicsDB as opposed to the workspace level as it is currently done.

Porting to gcc-13

Hello,

I tried to build GenomicsDB with gcc-13, and the good news is only few changes are needed!

Specifically, one has to add
#include <cstdint>
in

src/main/cpp/include/utils/headers.h
core/src/codec/codec_filter_delta_encode.cc in the tiledb repo
core/src/codec/codec_filter_delta_encode.cc in the tiledb repo

This is so that the compiler knows about uint8, uint32... types

Best,
Pierre

genomicsdb / genomicsdb Goto Github PK

genomicsdb's Introduction

External Contributions

Checklist before creating Pull Request

genomicsdb's People

Contributors

Stargazers

Watchers

Forkers

genomicsdb's Issues

Dockerfile:57

55 | useradd -r -U -m ${user} 56 | 57 | >>> COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* ${install_dir}/bin/ 58 | 59 | USER ${user}

Recommend Projects

Recommend Topics

Recommend Org

55 | useradd -r -U -m ${user}
56 |
57 | >>> COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* ${install_dir}/bin/
58 |
59 | USER ${user}