Git Product home page Git Product logo

genomicsdb / genomicsdb Goto Github PK

View Code? Open in Web Editor NEW
89.0 89.0 18.0 67.48 MB

High performance data storage for importing, querying and transforming variants.

Home Page: https://genomicsdb.readthedocs.io

License: Other

CMake 2.79% C 0.36% Dockerfile 0.08% Shell 2.46% Python 4.89% Java 19.22% C++ 69.60% Scala 0.60%
bioinformatics cpp gatk genomics genomicsdb java mpi precision-medicine scala spark variant variant-calling

genomicsdb's Introduction

License: MIT readthedocs Maven Central

Master Develop
actions actions
codecov codecov

GenomicsDB is built on top of a fork of htslib and a tile-based array storage system for importing, querying and transforming variant data. Variant data is sparse by nature (sparse relative to the whole genome) and using sparse array data stores is a perfect fit for storing such data. GenomicsDB is a highly performant scalable data storage written in C++ for importing, querying and transforming genomic variant data. See genomicsdb.readthedocs.io for documentation and usage.

  • Supported platforms : Linux and MacOS.
  • Supported filesystems : POSIX, HDFS, EMRFS(S3), GCS and Azure Blob.

Included are

  • JVM/Spark wrappers that allow for streaming VariantContext buffers to/from the C++ layer among other functions. GenomicsDB jars with native libraries and only zlib dependencies are regularly published on Maven Central.
  • Native tools for incremental ingestion of variants in the form of VCF/BCF/CSV into GenomicsDB for performance.
  • MPI and Spark support for parallel querying of GenomicsDB.

GenomicsDB is packaged into gatk4 and benefits qualitatively from a large user base.

External Contributions

GenomicsDB is open source and all participation is welcome. GenomicsDB is released under the MIT License and all external contributors are expected to grant an MIT License for their contributions.

Checklist before creating Pull Request

Please ensure that the code is well documented in Javadoc style for Java/Scala. For Java/C/C++ code formatting, roughly adhere to the Google Style Guides. See GenomicsDB Style Guide

genomicsdb's People

Contributors

aoblebea avatar cpjagan avatar danking avatar dependabot[bot] avatar eneskuluk avatar francares avatar gitmach avatar hillsd avatar jackgoldsmith4 avatar jacmarjorie avatar jakebolewski avatar jeffhammond avatar joshblum avatar kdatta avatar kgururaj avatar lbergelson avatar luszczek avatar mingrutar avatar mishalinaik avatar mlathara avatar nalinigans avatar paolonarvaez avatar psfoley avatar raonyguimaraes avatar shreyas-muralidhara avatar stavrospapadopoulos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genomicsdb's Issues

Maven wget url invalid

It seems when running the install_prereqs.sh, during the maven installation step, it is trying to wget from the following url,
https://www-us.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz. However, this led to the error wget: unable to resolve host address 'www-us.apache.org'. The host www-us.apache.org is no longer valid. Should this url be updated?

Refine gt_mpi_gather and other tools

GenomicsDB tools should not be throwing exceptions and core dump for expected issues, rather should exit gracefully with meaningful messages.

plink support has introduced mpi dependency in test_gen.cc

The C unit tests are meant to test the basic functionality exposed by <GenomicsDB>/src/main/cpp and libtiledbgenomicsdb.so, so should not have any mpi dependencies. However, gt_mpi_gather can be used for testing bgen_support as a functional test from either run.py or test_tools.sh from <GenomicsDB>/tests, but test_bgen should only test the functionality exposed by libtiledbgenomicsdb.so

protobuf version

Hi,
I tried to follow the instructions by first building protobuf 3.0.2, but got an error "google/protobuf/port_def.inc: No such file or directory". Then I did git checkout 3.7.x inside protobuf followed by the installation and that succeeded.
Regards,
Shubham

Dependencies missing in the pom

Hello,

The dependencies commons-io:commons-io, com.google.guava:guava and org.apache.commons:commons-lang3 (test scope only for the last one) are missing in pom.xml, whereas they are needed by some files in the code.

Cheers,
Pierre

[TileDB::FileSystem] Error and FeatureReader to initialize with exception Error

Dear team,

The GenomicDB was failed (same command and dataset) with different errors across two file systems.
Summary is as follows:

1. BeeGFS file system:

$ cat Cq5A.chunk_1.GenomicsDBImport.11705869.err | tail -13
01:08:38.145 INFO GenomicsDBImport - Starting batch input file preload
01:09:38.916 INFO GenomicsDBImport - Finished batch preload
01:09:38.916 INFO GenomicsDBImport - Importing batch 1 with 564 samples
[TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error; path=/ibex/scratch/projects/c2071/1000quinoa/naga/1128_Phase2/RESULTS/gVCF/Cq5A.1/Cq5A$1$8993623/__06ef12cd-3e3e-445b-8a17-78d392d6e35347261215135488_1600294179899/ExcessHet.tdb; errno=121(Remote I/O error)
[TileDB::WriteState] Error: Cannot write segment to file.
02:44:19.847 erro NativeGenomicsDB - pid=208539 tid=208928 VariantStorageManagerException exception : Error while writing to TileDB array
TileDB error message : [TileDB::WriteState] Error: Cannot write segment to file
[TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error; path=/ibex/scratch/projects/c2071/1000quinoa/naga/1128_Phase2/RESULTS/gVCF/Cq5A.1/Cq5A$1$8993623/__06ef12cd-3e3e-445b-8a17-78d392d6e35347261215135488_1600294179899/ExcessHet.tdb; errno=121(Remote I/O error)
terminate called after throwing an instance of 'std::exception'
what(): std::exception

2. NFS file system

$ cat Cq5A.chunk_1.GenomicsDBImport.11739158.err | tail -13
11:49:12.994 INFO GenomicsDBImport - Starting batch input file preload
11:52:16.216 INFO GenomicsDBImport - Shutting down engine
[September 21, 2020 11:52:16 AM AST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 8.28 minutes.
Runtime.totalMemory()=19555942400

A USER ERROR has occurred: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize with exception: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from file:///ibex/scratch/projects/c2071/1000quinoa/OUT/Sept2020/VCF/S7E1_batch2.snps.indels.g.vcf.gz

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

Note:

  1. export TILEDB_DISABLE_FILE_LOCKING=1 environment variable was set before running GenomicDB
  2. The same NFS file system error was observed in BeeGFS file system during the execution of GenomicDB with different chromosome name/chunks. I used to overcome this issues

Background:

GATK version: 4.1.8
GenomicDB native library version: 1.3.0-e701905
Command used:
gatk GenomicsDBImport --variant $INPUT --genomicsdb-workspace-path $gVCF/$ChrName.$size --intervals $ChrName:$Start-$End --reader-threads $CORES"

Where,

$INPUT is list of Haplotypecaller gVCF files from 1128 samples.

$ChrName is Cq5A

$size is 1,2,3 ..8 (8 chunks)

$ChrName:$Start-$End is based on this below summary:

Cq5A split into 8 parts and here is the summary.

$ cat Chromosome_distribution.txt | grep Cq5A
Cq5A split into 8 parts
Chromosome name:Cq5A, Chunk number: 1, and Interval(Start:1-End:8993623)
Chromosome name:Cq5A, Chunk number: 2, and Interval(Start:8993624-End:17987246)
Chromosome name:Cq5A, Chunk number: 3, and Interval(Start:17987247-End:26980869)
Chromosome name:Cq5A, Chunk number: 4, and Interval(Start:26980870-End:35974492)
Chromosome name:Cq5A, Chunk number: 5, and Interval(Start:35974493-End:44968115)
Chromosome name:Cq5A, Chunk number: 6, and Interval(Start:44968116-End:53961738)
Chromosome name:Cq5A, Chunk number: 7, and Interval(Start:53961739-End:62955361)
Chromosome name:Cq5A, Chunk number: 8, and Interval(Start:62955362-End:64666259)

Observations:

  1. Chunks 1,2,3,7 and 8 are failed (Chunks 4,5 and 6 are successful) in BeeGFS
  2. Chunks 1,2,3,4 and 7 are failed (chunks 5,6 and 8 are successful) in NFS.
  3. For example:

At BeeGFS:
$ cat Cq5A.chunk_5.GenomicsDBImport.11705873.out
Tool returned:
true

At NFS
$ cat Cq5A.chunk_5.GenomicsDBImport.11739162.out
Tool returned:
true

System environment:

OS: CentOS Linux release 7.7.1908 (Core)
Java version: 1.8.0_242
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

Thanks and Regards,
Naga

ERROR: failed to solve: failed to compute cache key

Building 0.1s (12/13) docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 2.32kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 34B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:latest 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 40.18kB 0.0s
=> [release 1/6] FROM docker.io/library/ubuntu 0.0s
=> CACHED [full 2/4] COPY . /build/GenomicsDB/ 0.0s
=> CACHED [full 3/4] WORKDIR /build/GenomicsDB 0.0s
=> CACHED [full 4/4] RUN ./scripts/prereqs/install_prereqs.sh full && useradd -r -U -m genomicsdb && ./scripts/install_genomicsdb.sh genomicsdb /tmp/genomicsdb/install true 0.0s
=> CACHED [release 2/6] COPY ./scripts/prereqs /build 0.0s
=> CACHED [release 3/6] WORKDIR /build 0.0s
=> CACHED [release 4/6] RUN ./install_prereqs.sh release && useradd -r -U -m genomicsdb 0.0s
=> ERROR [release 5/6] COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* /tmp/genomicsdb/install/bin/ 0.0s

[release 5/6] COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* /tmp/genomicsdb/install/bin/:


Dockerfile:57

55 | useradd -r -U -m ${user}
56 |
57 | >>> COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* ${install_dir}/bin/
58 |
59 | USER ${user}

ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 667d83a6-2fb1-41e6-a5d0-fe43821cfc6b::2e3shtccixb8ww9htyrm6vzh8: "/usr/local/bin/gt_mpi_gather": not found

Excessive logging with --max-alternate-alleles

@mlathara Hello - we're running GATK GenotypeGVCFs with --max-alternate-alleles=6, with a genomicsDB input (~2500 samples). This line:

ss << " has too many genotypes in the combined VCF record : ";

logs every single sample (not site) where it exceeds max-alternate-alleles, and in our case it writing a ridiculous amount to our log. Is there a way to tone this down? Can it log a max of XX times, and then stop? Can we toggle it with some other parameter?

thanks for any ideas.

Compile error: ‘bcf_int64_missing’ was not declared in this scope

I have trouble installing GenomicsDB on CentOS Linux (release 7.4.1708).

I ran

cmake ../ -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="/home/riestma1/" -DDISABLE_MPI=1 -DUSE_LIBCSV=0

The output looked good to me:

-- The C compiler identification is GNU 6.4.0
-- The CXX compiler identification is GNU 6.4.0
-- Check for working C compiler: /usr/prog/GCCcore/6.4.0/bin/cc
-- Check for working C compiler: /usr/prog/GCCcore/6.4.0/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/prog/GCCcore/6.4.0/bin/c++
-- Check for working CXX compiler: /usr/prog/GCCcore/6.4.0/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Performing Test test_cpp_2011
-- Performing Test test_cpp_2011 - Success
-- Performing Test test_stack_protector_strong
-- Performing Test test_stack_protector_strong - Success
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Performing Test OPENMPV4_FOUND
-- Performing Test OPENMPV4_FOUND - Success
-- Found libuuid: /usr/include  
-- Found CURL: /lib64/libcurl.so (found version "7.29.0") 
-- Found RapidJSON: /home/riestma1/include  
-- Found htslib: /home/riestma1/git/GenomicsDB/dependencies/htslib  
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7") 
-- Found OpenSSL: /usr/lib64/libssl.so;/usr/lib64/libcrypto.so (found version "1.0.2k") 
-- Found TileDB: /home/riestma1/git/GenomicsDB/dependencies/TileDB/core/include/c_api  
-- Performing Test HAS_STD_CXX11
-- Performing Test HAS_STD_CXX11 - Success
-- Building muparserx version: 4.0.8
-- Using install prefix: /home/riestma1
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.5") 
-- Found OpenMP: -fopenmp  
-- Found JNI: /usr/prog/Java/1.8.0_162/jre/lib/amd64/libjawt.so  
-- Found HDFS: /home/riestma1/git/GenomicsDB/dependencies/TileDB/deps/HDFSWrapper/hadoop-hdfs-native/main/native/libhdfs  
-- Performing Test HAVE_BETTER_TLS
-- Performing Test HAVE_BETTER_TLS - Success
-- Performing Test HAVE_INTEL_SSE_INTRINSICS
-- Performing Test HAVE_INTEL_SSE_INTRINSICS - Success
-- Looking for dlopen in dl
-- Looking for dlopen in dl - found
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Performing Test CXX_2011_FOUND
-- Performing Test CXX_2011_FOUND - Success
-- Compiler supports C++ 2011.
-- The TileDB library is compiled with verbosity.
-- Found Protobuf: /home/riestma1/include  
-- Found PROTOBUF: /home/riestma1/lib/libprotobuf.a  
-- Performing Test PROTOBUF_V3_STABLE_FOUND
-- Performing Test PROTOBUF_V3_STABLE_FOUND - Success
-- Could not find libcsv headers and/or libraries  (missing:  LIBCSV_INCLUDE_DIR LIBCSV_LIBRARY) 
-- Could not find safestring headers and/or libraries  (missing:  SAFESTRINGLIB_INCLUDE_DIR SAFESTRINGLIB_LIBRARY) 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/riestma1/git/GenomicsDB/build
[  7%] Built target htslib
[ 10%] Built target PROTOBUF_GENERATED_CXX_TARGET
[ 10%] Building CXX object src/main/CMakeFiles/GenomicsDB_library_object_files.dir/cpp/src/query_operations/variant_operations.cc.o
In file included from /home/riestma1/git/GenomicsDB/src/main/cpp/include/utils/gt_common.h:29:0,
                 from /home/riestma1/git/GenomicsDB/src/main/cpp/include/genomicsdb/variant.h:27,
                 from /home/riestma1/git/GenomicsDB/src/main/cpp/include/query_operations/variant_operations.h:26,
                 from /home/riestma1/git/GenomicsDB/src/main/cpp/src/query_operations/variant_operations.cc:23:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h: In function ‘T get_bcf_missing_value() [with T = long int]’:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h:83:10: error: ‘bcf_int64_missing’ was not declared in this scope
   return bcf_int64_missing;
          ^~~~~~~~~~~~~~~~~
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h: In function ‘T get_bcf_missing_value() [with T = long unsigned int]’:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h:88:10: error: ‘bcf_int64_missing’ was not declared in this scope
   return bcf_int64_missing;

Seems like the htslib/vcf.h is not included in vcf/vcf.h. But not sure why:

HTSLIB_INCLUDE_DIR:PATH=/home/riestma1/git/GenomicsDB/dependencies/htslib
HTSLIB_INSTALL_DIR:PATH=
HTSLIB_SOURCE_DIR:PATH=/home/riestma1/git/GenomicsDB/dependencies/htslib

Thanks a lot!

GATK GenomicsDBImport - use list as input

I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.

The compatibility of array_schema.cc

I want to run GATK on an arm machine, and I did these:

  1. compile the source code of GenomicsDB
  2. generate the shared library: libtiledbgenomicsdb.so
  3. insert the shared library into the JAR package (gatk-package-4.4.0.0-local.jar) by:
    jar uvf gatk-package-4.4.0.0-local.jar libtiledbgenomicsdb.so
  4. but I found a bug when I run GATK
    image
  5. It turns out that the func deflateInit in libz.so requires the second parameter "level" to be in range -1~9:
    https://stackoverflow.com/questions/42893519/zlib-deflateinit-always-returns-z-stream-error
  6. This bug came from the func ArraySchema::deserialize in array_schema.cc:
if (get_version() >= 1L) {
  // char compression_level;
  signed char compression_level;
  for(int i=0; i<=attribute_num_; ++i) {
    assert(offset + sizeof(char) <= buffer_size);
    memcpy(&compression_level, buffer + offset, sizeof(char));
    offset += sizeof(char);
    compression_level_.push_back(compression_level);
  }
}
// Load offsets_compression_. Support added in array schema version 2L
if (get_version() >= 2L) {
  // char offsets_compression;
  signed char offsets_compression;
  for(int i=0; i<attribute_num_; ++i) {
    assert(offset + sizeof(char) <= buffer_size);
    memcpy(&offsets_compression, buffer + offset, sizeof(char));
    offset += sizeof(char);
    offsets_compression_.push_back(static_cast<int>(offsets_compression));
  }
}
// Load offsets_compression_level_. Support added in array schema version 2L
if (get_version() >= 2L) {
  // char offsets_compression_level;
  signed char offsets_compression_level;
  for(int i=0; i<attribute_num_; ++i) {
    assert(offset + sizeof(char) <= buffer_size);
    memcpy(&offsets_compression_level, buffer + offset, sizeof(char));
    offset += sizeof(char);
    offsets_compression_level_.push_back(offsets_compression_level);
  }
}

where compression_level_ is vector int and compression_level is char.
7. The conversion of char to int is ambiguous: char will be recognized as signed or unsigned on different machines, and int(0xff) = -1 or 255 in result.
8. when char is replaced by signed char, this bug will be fixed.

Compilation on Ubuntu jammy fails in aws-cpp-sdk

I tried installing GenomicsDB following the Wiki using an Ubuntu jammy Docker image. Looks like the compilation fails here:


/root/awssdk-install/1.8.x/src/awssdk-build/aws-cpp-sdk-core/source/utils/crypto/openssl/CryptoImpl.cpp:373:31: error: ‘int HMAC_CTX_reset(HMAC_CTX*)’ is deprecated: Since OpenSSL 3.0 [-Werror=deprecated-declarations]
  373 |                 HMAC_CTX_reset(m_ctx);

A workaround (also used in aws-cpp-sdk: aws/aws-sdk-cpp#1582) is setting -Wno-error=deprecated-declarations, but I'm not sure where I have to provide this; my amateurish attempts failed.

Import should automatically increase buffer size as needed

Currently the import tools expect to have a buffer size set that allows reading an entire VCF line (size_per_column_partition: see here).

It would be much more user friendly to have the import process automatically increase the buffer size as needed if the initial buffer size does not fit the input VCF line.

Ambiguous call to Assert in the tests

Hello,

In src/test/java/org/genomicsdb/reader/GenomicsDBQueryTest.java in version 1.5.1, there is the call
Assert.assertEquals(intervals.get(0).calls.size(), expected_calls[i]);
which is ambiguous as both method assertEquals(java.lang.Object,java.lang.Object) in org.testng.Assert and method assertEquals(long,long) in org.testng.Assert match. This raises an error, at least with openjdk-17.

Explicitly converting Long to long with
Assert.assertEquals(intervals.get(0).calls.size(), expected_calls[i].longValue());
solves this.

Best,

Pierre

Couldn't create GenomicsDBFeatureReader

when i excuted comand
"gatk --java-options "-Xmx4g" GenotypeGVCFs
-R ${ref}
-V gendb://chr1_database
--tmp-dir=${temp_dir}
-O DB76.chr1.vcf.gz".
problem always occured with message "[TileDB::StorageManager] Error: Cannot lock consolidation filelock; Cannot lock ....... A USER ERROR has occurred: Couldn't create GenomicsDBFeatureReader".
i was very confused with that problem for a month. who can help me? thanks so much!

Github action sporadic failure with spark

Some builds fail with sparkDriver not able to find a random free port. Most of the time, restarting the build works. But just wondering if you can take a look.

2021-04-11T18:38:04.1426794Z Sanity test with query.json : spark-submit --master local --deploy-mode client --total-executor-cores 1 --executor-memory 512M  --conf "spark.executor.extraJavaOptions=" --conf "spark.driver.extraJavaOptions=" --class org.genomicsdb.spark.api.GenomicsDBSparkBindings /home/runner/work/GenomicsDB/GenomicsDB/build/target/genomicsdb-x.y.z.test-allinone.jar /tmp/tmp69y8hzf7/sanity_test/t0_1_2.json /tmp/tmp69y8hzf7/sanity_test/query.json failed
2021-04-11T18:38:04.1429971Z Stdout:
2021-04-11T18:38:04.1431279Z Stderr: 21/04/11 18:38:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-04-11T18:38:04.1432436Z 21/04/11 18:38:03 INFO SparkContext: Running Spark version 3.0.1
2021-04-11T18:38:04.1433443Z 21/04/11 18:38:03 INFO ResourceUtils: ==============================================================
2021-04-11T18:38:04.1437692Z 21/04/11 18:38:03 INFO ResourceUtils: Resources for spark.driver:
2021-04-11T18:38:04.1438166Z
2021-04-11T18:38:04.1438643Z 21/04/11 18:38:03 INFO ResourceUtils: ==============================================================
2021-04-11T18:38:04.1439513Z 21/04/11 18:38:03 INFO SparkContext: Submitted application: GenomicsDB API Experimental Bindings
2021-04-11T18:38:04.1440442Z 21/04/11 18:38:03 INFO SecurityManager: Changing view acls to: runner
2021-04-11T18:38:04.1441398Z 21/04/11 18:38:03 INFO SecurityManager: Changing modify acls to: runner
2021-04-11T18:38:04.1442177Z 21/04/11 18:38:03 INFO SecurityManager: Changing view acls groups to:
2021-04-11T18:38:04.1442970Z 21/04/11 18:38:03 INFO SecurityManager: Changing modify acls groups to:
2021-04-11T18:38:04.1444518Z 21/04/11 18:38:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(runner); groups with view permissions: Set(); users  with modify permissions: Set(runner); groups with modify permissions: Set()
2021-04-11T18:38:04.1446748Z 21/04/11 18:38:04 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
...
2021-04-11T18:38:04.1456342Z 21/04/11 18:38:04 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
2021-04-11T18:38:04.1472401Z java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
2021-04-11T18:38:04.1475406Z  at sun.nio.ch.Net.bind0(Native Method)
2021-04-11T18:38:04.1476149Z  at sun.nio.ch.Net.bind(Net.java:461)
2021-04-11T18:38:04.1476956Z  at sun.nio.ch.Net.bind(Net.java:453)
...

GenomicsDB IO exception has very limited debugging information

Originally reported in #6745 in GATK. When running a test, the reporting user got a GenomicsDB JNI error, thrown by jniGenomicsDBInit() in GenomicsDBQueryStream.java.

Looking at the C++ code for the native method in genomicsdb_GenomicsDBQueryStream.cc I find that the method handleJNIException() is called. However, that method takes the generic class std::exception, and the stack trace reports only std::exception, rather than a specific exception.

Is there a way to determine the specific exception type here?

ulimit issue isn't obvious from error message

Hit this today and took a while to figure it out. I was using vcf2tiledb to load some vcfs into genomicsdb arrays, and kept running into an error of the sort:

VCF2BinaryException : Could not open file : could not load index (VCF/BCF files must be block compressed and indexed)

And both the vcf and index clearly did exist. Took a while (and some help) to figure out that ulimit wasn't high enough and number of open file descriptors was the problem. Would be very helpful to change the error message somehow to hint at the problem.

Porting to gcc-13

Hello,

I tried to build GenomicsDB with gcc-13, and the good news is only few changes are needed!

Specifically, one has to add
#include <cstdint>
in

  • src/main/cpp/include/utils/headers.h
  • core/src/codec/codec_filter_delta_encode.cc in the tiledb repo
  • core/src/codec/codec_filter_delta_encode.cc in the tiledb repo

This is so that the compiler knows about uint8, uint32... types

Best,
Pierre

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.