genomicsdb / genomicsdb Goto Github PK
View Code? Open in Web Editor NEWHigh performance data storage for importing, querying and transforming variants.
Home Page: https://genomicsdb.readthedocs.io
License: Other
High performance data storage for importing, querying and transforming variants.
Home Page: https://genomicsdb.readthedocs.io
License: Other
Hello,
When building GenomicsDB in Debian unstable, I had to make the enclosed changes related to JsonFormat methods.
Cheers,
Pierre
I'd like some help troubleshooting the install process on ubuntu 16.04
attached is the output for cmake & make
Hello,
In src/test/java/org/genomicsdb/reader/GenomicsDBQueryTest.java in version 1.5.1, there is the call
Assert.assertEquals(intervals.get(0).calls.size(), expected_calls[i]);
which is ambiguous as both method assertEquals(java.lang.Object,java.lang.Object) in org.testng.Assert and method assertEquals(long,long) in org.testng.Assert match. This raises an error, at least with openjdk-17.
Explicitly converting Long to long with
Assert.assertEquals(intervals.get(0).calls.size(), expected_calls[i].longValue());
solves this.
Best,
Pierre
This seems to be an oft-requested feature on the lines of "Is there any option we can filter with attribute, example BaseQRankSum > 1.23?"
See Intel-HLS/GenomicsDB#213.
The C unit tests are meant to test the basic functionality exposed by <GenomicsDB>/src/main/cpp
and libtiledbgenomicsdb.so
, so should not have any mpi dependencies. However, gt_mpi_gather
can be used for testing bgen_support as a functional test from either run.py
or test_tools.sh
from <GenomicsDB>/tests
, but test_bgen should only test the functionality exposed by libtiledbgenomicsdb.so
Will have to persist the callset.json files at the workspace/array/fragment level in GenomicsDB as opposed to the workspace level as it is currently done.
It seems when running the install_prereqs.sh
, during the maven installation step, it is trying to wget from the following url,
https://www-us.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz
. However, this led to the error wget: unable to resolve host address 'www-us.apache.org'
. The host www-us.apache.org is no longer valid. Should this url be updated?
See #327
when i excuted comand
"gatk --java-options "-Xmx4g" GenotypeGVCFs
-R ${ref}
-V gendb://chr1_database
--tmp-dir=${temp_dir}
-O DB76.chr1.vcf.gz".
problem always occured with message "[TileDB::StorageManager] Error: Cannot lock consolidation filelock; Cannot lock ....... A USER ERROR has occurred: Couldn't create GenomicsDBFeatureReader".
i was very confused with that problem for a month. who can help me? thanks so much!
I tried installing GenomicsDB following the Wiki using an Ubuntu jammy Docker image. Looks like the compilation fails here:
/root/awssdk-install/1.8.x/src/awssdk-build/aws-cpp-sdk-core/source/utils/crypto/openssl/CryptoImpl.cpp:373:31: error: ‘int HMAC_CTX_reset(HMAC_CTX*)’ is deprecated: Since OpenSSL 3.0 [-Werror=deprecated-declarations]
373 | HMAC_CTX_reset(m_ctx);
A workaround (also used in aws-cpp-sdk: aws/aws-sdk-cpp#1582) is setting -Wno-error=deprecated-declarations, but I'm not sure where I have to provide this; my amateurish attempts failed.
Hello,
The dependencies commons-io:commons-io, com.google.guava:guava and org.apache.commons:commons-lang3 (test scope only for the last one) are missing in pom.xml, whereas they are needed by some files in the code.
Cheers,
Pierre
Hit this today and took a while to figure it out. I was using vcf2tiledb to load some vcfs into genomicsdb arrays, and kept running into an error of the sort:
VCF2BinaryException : Could not open file : could not load index (VCF/BCF files must be block compressed and indexed)
And both the vcf and index clearly did exist. Took a while (and some help) to figure out that ulimit wasn't high enough and number of open file descriptors was the problem. Would be very helpful to change the error message somehow to hint at the problem.
It would be useful for query_variants()
to optionally return top-level variant information without the genotypes field. This should make it easy to associate calls from query_variant_calls()
back to variant metadata (reference, alternates, etc.) without having to carry along unnecessary, and repetitive genotype information.
Originally posted by @jacmarjorie in #41 (comment)
I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.
[release 5/6] COPY --from=full /usr/local/bin/genomicsdb /usr/local/bin/gt_mpi_gather /usr/local/bin/vcf* /tmp/genomicsdb/install/bin/:
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 667d83a6-2fb1-41e6-a5d0-fe43821cfc6b::2e3shtccixb8ww9htyrm6vzh8: "/usr/local/bin/gt_mpi_gather": not found
Looks like zstd has really good decompress performance and we should at least investigate this.
Hello,
I tried to build GenomicsDB with gcc-13, and the good news is only few changes are needed!
Specifically, one has to add
#include <cstdint>
in
This is so that the compiler knows about uint8, uint32... types
Best,
Pierre
I want to run GATK on an arm machine, and I did these:
jar uvf gatk-package-4.4.0.0-local.jar libtiledbgenomicsdb.so
if (get_version() >= 1L) {
// char compression_level;
signed char compression_level;
for(int i=0; i<=attribute_num_; ++i) {
assert(offset + sizeof(char) <= buffer_size);
memcpy(&compression_level, buffer + offset, sizeof(char));
offset += sizeof(char);
compression_level_.push_back(compression_level);
}
}
// Load offsets_compression_. Support added in array schema version 2L
if (get_version() >= 2L) {
// char offsets_compression;
signed char offsets_compression;
for(int i=0; i<attribute_num_; ++i) {
assert(offset + sizeof(char) <= buffer_size);
memcpy(&offsets_compression, buffer + offset, sizeof(char));
offset += sizeof(char);
offsets_compression_.push_back(static_cast<int>(offsets_compression));
}
}
// Load offsets_compression_level_. Support added in array schema version 2L
if (get_version() >= 2L) {
// char offsets_compression_level;
signed char offsets_compression_level;
for(int i=0; i<attribute_num_; ++i) {
assert(offset + sizeof(char) <= buffer_size);
memcpy(&offsets_compression_level, buffer + offset, sizeof(char));
offset += sizeof(char);
offsets_compression_level_.push_back(offsets_compression_level);
}
}
where compression_level_ is vector int and compression_level is char.
7. The conversion of char to int is ambiguous: char will be recognized as signed or unsigned on different machines, and int(0xff) = -1 or 255 in result.
8. when char is replaced by signed char, this bug will be fixed.
This is a placeholder to
Hi, is there some Python interface for GenomicsDB?
Dear team,
The GenomicDB was failed (same command and dataset) with different errors across two file systems.
Summary is as follows:
1. BeeGFS file system:
$ cat Cq5A.chunk_1.GenomicsDBImport.11705869.err | tail -13
01:08:38.145 INFO GenomicsDBImport - Starting batch input file preload
01:09:38.916 INFO GenomicsDBImport - Finished batch preload
01:09:38.916 INFO GenomicsDBImport - Importing batch 1 with 564 samples
[TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error; path=/ibex/scratch/projects/c2071/1000quinoa/naga/1128_Phase2/RESULTS/gVCF/Cq5A.1/Cq5A$1$8993623/__06ef12cd-3e3e-445b-8a17-78d392d6e35347261215135488_1600294179899/ExcessHet.tdb; errno=121(Remote I/O error)
[TileDB::WriteState] Error: Cannot write segment to file.
02:44:19.847 erro NativeGenomicsDB - pid=208539 tid=208928 VariantStorageManagerException exception : Error while writing to TileDB array
TileDB error message : [TileDB::WriteState] Error: Cannot write segment to file
[TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error; path=/ibex/scratch/projects/c2071/1000quinoa/naga/1128_Phase2/RESULTS/gVCF/Cq5A.1/Cq5A$1$8993623/__06ef12cd-3e3e-445b-8a17-78d392d6e35347261215135488_1600294179899/ExcessHet.tdb; errno=121(Remote I/O error)
terminate called after throwing an instance of 'std::exception'
what(): std::exception
2. NFS file system
$ cat Cq5A.chunk_1.GenomicsDBImport.11739158.err | tail -13
11:49:12.994 INFO GenomicsDBImport - Starting batch input file preload
11:52:16.216 INFO GenomicsDBImport - Shutting down engine
[September 21, 2020 11:52:16 AM AST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 8.28 minutes.
Runtime.totalMemory()=19555942400
A USER ERROR has occurred: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize with exception: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from file:///ibex/scratch/projects/c2071/1000quinoa/OUT/Sept2020/VCF/S7E1_batch2.snps.indels.g.vcf.gz
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Note:
Background:
GATK version: 4.1.8
GenomicDB native library version: 1.3.0-e701905
Command used:
gatk GenomicsDBImport --variant $INPUT --genomicsdb-workspace-path $gVCF/$ChrName.$size --intervals $ChrName:$Start-$End --reader-threads $CORES"
Where,
$INPUT is list of Haplotypecaller gVCF files from 1128 samples.
$ChrName is Cq5A
$size is 1,2,3 ..8 (8 chunks)
$ChrName:$Start-$End is based on this below summary:
Cq5A split into 8 parts and here is the summary.
$ cat Chromosome_distribution.txt | grep Cq5A
Cq5A split into 8 parts
Chromosome name:Cq5A, Chunk number: 1, and Interval(Start:1-End:8993623)
Chromosome name:Cq5A, Chunk number: 2, and Interval(Start:8993624-End:17987246)
Chromosome name:Cq5A, Chunk number: 3, and Interval(Start:17987247-End:26980869)
Chromosome name:Cq5A, Chunk number: 4, and Interval(Start:26980870-End:35974492)
Chromosome name:Cq5A, Chunk number: 5, and Interval(Start:35974493-End:44968115)
Chromosome name:Cq5A, Chunk number: 6, and Interval(Start:44968116-End:53961738)
Chromosome name:Cq5A, Chunk number: 7, and Interval(Start:53961739-End:62955361)
Chromosome name:Cq5A, Chunk number: 8, and Interval(Start:62955362-End:64666259)
Observations:
At BeeGFS:
$ cat Cq5A.chunk_5.GenomicsDBImport.11705873.out
Tool returned:
true
At NFS
$ cat Cq5A.chunk_5.GenomicsDBImport.11739162.out
Tool returned:
true
System environment:
OS: CentOS Linux release 7.7.1908 (Core)
Java version: 1.8.0_242
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
Thanks and Regards,
Naga
Hello,
When building I get failures due to missing fmt links, the enclosed changes solved this for me.
Cheers,
Pierre
GenomicsDB tools should not be throwing exceptions and core dump for expected issues, rather should exit gracefully with meaningful messages.
gatk forum users have started requesting support for MacOS and Linux arm64 architectures - see forum post.
We were stuck with protobuf-v3.0.0-beta-1 because of gatk, but now that gatk has moved to using 3.8.0 of protobuf, genomicsdb can move to this version as well.
Also allow for building protobuf from the cmake Modules, so users do not have to figure out yet another prerequisite to build/install.
Currently the import tools expect to have a buffer size set that allows reading an entire VCF line (size_per_column_partition: see here).
It would be much more user friendly to have the import process automatically increase the buffer size as needed if the initial buffer size does not fit the input VCF line.
This is something to keep in mind when we start using this functionality.
Originally posted by @jPleyte in #186 (comment)
Hi,
I tried to follow the instructions by first building protobuf 3.0.2, but got an error "google/protobuf/port_def.inc: No such file or directory". Then I did git checkout 3.7.x inside protobuf followed by the installation and that succeeded.
Regards,
Shubham
I have trouble installing GenomicsDB on CentOS Linux (release 7.4.1708).
I ran
cmake ../ -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="/home/riestma1/" -DDISABLE_MPI=1 -DUSE_LIBCSV=0
The output looked good to me:
-- The C compiler identification is GNU 6.4.0
-- The CXX compiler identification is GNU 6.4.0
-- Check for working C compiler: /usr/prog/GCCcore/6.4.0/bin/cc
-- Check for working C compiler: /usr/prog/GCCcore/6.4.0/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/prog/GCCcore/6.4.0/bin/c++
-- Check for working CXX compiler: /usr/prog/GCCcore/6.4.0/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Performing Test test_cpp_2011
-- Performing Test test_cpp_2011 - Success
-- Performing Test test_stack_protector_strong
-- Performing Test test_stack_protector_strong - Success
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Performing Test OPENMPV4_FOUND
-- Performing Test OPENMPV4_FOUND - Success
-- Found libuuid: /usr/include
-- Found CURL: /lib64/libcurl.so (found version "7.29.0")
-- Found RapidJSON: /home/riestma1/include
-- Found htslib: /home/riestma1/git/GenomicsDB/dependencies/htslib
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7")
-- Found OpenSSL: /usr/lib64/libssl.so;/usr/lib64/libcrypto.so (found version "1.0.2k")
-- Found TileDB: /home/riestma1/git/GenomicsDB/dependencies/TileDB/core/include/c_api
-- Performing Test HAS_STD_CXX11
-- Performing Test HAS_STD_CXX11 - Success
-- Building muparserx version: 4.0.8
-- Using install prefix: /home/riestma1
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.5")
-- Found OpenMP: -fopenmp
-- Found JNI: /usr/prog/Java/1.8.0_162/jre/lib/amd64/libjawt.so
-- Found HDFS: /home/riestma1/git/GenomicsDB/dependencies/TileDB/deps/HDFSWrapper/hadoop-hdfs-native/main/native/libhdfs
-- Performing Test HAVE_BETTER_TLS
-- Performing Test HAVE_BETTER_TLS - Success
-- Performing Test HAVE_INTEL_SSE_INTRINSICS
-- Performing Test HAVE_INTEL_SSE_INTRINSICS - Success
-- Looking for dlopen in dl
-- Looking for dlopen in dl - found
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE
-- Performing Test CXX_2011_FOUND
-- Performing Test CXX_2011_FOUND - Success
-- Compiler supports C++ 2011.
-- The TileDB library is compiled with verbosity.
-- Found Protobuf: /home/riestma1/include
-- Found PROTOBUF: /home/riestma1/lib/libprotobuf.a
-- Performing Test PROTOBUF_V3_STABLE_FOUND
-- Performing Test PROTOBUF_V3_STABLE_FOUND - Success
-- Could not find libcsv headers and/or libraries (missing: LIBCSV_INCLUDE_DIR LIBCSV_LIBRARY)
-- Could not find safestring headers and/or libraries (missing: SAFESTRINGLIB_INCLUDE_DIR SAFESTRINGLIB_LIBRARY)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/riestma1/git/GenomicsDB/build
[ 7%] Built target htslib
[ 10%] Built target PROTOBUF_GENERATED_CXX_TARGET
[ 10%] Building CXX object src/main/CMakeFiles/GenomicsDB_library_object_files.dir/cpp/src/query_operations/variant_operations.cc.o
In file included from /home/riestma1/git/GenomicsDB/src/main/cpp/include/utils/gt_common.h:29:0,
from /home/riestma1/git/GenomicsDB/src/main/cpp/include/genomicsdb/variant.h:27,
from /home/riestma1/git/GenomicsDB/src/main/cpp/include/query_operations/variant_operations.h:26,
from /home/riestma1/git/GenomicsDB/src/main/cpp/src/query_operations/variant_operations.cc:23:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h: In function ‘T get_bcf_missing_value() [with T = long int]’:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h:83:10: error: ‘bcf_int64_missing’ was not declared in this scope
return bcf_int64_missing;
^~~~~~~~~~~~~~~~~
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h: In function ‘T get_bcf_missing_value() [with T = long unsigned int]’:
/home/riestma1/git/GenomicsDB/src/main/cpp/include/vcf/vcf.h:88:10: error: ‘bcf_int64_missing’ was not declared in this scope
return bcf_int64_missing;
Seems like the htslib/vcf.h is not included in vcf/vcf.h. But not sure why:
HTSLIB_INCLUDE_DIR:PATH=/home/riestma1/git/GenomicsDB/dependencies/htslib
HTSLIB_INSTALL_DIR:PATH=
HTSLIB_SOURCE_DIR:PATH=/home/riestma1/git/GenomicsDB/dependencies/htslib
Thanks a lot!
Originally reported in #6745 in GATK. When running a test, the reporting user got a GenomicsDB JNI error, thrown by jniGenomicsDBInit() in GenomicsDBQueryStream.java.
Looking at the C++ code for the native method in genomicsdb_GenomicsDBQueryStream.cc I find that the method handleJNIException() is called. However, that method takes the generic class std::exception, and the stack trace reports only std::exception, rather than a specific exception.
Is there a way to determine the specific exception type here?
Some builds fail with sparkDriver
not able to find a random free port. Most of the time, restarting the build works. But just wondering if you can take a look.
2021-04-11T18:38:04.1426794Z Sanity test with query.json : spark-submit --master local --deploy-mode client --total-executor-cores 1 --executor-memory 512M --conf "spark.executor.extraJavaOptions=" --conf "spark.driver.extraJavaOptions=" --class org.genomicsdb.spark.api.GenomicsDBSparkBindings /home/runner/work/GenomicsDB/GenomicsDB/build/target/genomicsdb-x.y.z.test-allinone.jar /tmp/tmp69y8hzf7/sanity_test/t0_1_2.json /tmp/tmp69y8hzf7/sanity_test/query.json failed
2021-04-11T18:38:04.1429971Z Stdout:
2021-04-11T18:38:04.1431279Z Stderr: 21/04/11 18:38:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-04-11T18:38:04.1432436Z 21/04/11 18:38:03 INFO SparkContext: Running Spark version 3.0.1
2021-04-11T18:38:04.1433443Z 21/04/11 18:38:03 INFO ResourceUtils: ==============================================================
2021-04-11T18:38:04.1437692Z 21/04/11 18:38:03 INFO ResourceUtils: Resources for spark.driver:
2021-04-11T18:38:04.1438166Z
2021-04-11T18:38:04.1438643Z 21/04/11 18:38:03 INFO ResourceUtils: ==============================================================
2021-04-11T18:38:04.1439513Z 21/04/11 18:38:03 INFO SparkContext: Submitted application: GenomicsDB API Experimental Bindings
2021-04-11T18:38:04.1440442Z 21/04/11 18:38:03 INFO SecurityManager: Changing view acls to: runner
2021-04-11T18:38:04.1441398Z 21/04/11 18:38:03 INFO SecurityManager: Changing modify acls to: runner
2021-04-11T18:38:04.1442177Z 21/04/11 18:38:03 INFO SecurityManager: Changing view acls groups to:
2021-04-11T18:38:04.1442970Z 21/04/11 18:38:03 INFO SecurityManager: Changing modify acls groups to:
2021-04-11T18:38:04.1444518Z 21/04/11 18:38:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(runner); groups with view permissions: Set(); users with modify permissions: Set(runner); groups with modify permissions: Set()
2021-04-11T18:38:04.1446748Z 21/04/11 18:38:04 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
...
2021-04-11T18:38:04.1456342Z 21/04/11 18:38:04 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
2021-04-11T18:38:04.1472401Z java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
2021-04-11T18:38:04.1475406Z at sun.nio.ch.Net.bind0(Native Method)
2021-04-11T18:38:04.1476149Z at sun.nio.ch.Net.bind(Net.java:461)
2021-04-11T18:38:04.1476956Z at sun.nio.ch.Net.bind(Net.java:453)
...
@kgururaj mentioned in #41 (review) that some of the structures used have a lot in common with the Protobuf structs and that they can be unified.
See #46
Currently, only vcf files can be ingested. Refactor code to have an api to ingest from multiple sources. This api should be made callable from Java/Python/R bindings as well.
The default is all rows
when no rows are specified, so GenomicsDB api query_variants
and query_variant_calls
should consider all rows. Thanks @jacmarjorie for pointing this issue.
This could be part of a separate workflow. We could just run run_spark_hdfs.py for pull requests.
Hi,
If I import a VCF file into genomicsDB containing many samples, say something from the 1000 genome project, are the fields that are common to all samples (the ones mentioned in the subject line) stored in each TileDB cell? Or are they stored just once per variant?
Regards,
Shubham
@mlathara Hello - we're running GATK GenotypeGVCFs with --max-alternate-alleles=6, with a genomicsDB input (~2500 samples). This line:
logs every single sample (not site) where it exceeds max-alternate-alleles, and in our case it writing a ridiculous amount to our log. Is there a way to tone this down? Can it log a max of XX times, and then stop? Can we toggle it with some other parameter?
thanks for any ideas.
This issue surfaced from the gatk user-forum.. VariantQueryProcessorException puts out the flattened coordinates in the exception making it difficult for the user to discern what is happening.
Hello,
I am packaging GenomicsDB into Debian; I had to make the enclosed change ad string_view needs C++-17, not -14.
Cheers,
Pierre
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.