tiledb-inc / tiledb-vcf Goto Github PK

View Code? Open in Web Editor NEW

80.0 17.0 13.0 33.98 MB

Efficient variant-call data storage and retrieval library using the TileDB storage library.

Home Page: https://tiledb-inc.github.io/TileDB-VCF/

License: MIT License

CMake 3.39% Python 7.29% Shell 1.43% C++ 66.99% C 5.89% Java 14.44% Makefile 0.27% Batchfile 0.03% CSS 0.08% SCSS 0.20%

genomics vcf variant-calling tiledb bioinformatics data-science spark python gwas

tiledb-vcf's Introduction

TileDB-VCF

A C++ library for efficient storage and retrieval of genomic variant-call data using TileDB Embedded.

Features

Easily ingest large amounts of variant-call data at scale
Supports ingesting single sample VCF and BCF files
New samples are added incrementally, avoiding computationally expensive merging operations
Allows for highly compressed storage using TileDB sparse arrays
Efficient, parallelized queries of variant data stored locally or remotely on S3
Export lossless VCF/BCF files or extract specific slices of a dataset

What's Included?

Command line interface (CLI)
APIs for C, C++, Python, and Java
Integrates with Spark and Dask

Quick Start

The documentation website provides comprehensive usage examples but here are a few quick exercises to get you started.

We'll use a dataset that includes 20 synthetic samples, each one containing over 20 million variants. We host a publicly accessible version of this dataset on S3, so if you have TileDB-VCF installed and you'd like to follow along just swap out the uri's below for s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20. And if you don't have TileDB-VCF installed yet, you can use our Docker images to test things out.

CLI

Export complete chr1 BCF files for a subset of samples:

tiledbvcf export \
  --uri vcf-samples-20 \
  --regions chr1:1-248956422 \
  --sample-names v2-usVwJUmo,v2-WpXCYApL

Create a TSV file containing all variants within one or more regions of interest:

tiledbvcf export \
  --uri vcf-samples-20 \
  --sample-names v2-tJjMfKyL,v2-eBAdKwID \
  -Ot --tsv-fields "CHR,POS,REF,S:GT" \
  --regions "chr7:144000320-144008793,chr11:56490349-56491395"

Python

Running the same query in python

import tiledbvcf

ds = tiledbvcf.Dataset(uri = "vcf-samples-20", mode="r")

ds.read(
    attrs = ["sample_name", "pos_start", "fmt_GT"],
    regions = ["chr7:144000320-144008793", "chr11:56490349-56491395"],
    samples = ["v2-tJjMfKyL", "v2-eBAdKwID"]
)

returns results as a pandas DataFrame

     sample_name  pos_start    fmt_GT
0    v2-nGEAqwFT  143999569  [-1, -1]
1    v2-tJjMfKyL  144000262  [-1, -1]
2    v2-tJjMfKyL  144000518  [-1, -1]
3    v2-nGEAqwFT  144000339  [-1, -1]
4    v2-nzLyDgYW  144000102  [-1, -1]
..           ...        ...       ...
566  v2-nGEAqwFT   56491395    [0, 0]
567  v2-ijrKdkKh   56491373    [0, 0]
568  v2-eBAdKwID   56491391    [0, 0]
569  v2-tJjMfKyL   56491392  [-1, -1]
570  v2-nzLyDgYW   56491365  [-1, -1]

Want to Learn More?

Code of Conduct

All participants in TileDB spaces are expected to adhere to high standards of professionalism in all interactions. This repository is governed by the specific standards and reporting procedures detailed in depth in the TileDB core repository Code Of Conduct.

tiledb-vcf's People

Contributors

Stargazers

Watchers

Forkers

raonyguimaraes olesya13 mitochon aaronwolen doytsujin ihnorton jsacco1 dimitrisstaratzis awenocur glycoaddict genostack jdblischak jin0008

tiledb-vcf's Issues

The nightly build job failed on Saturday (2023-11-04)

The nightly build job failed on Saturday (2023-11-04) in run 6758822626

missing libcurl.so.4 in latest tiledbvcf-cli docker image

The latest docker image for tiledbvcf-cli is missing the libcuro.so.4 library

Fri Sep 17 08:50:12 alexanderstuckey@GEL-C02WN56EG8WN: ~ $ docker pull tiledb/tiledbvcf-cli
Using default tag: latest
latest: Pulling from tiledb/tiledbvcf-cli
35807b77a593: Pull complete 
ee1ab4236f38: Pull complete 
c4f48a4e2189: Pull complete 
e698f529638d: Pull complete 
a00061aafe04: Pull complete 
Digest: sha256:5fbb8dd68a23d594e05914abd3e06de027b96eee990f217b6ae5235b20a19b15
Status: Downloaded newer image for tiledb/tiledbvcf-cli:latest
docker.io/tiledb/tiledbvcf-cli:latest
Fri Sep 17 08:51:23 alexanderstuckey@GEL-C02WN56EG8WN: ~ $ docker run --rm tiledb/tiledbvcf-cli tiledbvcf -h
/usr/local/bin/tiledbvcf: error while loading shared libraries: libcurl.so.4: cannot open shared object file: No such file or directory

The previous version that I had works fine

Fri Sep 17 08:49:49 alexanderstuckey@GEL-C02WN56EG8WN: ~ $ docker run --rm tiledb/tiledbvcf-cli tiledbvcf -h
TileDBVCF -- efficient variant-call data storage and retrieval.

This command-line utility provides an interface to create, store and efficiently retrieve variant-call data in the TileDB storage format.

More information: TileDB <https://tiledb.io>
TileDB-VCF version 0.9.0
TileDB version 2.3.1
htslib version 1.10

The nightly build job failed on Tuesday (2024-03-12)

The nightly build job failed on Tuesday (2024-03-12) in run 8258825305

Python API installation is missing pyarrow

Thanks for this great project.

If you follow the instructions at this document, then the result is a missing pyarrow dependency as seen below.

root@e47916114dc4:/opt/TileDB-VCF/apis/python# python setup.py install --user
running install
running bdist_egg
-- Using default install prefix /opt/TileDB-VCF/dist. To control CMAKE_INSTALL_PREFIX, set OVERRIDE_INSTALL_PREFIX=OFF
-- Install prefix is /opt/TileDB-VCF/dist.
-- Skipping search for htslib, building it as an external project. To use system htslib, set FORCE_EXTERNAL_HTSLIB=OFF
-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Starting TileDB-VCF superbuild.
-- Could NOT find Clipp (missing: CLIPP_INCLUDE_DIR) 
-- Adding Clipp as an external project
-- Could NOT find HTSlib (missing: HTSLIB_LIBRARIES HTSLIB_INCLUDE_DIR) 
-- Adding HTSlib as an external project
-- Adding TileDB as an external project
-- searching for catch in /opt/TileDB-VCF/libtiledbvcf/build/externals/src
-- Could NOT find Catch (missing: CATCH_INCLUDE_DIR) 
-- Adding Catch as an external project
-- Not found clang-tidy
-- Not found clang-format
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/TileDB-VCF/libtiledbvcf/build
Scanning dependencies of target ep_tiledb
Scanning dependencies of target ep_htslib
Scanning dependencies of target ep_catch
Scanning dependencies of target ep_clipp
[  2%] Creating directories for 'ep_tiledb'
[  5%] Creating directories for 'ep_htslib'
[  7%] Creating directories for 'ep_catch'
[ 10%] Creating directories for 'ep_clipp'
[ 12%] Performing download step (download, verify and extract) for 'ep_tiledb'
[ 15%] Performing download step (download, verify and extract) for 'ep_htslib'
[ 17%] Performing download step (download, verify and extract) for 'ep_clipp'
[ 20%] Performing download step (download, verify and extract) for 'ep_catch'
-- ep_clipp download command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_clipp-stamp/ep_clipp-download-*.log
[ 22%] No update step for 'ep_clipp'
[ 25%] No patch step for 'ep_clipp'
[ 27%] No configure step for 'ep_clipp'
[ 30%] No build step for 'ep_clipp'
[ 32%] Performing install step for 'ep_clipp'
-- ep_clipp install command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_clipp-stamp/ep_clipp-install-*.log
[ 35%] Completed 'ep_clipp'
-- ep_htslib download command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_htslib-stamp/ep_htslib-download-*.log
[ 35%] Built target ep_clipp
[ 37%] No patch step for 'ep_htslib'
[ 40%] No update step for 'ep_htslib'
[ 42%] Performing configure step for 'ep_htslib'
-- ep_catch download command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_catch-stamp/ep_catch-download-*.log
-- ep_tiledb download command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_tiledb-stamp/ep_tiledb-download-*.log
[ 45%] No patch step for 'ep_catch'
[ 47%] No update step for 'ep_catch'
[ 50%] No patch step for 'ep_tiledb'
[ 52%] No update step for 'ep_tiledb'
[ 55%] No configure step for 'ep_catch'
[ 57%] Performing configure step for 'ep_tiledb'
[ 60%] No build step for 'ep_catch'
[ 62%] No install step for 'ep_catch'
[ 65%] Completed 'ep_catch'
[ 65%] Built target ep_catch
-- ep_tiledb configure command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_tiledb-stamp/ep_tiledb-configure-*.log
[ 67%] Performing build step for 'ep_tiledb'
-- ep_htslib configure command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_htslib-stamp/ep_htslib-configure-*.log
[ 70%] Performing build step for 'ep_htslib'
-- ep_htslib build command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_htslib-stamp/ep_htslib-build-*.log
[ 72%] Performing install step for 'ep_htslib'
-- ep_htslib install command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_htslib-stamp/ep_htslib-install-*.log
[ 75%] Completed 'ep_htslib'
[ 75%] Built target ep_htslib
-- ep_tiledb build command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_tiledb-stamp/ep_tiledb-build-*.log
[ 77%] Performing install step for 'ep_tiledb'
-- ep_tiledb install command succeeded.  See also /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_tiledb-stamp/ep_tiledb-install-*.log
[ 80%] Completed 'ep_tiledb'
[ 80%] Built target ep_tiledb
Scanning dependencies of target libtiledbvcf
[ 82%] Creating directories for 'libtiledbvcf'
[ 85%] No download step for 'libtiledbvcf'
[ 87%] No patch step for 'libtiledbvcf'
[ 90%] No update step for 'libtiledbvcf'
[ 92%] Performing configure step for 'libtiledbvcf'
-- Using default install prefix /opt/TileDB-VCF/dist. To control CMAKE_INSTALL_PREFIX, set OVERRIDE_INSTALL_PREFIX=OFF
-- Install prefix is /opt/TileDB-VCF/dist.
-- Skipping search for htslib, building it as an external project. To use system htslib, set FORCE_EXTERNAL_HTSLIB=OFF
-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Starting TileDB-VCF regular build.
-- Starting TileDB-VCF build.
-- Found Clipp: /opt/TileDB-VCF/libtiledbvcf/build/externals/install/include  
-- Found HTSlib: /opt/TileDB-VCF/libtiledbvcf/build/externals/install/lib/libhts.so  
-- Found TileDB: /opt/TileDB-VCF/libtiledbvcf/build/externals/install/lib/libtiledb.so.1.7
-- Found Git: /usr/bin/git (found version "2.20.1") 
-- Building with commit hash 
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Found TileDB: /opt/TileDB-VCF/libtiledbvcf/build/externals/install/lib/libtiledb.so.1.7
-- searching for catch in /opt/TileDB-VCF/libtiledbvcf/build/externals/src
-- Found Catch: /opt/TileDB-VCF/libtiledbvcf/build/externals/src/ep_catch/single_include  
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/TileDB-VCF/libtiledbvcf/build/libtiledbvcf
[ 95%] Performing build step for 'libtiledbvcf'
Scanning dependencies of target TILEDB_VCF_OBJECTS
[  4%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/read/in_memory_exporter.cc.o
[  9%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/dataset/tiledbvcfdataset.cc.o
[ 13%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/c_api/tiledbvcf.cc.o
[ 18%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/dataset/attribute_buffer_set.cc.o
[ 36%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/write/record_heap.cc.o
[ 36%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/read/reader.cc.o
[ 36%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/read/exporter.cc.o
[ 36%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/read/bcf_exporter.cc.o
[ 40%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/read/tsv_exporter.cc.o
[ 50%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/vcf/vcf.cc.o
[ 50%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/vcf/region.cc.o
[ 63%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/utils/sample_utils.cc.o
[ 63%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/write/writer.cc.o
[ 68%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/utils/utils.cc.o
[ 72%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/utils/buffer.cc.o
[ 72%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/utils/bitmap.cc.o
[ 77%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/write/writer_worker.cc.o
[ 81%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/read/read_query_results.cc.o
[ 86%] Building CXX object src/CMakeFiles/TILEDB_VCF_OBJECTS.dir/__/external/base64/base64.cc.o
[ 86%] Built target TILEDB_VCF_OBJECTS
Scanning dependencies of target tiledbvcf-bin
Scanning dependencies of target tiledbvcf
[ 90%] Linking CXX shared library libtiledbvcf.so
[ 95%] Building CXX object src/CMakeFiles/tiledbvcf-bin.dir/cli/tiledbvcf.cc.o
[ 95%] Built target tiledbvcf
[100%] Linking CXX executable tiledbvcf
[100%] Built target tiledbvcf-bin
[ 97%] No install step for 'libtiledbvcf'
[100%] Completed 'libtiledbvcf'
[100%] Built target libtiledbvcf
Scanning dependencies of target install-libtiledbvcf
[ 86%] Built target TILEDB_VCF_OBJECTS
[ 95%] Built target tiledbvcf-bin
[100%] Built target tiledbvcf
Install the project...
-- Install configuration: "Release"
-- Installing: /opt/TileDB-VCF/dist/lib/libhts.so.1.8
-- Installing: /opt/TileDB-VCF/dist/lib/libtiledb.so.1.7
-- Installing: /opt/TileDB-VCF/dist/lib/libtiledbvcf.so
-- Set runtime path of "/opt/TileDB-VCF/dist/lib/libtiledbvcf.so" to "$ORIGIN/"
-- Installing: /opt/TileDB-VCF/dist/include/tiledbvcf/tiledbvcf.h
-- Installing: /opt/TileDB-VCF/dist/include/tiledbvcf/tiledbvcf_enum.h
-- Installing: /opt/TileDB-VCF/dist/include/tiledbvcf/arrow.h
-- Installing: /opt/TileDB-VCF/dist/include/tiledbvcf/tiledbvcf_export.h
-- Installing: /opt/TileDB-VCF/dist/bin/tiledbvcf
-- Set runtime path of "/opt/TileDB-VCF/dist/bin/tiledbvcf" to "/opt/TileDB-VCF/dist/lib"
-- Up-to-date: /opt/TileDB-VCF/dist/lib/libhts.so.1.8
-- Up-to-date: /opt/TileDB-VCF/dist/lib/libtiledb.so.1.7
Built target install-libtiledbvcf
Adding to package_data: ['libhts.so.1.8', 'libtiledb.so.1.7', 'libtiledbvcf.so']
running egg_info
creating src/tiledbvcf.egg-info
writing src/tiledbvcf.egg-info/PKG-INFO
writing dependency_links to src/tiledbvcf.egg-info/dependency_links.txt
writing top-level names to src/tiledbvcf.egg-info/top_level.txt
writing manifest file 'src/tiledbvcf.egg-info/SOURCES.txt'
reading manifest file 'src/tiledbvcf.egg-info/SOURCES.txt'
writing manifest file 'src/tiledbvcf.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/tiledbvcf
copying src/tiledbvcf/dask_functions.py -> build/lib.linux-x86_64-3.7/tiledbvcf
copying src/tiledbvcf/__init__.py -> build/lib.linux-x86_64-3.7/tiledbvcf
copying src/tiledbvcf/dataset.py -> build/lib.linux-x86_64-3.7/tiledbvcf
copying src/tiledbvcf/libhts.so.1.8 -> build/lib.linux-x86_64-3.7/tiledbvcf
copying src/tiledbvcf/libtiledb.so.1.7 -> build/lib.linux-x86_64-3.7/tiledbvcf
copying src/tiledbvcf/libtiledbvcf.so -> build/lib.linux-x86_64-3.7/tiledbvcf
running build_ext
Traceback (most recent call last):
  File "setup.py", line 276, in <module>
    'Programming Language :: Python :: 3.7',
  File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 145, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib/python3/dist-packages/setuptools/command/install.py", line 67, in run
    self.do_egg_install()
  File "/usr/lib/python3/dist-packages/setuptools/command/install.py", line 109, in do_egg_install
    self.run_command('bdist_egg')
  File "/usr/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "setup.py", line 232, in run
    bdist_egg.run(self)
  File "/usr/lib/python3/dist-packages/setuptools/command/bdist_egg.py", line 172, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/usr/lib/python3/dist-packages/setuptools/command/bdist_egg.py", line 158, in call_command
    self.run_command(cmdname)
  File "/usr/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib/python3/dist-packages/setuptools/command/install_lib.py", line 24, in run
    self.build()
  File "/usr/lib/python3.7/distutils/command/install_lib.py", line 109, in build
    self.run_command('build_ext')
  File "/usr/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 78, in run
    _build_ext.run(self)
  File "/usr/lib/python3.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "setup.py", line 218, in build_extensions
    import pyarrow
ModuleNotFoundError: No module named 'pyarrow'

The nightly build job failed on Thursday (2024-01-25)

The nightly build job failed on Thursday (2024-01-25) in run 7663413483

tiledbvcf extract --merge

An important use case for me is using tiledbvcf extract to export a single cVCF with a subset of samples of interest for downstream analyses (e.g. eQTL) . I see that I needed to specifically use option -m with export for that, otherwise the default output is a a vcf/bcf file for each sample (which does not seem very useful for me, but I am obviously biased towards cVCFs)

The problem is that tiledbvcf does not seem to support the -m option in version v2.6.4 which is the current conda version at this time. Strangely enough it was present in an older version v2.6.3 so I tried it with that version and it did not go well.

After ingesting ~2400 samples (for ~6million SNPs) I wanted to test the performance of extracting a subset of samples (all SNPs, not a region subset) as a cVCF. This worked pretty well for 2 sample IDs passed to tiledbvcf extract -m -s option (much faster than bcftools, which is what I am comparing it to, since bcftools has to scan the whole cVCF to do the same), but failed when I tried tiledbvcf extract -m -f with a list of 280 randomly picked samples: the process ran for days with no end in sight, so I had to kill it (bcftools did that in about 16 minutes, I just hoped tiledbcvf would be able to do the same but faster)

Considering this extraction feature (the -m, --merge option) is no longer present in the most recent conda version, are there any plans to bring it back (and fix it)?

python api missing for write to vcf?

Hi,

I was reading the documentation and saw that the docker version is able to export into vcf, but I was looking to do this as well within the python env but have not found the documentation for this. Is there a wrapper function to write directly vcfs out?

Wrong type hint for dataset python api

Issue Description

The python api for the dataset class contains a mode attribute which is type hinted as a boolean value, but will be handled as a string in the code.

Expected Behavior

The correct type hint should be str

Actual Behavior

The current type hint is bool

Environment

tiledbvcf-py version 0.27.0

Cannot submit_and_finalize query

Hi!
I am trying to ingest samples, but run into the following problem:

_40b48cf415984c70a4ab49e225d7c8ef_21 samples = SAMPLE
[2024-02-13 09:59:31.061] [tiledb-vcf] [...] [debug] Finalizing last contig batch of [1, 1]
[2024-02-13 09:59:31.061] [tiledb-vcf] [...] [debug] AlleleCount: Finalize query with 0 records
[2024-02-13 09:59:31.104] [tiledb-vcf] [...] [debug] VariantStats: Finalize query with 0 records
[2024-02-13 09:59:31.146] [tiledb-vcf] [...] [debug] Query buffer for 'contig' contains 3272 elements
[2024-02-13 09:59:31.146] [tiledb-vcf] [...] [critical] Cannot submit_and_finalize query with buffers set.

As far as I can tell, one of the following steps fails. Is there something I can test tweaking to make this work, or does anything else stand out as the obvious culprit here?

    File: libtiledbvcf/src/stats/allele_count.cc
 160   if (contig_records_ > 0) {
 161     if (utils::query_buffers_set(query_.get())) {
 162       LOG_FATAL("Cannot submit_and_finalize query with buffers set.");                                                                                                                                      
 163     }
 164     query_->submit_and_finalize();

 File: libtiledbvcf/src/stats/variant_stats.cc
 158   if (contig_records_ > 0) {
 159     if (utils::query_buffers_set(query_.get())) {
 160       LOG_FATAL("Cannot submit_and_finalize query with buffers set.");                                                                                                                                      
 161     }
 162     query_->submit_and_finalize();

TileDB-VCF's docs missing

Dear,

Could you please let me know would it be possible to check the TileDB-VCF's docs ?

URL is not valid anymore: https://docs.tiledb.com/main/solutions/integrations/population-genomics

thanks in advance
Regards

Different behaviour using regions or regions-file and maybe two bugs

Hello!

I noticed a differente behaviour betweens the --regions and the --regions-file parameter

A query using --regions works "almost" as expected

./tiledbvcf export --uri rewelldb --regions "7:150553605-150553605" -Ot --tsv-fields "CHR,POS,REF,S:GT"

This returns the information for this position on the DB, but the DB have 9 samples, and the results return 12 records, 3 records are duplicated

./tiledbvcf export --uri XXX --regions "7:150553605-150553605" -Ot --tsv-fields "CHR,POS,REF,S:GT" | sort | uniq -c
1 X000305-S 7 150553605 C 0,0
2 X000610-S 7 150553605 C 0,0
1 X001008-S 7 150553605 C 0,0
2 X004903-S 7 150553605 C 0,0
2 X005007-S 7 150553605 C 0,0
1 X005105-S 7 150553605 C 0,0
1 X005301-S 7 150553605 C 0,0
1 X005508-S 7 150553605 C 0,1
1 X005606-S 7 150553605 C 0,1
1 SAMPLE CHR POS REF S:GT

If i use the same region inside a .bed file, with the --regions-file parameter the cli starts to print the whole db instead of only the information on the position

thanks!

How to compile with conda?

Hi, could you please provide some information how to compile this package within a conda-environment?
I guess I have to somehow build libtiledb with cmake -DFORCE_EXTERNAL_HTSLIB=OFF?

store: Segmentation fault

Dear,

I tried to add the dbSNP vcf file to the TileDB-VCF.

but I go the following error:
Segmentation fault

Here is the header of the file:

##fileformat=VCFv4.2
##fileDate=20210513
##source=dbSNP
##dbSNP_BUILD_ID=155
##reference=GRCh38.p13
##fileformat=VCFv4.2
##fileDate=20210513
##phasing=partial
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|).  Does not include pseudogenes.">
##INFO=<ID=PSEUDOGENEINFO,Number=1,Type=String,Description="Pairs each of pseudogene symbol:gene id.  The pseudogene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other">
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant has associated publication">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44">
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42">
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41">
##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3">
##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53">
##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75">
##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available.">
##INFO=<ID=PUB,Number=0,Type=Flag,Description="RefSNP or associated SubSNP is mentioned in a publication">
##INFO=<ID=FREQ,Number=.,Type=String,Description="An ordered list of allele frequencies as reported by various genomic studies, starting with the reference allele followed by alternate alleles as ordered in the ALT column. When not already in the dbSNP allele set,
 al
leles from the studies are added to the ALT column.  The minor allele, which was previuosly reported in VCF as the GMAF, is the second largest value in the list.  This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter">
##INFO=<ID=COMMON,Number=0,Type=Flag,Description="RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Variant names from HGVS.    The order of these variants corresponds to the order of the info in the other clinical  INFO tags.">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="Variant Identifiers provided and maintained by organizations outside of NCBI, such as OMIM.  Source and id separated by colon (:).  Each identifier is separated by a vertical bar (|)">
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description="Allele Origin. One or more of the following values may be summed: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not
-te
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic, 6 - drug response, 8 - confers sensitivity, 9 - risk-factor, 10 -
 as
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Variant disease database name and ID, separated by colon (:)">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="Preferred ClinVar disease name">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar Review Status: no_assertion - No asserition provided by submitter, no_criteria - No assertion criteria provided by submitter, single - Classified by single submitter, mult - Classified by multiple sub
mit
##INFO=<ID=CLNACC,Number=.,Type=String,Description="For each allele (comma delimited), this is a pipe-delimited list of the Clinvar RCV phenotype accession.version strings associated with that allele.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  dbSNP

For your information, I the 2 columns (FORMAT dbSNP) with fixed values to the original data because I thought that could be the issue

so here is the data look like:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dbSNP NC_000001.11 10001 rs1570391677 T A,C . . RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON GT:GQ:DP 0/1:35:4 NC_000001.11 10002 rs1570391692 A C . . RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9944,0.005597 GT:GQ:DP 0/1:35:4 NC_000001.11 10003 rs1570391694 A C . . RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9902,0.009763 GT:GQ:DP 0/1:35:4 NC_000001.11 10007 rs1639538116 T C,G . . RS=1639538116;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=dbGaP_PopFreq:1,0,0 GT:GQ:DP 0/1:35:4 NC_000001.11 10008 rs1570391698 A C,G,T . . RS=1570391698;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9969,.,0.003086,.|dbGaP_PopFreq:1,0,.,0 GT:GQ:DP 0/1:35:4 NC_000001.11 10009 rs1570391702 A C,G . . RS=1570391702;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9911,.,0.008916|dbGaP_PopFreq:1,0,0 GT:GQ:DP 0/1:35:4 NC_000001.11 10013 rs1639538192 T C,G . . RS=1639538192;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=dbGaP_PopFreq:1,0,0 GT:GQ:DP 0/1:35:4 NC_000001.11 10013 rs1639538231 TA T . . RS=1639538231;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0 GT:GQ:DP 0/1:35:4 NC_000001.11 10014 rs1639538207 A C,G,T . . RS=1639538207;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=dbGaP_PopFreq:1,0,0,0 GT:GQ:DP 0/1:35:4

the original data can be found here: https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/

Could you please check this.

thanks in advance
Amin

tiledb-vcf-java jar doesn't include native libraries

Hello - we have a project that includes tiledb-vcf to store our VCF data files. To use the Java API from within our project, I have compiled the TileDB-VCF and TileDB-Java jar files and included them in the build.gradle file. This is working fine when we run on a linux box.

I now want to compile and run on a MAC (with Intel chip, not ARM)

When I try to access the VCFReader, I get the error:

java.lang.UnsatisfiedLinkError: 'java.lang.String io.tiledb.libvcfnative.LibVCFNative.tiledb_vcf_version()'

Using "jar -tf" to look at the tiledb-vcf-java-0.28.0.jar created on the MAC, I see it does not include the native libraries show below. These ARE included when we've compiled for version 25 on a linux machine.

lib/
lib/libspdlog.a
lib/libtiledb.so.2.16
lib/libtiledbvcfjni.so
lib/libhts.so.1.15.1
lib/libtiledbvcf.so

The commands I am using on the MAC (with Intel chip, NOT M1) are:

(pre-run: set JAVA version to 11)
git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF/apis/java
./gradlew --debug shadowjar

I have also compiled and loaded the jar file for TileDB-Java using these commands:

git clone https://github.com/TileDB-Inc/TileDB-Java.git
./gradlew assemble

The jar file contains the files below.
lib/
lib/libtiledb.dylib
lib/libtiledbjni.dylib

Has something changed in the later versions the excludes the native libraries when compiling TileDB-VCF/apis/java into a jar file? or perhaps there is a step or parameter I am missing?

Thanks - Lynn

The nightly build job failed on Friday (2023-12-08)

The nightly build job failed on Friday (2023-12-08) in run 7148624114

Ingesting of combined VCF (cVCF) is not supported

Hi, I just tried TileDB-VCF with a cVCF containing ~ 900 samples.
Unfortunately, I get the error message:

Error registering samples; a file has more than 1 sample. Ingestion from cVCF is not supported.

I do not want to split a cVCF of this size into separate files.
Is there an alternative?

gVCF support

Dear,

Does TileDB-VCF support the gVCF files like GATK-Genomics DB?

Thanks in advance

Regards
Amin

The nightly build job failed on Monday (2024-02-26)

The nightly build job failed on Monday (2024-02-26) in run 8058979598

The nightly build job failed on Wednesday (2024-03-20)

The nightly build job failed on Wednesday (2024-03-20) in run 8369032231

TileDB-VCF Functionalities

Dear,

I recently start looking into TileDB-VCF and testing it, I have some questions:

1- Does Tiledb-VCF have the ability to calculate the unique list of Chromosomal positions across all the samples or we should implement this functionality with for example Python API?

2- If I add the same sample as I stored previously in the database, data is added to the previous sample, and records will be doubled. As I checked the --no-duplicates should be defined at the time of the creation of the dataset and I assume this prevents a sample or a duplicate record from being added to the dataset. But I want to know can I use the timestamp to always query the latest version of each record?

3- The timestamp is an autogenerated value, or do I need to define it beforehand?

Sorry for the questions, and tanks in advance.

Kind Regards
Amin

Build failing in linux/arm64 Ubuntu VM

Hi!

I am experiencing trouble when building TileDB-VCF under the following conditions:

Host: Apple Silicon M2
VM Manager: UTM
VM Image OS: Ubuntu 22.04.3 LTS aarch64

Do you have any ideas of what may be the culprit here? Apart from the fact that ld fails, that is. For the record, TileDB appears to install fine via pip install tiledb, just in case that is at all relevant.

NB! Using conda is a no go.

[ 95%] Linking CXX shared library libtiledbvcf.so
[ 97%] Building CXX object src/CMakeFiles/tiledbvcf-bin.dir/cli/tiledbvcf.cc.o
/usr/bin/ld: /TileDB-VCF/libtiledbvcf/build/externals/install/lib/libtiledb.so.2.17: error adding symbols: file in wrong format
collect2: error: ld returned 1 exit status
make[5]: *** [src/CMakeFiles/tiledbvcf.dir/build.make:157: src/libtiledbvcf.so] Error 1
make[4]: *** [CMakeFiles/Makefile2:147: src/CMakeFiles/tiledbvcf.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....
[100%] Linking CXX executable tiledbvcf
/usr/bin/ld: /TileDB-VCF/libtiledbvcf/build/externals/install/lib/libtiledb.so.2.17: error adding symbols: file in wrong format
collect2: error: ld returned 1 exit status
make[5]: *** [src/CMakeFiles/tiledbvcf-bin.dir/build.make:173: src/tiledbvcf] Error 1
make[4]: *** [CMakeFiles/Makefile2:173: src/CMakeFiles/tiledbvcf-bin.dir/all] Error 2
make[3]: *** [Makefile:146: all] Error 2
make[2]: *** [CMakeFiles/libtiledbvcf.dir/build.make:86: libtiledbvcf-prefix/src/libtiledbvcf-stamp/libtiledbvcf-build] Error 2
make[1]: *** [CMakeFiles/Makefile2:228: CMakeFiles/libtiledbvcf.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
Traceback (most recent call last):
  File "/TileDB-VCF/apis/python/setup.py", line 274, in <module>
CMake configure command: ['cmake', '-DFORCE_EXTERNAL_HTSLIB=ON', '-DFORCE_EXTERNAL_TILEDB=OFF', '-DTILEDB_S3=ON', '-DDOWNLOAD_TILEDB_PREBUILT=ON', '-DCMAKE_BUILD_TYPE=Release', '-DENABLE_ARROW_EXPORT=ON', '../../libtiledbvcf']
    setup(
  File "/usr/local/lib/python3.11/site-packages/setuptools/__init__.py", line 87, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
    dist.run_commands()
  File "/usr/local/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
    self.run_command(cmd)
  File "/usr/local/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.11/site-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()
  File "/usr/local/lib/python3.11/site-packages/setuptools/command/install.py", line 123, in do_egg_install
    self.run_command('bdist_egg')
  File "/usr/local/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/TileDB-VCF/apis/python/setup.py", line 264, in run
    find_or_build_libtiledbvcf(self)
  File "/TileDB-VCF/apis/python/setup.py", line 179, in find_or_build_libtiledbvcf
    build_libtiledbvcf()
  File "/TileDB-VCF/apis/python/setup.py", line 164, in build_libtiledbvcf
    subprocess.check_call(build_cmd, cwd=build_dir)
  File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['make', '-j6']' returned non-zero exit status 2.
------
failed to solve: process "/bin/sh -c python setup.py install" did not complete successfully: exit code: 1

Fix Contributing.md building from source link

The build from source https://docs.tiledb.com/developer/tiledbvcf/installation doesn't exist anymore, we should update it.

The nightly build job failed on Tuesday (2023-09-26)

The nightly build job failed on Tuesday (2023-09-26) in run 6320860296

Ingestion of 2k VCFs in parallel on a cluster?

Hi,

I wonder if the TileDB-VCF can support running ingestion of a batches of VCFs on separate nodes of a cluster (possibly using some throw away, temporary databases) and then merge it into one proper database to be used for the queries/extraction of SNPs from a given interval?

That would greatly help creating a database. Hogging one server for days is something to be avoided if possible in our setup.

Thanks a lot for delivering already great solution.

Darek Kedra

Java API: Request to support loading Mac-ARM libraries

Our group is developing software, PHG (Practical Haplotype Graph, https://github.com/maize-genetics/phg_v2), that makes use of TileDB-VCF to store VCF files. Our own code is written using Kotlin, a popular Java replacement, which can call Java code seamlessly. We want plan to use the TileDB-VCF Java interface to read from the database and the CLI to write, since the Java interface does not support writing. We have written and tested code to access the tiledb database using the api and have that working. Some calls we need (sample names and headers) use the TileDB library, so we need to use both. Also, we want to support Linux, Mac-Intel, and Mac-ARM. We plan to combine the TileDB-Java and TileDB-VCF jar files to avoid distributing two copies of the tiledb shared library, which is rather big. (We have tested combining them and it works fine). We have noticed that there is a TileDB-Java jar file on maven central that contains libraries for Linux, Mac-Intel, Mac-ARM, and Windows. This leads to the following questions:
(1) Are there plans to port the native library loader from TileDB-Java to the TileDB-VCF Java api to support automatically loading Mac-ARM libraries from the jar file?
(2) If not, would you like us to port that code and contribute the change to your project?
(3) We would put any jars that we build (such as a combined TileDB-Java:TileDB-VCF jar) in the Maven central net.maizegenetics group. Do have any issues with that? (Our distribution plans are preliminary and could change.)

Preparing for future bioconda submission

This isn't urgent, but I recently investigated the potential submission to bioconda, so I wanted to create this reminder and share my notes:

Because TileDB already uses the conda-forge feedstock infrastructure to build conda binaries for TileDB-VCF, TileDB-Inc/tiledb-vcf-feedstock, it would be trivial to migrate to the conda-forge channel
However, TileDB-VCF cannot be submitted to conda-forge because it depends on htslib, which is only available from bioconda. conda-forge has a long-standing policy of requiring all of its dependencies to be available from conda-forge (recent example to confirm this policy still holds)
The bioconda channel depends on conda-forge packages, but it uses its own distinct mechanisms for building and maintaining recipes. The biggest difference is that they use a mono-repo, bioconda-recipes
The bioconda channel accepts GitHub Releases as stable URLs, so TileDB-VCF already meets its requirements (ie no need to first submit to an external repository such as PyPI)
The bioconda channel has its own auto-updater that can detect GitHub releases, and will open a new PR with the new version and checksum
Thus once on bioconda, it would be extra work to continue maintaining TileDB-Inc/tiledb-vcf-feedstock, so this would only be worth the effort if there was a separate use case (eg nightly conda builds)

xref: #47

Very high RAM usage when storing plant variant data from GVCFs

Hello!

I have been using TileDB-VCF to store variant data from several dozen sorghum lines. After several attempts to run tiledbvcf store resulted in the process being killed, I noticed that it was using upwards of 400 GB of RAM (more than my machine had free at the time). Even trying to store variant data from a single sample required more than 350 GB of RAM. Each sorghum gvcf is about 1-2GB uncompressed and contains 6-8 million records, so that seemed rather high! I had similar results trying to store maize genomes as well.

Original conditions were as follows:

OS: Rocky Linux version 9.0
tiledb: 2.17.4
tiledbvcf: 0.25

I also testing it on a Mac with an M1 chip running macOS version 12.5 (same tiledb/vcf versions). This machine only had 64 GB of RAM to begin with, and so the process ran out of free memory very quickly!

I have tracked down the source of this problem to the large deletions present in my gvcfs. The gvcfs were generated from whole-genome alignments of assemblies using the AnchorWave tool, and as a result they capture large insertions and deletions, some of which are 10s of Mbp long. It was most convenient for me to include the full sequence of the insertions and deletions rather than using symbolic <INS> or <DEL> alleles, so that sequence could be reconstructed with only the reference fasta. However, the large deletions specifically seemed to be the cause of the high RAM usage. Removing all deletions more than 100 bp long or replacing deletions with symbolic <DEL> alleles dropped the peak RAM usage to a more workable 50 GB with the default 10 samples per batch. Large insertions do not appear to have a great effect on the RAM usage.

That said, I'm unclear as to why the deletions are such a problem in the first place. Do you have any idea what would be eating up that much RAM? Is it a bug, or just a logical consequence of how deletions are handled? Any insight you have would be appreciated.

Docker build does not work from within docker file

In order to have the Docerfile-cli build correctly I had to move it to the root directory. Some of the COPY paths probably need to be changed in the Dockerfiles now that they are in the docker directory.

locale::facet::_S_create_c_locale name not valid (ver 0.11.1)

Hello,

I am getting the error in the issue subject while executing:

 singularity exec --bind ../release_7:/data  ~/SIFs/tiledbvcf-cli_0.11.1.sif tiledbvcf store --uri /data/R7_tiledbvcf /data/02_tsv2vcf_results/finngen_R7_AB1_AMOEBIASIS.gwas2vcf.annot.vcf.gz
terminate called after throwing an instance of 'std::runtime_error'
  what():  locale::facet::_S_create_c_locale name not valid
fish: Job 1, 'singularity exec --bind ../rele…' terminated by signal SIGABRT (Abort)

To get the above Singularity image and initiate the TileDB database I did:

#get the image
singularity pull tiledbvcf-cli_0.11.1.sif  docker://tiledb/tiledbvcf-cli:0.11.1

#create the store/initiate db
singularity exec --bind ../release_7:/data  ~/SIFs/tiledbvcf-cli_0.11.1.sif tiledbvcf create --uri /data/R7_tiledbvcf

The commands work just fine with version obtained from: docker://tiledb/tiledbvcf-cli:0.9.0

On the positive side:
tiledbvcf extract took just 2.5mins to dump +700k rows doing a query on a fairly large 3k VCFs +16M variants each/1.9TB using not exactly optimal NFS-mounted storage. Very impressive!

Thank you for your help

Darek Kedra

Azure blob storage credential parsing problem

Error when using the credentials to connect to an azure blob

azure_cred = ["vfs.azure.storage_account_name=myblob", "vfs.azure.storage_sas_token=?sv=123456789=="]
config = tiledbvcf.ReadConfig(tiledb_config=azure_cred)
ds = tiledbvcf.Dataset(uri,
                       mode = "w",
                       cfg=config)
ds.create_dataset()

RuntimeError: TileDB-VCF exception: Error setting TileDB config parameter; bad value 'vfs.azure.storage_sas_token=?sv=123456789==

This could be due to the utils::split function, parsing multiple "=" that typically exist in Azure SAS & Azure Account Keys.
See utils.cc, lines 318-331:

void set_tiledb_config(
    const std::vector<std::string>& params, tiledb_config_t* cfg) {
  tiledb_error_t* err;
  for (const auto& s : params) {
    auto kv = utils::split(s, '=');
    if (kv.size() != 2)
      throw std::runtime_error(
          "Error setting TileDB config parameter; bad value '" + s + "'");
    utils::trim(&kv[0]);
    utils::trim(&kv[1]);
    tiledb_config_set(cfg, kv[0].c_str(), kv[1].c_str(), &err);
    tiledb::impl::check_config_error(err);
  }
}

coordinates

I tried to load some example BCFs... and got this message. Does ingestion requires a sorted file?

tiledbvcf store --uri my-dataset C_SNP_INDEL_recal.vcf.bcf
[2019-11-14 22:25:28.049] [tiledb] [Process: 550] [Thread: 684] [error] [TileDB::Writer] Error: Write failed; Coordinates (5,6233885) succeed (5,6233870) in the global order
terminate called after throwing an instance of 'tiledb::TileDBError'
what(): [TileDB::Writer] Error: Write failed; Coordinates (5,6233885) succeed (5,6233870) in the global order

Thanks
Steve

tiledbvcf export missing ID column in the TSV output mode?

Hello,

I have run few queries like this one:

  singularity exec \
--bind ../release_7:/data  ~/SIFs/tiledbvcf-cli_0.11.1.sif tiledbvcf export  --uri /data/R7_tiledbvcf  \
--regions '22:39349925-39385575'  \
-Ot \
--tsv-fields CHR,POS,ID,REF,ALT,I:PGENE,I:ALTFCAS,I:ALTFCTR,S:ES,S:SE,S:LP,S:AF,S:AF  \
--output-path  /data/SYNGR1_out_all_full.tsv --samples-file /data/test_all.samples.list \
--mem-budget-mb 24000

The problem:

There are no values for the ID column. This seems to be not a single TileDB-VCF database/set of VCF issue or that say a number of top positions may have not any IDs values in the first place.

I will recheck if this is also the case for the other output types.

UPDATE

as expected, there are values in the ID column in the VCF output.

Segmentation fault on export

When exporting to any format but TSV, I get the following error.

$ tiledbvcf export --uri 150_debug_cli --verbose --output-format z --sample-names SZAXPI008746-45,SZAXPI008747-46,SZAXPI008748-47 --regions SL2.50ch00:1-100000 --output-dir 150_debug_cli_query
[W::bcf_hdr_check_sanity] GL should be declared as Number=G
Sorted 1 regions in 5.84e-05 seconds.
Allocating 11 fields (17 buffers) of size 63161283 bytes (60.2353MB)
Initialized TileDB query with 1 start_pos ranges,3 samples for contig SL2.50ch00 (contig batch 1/1, sample batch 1/1).
Segmentation fault (core dumped)

$ tiledbvcf stat --uri 150_debug_cli
[W::bcf_hdr_check_sanity] GL should be declared as Number=G
Statistics for dataset '150_debug_cli':
- Version: 4
- Tile capacity: 10,000
- Anchor gap: 1,000
- Number of samples: 84
- Extracted attributes: none

Core Dump

$ gdb `which tiledbvcf` /tmp/core
(No debugging symbols found in /home/saulo/anaconda3/bin/tiledbvcf)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--c
Core was generated by `tiledbvcf export --uri 150_debug_cli --verbose --output-format z --sample-names'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:383
383     ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
[Current thread is 1 (Thread 0x7f0f791e3bc0 (LWP 31549))]
(gdb) bt
#0  __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:383
#1  0x00007f0f7a56689d in bcf1_sync () from /home/saulo/anaconda3/bin/../lib/libhts.so.3
#2  0x00007f0f7a566f74 in bcf_copy () from /home/saulo/anaconda3/bin/../lib/libhts.so.3
#3  0x000056262389dc67 in tiledb::vcf::BCFExporter::buffer_record(tiledb::vcf::SampleAndId const&, bcf_hdr_t const*, bcf1_t*) ()
#4  0x000056262389ddd1 in tiledb::vcf::BCFExporter::export_record(tiledb::vcf::SampleAndId const&, bcf_hdr_t const*, tiledb::vcf::Region const&, unsigned int, tiledb::vcf::ReadQueryResults const&, unsigned long) ()
#5  0x00005626238af750 in tiledb::vcf::Reader::report_cell(tiledb::vcf::Region const&, unsigned int, unsigned long) ()
#6  0x00005626238afb32 in tiledb::vcf::Reader::process_query_results_v4() ()
#7  0x00005626238b0ac9 in tiledb::vcf::Reader::read_current_batch() ()
#8  0x00005626238bb732 in tiledb::vcf::Reader::read() ()
#9  0x0000562623847d40 in main ()

$ tiledbvcf --version
TileDB-VCF version 159d3ec-modified
TileDB version 2.1.6

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

Broken doc link

The link from the data model doc to the (genomics use case)[https://docs.tiledb.com/main/ssrSpaces/spaces/main/use-cases/genomics] doc seems to be broken.

The nightly build job failed on Friday (2023-10-20)

The nightly build job failed on Friday (2023-10-20) in run 6594855489

The nightly build job failed on Wednesday (2023-11-29)

The nightly build job failed on Wednesday (2023-11-29) in run 7041441147

The nightly build job failed on Tuesday (2024-02-27)

The nightly build job failed on Tuesday (2024-02-27) in run 8074632671

ds.count(...) always reports 0 (samples not really ingested, but no error thrown)

First, is there a gitter or slack channel for tiledb? Pretty interesting project, and I'd like to get more plugged in with the community if there is one.

Second,

I have been able to ingest your test data just fine. For instance queries over all of the smalli.vcf.gz files behave as expected. However, I am not able to get a real cohort ingested properly.

Specifically

import tiledbvcf

ds = tiledbvcf.TileDBVCFDataset('test-tiledbvcf', mode='w')
sample_names = [...]
sample_files = [...]
ds.ingest_samples(sample_files)

Seems to work fine. At least there's no errors, but the folder of the array has very little data in it.

$tree test-tiledbvcf/
test-tiledbvcf/
├── data
│   ├── __array_schema.tdb
│   ├── __lock.tdb
│   └── __meta
│       ├── __1586830166573_1586830166573_b44eea17956346ea998c06fec5f44d13
│       └── __1586830197336_1586830197336_8fe33e6c13b04b72bba2e232f32d22e3
├── metadata
│   ├── __tiledb_group.tdb
│   └── vcf_headers
│       ├── __1586830197325_1586830197325_ebdd203f6abb4b35ae8afdb6c0acb1b0
│       │   ├── __fragment_metadata.tdb
│       │   ├── header.tdb
│       │   └── header_var.tdb
│       ├── __array_schema.tdb
│       ├── __lock.tdb
│       └── __meta
└── __tiledb_group.tdb

$du -h ./test-tiledbvcf/
16K     ./test-tiledbvcf/metadata/vcf_headers/__1586830197325_1586830197325_ebdd203f6abb4b35ae8afdb6c0acb1b0
4.0K    ./test-tiledbvcf/metadata/vcf_headers/__meta
28K     ./test-tiledbvcf/metadata/vcf_headers
32K     ./test-tiledbvcf/metadata
12K     ./test-tiledbvcf/data/__meta
20K     ./test-tiledbvcf/data
56K     ./test-tiledbvcf/

Also, any count query such as

dso.count(samples=sample_names, regions=[...])

returns 0.

The records of my cohort all look like:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  !REDACTED!
1       1       .       N       <NON_REF>       .       .       END=10003       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
1       10004   .       C       <NON_REF>       .       .       END=10006       GT:DP:GQ:MIN_DP:PL      0/0:27:9:25:0,9,135
1       10007   .       T       <NON_REF>       .       .       END=10007       GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,653
1       10008   .       A       <NON_REF>       .       .       END=10008       GT:DP:GQ:MIN_DP:PL      0/0:30:15:30:0,15,225

And the header seems to have all of the necessary fields and seem the same as the test samples.

Any ideas?

Spark Shell ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2

Hello,

I am getting the following error while trying to import the data into spark:
val df = spark.read.format("io.tiledb.vcf").option("uri", "/data/workspace/dataset").option("range_partitions", 2).load()

The error is:
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2

and i load the shell using:
./spark-shell --jars ~/TileDB-VCF/apis/spark/build/libs/TileDB-VCF-Spark-0.7.0-SNAPSHOT-default.jar

Upgrading to Docker desktop v4.15 breaks building Dockerfile-py

Docker Desktop for Windows released version 4.15 on 1 Dec 2022. After upgrading from 4.14.1 to 4.15, I encountered errors in the docker build step for tiledbvcf.

SOLUTION: After uninstalling docker 4.15 and reinstalling 4.14.1, the build proceeds normally.

Command:
docker build -f .\docker\Dockerfile-py .

Output:

[+] Building 272.9s (8/15)
 => [internal] load build definition from Dockerfile-py                                                            0.0s
 => => transferring dockerfile: 35B                                                                                0.0s
 => [internal] load .dockerignore                                                                                  0.0s
 => => transferring context: 34B                                                                                   0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                    2.2s
 => [ 1/11] FROM docker.io/library/ubuntu:20.04@sha256:450e066588f42ebe1551f3b1a535034b6aa46cd936fe7f2c6b0d72997e  0.0s
 => [internal] load build context                                                                                  0.3s
 => => transferring context: 109.55kB                                                                              0.3s
 => CACHED [ 2/11] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommend  0.0s
 => CACHED [ 3/11] RUN pip install --no-cache-dir cython pandas>=1.21.0 tiledb==0.11.3                             0.0s
 => ERROR [ 4/11] RUN cd /tmp     && git clone https://github.com/apache/arrow.git -b apache-arrow-5.0.0     &&  270.6s
------
 > [ 4/11] RUN cd /tmp     && git clone https://github.com/apache/arrow.git -b apache-arrow-5.0.0     && cd arrow/cpp     && mkdir build     && cd build     && cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local                 -DARROW_FLIGHT=ON                 -DARROW_GANDIVA=ON                 -DARROW_ORC=ON                 -DARROW_WITH_BZ2=ON                 -DARROW_WITH_ZLIB=ON                 -DARROW_WITH_ZSTD=ON                 -DARROW_WITH_LZ4=ON                 -DARROW_WITH_SNAPPY=ON                 -DARROW_WITH_BROTLI=ON                 -DARROW_PARQUET=ON                 -DARROW_PYTHON=ON                 -DARROW_PLASMA=ON     && make -j$(nproc)     && make install     && cd /tmp/arrow/python     && python3 setup.py install     && cd /tmp     && rm -r arrow:
#6 0.426 Cloning into 'arrow'...
#6 16.96 Note: switching to '4591d76fce2846a29dac33bf01e9ba0337b118e9'.
#6 16.96
#6 16.96 You are in 'detached HEAD' state. You can look around, make experimental
#6 16.96 changes and commit them, and you can discard any commits you make in this
#6 16.96 state without impacting any branches by switching back to a branch.
#6 16.96
#6 16.96 If you want to create a new branch to retain commits you create, you may
#6 16.96 do so (now or later) by using -c with the switch command. Example:
#6 16.96
#6 16.96   git switch -c <new-branch-name>
#6 16.96
#6 16.96 Or undo this operation with:
#6 16.96
#6 16.96   git switch -
#6 16.96
#6 16.96 Turn off this advice by setting config variable advice.detachedHead to false
#6 16.96
#6 17.32 -- Building using CMake version: 3.16.3
#6 17.41 -- The C compiler identification is GNU 9.4.0
#6 17.49 -- The CXX compiler identification is GNU 9.4.0

--- truncated for brevity ---

src/arrow/dataset/CMakeFiles/arrow_dataset_objlib.dir/file_csv.cc.o
#6 170.8 [ 33%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/cache.cc.o
#6 170.9 [ 34%] Building CXX object src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/protocol_internal.cc.o
#6 171.7 [ 34%] Building CXX object src/arrow/dataset/CMakeFiles/arrow_dataset_objlib.dir/file_parquet.cc.o
#6 172.0 [ 35%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/cast_time.cc.o
#6 172.3 [ 35%] Building CXX object src/arrow/python/CMakeFiles/arrow_python_objlib.dir/common.cc.o
#6 173.2 [ 36%] Building CXX object src/arrow/python/CMakeFiles/arrow_python_objlib.dir/datetime.cc.o
#6 173.3 [ 36%] Building CXX object src/arrow/python/CMakeFiles/arrow_python_objlib.dir/decimal.cc.o
#6 174.2 [ 36%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/array_binary.cc.o
#6 190.9 c++: fatal error: Killed signal terminated program cc1plus
#6 190.9 compilation terminated.
#6 190.9 make[2]: *** [src/arrow/python/CMakeFiles/arrow_python_objlib.dir/build.make:63: src/arrow/python/CMakeFiles/arrow_python_objlib.dir/arrow_to_pandas.cc.o] Error 1
#6 190.9 make[2]: *** Waiting for unfinished jobs....
#6 191.0 [ 36%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/configuration.cc.o
#6 191.1 [ 37%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/array_decimal.cc.o
#6 191.3 [ 37%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/array_dict.cc.o
#6 191.6 [ 37%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/context_helper.cc.o
#6 191.9 [ 37%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/decimal_ir.cc.o
#6 192.3 [ 37%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/array_nested.cc.o
#6 192.3 [ 37%] Built target arrow_python_flight_objlib
#6 192.3 [ 37%] Building CXX object src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/serialization_internal.cc.o
#6 192.4 make[1]: *** [CMakeFiles/Makefile2:2641: src/arrow/python/CMakeFiles/arrow_python_objlib.dir/all] Error 2
#6 192.4 make[1]: *** Waiting for unfinished jobs....
#6 192.4 [ 38%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/decimal_type_util.cc.o
#6 194.0 c++: fatal error: Killed signal terminated program cc1plus
#6 194.0 compilation terminated.
#6 194.0 make[2]: *** [src/arrow/dataset/CMakeFiles/arrow_dataset_objlib.dir/build.make:141: src/arrow/dataset/CMakeFiles/arrow_dataset_objlib.dir/scanner.cc.o] Error 1
#6 194.0 make[2]: *** Waiting for unfinished jobs....
#6 194.1 [ 38%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/decimal_xlarge.cc.o
#6 194.2 [ 38%] Building CXX object src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/server.cc.o
#6 194.2 [ 38%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/array_primitive.cc.o
#6 194.3 [ 38%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_adaptive.cc.o
#6 194.6 [ 39%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_base.cc.o
#6 195.1 [ 39%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_binary.cc.o
#6 195.6 [ 39%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/engine.cc.o
#6 196.2 [ 39%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_decimal.cc.o
#6 196.4 [ 39%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o
#6 196.5 [ 39%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/date_utils.cc.o
#6 197.1 [ 39%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_nested.cc.o
#6 197.7 [ 40%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_primitive.cc.o
#6 197.8 [ 40%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_union.cc.o
#6 198.1 [ 40%] Building CXX object src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/server_auth.cc.o
#6 198.6 [ 40%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/expr_decomposer.cc.o
#6 198.6 [ 41%] Building CXX object src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/types.cc.o
#6 198.9 [ 42%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/expr_validator.cc.o
#6 199.3 [ 42%] Building CXX object src/gandiva/CMakeFiles/gandiva_objlib.dir/expression.cc.o
#6 199.4 /tmp/arrow/cpp/src/arrow/flight/server.cc: In member function ‘void arrow::flight::FlightServerBase::Impl::DoHandleSignal(int)’:
#6 199.4 /tmp/arrow/cpp/src/arrow/flight/server.cc:873:15: warning: ignoring return value of ‘ssize_t write(int, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
#6 199.4   873 |     PIPE_WRITE(self_pipe_wfd_, "0", 1);
#6 199.4       |               ^
#6 200.5 [ 42%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/concatenate.cc.o
#6 200.5 [ 42%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/data.cc.o
#6 201.1 c++: fatal error: Killed signal terminated program cc1plus
#6 201.1 compilation terminated.
#6 201.1 make[2]: *** [src/gandiva/CMakeFiles/gandiva_objlib.dir/build.make:167: src/gandiva/CMakeFiles/gandiva_objlib.dir/decimal_xlarge.cc.o] Error 1
#6 201.1 make[2]: *** Waiting for unfinished jobs....
#6 201.2 [ 42%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o
#6 201.2 [ 43%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/util.cc.o
#6 201.5 [ 43%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/array/validate.cc.o
#6 206.8 c++: fatal error: Killed signal terminated program cc1plus
#6 206.8 compilation terminated.
#6 206.8 make[2]: *** [src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/build.make:63: src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o] Error 1
#6 206.8 make[2]: *** Waiting for unfinished jobs....
#6 206.9 [ 43%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o
#6 207.4 [ 43%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/buffer.cc.o
#6 207.6 [ 43%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/chunked_array.cc.o
#6 208.0 [ 44%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compare.cc.o
#6 208.9 [ 44%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/config.cc.o
#6 209.0 [ 44%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/datum.cc.o
#6 209.2 [ 44%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/device.cc.o
#6 209.6 [ 45%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/extension_type.cc.o
#6 209.9 [ 45%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/memory_pool.cc.o
#6 210.0 [ 45%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/pretty_print.cc.o
#6 210.2 At global scope:
#6 210.2 cc1plus: warning: unrecognized command line option ‘-Wno-unknown-warning-option’
#6 210.3 [ 45%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/record_batch.cc.o
#6 210.3 [ 45%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/result.cc.o
#6 210.5 [ 46%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o
#6 210.8 make[1]: *** [CMakeFiles/Makefile2:3245: src/gandiva/CMakeFiles/gandiva_objlib.dir/all] Error 2
#6 210.8 [ 46%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o
#6 210.8 [ 46%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/status.cc.o
#6 211.1 [ 46%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/table.cc.o
#6 211.5 [ 46%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/table_builder.cc.o
#6 211.8 [ 47%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor.cc.o
#6 211.8 [ 47%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o
#6 211.8 [ 47%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o
#6 212.1 [ 47%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csx_converter.cc.o
#6 212.1 [ 47%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o
#6 212.8 [ 48%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/visitor.cc.o
#6 213.2 [ 48%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/c/bridge.cc.o
#6 214.0 /tmp/arrow/cpp/src/arrow/tensor.cc: In member function ‘arrow::Result<long int> arrow::Tensor::CountNonZero() const’:
#6 214.0 /tmp/arrow/cpp/src/arrow/tensor.cc:337:18: warning: ‘*((void*)& counter +8)’ may be used uninitialized in this function [-Wmaybe-uninitialized]
#6 214.0   337 |   NonZeroCounter counter(*this);
#6 214.0       |                  ^~~~~~~
#6 214.1 [ 48%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/buffered.cc.o
#6 214.1 [ 48%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/caching.cc.o
#6 214.3 At global scope:
#6 214.3 cc1plus: warning: unrecognized command line option ‘-Wno-unknown-warning-option’
#6 214.3 [ 48%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/compressed.cc.o
#6 214.6 [ 49%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/file.cc.o
#6 214.6 [ 49%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/hdfs.cc.o
#6 215.5 [ 49%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/hdfs_internal.cc.o
#6 215.7 [ 49%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/interfaces.cc.o
#6 215.9 [ 49%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/memory.cc.o
#6 216.0 [ 50%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/slow.cc.o
#6 216.6 [ 50%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/stdio.cc.o
#6 216.9 make[1]: *** [CMakeFiles/Makefile2:2317: src/arrow/dataset/CMakeFiles/arrow_dataset_objlib.dir/all] Error 2
#6 216.9 [ 50%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/io/transform.cc.o
#6 217.3 [ 50%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/basic_decimal.cc.o
#6 217.8 [ 51%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bit_block_counter.cc.o
#6 217.8 [ 51%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bit_run_reader.cc.o
#6 217.8 [ 51%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bit_util.cc.o
#6 217.8 [ 51%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bitmap.cc.o
#6 217.9 [ 51%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bitmap_builders.cc.o
#6 218.3 [ 52%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bitmap_ops.cc.o
#6 218.4 [ 52%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bpacking.cc.o
#6 218.6 [ 52%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/cancel.cc.o
#6 218.6 [ 52%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/compression.cc.o
#6 218.6 [ 52%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/cpu_info.cc.o
#6 218.6 [ 53%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/decimal.cc.o
#6 218.6 [ 53%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/delimiting.cc.o
#6 219.0 [ 53%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/formatting.cc.o
#6 219.1 [ 53%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/future.cc.o
#6 219.2 [ 53%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/int_util.cc.o
#6 219.3 [ 54%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/io_util.cc.o
#6 219.7 [ 54%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/logging.cc.o
#6 219.8 [ 54%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/key_value_metadata.cc.o
#6 220.0 [ 54%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/memory.cc.o
#6 220.0 [ 54%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/mutex.cc.o
#6 220.1 [ 55%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/string.cc.o
#6 220.3 [ 55%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/string_builder.cc.o
#6 220.4 [ 55%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/task_group.cc.o
#6 220.4 [ 55%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/tdigest.cc.o
#6 220.6 [ 55%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/thread_pool.cc.o
#6 220.8 [ 56%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/time.cc.o
#6 220.9 [ 56%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/trie.cc.o
#6 221.0 [ 56%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/uri.cc.o
#6 221.1 [ 56%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/utf8.cc.o
#6 221.5 [ 57%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/value_parsing.cc.o
#6 221.5 make[1]: *** [CMakeFiles/Makefile2:2445: src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/all] Error 2
#6 221.5 [ 57%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/base64.cpp.o
#6 221.8 [ 57%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/datetime/tz.cpp.o
#6 221.8 [ 57%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/bignum.cc.o
#6 222.0 [ 57%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/double-conversion.cc.o
#6 222.0 [ 58%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/bignum-dtoa.cc.o
#6 222.0 [ 58%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/fast-dtoa.cc.o
#6 222.3 [ 58%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/cached-powers.cc.o
#6 222.3 [ 58%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/fixed-dtoa.cc.o
#6 222.4 [ 58%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/diy-fp.cc.o
#6 222.5 [ 59%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/vendored/double-conversion/strtod.cc.o
#6 222.5 [ 59%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bpacking_avx2.cc.o
#6 222.5 [ 59%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/bpacking_avx512.cc.o
#6 222.6 [ 59%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/compression_brotli.cc.o
#6 222.7 [ 59%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/compression_bz2.cc.o
#6 222.8 [ 60%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/compression_lz4.cc.o
#6 222.8 [ 60%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/compression_snappy.cc.o
#6 223.0 [ 60%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/compression_zlib.cc.o
#6 223.0 [ 60%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/util/compression_zstd.cc.o
#6 223.1 [ 60%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o
#6 223.1 [ 61%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/chunker.cc.o
#6 223.4 [ 61%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/column_builder.cc.o
#6 223.6 [ 61%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/column_decoder.cc.o
#6 223.8 [ 61%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/options.cc.o
#6 223.8 [ 61%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/parser.cc.o
#6 224.0 [ 62%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/reader.cc.o
#6 224.1 [ 62%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/csv/writer.cc.o
#6 224.2 [ 62%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/api_aggregate.cc.o
#6 224.3 [ 62%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/api_scalar.cc.o
#6 224.5 [ 63%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/api_vector.cc.o
#6 224.8 [ 63%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/cast.cc.o
#6 224.9 [ 63%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec.cc.o
#6 225.0 [ 63%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec/exec_plan.cc.o
#6 225.4 [ 63%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec/expression.cc.o
#6 225.8 [ 64%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/function.cc.o
#6 225.8 [ 64%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/function_internal.cc.o
#6 226.3 [ 64%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernel.cc.o
#6 226.7 [ 64%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/registry.cc.o
#6 227.2 [ 64%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_basic.cc.o
#6 227.2 [ 65%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_mode.cc.o
#6 228.9 [ 65%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_quantile.cc.o
#6 229.5 [ 65%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_tdigest.cc.o
#6 230.1 [ 65%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_var_std.cc.o
#6 230.4 [ 65%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/codegen_internal.cc.o
#6 230.4 [ 66%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/hash_aggregate.cc.o
#6 230.8 [ 66%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_arithmetic.cc.o
#6 231.4 [ 66%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_boolean.cc.o
#6 242.4 c++: fatal error: Killed signal terminated program cc1plus
#6 242.4 compilation terminated.
#6 242.4 make[2]: *** [src/arrow/CMakeFiles/arrow_objlib.dir/build.make:1480: src/arrow/CMakeFiles/arrow_objlib.dir/csv/reader.cc.o] Error 1
#6 242.4 make[2]: *** Waiting for unfinished jobs....
#6 270.5 make[1]: *** [CMakeFiles/Makefile2:2118: src/arrow/CMakeFiles/arrow_objlib.dir/all] Error 2
#6 270.5 make: *** [Makefile:141: all] Error 2
------
executor failed running [/bin/sh -c cd /tmp     && git clone https://github.com/apache/arrow.git -b apache-arrow-5.0.0     && cd arrow/cpp     && mkdir build     && cd build     && cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local                 -DARROW_FLIGHT=ON                 -DARROW_GANDIVA=ON                 -DARROW_ORC=ON                 -DARROW_WITH_BZ2=ON                 -DARROW_WITH_ZLIB=ON                 -DARROW_WITH_ZSTD=ON                 -DARROW_WITH_LZ4=ON                 -DARROW_WITH_SNAPPY=ON                 -DARROW_WITH_BROTLI=ON                 -DARROW_PARQUET=ON                 -DARROW_PYTHON=ON                 -DARROW_PLASMA=ON     && make -j$(nproc)     && make install     && cd /tmp/arrow/python     && python3 setup.py install     && cd /tmp     && rm -r arrow]: exit code: 2

invalid next size on export -m gatk gvcfs

Hi,

I am testing a bit how tiledb-vcf is maybe able to store as well gvcfs from gatk pipeline and I stumbled on this issue where I am getting an invalid next size error message when I try to get vcf exported which has been stored in the tiledb-vcf db. Any hint of what it actually means?
I only loaded 2 samples into it.

Here the debug message:

[tiledb-vcf] [Process: 1] [Thread: 1] [info] Initialized TileDB query with 1 start_pos ranges, all samples for contig 1 (contig batch 1/194, sample batch 1/0).
 [tiledb-vcf] [Process: 1] [Thread: 1] [info] TileDB query started. (VmRSS = 0.285 GiB)
 [tiledb-vcf] [Process: 1] [Thread: 1] [info] TileDB query completed in 10.081 sec. (VmRSS = 2.401 GiB)
realloc(): invalid next size

export with -m (merge) option

Hello -

I am using tiledbvcf to create a dataset that I would later like to be able to export as a merged vcf file. I can successfully, load and export data from this dataset. What I would like to do is export to a multi-sample vcf file. It looks like export with the -m option should handle this, though it gives me memory errors. I added the -b flag to increase this but still no luck. The command I am running:

tiledbvcf export --uri tiledb_datasets/gvcf_dataset  -m -b 65536 -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf

The error I get:

Exception: SubarrayPartitioner: Trying to partition a unary range because of memory budget, this will cause the query to run very slow. Increase `sm.memory_budget` and `sm.memory_budget_var` through the configuration settings to avoid this issue. To override and run the query with the same budget, set `sm.skip_unary_partitioning_budget_check` to `true`.

Is there another trick to running the tiledbvcf export command to create a merged vcf? Thank you

I am running tiledbvcf version:

phgv2-conda) [lcj34@cbsubl01 phg_v2]$ tiledbvcf --version
TileDB-VCF version 0f72331-modified
TileDB version 2.16.3
htslib version 1.16

My machine is a linux, these specifics:

NAME="Rocky Linux"
VERSION="9.0 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"

The nightly build job failed on Thursday (2023-12-21)

The nightly build job failed on Thursday (2023-12-21) in run 7295697061

std::out_of_range when exporting to VCF

Hello,

After inserting SZAXPI009289-84.vcf.gz file (just uploaded to the same location of our previous communication) I get the following error when querying it:

tiledbvcf export --uri 150_debug_cli --verbose --output-format z --sample-names SZAXPI009289-84 --regions SL2.50ch00:1-500000 --output-dir 150_debug_cli_query
Sorted 1 regions in 3.99e-05 seconds.
Allocating 11 fields (17 buffers) of size 28422577 bytes (27.1059MB)
Initialized TileDB query with 1 start_pos ranges,1 samples for contig SL2.50ch00 (contig batch 1/1, sample batch 1/1).
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted

Exporting VCF writes "binary" DB INFO field

Exporting BCF to VCF format writes a DB INFO field that contains non-ASCII characters. For example DB=<B8>^^žDP.

NC_000014.9	94378225	rs2073333	C	T	64.64	.	AC=1;AF=0.5;AN=2;BaseQRankSum=-0.589;DB=���DP;DP=9;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=43.96;MQRankSum=-2.655;QD=7.18;ReadPosRankSum=1.309;SOR=0.446	GT:AD:DP:GQ:PL	0/1:5,4:9:72:72,0,107

which looks like

00000040  51 52 61 6e 6b 53 75 6d  3d 2d 30 2e 35 38 39 3b  |QRankSum=-0.589;|
00000050  44 42 3d b4 c8 16 bf 44  50 3b 44 50 3d 39 3b 45  |DB=....DP;DP=9;E|
00000060  78 63 65 73 73 48 65 74  3d 30 3b 46 53 3d 30 3b  |xcessHet=0;FS=0;|

How to delete a sample from the database?

Errors building on Mac Catalina

I am having a lot of errors when trying to build with Python on Catalina. The errors come after this command

clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Users/nuin/src/TileDB-VCF/apis/python/.eggs/pybind11-2.6.2-py3.8.egg/pybind11/include -I/Users/nuin/src/TileDB-VCF/apis/python/.eggs/pybind11-2.6.2-py3.8.egg/pybind11/include -I/Users/nuin/Projects/VariantsDB/variants_tile/venv/lib/python3.8/site-packages/pyarrow/include -I/Users/nuin/src/TileDB-VCF/dist/include -I/Users/nuin/Projects/VariantsDB/variants_tile/venv/include -I/Users/nuin/.pyenv/versions/3.8.3/include/python3.8 -c src/tiledbvcf/binding/libtiledbvcf.cc -o build/temp.macosx-10.15-x86_64-3.8/src/tiledbvcf/binding/libtiledbvcf.o -std=c++11 -g -O2

and end with these errors

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:767:29: note: '::std::greater_equal' declared here
struct _LIBCPP_TEMPLATE_VIS greater_equal : binary_function<_Tp, _Tp, bool>
                            ^
In file included from src/tiledbvcf/binding/libtiledbvcf.cc:1:
In file included from /Users/nuin/src/TileDB-VCF/apis/python/.eggs/pybind11-2.6.2-py3.8.egg/pybind11/include/pybind11/pybind11.h:45:
In file included from /Users/nuin/src/TileDB-VCF/apis/python/.eggs/pybind11-2.6.2-py3.8.egg/pybind11/include/pybind11/attr.h:13:
In file included from /Users/nuin/src/TileDB-VCF/apis/python/.eggs/pybind11-2.6.2-py3.8.egg/pybind11/include/pybind11/cast.h:13:
In file included from /Users/nuin/src/TileDB-VCF/apis/python/.eggs/pybind11-2.6.2-py3.8.egg/pybind11/include/pybind11/pytypes.h:12:
In file included from /Users/nuin/src/TileDB-VCF/apis/python/.eggs/pybind11-2.6.2-py3.8.egg/pybind11/include/pybind11/detail/common.h:160:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/unordered_set:364:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__hash_table:18:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:325:9: error: no member named 'isless' in the global namespace
using ::isless;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:326:9: error: no member named 'islessequal' in the global namespace
using ::islessequal;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:327:9: error: no member named 'islessgreater' in the global namespace
using ::islessgreater;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:328:9: error: no member named 'isunordered' in the global namespace
using ::isunordered;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:329:9: error: no member named 'isunordered' in the global namespace
using ::isunordered;
      ~~^
197 warnings and 13 errors generated.

If it helps

❯ clang --version
Apple clang version 12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Any help appreciated.

store dbSNP VCF

Hello,

I am trying to store the dbSNP ver 155 VCF file obtained using this link:

https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz

the problem:

I am getting addres boundary errors while running:

singularity exec --no-home --bind ./data:/data tiledbvcf-cli_latest.sif tiledbvcf store --uri /data/tiledb_dbSNP_155_hg38 /data/dbSNP_155.GRCh38.names_fixed.vcf.gz

Same with the original GCF_000001405.39.gz. Both indexed with bcftools with --csi
Storing/ingesting another VCF works: Good VCF format:

#contig=<ID=GL000192.1,length=547496,assembly=GRCh37>
##Gwas2VCF_command=--data /data/example.1k.txt --json /data/params.repo_example.json --id example.1kxxx --ref /data/human_g1k_v37.fasta --dbsnp /data/dbsnp.v153.b37.vcf.gz --out
 /data/example.1k.out.vcf --alias /app/alias.txt; 1.3.1
##file_date=2021-06-28T11:57:57.764058
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  example.1kxxx
10      1011166 rs11253538      C       T       .       PASS    AF=0.1939       ES:SE:LP:AF:SS:ID       0.0055:0.0067:0.168578:0.1939:89888:rs11253538
10      1011347 rs11253539      G       A       .       PASS    AF=0.1939       ES:SE:LP:AF:SS:ID       0.0045:0.0067:0.105518:0.1939:89888:rs11253539
1

dbSNP bad format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
NC_000001.11    10001   rs1570391677    T       A,C     .       .       RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.
0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
NC_000001.11    10002   rs1570391692    A       C       .       .       RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9944,0.
005597
NC_000001.11    10003   rs1570391694    A       C       .       .       RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9902,0.
009763

In short: is thr dbSNP VCF format not yet supported?

Python API memory consumption needs to be reduced

Currently we allocate a new set of Python buffers for every incomplete query resubmission. We should reuse them if possible (if ref count == 1).

The nightly build job failed on Wednesday (2024-02-07)

The nightly build job failed on Wednesday (2024-02-07) in run 7824279607

tiledb-inc / tiledb-vcf Goto Github PK

tiledb-vcf's Introduction

TileDB-VCF

Features

What's Included?

Quick Start

CLI

Python

Want to Learn More?

Code of Conduct

tiledb-vcf's People

Contributors

Stargazers

Watchers

Forkers

tiledb-vcf's Issues

Issue Description

Expected Behavior

Actual Behavior

Environment

Recommend Projects

Recommend Topics

Recommend Org