Comments (2)
Hi @darked89, thanks for the question.
TileDB supports fully parallel writes, so you can easily split-up the ingestion across as many nodes as you like. Just launch each job with its own disjoint subset of VCF files and theyβll all be stored in the same TileDB-VCF dataset, no temporary databases or merging operations needed π .
I don't have a cluster handy but I can show how we typically run parallel ingestions using TileDB Cloud's serverless compute. You didn't mention which API you're using so I'm defaulting to Python.
First create a new array on S3
import tiledbvcf
from tiledb.cloud.compute import Delayed
array_uri = "s3://my-array"
ds = tiledbvcf.Dataset(array_uri, stats = True, verbose = True, mode = "w")
ds.create_dataset(extra_attrs = ["fmt_GT"])
then create a UDF to handle the ingestion for each batch of samples.
def ingest_vcf_files(array_uri, vcf_uris, cfg):
print(f"Ingesting {len(vcf_uris)} starting with {vcf_uris[0]}")
cfg = tiledbvcf.ReadConfig(tiledb_config = cfg)
ds = tiledbvcf.Dataset(array_uri, mode = "w", verbose = True, cfg = cfg)
ds.ingest_samples(sample_uris = vcf_uris, threads = 2, memory_budget_mb = 512)
return vcf_uris
def return_samples(file_list):
out = []
[out.extend(i) for i in file_list]
return out
Then assuming we have a list of 100 VCF files, we'll split them into 5 batches of 20
n = 20
batched_vcfs = [vcf_uris[i * n:(i + 1) * n] for i in range((len(vcf_uris) + n - 1) // n )]
and create an instance of our delayed UDF for each batch.
delayed_writes = [Delayed(ingest_vcf_files)(array_uri, b, cfg) for b in batched_uris]
ingested_samples = Delayed(return_samples, name = "Combine")(delayed_writes)
Hope that helps. Let me know if you have any follow-up questions!
from tiledb-vcf.
Dear Aaron,
That really made my day ;).
I will have to test it out using our environment but I assume it will work as described.
FYI: I have tested tiledbvcf-cli using Singularity: works out of the box.
from tiledb-vcf.
Related Issues (20)
- The nightly build job failed on Saturday (2023-11-04) HOT 4
- Build failing in linux/arm64 Ubuntu VM HOT 9
- The nightly build job failed on Wednesday (2023-11-29) HOT 10
- The nightly build job failed on Friday (2023-12-08) HOT 8
- The nightly build job failed on Thursday (2023-12-21) HOT 7
- Very high RAM usage when storing plant variant data from GVCFs HOT 2
- Wrong type hint for dataset python api HOT 2
- export with -m (merge) option HOT 3
- tiledb-vcf-java jar doesn't include native libraries HOT 11
- The nightly build job failed on Thursday (2024-01-25) HOT 1
- The nightly build job failed on Wednesday (2024-02-07) HOT 11
- Cannot submit_and_finalize query HOT 2
- Java API: Request to support loading Mac-ARM libraries HOT 1
- The nightly build job failed on Monday (2024-02-26) HOT 1
- The nightly build job failed on Tuesday (2024-02-27) HOT 1
- The nightly build job failed on Tuesday (2024-03-12) HOT 1
- The nightly build job failed on Wednesday (2024-03-20) HOT 1
- delete sample from database: segmentation fault on CLI HOT 5
- The nightly build job failed on Wednesday (2024-04-03) HOT 2
- The nightly build job failed on Friday (2024-04-05) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tiledb-vcf.