Git Product home page Git Product logo

Comments (2)

aaronwolen avatar aaronwolen commented on June 9, 2024 1

Hi @darked89, thanks for the question.

TileDB supports fully parallel writes, so you can easily split-up the ingestion across as many nodes as you like. Just launch each job with its own disjoint subset of VCF files and they’ll all be stored in the same TileDB-VCF dataset, no temporary databases or merging operations needed πŸ™‚ .

I don't have a cluster handy but I can show how we typically run parallel ingestions using TileDB Cloud's serverless compute. You didn't mention which API you're using so I'm defaulting to Python.

First create a new array on S3

import tiledbvcf
from tiledb.cloud.compute import Delayed

array_uri = "s3://my-array"

ds = tiledbvcf.Dataset(array_uri, stats = True, verbose = True, mode = "w")
ds.create_dataset(extra_attrs = ["fmt_GT"])

then create a UDF to handle the ingestion for each batch of samples.

def ingest_vcf_files(array_uri, vcf_uris, cfg):
    print(f"Ingesting {len(vcf_uris)} starting with {vcf_uris[0]}")
    cfg = tiledbvcf.ReadConfig(tiledb_config = cfg)
    ds = tiledbvcf.Dataset(array_uri, mode = "w", verbose = True, cfg = cfg)
    ds.ingest_samples(sample_uris = vcf_uris, threads = 2, memory_budget_mb = 512)
    return vcf_uris

def return_samples(file_list):
    out = []
    [out.extend(i) for i in file_list]
    return out

Then assuming we have a list of 100 VCF files, we'll split them into 5 batches of 20

n = 20
batched_vcfs = [vcf_uris[i * n:(i + 1) * n] for i in range((len(vcf_uris) + n - 1) // n )]

and create an instance of our delayed UDF for each batch.

delayed_writes = [Delayed(ingest_vcf_files)(array_uri, b, cfg) for b in batched_uris]
ingested_samples = Delayed(return_samples, name = "Combine")(delayed_writes)

Hope that helps. Let me know if you have any follow-up questions!

from tiledb-vcf.

darked89 avatar darked89 commented on June 9, 2024

Dear Aaron,

That really made my day ;).
I will have to test it out using our environment but I assume it will work as described.
FYI: I have tested tiledbvcf-cli using Singularity: works out of the box.

from tiledb-vcf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.