Git Product home page Git Product logo

duckdb_delta's Introduction

DuckDB Delta Extension

This is the experimental DuckDB extension for Delta. It is built using the (also experimental) Delta Kernel. The extension (currently) offers read support for delta tables, both local and remote.

Supported platforms

The supported platforms are:

  • linux_amd64 and linux_amd64_gcc4 and linux_arm64
  • osx_amd64 and osx_arm64
  • windows_amd64

Support for the other DuckDB platforms is work-in-progress

How to use

Note

This extension requires the DuckDB v0.10.3 or higher

This extension is distributed as a binary extension. To use it, simply use one of its functions from DuckDB and the extension will be autoloaded:

FROM delta_scan('s3://some/delta/table');

To scan a local table, use the full path prefixes with file://

FROM delta_scan('file:///some/path/on/local/machine');

Cloud Storage authentication

Note that using DuckDB Secrets for Cloud authentication is supported.

S3 Example

CREATE SECRET (
  TYPE S3,
  PROVIDER CREDENTIAL_CHAIN
);
FROM delta_scan('s3://some/delta/table/with/auth');

Azure Example

CREATE SECRET (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'cli',
    ACCOUNT_NAME 'mystorageaccount'
);
FROM delta_scan('abfss://some/delta/table/with/auth');

Features

While still experimental, many (scanning) features/optimizations are already supported in this extension as it reuses most of DuckDB's regular parquet scanning logic:

  • multithreaded scans and parquet metadata reading
  • data skipping/filter pushdown
    • skipping row-groups in file (based on parquet metadata)
    • skipping complete files (based on delta partition info)
  • projection pushdown
  • scanning tables with deletion vectors
  • all primitive types
  • structs
  • Cloud storage (AWS, Azure, GCP) support with secrets

More features coming soon!

Building

See the Extension Template for generic build instructions

Running tests

There are various tests available for the delta extension:

  1. Delta Acceptence Test (DAT) based tests in /test/sql/dat
  2. delta-kernel-rs based tests in /test/sql/delta_kernel_rs
  3. Generated data based tests in tests/sql/generated (generated using delta-rs, PySpark, and DuckDB)

To run the first 2 sets of tests:

make test_debug

or in release mode

make test

To also run the tests on generated data:

make generate-data
GENERATED_DATA_AVAILABLE=1 make test

duckdb_delta's People

Contributors

samansmink avatar nfoerster2 avatar nicklan avatar stephaniewang526 avatar szarnyasg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.