Git Product home page Git Product logo

duckberg's Introduction

duckberg

duckDB vs Spark SQL on Apache Iceberg for 1 Billion rows - Nyc yellow taxi dataset.

Introduction

This project aims to compare the performance of duckDB and Spark SQL on Apache Iceberg for the Nyc yellow taxi dataset. The dataset is available on the NYC Taxi & Limousine Commission website. The dataset contains information about taxi trips in New York City, including pick-up and drop-off times, locations, trip distances, and fares. The dataset is available in parquet format and can be downloaded from the website.

Dataset

The dataset contains the following columns:

root
|-- VendorID: long (nullable = true)
|-- tpep_pickup_datetime: timestamp (nullable = true)
|-- tpep_dropoff_datetime: timestamp (nullable = true)
|-- trip_distance: double (nullable = true)
|-- store_and_fwd_flag: string (nullable = true)
|-- PULocationID: long (nullable = true)
|-- DOLocationID: long (nullable = true)
|-- payment_type: long (nullable = true)
|-- fare_amount: double (nullable = true)
|-- extra: double (nullable = true)
|-- mta_tax: double (nullable = true)
|-- tip_amount: double (nullable = true)
|-- tolls_amount: double (nullable = true)
|-- total_amount: double (nullable = true)
|-- airport_fee: integer (nullable = true)
|-- passenger_count: long (nullable = true)
|-- congestion_surcharge: double (nullable = true)
|-- improvement_surcharge: double (nullable = true)
|-- RatecodeID: long (nullable = true)

Dataset size on disk

The dataset used, is of Parquet format & is approximately 16 GB in size for the years 2023-2014 (inclusive). It contains approximately 1Billion rows. 1 file was corrupt & was removed from the dataset, hence the number of rows for the benchmark reduced to 0.88 Billion.

Setup

I Used the following setup for the benchmark:

  • 10 years Data was pulled for the years 2023-2014 (inclusive) from the NYC Taxi & Limousine Commission website.
  • Data was stored on local file system in parquet format.
  • 1 MacBook Air (M1, 2020), 16GB RAM, 512 GB Flash Storage, macOS Monterey, 8core (4 performance and 4 efficiency)
  • duckDB was run in python 3.10.0, duckDB read parquet stored in local file system
  • Spark was run in java in spark local mode, with iceberg warehouse catalog pointing to local file system
    • Data was loaded (~301sec) into iceberg warehouse tables before running benchmark. loading time was excluded in benchmark.
  • 9G of memory was allocated to both duckDB & Spark.
  • All cores were used for both duckDB & Spark.

build Jar

#need java 17
mvn clean package

Benchmark Queries

There were a total of 16 queries. Some were taken from this benchmark and some were taken from here and rest written by me. You can check the queries in the queries.py file.

Results

Not surprisingly, duckDB outperformed Spark SQL-Iceberg on all queries.

Note:

  • Data load time, Program startup times were not considered.
    • duckDB loading time was 0sec, as it read from files directly.
    • Apache Iceberg data loading was a 1 time job and recorded a time of ~300sec.
  • Only the query execution times were considered.

If startup times were to be considered, duckDB would outperform Spark Sql on Iceberg by even larger margins.

Query_times

The % difference/change from duckDb to Spark SQL on Iceberg was:

  • at a Minimum ~200x
  • at a Maximum was 5000x

%Difference

Including duckDb on iceberg (duckberg) in the comparison

I also ran the same queries on duckDB with iceberg. Query 15 was ommitted as the query used while running on duckDB with iceberg was rewritten to avoid duckdb-iceberg bug (The rewritten query was not performant, hence omitted).

duckberg

Thoughts for Technical Debate

  1. Why all the data resided on local disk not s3/azure blob storage ?
  2. Its unfair as Spark SQL was run in local mode and not a cluster mode i.e. more than 1 node.

My Side

The compressed parquet dataset size for 1Billion was 16GB. Any node on ec2/azure/gcp can handle 16GB disk. Downloading the data to disk (say aws s3 cp) + reading from local file system is faster than reading from s3/blob storage. Hence this pattern is feasible. Feel free to read from s3/azure blob storage.

The longest running query was query16. The peak memory usage for spark was about ~8.5GB, so it well within limits (I could have made Xmx=10g for safety). When it can complete on 1 node, why run on cluster ? You might think 'So y use Spark for 16G small dataset, This benchmark is not valid?'. Spark on local mode is a well known processing method. People move to clusters when limits are breached without no code changes. Dataset may be small on disk i.e. compressed parquet. Uncompressed it would be 32GB, You might think 'Still small! Not a case for Spark?'. Either way(whether we go for cluster or its a small dataset), you are edging out Spark on Iceberg on this 1 Billion row dataset :) Perhaps polars, is a better competitor to duckDB for this benchmark ? Ps: With Spark local mode, Shuffling gets eliminated.

memProfileQuery16

So why would I go for duckDB ?

Cost is the hidden/overlooked dimension to data engineering. Cost based engineering is the new OIL.

Like me ? Need a Consultation

Happy to help you with your data engineering needs. Reach out to me at LinkedIn or Twitter or Twitter

Ps: i am the Chief Inspiration officer at Bytespire.io

Buy me a coffee?

duckberg's People

Contributors

jaihind213 avatar

Stargazers

Ganeshan Venkataraman avatar Matthew Aylward  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.