Git Product home page Git Product logo

duckdb-pyspark-demo's Introduction

DuckDB & Pyspark Demo

This repository serves as a practical demonstration of leveraging the DuckDB engine while maintaining the same PySpark code pipeline duckspark.py, thanks to DuckDB's compatibility with the PySpark API. It provides a comparative analysis of a standalone PySpark pipeline versus a PySpark pipeline powered by DuckDB, using an openly available dataset. The entire setup is containerized for ease of deployment and quick startup.

Read the full blog post or watch the video.

Disclaimer

โš ๏ธ Please note that this feature is experimental. For details on what's available from the PySpark API, please visit DuckDB's GitHub repository.

Getting Started

Prerequisites

Before diving into the demo, ensure you have Docker installed on your system. This demo relies on Docker containers to run the PySpark and DuckDB environments.

Download the Data

Run the following command to download the necessary dataset. It contains Hacker News data for about ~1GB in Parquet.

make data

Running the Demos

After setting up, you can run the demos using the following commands. Each commands use container and target the same codebase duckspark.py.

DuckDB with PySpark To run the demo using DuckDB with PySpark, execute the following command. This command builds the Docker image (if not already built) and runs the script using DuckDB.

make duckspark

result:

real    0m1.225s
user    0m1.970s
sys     0m0.160s

Standalone PySpark

make pyspark

result :

real    0m5.411s
user    0m12.700s
sys     0m1.221s

These commands will execute the respective pipelines and display the time taken for each process, allowing you to compare the performance between the pure PySpark implementation and the DuckDB version.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.