Git Product home page Git Product logo

pyxn / spark-etl Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 16 KB

Scalable, efficient, and robust ETL pipeline built using PySpark, encapsulated in Docker for seamless portability and consistency across platforms. Designed with high-performance data processing in mind, this repository demonstrates cutting-edge data engineering practices.

License: MIT License

Dockerfile 10.54% Shell 5.94% Python 83.53%

spark-etl's Introduction

High-Performance ETL Solution with PySpark and Docker

This project introduces a robust and easy-to-scale data processing pipeline that uses PySpark for fast data handling and Docker for flexible, consistent runtimes. The pipeline takes in CSV data, performs necessary transformations, and outputs the results as CSV. With Docker, this pipeline ensures the same results regardless of the runtime environment.

Prerequisites

Before you start, make sure Docker is installed and set up properly on your system.

Project Files

  • transform.py: This script does all the ETL (extract, transform, load) work.
  • Dockerfile: This file tells Docker how to create the right environment for the script. This environment includes Python 3.9 and Apache Spark 3.1.2.
  • dbuild.sh: This shell script automates building and running the Docker container.
  • config.ini: This file lets you customize the input/output directories and lookup table paths.

How to Set Up the Docker Environment

⚠️ Important Note ⚠️

🔥 The dbuild.sh script stops and removes ALL active Docker containers and images on your system. 🔥 Please be careful when using this script. If you have important Docker containers running, you should modify this script to avoid stopping and removing them.

How to Build and Run the Docker Image

  1. In your terminal, go to the project directory.
  2. Make the shell script executable with this command:
    chmod +x dbuild.sh
  3. Run the dbuild.sh script to build and run the Docker image. Don't forget to replace my_image_name with the name you want for your image:
    ./dbuild.sh my_image_name

How to Configure the Pipeline

You can modify the config.ini file to set your data sources and destinations. In this file, you should set paths for:

  • DataDir: The directory with your input data.
  • OutputPathmatics and OutputVivvix: Where the pipeline should save the Pathmatics and Vivvix data.
  • Lookup table paths: Where to find the lookup tables that the pipeline uses to join data.

How to Use the Pipeline

Before you run this pipeline, make sure Docker is installed and configured correctly on your system.

📢 This pipeline is designed to run on a Spark cluster. You can use standalone mode for testing and development. However, for real-world use, you should set up this pipeline on a full Spark cluster.

Understanding dbuild.sh and When to Use It

The dbuild.sh script makes sure that each Docker image build and run starts from scratch by stopping and removing all Docker containers and images. This clean start is important for making sure that tests are consistent and reproducible.

However, the script is designed to be very thorough and is best used in controlled environments. If you use Docker for many projects on your local machine or in shared or production environments, this script could delete important data or disrupt services. So, it's important to understand what this script does and adjust it as needed.

Collaborate and Contribute

Your input is important! Feel free to point out issues, suggest improvements, or provide feedback to enhance the pipeline.


spark-etl's People

Contributors

pyxn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.