Git Product home page Git Product logo

haojunsng / data_voyager Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.39 MB

A containerised ELT pipeline that ingests data from the Strava API and the Open Meteo API, landing it in S3 and Postgres, with dbt for data modeling

Home Page: https://snghaojun18.atlassian.net/jira/software/projects/SNG/boards/2

License: MIT License

HCL 47.10% Makefile 1.16% Python 26.16% Dockerfile 1.25% Go 24.04% Ruby 0.30%

data_voyager's Introduction

data_voyager

Data Architecture

image

Repository Navigation

This repository contains 3 main parts - strava/, weather/ and iac/

I chose to adopt a monorepo approach only because this is more of an exploratory/hobby work and did not want the hassle of maintaining multiple repositories.

strava/ contains all code around the strava pipeline.

  • A batch ELT data pipeline in Python, connecting to Postgres DB, orchestrated by Airflow and dbt (through ECS).

weather/ contains all code around the weather pipeline.

  • A realtime data pipeline in Golang utilizing Kafka on a Kubernetes Service, connecting to Cassandra, with Terraform as the IaC.

iac/ contains all Terraform (chosen IaC) code.

  • All cloud resources with the exception of SSM Parameters are provisioned using Terraform.

strava/

  • extract/ contains the logic of data extraction from STRAVA API.
  • load/ contains the logic of loading data from landing buckets to database.
  • transformation/ contains the transformation logic.
  • orchestration/ contains the airflow code and DAGs.

extract


Description
  1. Obtain the following credentials from STRAVA App Integration:

    • CLIENT_ID
    • CLIENT_SECRET
    • REFRESH_TOKEN
  2. Store credentials in AWS SSM Parameter Store.

  3. Implement logic of data extraction in main.py.

  4. Pass credentials into container through ECS Task Definition.

  5. Dockerise extract/ and push to ECR. CD through GHA has been implemented in cd.yaml to update image upon merging to main.

  6. Compute will be orchestrated by Airflow through custom operator StravaToS3Operator which inherits from ECSRunTaskOperator.

load


Description

Supabase is chosen as the postgres database for this project mostly because they recently went GA, and the UI looks pretty clean and most importantly I can keep within the free tier very comfortably.

Supabase

supabase

  • Data loaded from S3.

transformation


dbt

dbt is chosen to handle all data transformation work required.

dbt Project Management

A monorepo approach to dbt Project management is taken because there will be dependencies between strava dbt_project and weather dbt_project -- I'd prefer to have them all in 1 place just so the dependencies between can be captured by dbt.

orchestration


Description

Airflow is chosen to manage all orchestration work around extracting, loading and transforming of data.

StravaToS3Operator & S3ToSupabaseOperator:

dag

  • Custom StravaToS3Operator inherits EcsRunTaskOperator and is created to call the STRAVA API for extraction.
  • Similarly, custom S3ToSupabaseOperator also inherits EcsRunTaskOperator and helps to load data from my S3 bucket to Supabase Postgres database.
  • Lastly, DbtOperator which triggers dbt tasks through ECS to execute the transformation logic.
  • All 3 logic (STRAVA extraction, Loading to Supabase & dbt Transformation) are managed in extract/, load/ and transformation/ respectively.
Deployment of DAGs to Airflow:

s3

  • Deployed to AWS S3 bucket through Github Actions aws s3 sync for MWAA cluster
dev/:
  • Local airflow development environment for testing
  • Symlinked to orchestration/dags/
  • Use docker-compose up to spin up local airflow

weather/

Kafka Producer

This was implemented in golang with Open-Meteo API.

comment

Kafka Consumer

This is also implemented in golang to consume events from a specified Kafka topic, processes the events, and then lands the data in an S3 bucket and a Cassandra instance.


iac/

Description

Terraform is chosen to support the IaC for this entire strava pipeline project.

Resources maintained using Terraform:
  • ECS Task Definition
  • ECR
  • Cloudwatch Logs
  • S3 Buckets
  • Networking
    • VPC
    • Subnets
    • Security Group
  • Identity Access Management
    • Service User for GHA
    • ECS Task Execution Role
    • ECS Task Role
    • Respective IAM policies required around authorisation management
Resources NOT maintained using Terraform:
  • SSM Parameters

Scalr

comment

Scalr was chosen to support remote terraform operations. The free tier supports up to 50 terraform operations monthly.

terraform plan will execute upon raising a PR with commits from the declared directory -- iac/.

comment

auto apply has been disabled and plans have to be manually approved on the Scalr UI, which can be navigated from the PR comments.

comment

Using GitHub Workflows with OIDC to Push Images to Amazon ECR

In this project, GitHub Workflows along with OIDC authentication are leveraged to automate the process of pushing/updating images to ECR on AWS.

Environment Variables Management

Using direnv and .envrc

Selected direnv along with a .envrc file to manage environment variables in the development environment. This automatically loads environment variables when entering the project directory.

Terraform Environment Variables

For Terraform-related environment variables, they are prefixed with TF_VAR_. This ensures that the environment variables can be registered by .tf files.

Python Environment Variables

In Python code, a straightforward method of retrieval is used (os.environ.get) to access environment variables.

Secure Credential Management

Confidential credentials, such as API keys and database passwords, are securely stored in the AWS Systems Manager (SSM) Parameter Store. They are retrieved securely at runtime and passed into the ECS containers.

Workflow Management

Jira image

data_voyager's People

Contributors

haojunsng avatar

Stargazers

Benjamin Dornel avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.