Apache Airflow pipeline and Spark/Hadoop practice

Pipeline overview

Data sources

Fact table: Japanese trade statistics from 1988 to 2019 (~100M rows)
Dimension table: HS Code (2017)
Dimension table: Japanese customs country code

Technologies used

Apache Airflow (docker-compose template based on data-engineering-zoomcamp)
Apache Spark
Apache Hadoop
Google Cloud Services

Implemented "physical" pipeline

prepare_fs: create staging directory in staging volume if not exists.
Transform & load stage 1 (runs locally)
- download_country_data: download the Japanese customs country code dataset.
- download_hs_data: download the HS Code (2017) dataset.
- process_country_data: simple transformations (set header row and type casting). Save to staging directory.
- process_hs_data: simple transformation (delete erroneous rows). Save to staging directory.
- copy_spark_job: copy the Spark job responsible for running stage 2 to staging area together with the datasets.
- upload_staging_to_gcs: upload staged files to GCS.
Transform & load stage 2 (runs on Dataproc cluster)
- run_dataproc_spark
  - Since the main dataset (Japanese trade statistics) is too big (110M rows) to be processed locally, doing so on the Dataproc cluster is a better idea.
  - Load fact table, then:
    - Split ym column into separated year and month columns.
    - Convert exp_imp column from number-encoded into text.
    - Extract hs6_code from the Japanese 9-digit statistics tracking ID (6-digit HS code + 3-digit domestic code). I could not find the domestic code catalogue so I just got rid of them.
    - Cast several other columns to their correct type.
  - Ingest fact table to Hive, partitioned by year.
  - Ingest dimension tables to Hive.

Super painful!!!

How to

Setup GCS

Create a service account and give it the following permissions (this is a really bad idea, i'm just too lazy to setup proper custom permissions)
- Storage Admin
- Storage Object Admin
- Dataproc Admin
- Viewer
Create the following objects in GCP:
- A storage bucket.
- A Dataproc cluster.
Download your service account's credential json file and place it in ~/.google/credentials/google_credentials.json.

Setup local Airflow environment

cd src
mkdir -p ./logs ./plugins
chmod -R 777 dags logs plugins
mv .env-template .env
Open .env file and fill in neccessary information about your GCP project and storage bucket.
docker compose build
docker compose up
Go to localhost:8080 (Airflow Web UI), open Admin > Connection and add a Google Cloud connection with the following information:
- Name: gcp_connection
- Credential file: /.google/credentials/google_credentials.json
- Project ID: <YOUR PROJECT ID>

Run the pipeline

Manually load the file custom_1988_2020.csv (from the first dataset listed above) to the root directory of your HDFS (very heavy, downloading it directly to the cluster seems like a better idea).
```
wget <URL> -O custom_1988_2020.csv.zip
unzip custom_1988_2020.csv.zip
hadoop dfs -put custom_1988_2020.csv hdfs://
```
Run the DAG from the local Airflow dashboard.
Done. All three datasets should be loaded into Apache Hive running on Dataproc.

Manual works

The data has been transformed and stored to Apache Hive for further analytics. Below steps are expected to be done manually.

Some analytics) with PySpark
~~A simple MapReduce Job~~

Future improvements (?)

Study GCP permission management better to create a service account with sufficient permission.
Get rid of Linux permission "hacks" like chmod 777.

Reference information

My Dataproc cluster:
- 1 master & 3 workers: n1-standard-4, 40 GB disk.

hungngocphat01 / airflow-pipeline-japan-trade Goto Github PK

airflow-pipeline-japan-trade's Introduction

Apache Airflow pipeline and Spark/Hadoop practice

Pipeline overview

Data sources

Technologies used

Implemented "physical" pipeline

How to

Setup GCS

Setup local Airflow environment

Run the pipeline

Manual works

Future improvements (?)

Reference information

airflow-pipeline-japan-trade's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent