- Fact table: Japanese trade statistics from 1988 to 2019 (~100M rows)
- Dimension table: HS Code (2017)
- Dimension table: Japanese customs country code
- Apache Airflow (
docker-compose
template based on data-engineering-zoomcamp) - Apache Spark
- Apache Hadoop
- Google Cloud Services
- prepare_fs: create staging directory in staging volume if not exists.
- Transform & load stage 1 (runs locally)
- download_country_data: download the Japanese customs country code dataset.
- download_hs_data: download the HS Code (2017) dataset.
- process_country_data: simple transformations (set header row and type casting). Save to staging directory.
- process_hs_data: simple transformation (delete erroneous rows). Save to staging directory.
- copy_spark_job: copy the Spark job responsible for running stage 2 to staging area together with the datasets.
- upload_staging_to_gcs: upload staged files to GCS.
- Transform & load stage 2 (runs on Dataproc cluster)
- run_dataproc_spark
- Since the main dataset (Japanese trade statistics) is too big (110M rows) to be processed locally, doing so on the Dataproc cluster is a better idea.
- Load fact table, then:
- Split
ym
column into separatedyear
andmonth
columns. - Convert
exp_imp
column from number-encoded into text. - Extract
hs6_code
from the Japanese 9-digit statistics tracking ID (6-digit HS code + 3-digit domestic code). I could not find the domestic code catalogue so I just got rid of them. - Cast several other columns to their correct type.
- Split
- Ingest fact table to Hive, partitioned by
year
. - Ingest dimension tables to Hive.
- run_dataproc_spark
Super painful!!!
- Create a service account and give it the following permissions (this is a really bad idea, i'm just too lazy to setup proper custom permissions)
- Storage Admin
- Storage Object Admin
- Dataproc Admin
- Viewer
- Create the following objects in GCP:
- A storage bucket.
- A Dataproc cluster.
- Download your service account's credential
json
file and place it in~/.google/credentials/google_credentials.json
.
cd src
mkdir -p ./logs ./plugins
chmod -R 777 dags logs plugins
mv .env-template .env
- Open
.env
file and fill in neccessary information about your GCP project and storage bucket. docker compose build
docker compose up
- Go to
localhost:8080
(Airflow Web UI), open Admin > Connection and add a Google Cloud connection with the following information:- Name:
gcp_connection
- Credential file:
/.google/credentials/google_credentials.json
- Project ID:
<YOUR PROJECT ID>
- Name:
- Manually load the file
custom_1988_2020.csv
(from the first dataset listed above) to the root directory of your HDFS (very heavy, downloading it directly to the cluster seems like a better idea).wget <URL> -O custom_1988_2020.csv.zip unzip custom_1988_2020.csv.zip hadoop dfs -put custom_1988_2020.csv hdfs://
- Run the DAG from the local Airflow dashboard.
- Done. All three datasets should be loaded into Apache Hive running on Dataproc.
The data has been transformed and stored to Apache Hive for further analytics. Below steps are expected to be done manually.
- Some analytics) with PySpark
A simple MapReduce Job
- Study GCP permission management better to create a service account with sufficient permission.
- Get rid of Linux permission "hacks" like
chmod 777
.
- My Dataproc cluster:
- 1 master & 3 workers:
n1-standard-4
, 40 GB disk.
- 1 master & 3 workers: