A demo of how to use K8s CronJobs to automate DBT pipelines.
You'll find here all you need to run a simple DBT project over Kubernetes with BiGQuery as a SQL backend.
We use several tools in order to run this project.
❗ Mandatory tools
python
andpoetry
in order to generate and upload some fake data and run DBT locally if you want.gcloud
CLI to interact with GCP. Installation guide herekubectl
to interact with you Kubernetes cluster. Installation withgcloud
herehelm
to install and manages Kubernetes resources. Installation guide hereterraform
to deploy and manage the infrastructure. Installation guide heredocker
to build and push the DBT image to GCP Artifact Registry. Installation guide here
ℹ️ Optional tools
k9s
, a TUI to visualize, explore and interact with the clusterkubens
to set a default namespace when runningkubectl
commandskubectx
to handle multiple context with kubectl
The very first step is to setup a new GCP project. Once this is done, you can start deploying your infrastructure.
We use Terraform to deploy and manage the infrastructure: all can be done without clicking on the GCP Console.
Follow the README in the terraform
folder.
There are 2 scripts in data
folder that generates and upload to GCS some fake data you can use. You can have a look at the data it generates by inspect the 2 .ndjson
files.
Make sure you have installed the project with poerty install
and just run:
poetry run pyhton data/data_faker.py
poetry run python data/storage_load.py
You can run DBT locally if you want to check that DBT can correctly run your SQL queries on BigQuery.
You'll need to define your profile.yml
in order to make DBT able to connect and query BigQuery. The default profile name is kube_bq_jobs
. You can find an example in the ConfigMap here.
Follow the README in the kubernetes/helm
folder