Git Product home page Git Product logo

airflow-chart4's Introduction

Airflow chart

Helm chart for deploying Apache Airflow on kubernetes.

Read more about Kubernetes Executor and Operator here.

Guideance

To force you not to end up in performance and/or other issues, this template takes some experience into account.

Chain of configuration

Any airflow setting can be set by the scheme AIRFLOW__{SECTION}__{KEY} in the config section your values.yaml. Rollout is performed when running helm updgrade on config changes, due to a checksum annotation on the pods.

The chart might seem to have secrets and configmaps that are not used, and worker pods might seem to be missing mounts. The configuration design of airflow is not straight forward. Webservers and scheduler gets config overloaded with a configmap, while a secret populate the configration file, that in turn creates environment variables for the workers(tasks). All of which is configurable from the values.yaml file in one single place.

The chart automatically sets the following variables:

AIRFLOW__CORE__SQL_ALCHEMY_CONN,
AIRFLOW__KUBERNETES__AIRFLOW_CONFIGMAP,
AIRFLOW__KUBERNETES__NAMESPACE

and:

AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME

if rbac is enabled.

Database backend

If you want your own DB backend for Airflow, just disable postgresql in the values file and add the sqlAlchemyConn value in the values file:

sqlAlchemyConn: somespec+other://username:password@db-hostname:5432/schema

Provision connections

The backend DB needs to be initialized, but also connections has to be provisioned to Ariflow. There is a provision job for this. Look at the example in the provisioner section of the values.yaml file for some inspiration.

provisioner:
  enabled: true
  cmds: |-
    airflow initdb;
    airflow connections --add --conn_id my_rs_connection \
    --conn_type jdbc \
    --conn_host jdbc:redshift://my-redshift.eu-west-1.redshift.amazonaws.com \
    --conn_login my_rs_user \
    --conn_password my_secret_password \
    --conn_schema my_database \
    --conn_port 5439 \
    --conn_extra '{"extra__jdbc__drv_path": "/usr/local/ariflow/drivers/RedshiftJDBC42-no-awssdk-1.2.15.1025.jar", "extra__jdbc__drv_clsname": "com.amazon.redshift.jdbc42.Driver"}';

this shows how to provision a AWS Redshift JDBS connection supported by the default docker image

Provision users

The provision job can also be used to provision users.

provisioner:
  enabled: true
  cmds: |-
    airflow initdb;
    airflow create_user \
    --role Admin \
    --username airflow \
    --password airflow \
    --firstname Air \
    --lastname Flow \
    --email [email protected];

this shows how to provision an admin user called airflow with password airflow

Worker logs

Worker(task) logs are not available by default, check debugging section for now to check how to get logs. You can configure remote logging on AWS S3 for example:

  AIRFLOW__CORE__REMOTE_LOGGING: "True"
  AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER: s3://eu-production-airflow/logs/
  AIRFLOW__CORE__ENCRYPT_S3_LOGS: "False"

in this case we rely on EC2 node instance profile to have access to that bucket

TODO: Figure out a way to serve logs to webservers.

Either add optional persistent volume shared between all pods, requires a ReadWriteMany shared persistant volume, that is not that common to have. Or somehow use the airflow serve_logs functionality.

Provision dags

Inner workings of airflow seems to work best if you do not share a volume for dags, but rather put your materialized dags as a layer on your docker image. Not hosting files on a remote filesystem also improves performance a small bit. It might seem cumbersome to rebuild your docker image each time and you might have dags in many different repos with different pipelines. I am sorry, the most reliable way is to do it like this.

Problem is that most filesystems used for kubernetes are not ReadWriteMany, and thus not mountable by more than one pod at the time. And most ReadWriteMany solutions are hideously slow, like AWS EFS.

The alternative way is to use the gitsync function built into Airflow, that still should work in this config. But it git syncs for each task in an EmptyDir mount, so basically a full clone...

So I made a tiny patch of Airflow in my docker image. Edit: worker_container_contains_dags = True is set by default

Installing the Chart

To install the chart with the release name my-airflow in the my-airflow namespace:

$ helm repo add - username <your_github_username> - password <your_github_token> tekn0ir-airflow 'https://raw.githubusercontent.com/tekn0ir/airflow-chart/master/'
$ helm repo update
$ helm upgrade --install my-airflow --namespace=my-airflow tekn0ir-airflow/airflow-chart

This chart includes a postgresql chart as a dependency to the Airflow cluster in its requirement.yaml by default. The chart can be customized using the following configurable parameters:

Parameter Description Default
airflowImage Airflow Container image name tekn0ir/airflow-docker
airflowImageTag Airflow Container image tag 1.10.1rc2
imagePullPolicy Airflow Container pull policy IfNotPresent
fernetKey Airflow fernet key, for encryption of data af7CN0q6ag5U3g08IsPsw3K45U7Xa0axgVFhoh-3zB8=
ingress.enabled Enables Ingress for Drone true
ingress.annotations Ingress annotations {}
ingress.labels Ingress labels {}
ingress.hosts Ingress accepted hostnames [airflow.192.168.99.100.xip.io]
ingress.tls Ingress TLS configuration []
service.annotations Service annotations {prometheus.io scrape config}
webserver.replicas Number of webserver replicas 2
webserver.annotations Webserver annotations {}
scheduler.annotations Scheduler annotations {}
rbac.enabled Enable a service account and role for the cluster to use true
serviceAccountName ServiceAccount namer to use if it cannot be created with RBAC ``
provisioner.enabled Enable the provisioning job to run arbitrary bash commands in the Airflow cluster, example initiate DB and provision connections true
provisioner.cmds The provisioning commands... ...
config Set any environment variable, mostly used to set any airflow setting by the scheme AIRFLOW__{SECTION}__{KEY} ...
postgresql.enabled Configure dependency: https://github.com/helm/charts/tree/master/stable/postgresql true

Specify parameters using --set key=value[,key=value] argument to helm install

Alternatively a YAML file that specifies the values for the parameters can be provided like this:

$ git clone https://github.com/tekn0ir/airflow-chart.git
$ cd airflow-chart
$ helm dependency update
$ helm install --name my-airflow -f values.yaml .

Debugging

One simple thing you can do is to set AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "False" in the config section in your values file. That makes Airflow not remove terminated worker pods so you can check logs and descriptions to see that all is correctly set.

airflow-chart4's People

Contributors

tekn0ir avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.