faradayio / falconeri Goto Github PK

Transform lots of data using a Kubernetes cluster

Home Page: https://github.com/faradayio/falconeri/blob/master/guide/src/SUMMARY.md

Rust 89.78% PLpgSQL 0.63% Shell 1.28% Dockerfile 0.95% Handlebars 5.02% Just 2.33%

falconeri's Introduction

`falconeri`: Run batch data-processing jobs on Kubernetes

Falconeri runs on a pre-existing Kubernetes cluster, and it allows you to use Docker images to transform large data files stored in cloud buckets.

For detailed instructions, see the Falconeri guide.

Setup is simple:

falconeri deploy
falconeri proxy
falconeri migrate

Running is similarly simple:

falconeri job run my-job.json

REST API

Note that falconerid has a complete REST API, and you don't actually need to use the falconeri command-line tool during normal operations. This is used internally at Faraday, and it should be fairly self-explanatory, but it isn't documented.

Contributing to `falconeri`

First, you'll need to set up some development tools:

cargo install just
cargo install cargo-deny
cargo install cargo-edit

# If you want to change the SQL schema, you'll also need the `diesel` CLI. This
# may also require installing some C development libraries.
cargo install diesel_cli

Next, check out the available tasks in the justfile:

just --list

For local development, you'll want to install minikube. Start it as follows, and point your local Docker at it:

minikube start
eval $(minikube docker-env)

Then build an image. You must have docker-env set up as above if you want to test this image.

just image

Now you can deploy a development version of falconeri to minikube:

cargo run -p falconeri -- deploy --development

Check to see if your cluster comes up:

kubectl get all

# Or if you have `watch`, try:
watch -n 5 kubectl get all

Running the example program

Running the example program is necessary to make sure falconeri works. First, run:

cd examples/word-frequencies

Next, you'll need to set up an S3 bucket. If you're at Faraday, run:

# Faraday only!
just secret

If you're not a Faraday, create an S3 bucket, and place a *.txt file in $MY_BUCKET/texts/. Then, set up an AWS access key with read/write access to the bucket, and save the key pair in files named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Then run:

# Not for Faraday!
kubectl create secret generic s3 \
    --from-file=AWS_ACCESS_KEY_ID \
    --from-file=AWS_SECRET_ACCESS_KEY

Then edit word-frequencies.json to point at your bucket.

Now you can build the worker image using:

# This assumes you previously ran `just image` in the top-level directory.
just image

In another terminal, start a falconeri proxy command:

just proxy

In the original terminal, start the job:

just run

From here, you can use falconeri job describe $ID and kubectl normally. See the guide for more details.

Releasing a new `falconeri`

For now, this process should only be done by Eric, because there are some semver issues that we haven't fully thought out yet.

First, edit the CHANGELOG.md file to describe the release. Next, bump the version:

just set-version $MY_NEW_VERSION

Commit your changes with a subject like:

$MY_NEW_VERSION: Short description

You should be able to make a release by running:

just MODE=release release

Once the the binaries have built, you can find them at https://github.com/faradayio/falconeri/releases. The CHANGELOG.md entry should be automatically converted to release notes.

Changing the database schema

We use diesel as our ORM. This has complex tradeoffs, and we've been considering whether to move to sqlx or tokio-postgres in the future. See above for instructions on install diesel_cli.

To create a new migration, run:

cd falconeri_common
diesel migration generate add_some_table_or_columns

This will generate a new up.sql and down.sql file which you can edit as needed. These work like Rails migrations: up.sql makes the necessary changes to the database, and down.sql reverts those changes. But in this case, migrations are written using SQL.

You can show a list of migrations using:

diesel migration list

To apply pending migrations, run:

diesel migration run

# Test the `down.sql` file as well.
diesel migration revert
diesel migration run

After doing this, edit falconeri_common/src/schema.rs and revert any changes which break the schema, and any which introduce warnings. You will probably also need to update any corresponding files in falconeri_common/src/models/.

Migrations will be compiled into the server and run on deploys, as well.

falconeri's People

Contributors

Stargazers

Watchers

Forkers

icodein

falconeri's Issues

transparently gzip files before sending them back to s3/gcs

gcs lets you gzip files before uploading them for transparent compression

https://cloud.google.com/storage/docs/transcoding

falconeri job run timeout is too short, causing false negatives

falconeri job run pipelines/x.json
Error: error posting http://localhost:8089/jobs
  caused by: http://localhost:8089/jobs: timed out

the timeout appears to be like 10 seconds, but it should be more like 60 so that the server has time to (for example) gcloud ls everything

this makes you think a job didn't start when in fact it did

Better cleanup of output buckets after failed datums

This is a follow-on to fixing #33.

From the source:

            // Remove `OutputFile` records for this datum, so we can upload the
            // same output files again.
            //
            // TODO: Unfortunately, there's an issue here. It takes one of two
            // forms:
            //
            // 1. Workers use deterministic file names. In this case, we
            //    _should_ be fine, because we'll just overwrite any files we
            //    did manage to upload.
            // 2. Workers use random filenames. Here, there are two subcases: a.
            //    We have successfully created an `OutputFile` record. b. We
            //    have yet to create an `OutputFile` record.
            //
            // We need to fix (2b) by pre-creating all our `OutputFile` records
            // _before_ uploading, and then updating them later to show that the
            // output succeeded. Which them into case (2a). And then we can fix (2a)
            // by deleting any S3/GCS files corresponding to `OutputFile::uri`.

"job list" should show ready/running/done/error

currently the list view only shows started at

it would be very helpful to show ready/running/done/error or some other indication of progress

can't detect failure of own pods ~doesn't scale down reliably~

Hours after a job has finished successfully, I see things like:

Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    vendor-union-az8f1-pvwvl                                    1 (12%)       0 (0%)      4G (14%)         4G (14%)       15h

I delete the job (kubectl delete job/vendor-union-az8f1) and the node is autoscaled away.

➡️ maybe: when it's done with a falc job, delete the k8s job.

specify a script to be run once job is done (once)

idea credit to @n1ywb

"describe" should be extra nice and try to determine type for you

falconeri describe X

if X is a job, describe that
if X is a datum, describe that

simple to implement, hugely helpful to regular users

http://[...]/jobs/X endpoint doesn't have error information

it shows status=error but doesn't provide any error details

allow specifying multiple jobs to be run serially

e.g. falconeri job run-multiple step1.json step2.json step3.json

obviously it shouldn't evaluate input until the preceding step is done

falconeri deploy option to not output secret

because you might want to capture the kubeconfig and commit it to source

duplicate key value violates unique constraint "jobs_job_name_key"

error inserting job caused by: duplicate key value violates unique constraint "jobs_job_name_key"

can't specify ttlSecondsAfterFinished to clean up old jobs

https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

it's nice to be able to set finished jobs to be deleted after a few days

after 1.21 this will be in beta and so it will be enabled (at least) on GKE

lacks `falc job delete`

all you have to do is

delete the k8s job
set status=deleted in the jobs table

given a job id, hard to find datum and therefore error via http

querying /jobs/X won't give you datum ids, but that's the only way to get errors

Make postgres proxy port configurable

Currently the proxy seems to be hard coded to listen on 5432, which conflicts with my local postgres instance.

It would be nice if falconeri proxy could listen on some other port

HTTP /jobs/X doesn't include error message

it would be nice if you could get the error from the HTTP interface

currently it only gives you "status": "error"

duplicate key value violates unique constraint "output_files_job_id_uri_key"

ERROR: could not upload outputs
caused by: unexpected HTTP status 500 Internal Server Error for http://falconerid:8089/output_files:
ERROR: error inserting datums
caused by: duplicate key value violates unique constraint "output_files_job_id_uri_key"

falconeri_common::storage::s3 spams the log

[2020-11-30T22:14:49Z TRACE falconeri_common::storage::s3] uploading /pfs/out/ to s3://mybucket/myfolder/
Nov 30 17:14:55 worker falconeri-production[myjob: Completed 256.0 KiB/272.1 MiB (858.2 KiB/s) with 1 file(s) remaining Completed 512.0 KiB/272.1 MiB (1.6 MiB/s) with 1 file(s) remaining Completed 768.0 KiB/272.1 MiB (2.4 MiB/s) with 1 file(s) remaining Completed 1.0 MiB/272.1 MiB (3.1 MiB/s) with 1 file(s) remaining Completed 1.2 MiB/272.1 MiB (3.8 MiB/s) with 1 file(s) remaining Completed 1.5 MiB/272.1 MiB (4.4 MiB/s) with 1 file(s) remaining Completed 1.8 MiB/272.1 MiB (5.0 MiB/s) with 1 file(s) remaining Completed 2.0 MiB/272.1 MiB (5.7 MiB/s) with 1 file(s) remaining Completed 2.2 MiB/272.1 MiB (6.3 MiB/s) with 1 file(s) remaining Completed 2.5 MiB/272.1 MiB (6.9 MiB/s) with 1 file(s) remaining Completed 2.8 MiB/272.1 MiB (7.6 MiB/s) with 1 file(s) remaining Completed 3.0 MiB/272.1 MiB (8.2 MiB/s) with 1 file(s) remaining Completed 3.2 MiB/272.1 MiB (8.8 MiB/s) with 1 file(s) remaining Completed 3.5 MiB/272.1 MiB (9.5 MiB/s) with 1 file(s) remaining Completed 3.8 MiB/272.1 MiB (10.1 MiB/s) with 1 file(s) remaining Completed 4.0 MiB/272.1 MiB (10.7 MiB/s) with 1 file(s) remaining Completed 4.2 MiB/272.1 MiB (10.9 MiB/s) with 1 file(s) remaining Completed 4.5 MiB/272.1 MiB (11.4 MiB/s) with 1 file(s) remaining Completed 4.8 MiB/272.1 MiB (12.0 MiB/s) with 1 file(s) remaining Completed 5.0 MiB/272.1 MiB (12.6 MiB/s) with 1 file(s) remaining Completed 5.2 MiB/272.1 MiB (13.2 MiB/s) with 1 file(s) remaining Completed 5.5 MiB/272.1 MiB (13.5 MiB/s) with 1 file(s) remaining Completed 5.8 MiB/272.1 MiB (14.0 MiB/s) with 1 file(s) remaining Completed 6.0 MiB/272.1 MiB (14.6 MiB/s) with 1 file(s) remaining Completed 6.2 MiB/272.1 MiB (15.1 MiB/s) with 1 file(s) remaining Completed 6.5 MiB/272.1 MiB (15.7 MiB/s) with 1 file(s) remaining Completed 6.8 MiB/272.1 MiB (16.1 MiB/s) with 1 file(s) remaining Completed 7.0 MiB/272.1 MiB (16.7 MiB/s) with 1 file(s) remaining Completed 7.2 MiB/272.1 MiB (17.1 MiB/s) with 1 file(s) remaining Completed 7.5 MiB/272.1 MiB (17.5 MiB/s) with 1 file(s) remaining Completed 7.8 MiB/272.1 MiB (18.1 MiB/s) with 1 file(s) remaining Completed 8.0 MiB/272.1 MiB (18.6 MiB/s) with 1 file(s) remaining Completed 8.2 MiB/272.1 MiB (19.1 MiB/s) with 1 file(s) remaining Completed 8.5 MiB/272.1 MiB (19.6 MiB/s) with 1 file(s) remaining Completed 8.8 MiB/272.1 MiB (20.2 MiB/s) with 1 file(s) remaining Completed 9.0 MiB/272.1 MiB (20.6 MiB/s) with 1 file(s) remaining Completed 9.2 MiB/272.1 MiB (21.1 MiB/s) with 1 file(s) remaining Completed 9.5 MiB/272.1 MiB (21.7 MiB/s) with 1 file(s) remaining Completed 9.8 MiB/272.1 MiB (22.1 MiB/s) with 1 file(s) remaining Completed 10.0 MiB/272.1 MiB (22.5 MiB/s) with 1 file(s) remaining Completed 10.2 MiB/272.1 MiB (23.0 MiB/s) with 1 file(s) remaining Completed 10.5 MiB/272.1 MiB (23.3 MiB/s) with 1 file(s) remaining Completed 10.8 MiB/272.1 MiB (23.8 MiB/s) with 1 file(s) remaining Completed 11.0 MiB/272.1 MiB (24.3 MiB/s) with 1 file(s) remaining Completed 11.2 MiB/272.1 MiB (24.6 MiB/s) with 1 file(s) remaining Completed 11.5 MiB/272.1 MiB (25.1 MiB/s) with 1 file(s) remaining Completed 11.8 MiB/272.1 MiB (25.6 MiB/s) with 1 file(s) remaining Completed 12.0 MiB/272.1 MiB (26.1 MiB/s) with 1 file(s) remaining Completed 12.2 MiB/272.1 MiB (26.5 MiB/s) with 1 file(s) remaining Completed 12.5 MiB/272.1 MiB (26.8 MiB/s) with 1 file(s) remaining Completed 12.8 MiB/272.1 MiB (27.3 MiB/s) with 1 file(s) remaining Completed 13.0 MiB/272.1 MiB (27.8 MiB/s) with 1 file(s) remaining Completed 13.2 MiB/272.1 MiB (28.0 MiB/s) with 1 file(s) remaining Completed

prints sensitive information in log entries

app: dbcrossbar.2 MiB/    0.0 B]    3.3 MiB/s
 ver: 0.0.13
  from_locator: postgres://SECRET
   stream: seamus_recipients
    table: seamus_recipients

from_locator (and so to_locator) probably contain secrets

falconeri deploy asks for wrong roles

get rid of nodes
specify what batch you want

- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create", "delete", "deletecollection", "patch", "update", "get", "list", "watch"]
# We'll eventually need read-only access to pod and node information to manage
# various monitoring and recovery tasks.
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

`falc proxy` silently fails if 5432 is already taken

this is what success looks like:

$ falc proxy
Forwarding from [::1]:5432 -> 5432
Forwarding from 127.0.0.1:5432 -> 5432

this is what silent failure looks like (because a local postgres is running on 5432):

$ falc proxy
Forwarding from 127.0.0.1:5432 -> 5432

but

$ falc job list
Error: DatabaseError(__Unknown, "relation \"jobs\" does not exist")

could not list jobs

"job describe" should show pod names next to datum ids

most of the time you want to get the pod so you can log it

just put an extra col in here:

Running datums:
ID  STARTED_AT
9ee7b37d-80d2-4b6f-a30b-4c98220e5697  2020-08-25T00:53:05.251386

alternative: falconeri datum log X

falconeri deploy neglects to set type: nodePort for ingress

if you want to map it to a port on a cluster

can't specify activeDeadlineSeconds to limit job duration

https://kubernetes.io/docs/concepts/workloads/controllers/job/

it's nice to give jobs a time limit of a couple days - i've found falconeri jobs running weeks afterwards

this is supported by GKE and EKS

command order is the reverse of kubernetes, which is confusing

kubectl describe job foo

versus

falconeri job describe foo