Git Product home page Git Product logo

falconeri's Introduction

falconeri: Run batch data-processing jobs on Kubernetes

Falconeri runs on a pre-existing Kubernetes cluster, and it allows you to use Docker images to transform large data files stored in cloud buckets.

For detailed instructions, see the Falconeri guide.

Setup is simple:

falconeri deploy
falconeri proxy
falconeri migrate

Running is similarly simple:

falconeri job run my-job.json

REST API

Note that falconerid has a complete REST API, and you don't actually need to use the falconeri command-line tool during normal operations. This is used internally at Faraday, and it should be fairly self-explanatory, but it isn't documented.

Contributing to falconeri

First, you'll need to set up some development tools:

cargo install just
cargo install cargo-deny
cargo install cargo-edit

# If you want to change the SQL schema, you'll also need the `diesel` CLI. This
# may also require installing some C development libraries.
cargo install diesel_cli

Next, check out the available tasks in the justfile:

just --list

For local development, you'll want to install minikube. Start it as follows, and point your local Docker at it:

minikube start
eval $(minikube docker-env)

Then build an image. You must have docker-env set up as above if you want to test this image.

just image

Now you can deploy a development version of falconeri to minikube:

cargo run -p falconeri -- deploy --development

Check to see if your cluster comes up:

kubectl get all

# Or if you have `watch`, try:
watch -n 5 kubectl get all

Running the example program

Running the example program is necessary to make sure falconeri works. First, run:

cd examples/word-frequencies

Next, you'll need to set up an S3 bucket. If you're at Faraday, run:

# Faraday only!
just secret

If you're not a Faraday, create an S3 bucket, and place a *.txt file in $MY_BUCKET/texts/. Then, set up an AWS access key with read/write access to the bucket, and save the key pair in files named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Then run:

# Not for Faraday!
kubectl create secret generic s3 \
    --from-file=AWS_ACCESS_KEY_ID \
    --from-file=AWS_SECRET_ACCESS_KEY

Then edit word-frequencies.json to point at your bucket.

Now you can build the worker image using:

# This assumes you previously ran `just image` in the top-level directory.
just image

In another terminal, start a falconeri proxy command:

just proxy

In the original terminal, start the job:

just run

From here, you can use falconeri job describe $ID and kubectl normally. See the guide for more details.

Releasing a new falconeri

For now, this process should only be done by Eric, because there are some semver issues that we haven't fully thought out yet.

First, edit the CHANGELOG.md file to describe the release. Next, bump the version:

just set-version $MY_NEW_VERSION

Commit your changes with a subject like:

$MY_NEW_VERSION: Short description

You should be able to make a release by running:

just MODE=release release

Once the the binaries have built, you can find them at https://github.com/faradayio/falconeri/releases. The CHANGELOG.md entry should be automatically converted to release notes.

Changing the database schema

We use diesel as our ORM. This has complex tradeoffs, and we've been considering whether to move to sqlx or tokio-postgres in the future. See above for instructions on install diesel_cli.

To create a new migration, run:

cd falconeri_common
diesel migration generate add_some_table_or_columns

This will generate a new up.sql and down.sql file which you can edit as needed. These work like Rails migrations: up.sql makes the necessary changes to the database, and down.sql reverts those changes. But in this case, migrations are written using SQL.

You can show a list of migrations using:

diesel migration list

To apply pending migrations, run:

diesel migration run

# Test the `down.sql` file as well.
diesel migration revert
diesel migration run

After doing this, edit falconeri_common/src/schema.rs and revert any changes which break the schema, and any which introduce warnings. You will probably also need to update any corresponding files in falconeri_common/src/models/.

Migrations will be compiled into the server and run on deploys, as well.

falconeri's People

Contributors

emk avatar seamusabshere avatar jeteipel avatar

Stargazers

Umang Bhalla avatar Shashank Pachava avatar Roman Hossain Shaon avatar GAURAV avatar Nikolay Tsutsarin avatar Arthur Rand avatar Prithaj Nath avatar

Watchers

Andy Rossmeissl avatar  avatar  avatar James Cloos avatar  avatar

Forkers

icodein

falconeri's Issues

falconeri job run timeout is too short, causing false negatives

falconeri job run pipelines/x.json
Error: error posting http://localhost:8089/jobs
  caused by: http://localhost:8089/jobs: timed out

the timeout appears to be like 10 seconds, but it should be more like 60 so that the server has time to (for example) gcloud ls everything

this makes you think a job didn't start when in fact it did

Better cleanup of output buckets after failed datums

This is a follow-on to fixing #33.

From the source:

            // Remove `OutputFile` records for this datum, so we can upload the
            // same output files again.
            //
            // TODO: Unfortunately, there's an issue here. It takes one of two
            // forms:
            //
            // 1. Workers use deterministic file names. In this case, we
            //    _should_ be fine, because we'll just overwrite any files we
            //    did manage to upload.
            // 2. Workers use random filenames. Here, there are two subcases: a.
            //    We have successfully created an `OutputFile` record. b. We
            //    have yet to create an `OutputFile` record.
            //
            // We need to fix (2b) by pre-creating all our `OutputFile` records
            // _before_ uploading, and then updating them later to show that the
            // output succeeded. Which them into case (2a). And then we can fix (2a)
            // by deleting any S3/GCS files corresponding to `OutputFile::uri`.

can't detect failure of own pods ~doesn't scale down reliably~

Hours after a job has finished successfully, I see things like:

Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    vendor-union-az8f1-pvwvl                                    1 (12%)       0 (0%)      4G (14%)         4G (14%)       15h

I delete the job (kubectl delete job/vendor-union-az8f1) and the node is autoscaled away.

➡️ maybe: when it's done with a falc job, delete the k8s job.

Make postgres proxy port configurable

Currently the proxy seems to be hard coded to listen on 5432, which conflicts with my local postgres instance.

It would be nice if falconeri proxy could listen on some other port

falconeri_common::storage::s3 spams the log

[2020-11-30T22:14:49Z TRACE falconeri_common::storage::s3] uploading /pfs/out/ to s3://mybucket/myfolder/
Nov 30 17:14:55 worker falconeri-production[myjob: Completed 256.0 KiB/272.1 MiB (858.2 KiB/s) with 1 file(s) remaining Completed 512.0 KiB/272.1 MiB (1.6 MiB/s) with 1 file(s) remaining Completed 768.0 KiB/272.1 MiB (2.4 MiB/s) with 1 file(s) remaining Completed 1.0 MiB/272.1 MiB (3.1 MiB/s) with 1 file(s) remaining Completed 1.2 MiB/272.1 MiB (3.8 MiB/s) with 1 file(s) remaining Completed 1.5 MiB/272.1 MiB (4.4 MiB/s) with 1 file(s) remaining Completed 1.8 MiB/272.1 MiB (5.0 MiB/s) with 1 file(s) remaining Completed 2.0 MiB/272.1 MiB (5.7 MiB/s) with 1 file(s) remaining Completed 2.2 MiB/272.1 MiB (6.3 MiB/s) with 1 file(s) remaining Completed 2.5 MiB/272.1 MiB (6.9 MiB/s) with 1 file(s) remaining Completed 2.8 MiB/272.1 MiB (7.6 MiB/s) with 1 file(s) remaining Completed 3.0 MiB/272.1 MiB (8.2 MiB/s) with 1 file(s) remaining Completed 3.2 MiB/272.1 MiB (8.8 MiB/s) with 1 file(s) remaining Completed 3.5 MiB/272.1 MiB (9.5 MiB/s) with 1 file(s) remaining Completed 3.8 MiB/272.1 MiB (10.1 MiB/s) with 1 file(s) remaining Completed 4.0 MiB/272.1 MiB (10.7 MiB/s) with 1 file(s) remaining Completed 4.2 MiB/272.1 MiB (10.9 MiB/s) with 1 file(s) remaining Completed 4.5 MiB/272.1 MiB (11.4 MiB/s) with 1 file(s) remaining Completed 4.8 MiB/272.1 MiB (12.0 MiB/s) with 1 file(s) remaining Completed 5.0 MiB/272.1 MiB (12.6 MiB/s) with 1 file(s) remaining Completed 5.2 MiB/272.1 MiB (13.2 MiB/s) with 1 file(s) remaining Completed 5.5 MiB/272.1 MiB (13.5 MiB/s) with 1 file(s) remaining Completed 5.8 MiB/272.1 MiB (14.0 MiB/s) with 1 file(s) remaining Completed 6.0 MiB/272.1 MiB (14.6 MiB/s) with 1 file(s) remaining Completed 6.2 MiB/272.1 MiB (15.1 MiB/s) with 1 file(s) remaining Completed 6.5 MiB/272.1 MiB (15.7 MiB/s) with 1 file(s) remaining Completed 6.8 MiB/272.1 MiB (16.1 MiB/s) with 1 file(s) remaining Completed 7.0 MiB/272.1 MiB (16.7 MiB/s) with 1 file(s) remaining Completed 7.2 MiB/272.1 MiB (17.1 MiB/s) with 1 file(s) remaining Completed 7.5 MiB/272.1 MiB (17.5 MiB/s) with 1 file(s) remaining Completed 7.8 MiB/272.1 MiB (18.1 MiB/s) with 1 file(s) remaining Completed 8.0 MiB/272.1 MiB (18.6 MiB/s) with 1 file(s) remaining Completed 8.2 MiB/272.1 MiB (19.1 MiB/s) with 1 file(s) remaining Completed 8.5 MiB/272.1 MiB (19.6 MiB/s) with 1 file(s) remaining Completed 8.8 MiB/272.1 MiB (20.2 MiB/s) with 1 file(s) remaining Completed 9.0 MiB/272.1 MiB (20.6 MiB/s) with 1 file(s) remaining Completed 9.2 MiB/272.1 MiB (21.1 MiB/s) with 1 file(s) remaining Completed 9.5 MiB/272.1 MiB (21.7 MiB/s) with 1 file(s) remaining Completed 9.8 MiB/272.1 MiB (22.1 MiB/s) with 1 file(s) remaining Completed 10.0 MiB/272.1 MiB (22.5 MiB/s) with 1 file(s) remaining Completed 10.2 MiB/272.1 MiB (23.0 MiB/s) with 1 file(s) remaining Completed 10.5 MiB/272.1 MiB (23.3 MiB/s) with 1 file(s) remaining Completed 10.8 MiB/272.1 MiB (23.8 MiB/s) with 1 file(s) remaining Completed 11.0 MiB/272.1 MiB (24.3 MiB/s) with 1 file(s) remaining Completed 11.2 MiB/272.1 MiB (24.6 MiB/s) with 1 file(s) remaining Completed 11.5 MiB/272.1 MiB (25.1 MiB/s) with 1 file(s) remaining Completed 11.8 MiB/272.1 MiB (25.6 MiB/s) with 1 file(s) remaining Completed 12.0 MiB/272.1 MiB (26.1 MiB/s) with 1 file(s) remaining Completed 12.2 MiB/272.1 MiB (26.5 MiB/s) with 1 file(s) remaining Completed 12.5 MiB/272.1 MiB (26.8 MiB/s) with 1 file(s) remaining Completed 12.8 MiB/272.1 MiB (27.3 MiB/s) with 1 file(s) remaining Completed 13.0 MiB/272.1 MiB (27.8 MiB/s) with 1 file(s) remaining Completed 13.2 MiB/272.1 MiB (28.0 MiB/s) with 1 file(s) remaining Completed

prints sensitive information in log entries

app: dbcrossbar.2 MiB/    0.0 B]    3.3 MiB/s
 ver: 0.0.13
  from_locator: postgres://SECRET
   stream: seamus_recipients
    table: seamus_recipients

from_locator (and so to_locator) probably contain secrets

falconeri deploy asks for wrong roles

  • get rid of nodes
  • specify what batch you want
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create", "delete", "deletecollection", "patch", "update", "get", "list", "watch"]
# We'll eventually need read-only access to pod and node information to manage
# various monitoring and recovery tasks.
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

`falc proxy` silently fails if 5432 is already taken

this is what success looks like:

$ falc proxy
Forwarding from [::1]:5432 -> 5432
Forwarding from 127.0.0.1:5432 -> 5432

this is what silent failure looks like (because a local postgres is running on 5432):

$ falc proxy
Forwarding from 127.0.0.1:5432 -> 5432

but

$ falc job list
Error: DatabaseError(__Unknown, "relation \"jobs\" does not exist")

could not list jobs

"job describe" should show pod names next to datum ids

most of the time you want to get the pod so you can log it

just put an extra col in here:

Running datums:
ID  STARTED_AT
9ee7b37d-80d2-4b6f-a30b-4c98220e5697  2020-08-25T00:53:05.251386

alternative: falconeri datum log X

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.