flyteorg / flyte Goto Github PK

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

License: Apache License 2.0

Makefile 0.28% Shell 0.85% Python 0.95% Dockerfile 0.06% Mustache 0.16% Go 97.53% Smarty 0.13% HTML 0.01% Batchfile 0.01% Rust 0.01%

flyte machine-learning golang scale workflow data-science data-analysis data kubernetes-operator kubernetes orchestration-engine mlops dataops grpc python production production-grade declarative fine-tuning llm

flyte's Introduction

Flyte

🏗️ 🚀 📈

Flyte is an open-source orchestrator that facilitates building production-grade data and ML pipelines. It is built for scalability and reproducibility, leveraging Kubernetes as its underlying platform. With Flyte, user teams can construct pipelines using the Python SDK, and seamlessly deploy them on both cloud and on-premises environments, enabling distributed processing and efficient resource utilization.

Build

Write code in Python or any other language and leverage a robust type engine.

Deploy & Scale

Either locally or on a remote cluster, execute your models with ease.

Quick start

Install Flyte's Python SDK

pip install flytekit

Create a workflow (see example)
Run it locally with:

pyflyte run hello_world.py hello_world_wf

Ready to try a Flyte cluster?

Create a new sandbox cluster, running as a Docker container:

flytectl demo start

Now execute your workflows on the cluster:

pyflyte run --remote hello_world.py hello_world_wf

Do you want to see more but don't want to install anything?

Head over to https://sandbox.union.ai/. It allows you to experiment with Flyte's capabilities from a hosted Jupyter notebook.

Ready to productionize?

Go to the Deployment guide for instructions to install Flyte on different environments

Tutorials

Features

🚀 Strongly typed interfaces: Validate your data at every step of the workflow by defining data guardrails using Flyte types.
🌐 Any language: Write code in any language using raw containers, or choose Python, Java, Scala or JavaScript SDKs to develop your Flyte workflows.
🔒 Immutability: Immutable executions help ensure reproducibility by preventing any changes to the state of an execution.
🧬 Data lineage: Track the movement and transformation of data throughout the lifecycle of your data and ML workflows.
📊 Map tasks: Achieve parallel code execution with minimal configuration using map tasks.
🌎 Multi-tenancy: Multiple users can share the same platform while maintaining their own distinct data and configurations.
🌟 Dynamic workflows: Build flexible and adaptable workflows that can change and evolve as needed, making it easier to respond to changing requirements.
⏯️ Wait for external inputs before proceeding with the execution.
🌳 Branching: Selectively execute branches of your workflow based on static or dynamic data produced by other tasks or input data.
📈 Data visualization: Visualize data, monitor models and view training history through plots.
📂 FlyteFile & FlyteDirectory: Transfer files and directories between local and cloud storage.
🗃️ Structured dataset: Convert dataframes between types and enforce column-level type checking using the abstract 2D representation provided by Structured Dataset.
🛡️ Recover from failures: Recover only the failed tasks.
🔁 Rerun a single task: Rerun workflows at the most granular level without modifying the previous state of a data/ML workflow.
🔍 Cache outputs: Cache task outputs by passing cache=True to the task decorator.
🚩 Intra-task checkpointing: Checkpoint progress within a task execution.
⏰ Timeout: Define a timeout period, after which the task is marked as failure.
🏭 Dev to prod: As simple as changing your domain from development or staging to production.
💸 Spot or preemptible instances: Schedule your workflows on spot instances by setting interruptible to True in the task decorator.
☁️ Cloud-native deployment: Deploy Flyte on AWS, GCP, Azure and other cloud services.
📅 Scheduling: Schedule your data and ML workflows to run at a specific time.
📢 Notifications: Stay informed about changes to your workflow's state by configuring notifications through Slack, PagerDuty or email.
⌛️ Timeline view: Evaluate the duration of each of your Flyte tasks and identify potential bottlenecks.
💨 GPU acceleration: Enable and control your tasks’ GPU demands by requesting resources in the task decorator.
🐳 Dependency isolation via containers: Maintain separate sets of dependencies for your tasks so no dependency conflicts arise.
🔀 Parallelism: Flyte tasks are inherently parallel to optimize resource consumption and improve performance.
💾 Allocate resources dynamically at the task level.

Who's using Flyte

Join the likes of LinkedIn, Spotify, Freenome, Pachama, Warner Bros. and many others in adopting Flyte for mission-critical use cases. For a full list of adopters and information on how to add your organization or project, please visit our ADOPTERS page.

How to stay involved

📆 Weekly office hours: Live informal sessions with the Flyte team held every week. Book a 30-minute slot and get your questions answered.
👥 Monthly community sync: Happening the first Tuesday of every month, this is where the Flyte team provides updates on the project, and community members can share their progress and ask questions.
💬 Slack: Join the Flyte community on Slack to chat with other users, ask questions, and get help.
⚠️ Newsletter: join this group to receive the Flyte Monthly newsletter.
📹 Youtube: Tune into panel discussions, customer success stories, community updates and feature deep dives.
📄 Blog: Here, you can find tutorials and feature deep dives to help you learn more about Flyte.
💡 RFCs: RFCs are used for proposing new ideas and features to improve Flyte. You can refer to them to stay updated on the latest developments and contribute to the growth of the platform.

How to contribute

There are many ways to get involved in Flyte, including:

Submitting bugs and feature requests for various components.
Reviewing the documentation and submitting pull requests for anything from fixing typos to adding new content.
Speaking or writing about Flyte or any other ecosystem integration and letting us know!
Taking on a help wanted or good-first-issue and following the CONTRIBUTING guide to submit changes to the codebase.
Upvoting popular feature requests to show your support.

We ❤️ our contributors

License

Flyte is available under the Apache License 2.0. Use it wisely.

flyte's People

Contributors

Stargazers

Watchers

Forkers

chixcode honnix dnuang glsoda-zz petertfs chrismclennon moose007 shyamalschandra mbrukman chengjingfeng pzhao16me kinow yangkf1985 tomzhang djlin saonam ilikedata hubayirp gurpreetsachdeva liuweiping2020 adinin kikou2016 kunlqt chunmk lwflwflwf tejamoy hoyajigi pete1313 a3digit bobqiu dongminaug shoman2 joulroad carol8421 yaxche-io akshay-codemonk dashjim robertdigital brandon-segal patilvikram gabeochieng ai-cloud-kubernetes sbrunk kumare3 morristech pratikfalke sudhirsilwal23 ttanay winning1120xx cxz igorvalko leoleo17 ckiosidis narape pingsutw ibegleris leorleor irvifa emiliza seetharamireddy540 devopstoday11 rstanevich jens-hummelshoej-tri youngwookim isabella232 cmahima nuclyde-io keshann93 katrogan qlyu001 lsena ochienggot apatel-fn jeevb lowc1012 cosmicbboy ahmedrachid sandragh5 zelda3721 v01dxyz samhita-alla shrinivas-io bitbravo jakeneyer erichep dolthub akumor pragyanaischool stjordanis danielschulz forestlzj ankurd1 reloadbrain scarf-sh ajsalow usu-research eapolinario grantreedtrioutside pmahindrakar-oss daeruin

flyte's Issues

Parallel Node (Propeller Side)

TCS are excited for the native parallelization offered in Flyte 2.0. This task is for the propeller side execution of parallel nodes.

Expanded error message collapses when scrolling out of view

Find an execution in the executions table (workflow details page) that has a long error message.
Click to expand the error message.
Scroll the row out of view
Scroll the row back into view

Expected: The error message should still be expanded.

Actual: The error message renders collapsed, but the row is still the size that it would be with the error message expanded. Now the content sits in the middle of a row that is too tall.

Allow download of Inputs / Outputs

It's unclear exactly what format things should be in, but for I/O types like CSV/Blob/Schema we should be able to provide a download link for the user.

Options:

Convert it to a signed S3 link. This is probably not the right move because we need to verify the identity of a user before allowing them to download
Convert the s3:// protocol to an actual s3 link. It would be up to the user to ensure they are assuming the correct role to be able to download the file.

Likely it will be option 2.

For things like CSV list, we have to consider how to display a list of these items.

Console sends `undefined` instead of `false` for unchecked toggle switches

For workflows which take boolean values, the Console renders a toggle switch. When the toggle remains switched to "off", the resulting computed value is undefined instead of false. This translated to passing no value for the input when making the launch request.
For required inputs with no default value, that will result in a 400.

At the very least, if a boolean value is required and has no default, we should be translating an unchecked toggle to false to make sure the launch request succeeds.

Once default values are implemented for the form, this should become less of an issue.

Audit of UI / UX tests

We need a story around what types of testing we are doing for the UI, and an update of the existing test coverage to move toward that goal.
Right now, we have a mixture of tests implemented with react-testing-library, Enzyme(?), and react-test-renderer (mostly snapshots which we don't really need).

The target will be:

Use react-testing-library for all unit/component tests.
Remove Enzyme / react-test-renderer
Make a decision on whether we need any integration / end-2-end / automated UI testing (something like Cypress / Browserstack / etc.)
Choose a target for code coverage and open one or more issues to track hitting that target.

Add workflow level timeouts

Handle error codes from Admin API

The Admin API returns error code values that we can use to show more informative errors to users.

Parallel Node Executions (CLI)

Ensure parallel node executions are visible in a reasonable manner in the CLI.

Local execution and end-to-end testing strategy

test issue 1

test issue

Node Validators

It should be possible to specify pre and post validators on nodes to prevent advancement of a node (or cache poisoning) if the input/output data does not match standards.

Handle edge cases around schedule updates

Background: We don't have any transactional guarantees for the case where a schedule rule in cloudwatch is say, deleted but the subsequent database update fails. Although we return an error and a user can retry (and the delete call to cloudwatch is idempotent) unless the user retries we have no guarantee of being in a non-corrupt state.

We could update the scheduled workflow event dequeuing logic to trigger a call to delete a rule when no active launch plan versions exist. Unfortunately there's a possible race condition this exposes in the case of an end-user calling disable in one step, and then enable separately after that.

As a solution, [~matthewsmith] proposed adding an epoch to schedule names to distinguish them. Since we already want to make schedule names more descriptive (with some kind of truncated project & domain in the name) that work can fall under this work item.

Add Auth to Console

Admin handles most of the auth flow. Console needs to properly handle 401 responses and redirect to the auth flow to refresh cookies.

Support additional input types in the Launch UI

We don't currently support list/map or some of the less common types. This task is to at least implement list/map and explore if there is anything we can do about supporting the other types.

Default timeout policy

Right now if a container is misconfigured or something, the job sticks around forever. Propeller should garbage collect and fail.

Document launch plans in SDK

Investigate using gRPC in JS

Create Flyte Admin API JS client library

Plugin Default Behavior Update

{"json":\{"exec_id":"","node":"","ns":"-development","routine":"worker-13","src":"handler.go:216","tasktype":"spark","wf":"***.SparkTasksWorkflow"}

,"level":"warning","msg":"No plugin found for Handler-type [spark], defaulting to [container]","ts":"2019-11-11T21:09:36Z"}

Defaulting Spark to container doesn't make sense and ideally we should fail cleanly at Propeller level and expose it to users instead of executing it as a container task and leading to an unknown/weird container failures. I think this also applies to other tasks like Hive/Sidecar.

Validation of all CRD using OpenAPI spec

Creating CRDs should not result in a death spiral of the operator. We should provide hooks to validate the spec

HTTP 400 returned when attempting to retrieve data for NodeExecution child of a Dynamic Task

Update:

This is a UI bug. We should not attempt to retrieve inputs if no inputsUri is set, and should not attempt to retrieve outputs if closure.outputsUri is unset.

Direct child

[https://flyte.lyft.net/api/v1/data/node_executions/flytekit/production/y9n8xi9amd/task1-b0e1be7f74-h-task-sqb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0]

Grandchild (nested subtask)

[https://flyte.lyft.net/api/v1/data/node_executions/flytekit/production/y9n8xi9amd/task1-b0e1be7f74-h-task-sqb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0-78d085b30a--sub-taskb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0]

The above URLs should both return NodeExecution data for the ids provided, but instead they return an error "invalid URI".

Ensure SDK Error Messages Render Correctly for Entities When Config is Not Set

Some repr methods in Flytekit SDK rely on "required" configurations. This obscures exceptions when config is not available in the environment.

500 returned when querying outputs for a custom container

There is an expectation from Admin that some type of output will exist in storage for a NodeExecution. This turns out not to be the case if a container is running without the SDK. We need some type of handling for this case.

If applicable, show the LP that created a given execution

On the Execution Details page, expose the Launch Plan which was used to create the execution.

Move flytegraph into a separate package

The graph components in the console are designed to be a reusable package, but while it's under active development I'm leaving it inside the flyteconsole repo. This ticket is for tracking the work to be done to publish it as a standalone package.

Sorting/filtering by inputs

It's useful to filter executions down by the value of certain inputs. For instance, if a workflow takes a region code as an input and is run frequently with different values for the region code, a user may want to only see executions using one given value of that code ("SEA").

This functionality will require a design spec, since workflows may have many inputs of varying types and indexing across those types and values is non-trivial.

Note: There is an internal design document that could be cleaned up and moved to public in order to provide guidance for this item.

Graph Enhancements

This is to cover any overflow / nice-to-haves on the graph implementation after the initial usable version. Some ideas:

Diving into layers of the graph (i.e. expanding subworkflow nodes inline)
Zooming/panning
Hover animations, including highlighting data flow in adjacent nodes
Animations on nodes in progress
Different rendering for nodes which were not executed

Execution IDs aren't copy-pastable across UI, CLI

The full execution idea ID ex:project:domain:id

In the UI we only show the last portion ("id")

The CLI requires the full "ex:project:domain:id", meaning you can easily copy-paste between the two.

Request from pricing.

Parallel/Map Node

Allow loose parallelism as a native part of the Flyte spec. In other words, allow a 'parallel node' to take a list of inputs and map the work out to replicas of the same executable: task, workflow, or launch plan.

Ability to delete/hide obsolete workflows from UI

The UI currently hides workflows which are marked as archived. But you can only set this value via the CLI / API. Users should be able to mark a workflow as archived through the UI as well.

Hotkeys

There are probably some hotkeys worth implementing. This is a placeholder to determine what those should be.

Implement a walkthrough or tutorial

Render Logs directly in the UI

We have enough information from activity execution entity to make calls directly to AWS to retrieve log stream events.

Accessing log streams requires specific permissions. These won't exist on the client (nor should they). But the server side could be granted that role and be a proxy for the logs.

So it might look something like this:

Client makes a request to UI server side to open logs for a specific execution, passing the execution ID. This opens a long-lived TCP request which will be used to stream the log back to the client
Server-side opens a connection to AWS to get the log stream for that execution. These have to be retrieved in chunks. Server-side begins streaming the chunks to the client
Server-side listens for (pings? Can AWS do push for these?) additional log stream lines and pushes them to the client as they are discovered.

Questions/Concerns:

This could be simpler if there was a way for the UI to retrieve a temporary token to use for AWS access. Can the server generate one of these and return it?
How do we know when the log stream has ended and we can close the connection to the client? Can we check for a specific string in it?
Each one of these will consume a connection to the server and hold it open for what could be a long time. This could cause resource constraints, but we can always scale the UI servers to accomodate
Should we consider web sockets for this type of thing? We could have a mechanism where, while an active websocket connection is open watching a particular execution, the server-side will continue to poll for the latest logs and deliver them to whatever listeners are active. This has the benefit of only making the requests to AWS once if there are multiple listeners
If we do use Websockets, this functionality is almost complicated enough to warrant spinning up a separate service to handle it.

Filter/view executions by SHA in Flyte 2.0 UI

Already in the cli:

flyte-cli -h flyte.lyft.net -p flytekit -d development list-executions -f "eq(workflow.version,gitsha)"

This is to track potential for this in the UI.

Customer notes:

NOTE

The UI can already filter executions by Version, but we don't show versions in the executions table. The work here is mostly for adding that.

Will require a small amount of UX work to determine how to surface versions in the table rows.

New end2end tests

Switch flyteidl output to be commonjs

flyteidl is currently being output as an es6 module, which makes it incompatible with NodeJS unless it is run through webpack first. There's no real reason to do it that way, and protobufjs supports commonjs output, so we should switch to that.

Replace loading indicators

We want to make some updates to the way we load items:

Show no loading indicator if the request returns within 1 second
After 1 second, show a shimmer/skeleton state

TODO: Document all the places where we currently use loading spinners.

Platform-Specified Defaults for Configs

Support scheduling of workflows via the UI

Better document the local testing story

The local testing story is weak... we can do a better job documenting tips for how to improve.

Our initial idea is that the pyflyte execute command can be run locally, but this has some problems like it uses an autodeleting temp dir and it might mess up real outputs in S3, etc.

We'll play around with stuff and at least come up with some short term workarounds.

Support specifying notifications when launching workflows via the UI

The Inputs for launching a workflow accept a Notifications fields, which can be used to specify notification rules for specific states. It's a little complicated (can be email, PD, Slack to multiple recipients for multiple states), so we'll tackle it as a separate task.

Parallel Node Executions (UI)

Ensure we have a good visualization for parallel nodes in the UI.

Rework dynamic node relationships in data model

Admin currently allows tasks to be parents of other nodes (1->many) and nodes to be parents of other tasks (1-1). This has lead to some confusion/assumptions:

While tasks do yield nodes, they, tasks, finish executing well before those nodes start. It's not entirely accurate to have this task->node parent relationship
Due to how they are currently presented in the data model, the nested UX looks confusing with the task row showing success and sub-rows showing running (indicating the yielded nodes are still running).

We have talked separately on different occasions about how this should ideally be represented. This task is to track the concrete steps towards a better model.

Notifications for Nodes in SDK

Implement specification of nodes in SDK.

Figure out validation / default value implementation for JS

Problem:

The messages coming back from the API are decoded by protobufjs. But since all the fields in a proto messages are optional by convention, we don't have any assurance that the records are valid and usable. This has caused errors before on the client side.

Solution options:

Manual validation of the records and type-casting (message as X) or type-guarding (: message is X) to the stricter types present on the client side. This has the advantage of being flexible in the UI requirements, and the disadvantage of being difficult to keep in sync with the protobuf source of truth.
Automated validation via some type of schema definition stored on the client side (JSON Schema is one such option). This has the advantage of generating consistent code on the client side which is kept up-to-date automatically as the schema is updated, as well as providing a schema document that can be used to validate the JSON output from the API. It has the same disadvantage of being a separate solution which must be updated manually any time the API contract changes.
Switch the console to use protoc-generated JS/TS libraries and decorate all protobuf messages with the appropriate validation. This has the advantage of the validation rules being identical on both server and client (and updating automatically) as well as providing a generic solution for validation (call validate() on the message class coming back from the server). It has the disadvantage of requiring a non-trivial amount of work: Switching from protobufjs to protoc, enabling the TS output from protoc, updating console code to work with the new typings and decoding strategy.

Option 3 is ideal, but the amount of work necessary to do so is concerning (especially considering it may not work correctly and we might have to back it out).

test issue 2

test issue

Breadcrumbs for the UI

We need to determine what info should be available in the breadcrumbs.

Show project, domain, entity type, (sub-entity type), version. In this case, sub-entity is something like an execution or launch plan belonging to a particular workflow.
Show a static project/domain combo just to set context, but don't make them links, then show the same as in #1
Leave out project/domain entirely

Expose execution caching status in the UI

Currently the UI does not show that a task execution is memoized. It is just absent from execution details if the execution was skipped because of cache.

Depends on #138

Update visuals used for errors

This is a task to audit our usage of error messages.

Ensure that all places where we use error messages are using an appropriately sized component
Evaluate messaging used
Discover any views/components which currently do not use error messages in their failure states and update them

Implement Launch Plan details

This will probably be similar to Workflow Version details, in that it will show information from the closure. But it may not show the graph, or it may optionally allow a user to show a graph view of the workflow at that version.

TODO: Determine which details of a LP are useful to show.