uber / cadence Goto Github PK

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

Home Page: https://cadenceworkflow.io

License: MIT License

Makefile 0.22% Go 99.65% Shell 0.10% Dockerfile 0.03%

uber cadence workflows orchestration-engine workflow-automation distributed-systems service-bus service-fabric services-platform golang

cadence's Introduction

Cadence

This repo contains the source code of the Cadence server and other tooling including CLI, schema tools, bench and canary.

You can implement your workflows with one of our client libraries. The Go and Java libraries are officially maintained by the Cadence team, while the Python and Ruby client libraries are developed by the community.

You can also use iWF as a DSL framework on top of Cadence.

See Maxim's talk at Data@Scale Conference for an architectural overview of Cadence.

Visit cadenceworkflow.io to learn more about Cadence. Join us in Cadence Documentation project. Feel free to raise an Issue or Pull Request there.

Community

Github Discussion
- Best for Q&A, support/help, general discusion, and annoucement
StackOverflow
- Best for Q&A and general discusion
Github Issues
- Best for reporting bugs and feature requests
Slack
- Best for contributing/development discussion

Getting Started

Start the cadence-server

To run Cadence services locally, we highly recommend that you use Cadence service docker to run the service. You can also follow the instructions to build and run it.

Please visit our documentation site for production/cluster setup.

Run the Samples

Try out the sample recipes for Go or Java to get started.

Use Cadence CLI

Cadence CLI can be used to operate workflows, tasklist, domain and even the clusters.

You can use the following ways to install Cadence CLI:

Use brew to install CLI: brew install cadence-workflow
- Follow the instructions if you need to install older versions of CLI via homebrew. Usually this is only needed when you are running a server of a too old version.
Use docker image for CLI: docker run --rm ubercadence/cli:<releaseVersion> or docker run --rm ubercadence/cli:master . Be sure to update your image when you want to try new features: docker pull ubercadence/cli:master
Build the CLI binary yourself, check out the repo and run make cadence to build all tools. See CONTRIBUTING for prerequisite of make command.
Build the CLI image yourself, see instructions

Cadence CLI is a powerful tool. The commands are organized by tabs. E.g. workflow->batch->start, or admin->workflow->describe.

Please read the documentation and always try out --help on any tab to learn & explore.

Use Cadence Web

Try out Cadence Web UI to view your workflows on Cadence. (This is already available at localhost:8088 if you run Cadence with docker compose)

Contributing

We'd love your help in making Cadence great. Please review our contribution guide.

If you'd like to propose a new feature, first join the Slack channel to start a discussion and check if there are existing design discussions. Also peruse our design docs in case a feature has been designed but not yet implemented. Once you're sure the proposal is not covered elsewhere, please follow our proposal instructions.

Other binaries in this repo

If server runs with Cassandra, Use Cadence Cassandra tool
If server runs with SQL database, Use Cadence SQL tool

The easiest way to get the schema tool is via homebrew.

brew install cadence-workflow also includes cadence-sql-tool and cadence-cassandra-tool.

The schema files are located at /usr/local/etc/cadence/schema/.
To upgrade, make sure you remove the old ElasticSearch schema first: mv /usr/local/etc/cadence/schema/elasticsearch /usr/local/etc/cadence/schema/elasticsearch.old && brew upgrade cadence-workflow. Otherwise ElasticSearch schemas may not be able to get updated.
Follow the instructions if you need to install older versions of schema tools via homebrew. However, easier way is to use new versions of schema tools with old versions of schemas. All you need is to check out the older version of schemas from this repo. Run git checkout v0.21.3 to get the v0.21.3 schemas in the schema folder.

Stargazers over time

License

MIT License, please see LICENSE for details.

cadence's People

Contributors

Stargazers

Watchers

Forkers

tamer-eldeeb sivakku samarabbas sparklogic mxk1235 yiminc-zz isitwhoisit aryanugroho jbaski cluo longquanzheng parasitew vodelerk reactual vvelikodny juniiorf upnrunnhq alexdrinkwater flashtony2005 skielosky haoxuu arthurgan ansafinney mfateev superchaoran mechanicalai mudit3774 nathanboktae servicefoundation ryanwalls sgajera sharath1709 meiliang86 uber-qlam etsangsplk hubbucket-team halakaraki 3dsim slimakcz ramananm royadityak rowhit zhouyonglong billjedi 981724480 summer-ji-eng yaronsumel kumar-sundaram pankeshgupta golandr venkat1109 lonelywolfrider mactaggart andrewjdawson2016 bayesianmind richardbolt appcoreopc kvds-kalyan bolinov spencerx aminelaadhari lihannan99 kymr shyamalschandra aburan28 geoffbaker z4-box pradeeptg classjava shadowwalker2718 minhnhut0602 liuyu81 xiemaisi sagikazarmark marcusbooyah pgohite renesugar vkuzmin-uber agent-tao joaolpinho keeblerelvis s8sg vedagarwal yycptt cajbecu henrywu2019 yogendra-prasad huiwenhan jverce fjfd gs- hbcbh1999 danieldroit jithinraj giogkarakis rjammala hipsterelitist chaitanyaphalak bsmr stvliu

cadence's Issues

History cache invalidation on Cassandra timeouts

we have an issue where if we got a timeout error while updating the wf mutable state, we couldn't guarantee that we read the correct, latest state on reload. This is because the write could still be applied after executing the read.
This could have lead to corrupting the Events table if we tried to use the stale next_event_id value for subsequent writes.

Cadence Feature: Support for Terminate Workflow Execution

DeleteWorkflowExecution is not transactional with workflow completion update

When decider responds back with complete workflow decision, we first update the execution with new events and then delete workflow execution as a separate transaction. This can cause issues when the update times out but we successfully apply the update. This can cause us to never delete this workflow execution.
We need to make sure that execution is update and deleted in the same transaction.

Report metric for persistence API for HistoryManager

Matching Service: Emit relevant metrics from all matching engine components

Matching Service: Start task failure handling

Matching Engine can lose decision tasks

By design, the matching engine can lose tasks even before recording in the execution history that they started. This is OK for activity tasks, since there are always timeouts for them.
On the other hand, there is no ScheduledToStart timeout for decision tasks (to avoid unnecessary timeouts in case decider was down or not polling tasks). If the decision task is lost, the workflow execution will get stuck forever.

Cadence Feature: Support for Custom Event in execution history

This is very useful for supporting scenarios like storing config as a custom event when the workflow execution is started. This will allow users to make configuration changes without breaking running instances.

Matching Service should not serve requests before it's ready

The matching service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach matching engine before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle History Service, where we block incoming requests on a wait group until the initialization of the service is complete.

Design task to expose mutable state to client-side

Certain workflows are easy to write if mutable state is exposed directly to client for making decisions instead of history. Workflows like cron will prefer this model and it is much more optimized for such scenarios. Also using mutable state for things like activity retries are much preferable rather than having client implement the retry logic.

Matching Engine: Support for caching tasks for a tasklist

Cadence Feature: Support for ContinueAsNew

History Service: Fix timer task creation on activity heartbeat

History service seems to be creating a timeout task on each heartbeat. Instead we should have last hearbeat time recorded in mutable state and only create new timeout when the first one expires based on the last value for last recorded heartbeat.

Matching Engine: Rate limit creation of new tasks for any TaskList

Every TaskList is mapped to single cassandra partition. So if we have all shards writing events to single TaskList, than it becomes the scalability bottleneck for the system. If Sync matching is not happening we end up writing all the tasks to cassandra and under lots of load cassandra transactions start timing out. This behavior ends up in generating very large number of duplicate tasks.
I think we need to put a rate limiter on TaskList to prevent this situation from happening. We should just return a throttle error back to client, and have the client backoff and retry failures. This should cause the system to degrade gracefully under extreme load.

History Service: Append-only history support

This includes all neccesary changes to schema, API, and engine.

History Engine: Timer optimizations

Currently all timers are created on each activity and decision task. We need to implement the logic to create a single timer for each workflow execution and set the next earliest timer when that one fires.

Matching Service: Partitioned TaskList for higher throughput

History Service: Mutable state API cleanup

Currently mutable state is only used for small part of the API. This work item is created to track it is used for all API calls on History service and is updated to keep track of all relevant information like:

ActivityInfos
TimerInfos
OutstandingDecision
NextEventID
ChildWorkflows
Potentially any Signal ID if it makes sense for any API

Design for Namespace/Domain support in Cadence

Shard assignment to nodes is uneven

Create ActivityTaskScheduleFailed event in history on bad decisions

If RespondDecisionTask sends in bad request or corrupted data than we just silently ignore the activitySchedule decisions. Instead we need to add relevant failure like ActivityTaskScheduleFailed event and then also create a new DecisionTask for the decider. Here is an instance of the failure:
{"RunID":"c09c5b10-d240-4f8b-bc4c-5735c0bb3805","ScheduleID":212,"Service":"cadence-frontend","WorkflowID":"48018f57-0c39-4d4e-b055-e3df3fff7464","level":"error","msg":"RespondDecisionTaskCompleted. Error: BadRequestError({Message:Missing StartToCloseTimeoutSeconds in the activity scheduling parameters.})","time":"2017-03-07T13:56:56-08:00"}

Cassandra Schema: Convert to use cassandra timestamps for various parts of the persistence API

For example: updated_at.

Frontend Service: Emit all relevant metrics

Deletion of history events on workflow completion

We mark the workflow execution row with a TTL in executions table on completion. This takes care of workflow execution entry in executions table but we still need leak space in the events table as we don't cleanup the history associated with that execution.
We could use the timer queue processor for this purpose and queue up a timer task to delete the execution history after retention period.

Basic Server Side Throttling

Cadence is a multi-tenant service and we need to protect against single bad user bringing the entire system down. This task is to implement basic throttling and quotas for each client.

Cadence Feature: Support for Query Decision Task API

This is for monitoring purpose, cli style scenarios. This allows us to get call-stacks, debug stuck issues, etc without hosting decider implementation.

Handling of history corruption

Hopefully, execution history should never get corrupted. If, for any reason (bugs?) we get into a state where this happens we should not just return a retriable error to the callers.

Frontend should not serve requests before it's ready

The frontend service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach frontend before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle history and matching services where we block incoming requests on a wait group until the initialization of the service is complete.

Remove pending task tracker in history engine

Since we serialize writes to cassandra anyway, the pending task tracker just adds unnecessary complexity. It can be replaced by a simple counter

Support for Archival of Workflow Execution History

Create Lock Manager to serialize access to executions

Right now, every request gets a WorkflowExecutionContext from the cache and then acquires a lock on that object. It is possible in edge conditions that two requests end up with two different context objects (request 1 gets the context, the context gets evicted from the cache, then request 2 creates a new object). This will break the guarantee that only one write per execution originates from the history engine at a time.
We can fix this by having a central lock manager that grants locks on executions instead of locking the context object itself.

Matching Service: Not to assign decision/activity tasks to Poll connections that are closed by client

History Service: Component Metrics for Engine/Transfer Queue/Shard Controller.

History Service: Config knob to support full validation of execution history

This is super useful for testing

Matching Service: Support for batching writes to cassandra on CreateTask

Matching Service: Priorities on TaskList dispatch

This is to enable the scenario to give higher priorities to task for outstanding workflows rather than newer ones. So we can complete outstanding ones faster in the event of backlog.

CreateWorkflowExecution flag to support fail creation if workflow already completed

Currently Cadence has support for dedupe on workflow-id if the execution is still running. There are scenarios where workflows are fast running and completes immediately, so it would be super useful to have support for dedupe on workflow-id on completed executions also.

Expose admin endpoints for all services

We need to expose http endpoints on all services which can be used to query internal state of the host and also query service health information.

Matching Service: TaskList throttling to allow users to limit activities per second

Range Delete for Transfer Tasks/Timer Tasks/Task Lists

Cadence Feature: Support for Workflow Timeout

Schema: Versioning and upgrade support for cadence schema

Server side workflow/activity parameters validation

Cadence Feature: Support for Child Workflows

Visibility and Debugging API

Add integration test for Timer Cancellation

Add an integration test to exercise timer cancellation.

Cadence Feature: Support for Signals

History client support for host redirect

Now we have support for returning the correct host information when API calls to history service fails with ShardOwnershipLostError.
History client needs to look into the ShardOwnershipLostError and retry the request given the host information as part of the error.

History Service: Component Metrics for Mutable state/Timer Queue

Cadence Feature: Restart failed workflows

This feature is to support restarting workflows from a given point in workflow execution history. Basically you want to preserve the history of an execution up to a point and restart from that location. Very useful when workflow fails due to a bug at a certain point and you want to restart a workflow after fixing the bug.