Comments (17)
After further discussion with @m4dcoder I'm proposing we move the retry out of the TaskTransitionSpec
and into the TaskSpec
. When I think of a retry, it is an attribute of the task itself. After the retries are exhausted, then we concern ourselves with which task we want to transition to. We can do this without losing the conditional functionality that was originally proposed. Here's an example task written in this style.
my_task:
action: core.foo
retry:
count: 3
next:
- when: <% failed() %>
do: fail
The retry spec can support something like:
retry:
when: condition for retry
delay: number of seconds before retry
count: number of times to retry
This means we can say something along the lines of
my_task:
action: core.foo
retry:
when: <% failed() and result().status_code = 401 %>
count: 3
next:
- when: <% failed() %>
do: fail
@punkrokk I like the backoff idea, but I'm probably going to start with a smaller scoped implementation. After I finish that maybe we can revisit your idea.
from orquesta.
From @m4dcoder
We can start the retry model with the following.
retry: delay: number of seconds before retry count: number of times to retry
I would suggest the retry function also allows for some sort of backoff or back pressure. Maybe this overlaps with policies.
Consider the following use case:
I have a list of computers to patch. I then fire off say, a Tanium action to get all my computers to a baseline. Trouble is, there are 750 computers. While I can fire off the action to patch 750 computers in one action run, I need to over time continue to check the status of the patch process. The following things make the retry slightly more complex:
- Some computers may be on vacation, in China, or off for awhile. That time period could be days, weeks etc.
- We don't want to hit the API every 3 sec/min or even every 3 hours as we get farther away from the initial launch.
So the following might be useful:
retry:
delay: number of seconds before retry
count: number of times to retry
max_time_retry: number of seconds or minutes to stop retrying if we are still failing (e.g. two weeks)
wait_until_exponential: how many 'normal' retries before we go exponential
wait_exponential_multiplier: number of ms to increase wait time exponentially
wait_exponential_max: 10000
I've used this library in the past for other shorter lived actions with pxGrid. There are some other interesting options, like random retry delays.
I really like the idea of being able to detect if we get an http 401, but are looking for an http 200 and retry; or other functional variations.
from orquesta.
Our current thinking is to put the retry in the task transition model. For example, when a task failed (or another other conditions), then retry the task execution.
next:
- when: <% failed() %>
retry: ...
In orquesta, the condition for task transition allows for flexibility (i.e. core.http succeeded but returns error code of 401). So if we put the retry in the task model, the orquesta conductor will not know under what condition or state it should honor the retries.
next:
- when: <% succeeded() and result().status_code = 401 %>
retry: ...
We can start the retry model with the following.
retry:
delay: number of seconds before retry
count: number of times to retry
On implementation, only thinking thru this briefly, the retry should be evaluated by the conductor in the update_task_flow
methods at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L562. If a retry is decided, the task should be returned by get_next_tasks
at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L479. In the st2workflowengine, we want to result in a new request (and thus a new DB record) for the task execution.
from orquesta.
@trstruth We are near the deadline for st2 v3.0 and so this feature would be a good target for the next release (3.1) or the one after (3 months window). If you like to contribute, we want to help you be successful. So let us know how we can help you.
from orquesta.
Working on adding the retry
syntax to the TaskSpec.
I've taken what @trstruth and @m4dcoder described in their last messages and think it's a great start.
My only question is, should the count
property have a special value of 0
or -1
that means "retry forever"?
from orquesta.
In order to jump-start some discussion:
I've spent a couple of hours this morning reading through the orquesta source and inspecting the modules in a repl. It seems like the first step would be to modify the TaskSpec
to take a retry key, then modify the composer to either generate an "unrolled" series of nodes that represent a potential retry, or make the conductor aware of the state of how many retries have been executed thus far. Do these approaches sound reasonable?
In reference to the README:
The criteria for task transition is an attribute of the edge.
Should retry be a task transition to itself?
The workflow conductor traverses the graph, directs the flow of the workflow execution, and tracks runtime state of the execution.
Should a "retry counter" be an attribute of the runtime state?
from orquesta.
+1
from orquesta.
@trstruth How are you doing with this? Are you still planning to contribute this feature to orquesta?
from orquesta.
@m4dcoder have not gotten any time to write code for this thus far due to work, but I'm still interested in contributing it when I do. When are you guys planning the release for it?
from orquesta.
@trstruth Thank you for identifying the use cases, issues with the original proposal, and proposing the alternative solution. I'm looking forward to the PR.
from orquesta.
For background information, in our discussion, @trstruth highlighted issues about multiple retries defined in more than one task transition, how to handle them in the conductor, and how to fail the workflow (or do other logging/cleanup/remediation steps) after retries are exhausted.
from orquesta.
from orquesta.
We also look forward to this feature becoming available.
@trstruth
Is this under development?
from orquesta.
I'm ok having -1
to mean retry forever and then 0 means no retry in case someone wants to pass a Jinja/YAQL expression that decide not to retry. When we integrate this back into st2, we need to make sure there's a test case to be able to cancel such task that is retrying forever.
from orquesta.
Just putting this up for consideration - I think retry forever is a bad pattern. The concept of "infinite retries" probably shouldn't actually be captured by a retry, but rather baked into the workflow logic itself. Couldn't one simply create a task transition that recursively references the same task to create the equivalent of an infinite retry?
from orquesta.
I'm not a fan of using count: 0
or count: -1
to indicate "no retry" or "retry forever", because that makes a user has to go look up documentation on what those special values mean. This makes workflow definitions more opaque to users as they read through it.
Instead of using special values, I would simply use the string none
or null
value to indicate disabling retry, and the string forever
or the string infinity
to indicate infinite retry.
And while I would tend to agree with @trstruth that retry forever is a bad workflow pattern, I would expect some of our users will require that.
And retries can (and for some of our internal workflows, already are) be implemented in the workflow logic itself:
vars:
- do_thing_retry_delay: 10 # seconds
- do_thing_count: 5
- do_thing_iteration: 0
tasks:
do_thing:
action: ...
next:
# try again
- when: <% failed() and ctx().do_thing_iteration < ctx().do_thing_count %>
publish:
- do_thing_iteration: <% ctx().do_thing_iteration + 1 %>
do: sleep_and_try_again
- when: <% failed() and ctx().do_thing_iteration >= ctx().do_thing_count %>
do: fail_entire_workflow
- when: <% succeeded() %>
do: next_thing
sleep_and_try_again:
action: core.local
input:
command: sleep <% ctx().do_thing_retry_delay %>
next:
- do: do_thing
That retry logic is difficult to parse and not straightforward.
Another problem is that the execution history is expanded every time the retry task is executed:
* do_thing # failed
* sleep_and_try_again # retry
* do_thing # failed
* sleep_and_try_again # retry
* do_thing # succeeded
* next_thing
At least in Mistral*, when retry
is used explicitly the failed tasks don't appear in the execution history:
* do_thing # failed, failed, succeeded
* next_thing
* I'm not saying we have to implement things exactly like Mistral.
from orquesta.
from orquesta.
Related Issues (20)
- Exception running workflow - 'ValueError: malformed node or string: <_ast.BinOp object at 0x7f80d29946d8>' HOT 3
- Workflow join is not triggered on complete for failed task(s) HOT 8
- The ujson 2.0.x doesn't compatible with Orquesta HOT 1
- Retries using with-items runs a retry even on objects that succeeded as well HOT 5
- Add ability in task spec to wait for a lock before proceeding HOT 4
- Join failure within nested workflows can cause Parent workflow to run indefinitely.
- Join ALL and conditional branches conflict? HOT 6
- Incomplete next staged concurrent task with items if last running nested item fails.
- KeyValue DataStore does not load in Workflow HOT 2
- 'Inspect the workflow spec' and 'Instantiate the workflow conductor' take too long HOT 7
- Workflow join is not properly working if one step fails HOT 4
- Disable action notify triggers when action is executed under workflow context HOT 3
- Naming a workflow task "get_task" causes "'TaskSpec' object is not callable" error
- Valid YAQL in With Items Input Fails HOT 8
- JOIN all with condition HOT 2
- Investigate using `rustworkx` instead of `networkx` HOT 1
- Workflow stuck with concurrency value of 0.
- Orquesta workflow inquiries responder user ID not found
- task with join: all starts without waiting for all the previous task completed when there is a loop
- Vague error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from orquesta.