SUMMARY When orquesta workflow tasks fail, there should be a built

After further discussion with <a class="user-mention notranslate" data-hovercard-type=

From <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Task Retry in Workflow Definition about orquesta HOT 17 CLOSED

stackstorm commented on June 5, 2024 13

Task Retry in Workflow Definition

from orquesta.

Comments (17)

trstruth commented on June 5, 2024 6

After further discussion with @m4dcoder I'm proposing we move the retry out of the TaskTransitionSpec and into the TaskSpec. When I think of a retry, it is an attribute of the task itself. After the retries are exhausted, then we concern ourselves with which task we want to transition to. We can do this without losing the conditional functionality that was originally proposed. Here's an example task written in this style.

  my_task:
    action: core.foo
    retry:
      count: 3
    next:
      - when: <% failed() %>
        do: fail

The retry spec can support something like:

retry:
  when: condition for retry
  delay: number of seconds before retry
  count: number of times to retry

This means we can say something along the lines of

  my_task:
    action: core.foo
    retry:
      when: <% failed() and result().status_code = 401 %>
      count: 3
    next:
      - when: <% failed() %>
        do: fail

@punkrokk I like the backoff idea, but I'm probably going to start with a smaller scoped implementation. After I finish that maybe we can revisit your idea.

from orquesta.

punkrokk commented on June 5, 2024 2

From @m4dcoder

We can start the retry model with the following.
retry:
  delay: number of seconds before retry
  count: number of times to retry

I would suggest the retry function also allows for some sort of backoff or back pressure. Maybe this overlaps with policies.

Consider the following use case:

I have a list of computers to patch. I then fire off say, a Tanium action to get all my computers to a baseline. Trouble is, there are 750 computers. While I can fire off the action to patch 750 computers in one action run, I need to over time continue to check the status of the patch process. The following things make the retry slightly more complex:

Some computers may be on vacation, in China, or off for awhile. That time period could be days, weeks etc.
We don't want to hit the API every 3 sec/min or even every 3 hours as we get farther away from the initial launch.

So the following might be useful:

 retry:
   delay: number of seconds before retry
   count: number of times to retry
   max_time_retry: number of seconds or minutes to stop retrying if we are still failing (e.g. two weeks) 
   wait_until_exponential: how many 'normal' retries before we go exponential
   wait_exponential_multiplier: number of ms to increase wait time exponentially
   wait_exponential_max: 10000

I've used this library in the past for other shorter lived actions with pxGrid. There are some other interesting options, like random retry delays.

I really like the idea of being able to detect if we get an http 401, but are looking for an http 200 and retry; or other functional variations.

from orquesta.

m4dcoder commented on June 5, 2024 1

Our current thinking is to put the retry in the task transition model. For example, when a task failed (or another other conditions), then retry the task execution.

next:
  - when: <% failed() %>
    retry: ...

In orquesta, the condition for task transition allows for flexibility (i.e. core.http succeeded but returns error code of 401). So if we put the retry in the task model, the orquesta conductor will not know under what condition or state it should honor the retries.

next:
  - when: <% succeeded() and result().status_code = 401 %>
    retry: ...

We can start the retry model with the following.

retry:
  delay: number of seconds before retry
  count: number of times to retry

On implementation, only thinking thru this briefly, the retry should be evaluated by the conductor in the update_task_flow methods at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L562. If a retry is decided, the task should be returned by get_next_tasks at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L479. In the st2workflowengine, we want to result in a new request (and thus a new DB record) for the task execution.

from orquesta.

m4dcoder commented on June 5, 2024 1

@trstruth We are near the deadline for st2 v3.0 and so this feature would be a good target for the next release (3.1) or the one after (3 months window). If you like to contribute, we want to help you be successful. So let us know how we can help you.

from orquesta.

nmaludy commented on June 5, 2024 1

Working on adding the retry syntax to the TaskSpec.

I've taken what @trstruth and @m4dcoder described in their last messages and think it's a great start.

My only question is, should the count property have a special value of 0 or -1 that means "retry forever"?

from orquesta.

trstruth commented on June 5, 2024

In order to jump-start some discussion:

I've spent a couple of hours this morning reading through the orquesta source and inspecting the modules in a repl. It seems like the first step would be to modify the TaskSpec to take a retry key, then modify the composer to either generate an "unrolled" series of nodes that represent a potential retry, or make the conductor aware of the state of how many retries have been executed thus far. Do these approaches sound reasonable?

In reference to the README:

The criteria for task transition is an attribute of the edge.

Should retry be a task transition to itself?

The workflow conductor traverses the graph, directs the flow of the workflow execution, and tracks runtime state of the execution.

Should a "retry counter" be an attribute of the runtime state?

from orquesta.

nmaludy commented on June 5, 2024

from orquesta.

m4dcoder commented on June 5, 2024

@trstruth How are you doing with this? Are you still planning to contribute this feature to orquesta?

from orquesta.

trstruth commented on June 5, 2024

@m4dcoder have not gotten any time to write code for this thus far due to work, but I'm still interested in contributing it when I do. When are you guys planning the release for it?

from orquesta.

m4dcoder commented on June 5, 2024

@trstruth Thank you for identifying the use cases, issues with the original proposal, and proposing the alternative solution. I'm looking forward to the PR.

from orquesta.

m4dcoder commented on June 5, 2024

For background information, in our discussion, @trstruth highlighted issues about multiple retries defined in more than one task transition, how to handle them in the conductor, and how to fail the workflow (or do other logging/cleanup/remediation steps) after retries are exhausted.

from orquesta.

punkrokk commented on June 5, 2024

Thanks for the feedback. I could open a separate issue if you like. Cheers, JP

…

On Mar 28, 2019, at 4:41 PM, W Chan ***@***.***> wrote: @trstruth <https://github.com/trstruth> Thank you for identifying the use cases, issues with the original proposal, and proposing the alternative solution. I'm looking forward to the PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#118 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDGeixF5KtpCnVTV0jBtLrkVxdCNfIzks5vbSjigaJpZM4aR2ha>.

from orquesta.

userlocalhost commented on June 5, 2024

We also look forward to this feature becoming available.

@trstruth
Is this under development?

from orquesta.

m4dcoder commented on June 5, 2024

I'm ok having -1 to mean retry forever and then 0 means no retry in case someone wants to pass a Jinja/YAQL expression that decide not to retry. When we integrate this back into st2, we need to make sure there's a test case to be able to cancel such task that is retrying forever.

from orquesta.

trstruth commented on June 5, 2024

Just putting this up for consideration - I think retry forever is a bad pattern. The concept of "infinite retries" probably shouldn't actually be captured by a retry, but rather baked into the workflow logic itself. Couldn't one simply create a task transition that recursively references the same task to create the equivalent of an infinite retry?

from orquesta.

blag commented on June 5, 2024

I'm not a fan of using count: 0 or count: -1 to indicate "no retry" or "retry forever", because that makes a user has to go look up documentation on what those special values mean. This makes workflow definitions more opaque to users as they read through it.

Instead of using special values, I would simply use the string none or null value to indicate disabling retry, and the string forever or the string infinity to indicate infinite retry.

And while I would tend to agree with @trstruth that retry forever is a bad workflow pattern, I would expect some of our users will require that.

And retries can (and for some of our internal workflows, already are) be implemented in the workflow logic itself:

vars:
  - do_thing_retry_delay: 10  # seconds
  - do_thing_count: 5
  - do_thing_iteration: 0
tasks:
  do_thing:
    action: ...
    next:
      # try again
      - when: <% failed() and ctx().do_thing_iteration < ctx().do_thing_count %>
        publish:
          - do_thing_iteration: <% ctx().do_thing_iteration + 1 %>
        do: sleep_and_try_again
      - when: <% failed() and ctx().do_thing_iteration >= ctx().do_thing_count %>
        do: fail_entire_workflow
      - when: <% succeeded() %>
        do: next_thing
  sleep_and_try_again:
    action: core.local
    input:
      command: sleep <% ctx().do_thing_retry_delay %>
    next:
      - do: do_thing

That retry logic is difficult to parse and not straightforward.

Another problem is that the execution history is expanded every time the retry task is executed:

* do_thing             # failed
* sleep_and_try_again  # retry
* do_thing             # failed
* sleep_and_try_again  # retry
* do_thing             # succeeded
* next_thing

At least in Mistral^*, when retry is used explicitly the failed tasks don't appear in the execution history:

* do_thing             # failed, failed, succeeded
* next_thing

^* I'm not saying we have to implement things exactly like Mistral.

from orquesta.

punkrokk commented on June 5, 2024

Would it make sense to have a global setting that times out forever after x minutes? Cheers, JP

…

On Jul 23, 2019, at 5:46 PM, Drew H ***@***.***> wrote: I'm not a fan of using count: 0 or count: -1 to indicate "no retry" or "retry forever", because that makes a user has to go look up documentation on what those special values mean. This makes workflow definitions more opaque to users as they read through it. Instead of using special values, I would simply use the string none or null value to indicate disabling retry, and the string forever or the string infinity to indicate infinite retry. And while I would tend to agree with @trstruth that retry forever is a bad workflow pattern, I would expect some of our users will require that. And retries can (and for some of our internal workflows, already are) be implemented in the workflow logic itself: vars: - do_thing_retry_delay: 10 # seconds - do_thing_count: 5 - do_thing_iteration: 0 tasks: do_thing: action: ... next: # try again - when: <% failed() and ctx().do_thing_iteration < ctx().do_thing_count %> publish: - do_thing_iteration: <% ctx().do_thing_iteration + 1 %> do: sleep_and_try_again - when: <% failed() and ctx().do_thing_iteration >= ctx().do_thing_count %> do: fail_entire_workflow - when: <% succeeded() %> do: next_thing sleep_and_try_again: action: core.local input: command: sleep <% ctx().do_thing_retry_delay %> next: - do: do_thing That retry logic is difficult to parse and not straightforward. Another problem is that the execution history is expanded every time the retry task is executed: * do_thing # failed * sleep_and_try_again # retry * do_thing # failed * sleep_and_try_again # retry * do_thing # succeeded * next_thing At least in Mistral*, when retry is used explicitly the failed tasks don't appear in the execution history: * do_thing # failed, failed, succeeded * next_thing * I'm not saying we have to implement things exactly like Mistral. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from orquesta.

Task Retry in Workflow Definition about orquesta HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent