Git Product home page Git Product logo

Comments (5)

bachsh avatar bachsh commented on May 12, 2024 10

An alternative implementation suggestion.

Accept an Iterable[TimeDelta] for retry_delay that will yield the next retry delay every time it is invoked. If this is not possible because the context doesn't allow it, it might be a Callable that receives the retry count and returns a retry delay.
Then, a set of retry strategies can be built on top of it.

from prefect.

jameslamb avatar jameslamb commented on May 12, 2024 5

I saw this has a needs description. I just came here because my first thought reading https://docs.prefect.io/core/tutorial/04-handling-failure.html#if-at-first-you-don-t-succeed was "I wonder if prefect also supports exponential backoff". I'd like propose a description, hope it helps:


Description of the problem

It's common for tasks that pull data from external systems to fail because of temporary unavailability of those systems. Those systems may, for example, include REST APIs, databases, file systems, or message brokers.

Some of the root causes for that unavailability may be unaffected by consumers putting load on the system, including:

  • temporary downtime due to maintenance like OS upgrades
  • temporary downtime due to data migrations
  • unreliable connection due to poor internet connectivity

In those cases, it makes sense to wait a bit and retry the tasks. prefect offers an easy way to do this with arguments max_retries and retry_delay to @task(). These arguments allow you to say "if this task fails, try it n more times and wait s seconds between each retry".

However, this type of logic could defeat its own purpose if the unavailability is because of a problem which is made worse by new load from consumers, including:

  • resource exhaustion
    • example: 100% CPU utilization caused by a large number of requests or a few very-expensive requests (e.g. database queries that result in a full table scan)
  • quota limits reached
    • example: many relational databases cap the number of connections which can be open simultaneously
  • routing bottleneck
    • example: service sits behind a load balancer and the load balancer is receiving new requests faster than it can route them

In these situations, it's advisable to use a retry strategy which waits longer and longer after each failure. If all consumers do this, it will give the external system a better chance to recover. For one example reference (there are many), see this AWS blog post.

What it might look like to add this to prefect

Implementing exponential backoff means that the amount of time to wait between retries is a function of the number of retries so far.

It might look like this pseudocode:

# what is the shortest time you're willing to wait after a failed attempt?
wait_min = 1

# what is the longest you're willing to wait between retries?
wait_max = 10

# how fast do you want waiting time to scale relative to number of attempts?
wait_base = 2

# what is the most times you're willing to retry before saying a task failed?
max_attempts = 5

keep_trying = True
num_attempts_so_far = 0
while keep_retrying:
    result = task.run()
    num_attempts_so_far += 1
    if result == SUCCESS:
        keep_retrying = False
    elif num_attempts_so_far == max_attempts:
        keep_trying = False
    else:
        time_to_wait = max(
            wait_min,
            min(
                wait_max,
                random.uniform(0, wait_base * 2 ** num_attempts_so_far)
            )
        )
        time.sleep(time_to_wait)

To support this for prefect tasks, max_retries and retry_delay from the existing task() API could probably be reused. It might make sense to map retry_delay to wait_min. Then you'd have to give people the ability to add wait_base and wait_max, and probably one more keyword argument like retry_strategy="exponential_backoff".

I don't know enough about the prefect API to suggest other implementations but hopefully this description at least makes the problem and the value of solving it concrete.

Thanks for your time and consideration!

from prefect.

evan-burke avatar evan-burke commented on May 12, 2024

+1, I use this a lot via https://github.com/litl/backoff.

from prefect.

anna-geller avatar anna-geller commented on May 12, 2024

image

from prefect.

anna-geller avatar anna-geller commented on May 12, 2024

Closing since we have another issue about first adding random jitter to the retry delay

from prefect.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.