Comments (5)
An alternative implementation suggestion.
Accept an Iterable[TimeDelta]
for retry_delay
that will yield the next retry delay every time it is invoked. If this is not possible because the context doesn't allow it, it might be a Callable
that receives the retry count and returns a retry delay.
Then, a set of retry strategies can be built on top of it.
from prefect.
I saw this has a needs description
. I just came here because my first thought reading https://docs.prefect.io/core/tutorial/04-handling-failure.html#if-at-first-you-don-t-succeed was "I wonder if prefect
also supports exponential backoff". I'd like propose a description, hope it helps:
Description of the problem
It's common for tasks that pull data from external systems to fail because of temporary unavailability of those systems. Those systems may, for example, include REST APIs, databases, file systems, or message brokers.
Some of the root causes for that unavailability may be unaffected by consumers putting load on the system, including:
- temporary downtime due to maintenance like OS upgrades
- temporary downtime due to data migrations
- unreliable connection due to poor internet connectivity
In those cases, it makes sense to wait a bit and retry the tasks. prefect
offers an easy way to do this with arguments max_retries
and retry_delay
to @task()
. These arguments allow you to say "if this task fails, try it n
more times and wait s
seconds between each retry".
However, this type of logic could defeat its own purpose if the unavailability is because of a problem which is made worse by new load from consumers, including:
- resource exhaustion
- example: 100% CPU utilization caused by a large number of requests or a few very-expensive requests (e.g. database queries that result in a full table scan)
- quota limits reached
- example: many relational databases cap the number of connections which can be open simultaneously
- routing bottleneck
- example: service sits behind a load balancer and the load balancer is receiving new requests faster than it can route them
In these situations, it's advisable to use a retry strategy which waits longer and longer after each failure. If all consumers do this, it will give the external system a better chance to recover. For one example reference (there are many), see this AWS blog post.
What it might look like to add this to prefect
Implementing exponential backoff means that the amount of time to wait between retries is a function of the number of retries so far.
It might look like this pseudocode:
# what is the shortest time you're willing to wait after a failed attempt?
wait_min = 1
# what is the longest you're willing to wait between retries?
wait_max = 10
# how fast do you want waiting time to scale relative to number of attempts?
wait_base = 2
# what is the most times you're willing to retry before saying a task failed?
max_attempts = 5
keep_trying = True
num_attempts_so_far = 0
while keep_retrying:
result = task.run()
num_attempts_so_far += 1
if result == SUCCESS:
keep_retrying = False
elif num_attempts_so_far == max_attempts:
keep_trying = False
else:
time_to_wait = max(
wait_min,
min(
wait_max,
random.uniform(0, wait_base * 2 ** num_attempts_so_far)
)
)
time.sleep(time_to_wait)
To support this for prefect
tasks, max_retries
and retry_delay
from the existing task()
API could probably be reused. It might make sense to map retry_delay
to wait_min
. Then you'd have to give people the ability to add wait_base
and wait_max
, and probably one more keyword argument like retry_strategy="exponential_backoff"
.
I don't know enough about the prefect
API to suggest other implementations but hopefully this description at least makes the problem and the value of solving it concrete.
Thanks for your time and consideration!
from prefect.
+1, I use this a lot via https://github.com/litl/backoff.
from prefect.
from prefect.
Closing since we have another issue about first adding random jitter to the retry delay
from prefect.
Related Issues (20)
- Add an "Automate" option to the Webhooks three-dot menu in the UI
- Add new fields to deployment schedule forms in the UI
- Update the scheduler to use a deployment schedule's max_scheduled_runs if set
- Make Workers aware of schedule concurrency limits
- Start tracking the schedule that generated a flow run with the `CreatedBy` field
- Add documentation for new deployment schedule fields
- Add support for new fields to `prefect deploy` and prefect.yaml
- Add CLI options to relevant `prefect deployment schedule` commands
- Parameters appear and then disappear when clicking them HOT 2
- anyway to install prefect client side only?
- `client/orchestration.py/update_deployment_schedule` should be update with `active` and `schedule` at once
- First class handling of dependencies from pyproject.toml after git_clone pull step HOT 1
- `publish_as_work_pool()` -> work pool base job template updates for ECS push pool
- Enhance documentation clarity for JSON field inputs for the ECS work pool and ideally across other work pool types
- default parameters are not coerced when running a flow from a deployment
- Global Search filter won't search inside the deployment name (sub string)
- Concurrent mode start failed with `uvicorn`
- CustomWebhook block type not found on Automation action HOT 2
- Deprecation warning missing from some blocks
- Unable to update variable in UI or using python sdk even when using overwrite=True - Error 'A variable with the name "the_answer" already exists.' HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prefect.