Comments (8)
I'd like a feature like rate-limit
in muffet which applies across all requests.
I'm happy to offer a bounty of sorts of 100β¬ (payable via PayPal or SEPA) for whoever implements this, if multiple people work on it I'm happy to split the money.
from lychee.
There is no delay between the requests at the moment.
In fact, we run many requests concurrently; even to the same host.
We could add a --request-delay
argument and in combination with --max-concurrency 1
it would resolve your issue, but it would just be a band-aid to the underlying problem, which is the rate limiting itself. You see, delays need to be tweaked by humans to find a sweet spot between throughput and stability. This can't be solved with a global delay.
Instead, my proposal is to add better rate-limiting support per website. I wrote https://github.com/mre/rate-limits a while ago and would like to integrate it into lychee. We would keep track of the current rate-limits for each host in a HashMap
(a concurrent HashMap
actually) or maybe even create one reqwest
client per host; I don't know which one is the better option right now.
E.g. the hash map could look like this:
use rate_limits::ResetTime;
use std::collections::HashMap;
use time::OffsetDateTime;
let mut rate_limits = HashMap::new();
rate_limits.insert(
"github.com".to_string(),
ResetTime::DateTime(OffsetDateTime::from_unix_timestamp(1350085394).unwrap()),
);
An entry would be inserted once we get rate-limited.
Before every request, we would check if the current time is after the ResetTime
.
If not, we'd wait the difference and finally remove the rate limit entry from the map.
This would scale much better than a global delay and would cover more use-cases.
What do you think?
from lychee.
It is pretty common for APIs, but not for websites I would guess. Realistically we might still need both, the rate-limit headers and a way to configure the delay.
Let's start with rate-limit headers, though, because that's a common way for a website to tell us to slow down. Another common way is the infamous 429
. We currently retry those requests with an exponential backoff, which is a start. (We could do better by estimating an optimal request interval based on the response time with various request delays, but let's not get ahead of ourselves.)
from lychee.
Instead, my proposal is to add better rate-limiting support per website.
πThis!
Checking my awesome-falsehood project for dead links reveals some false positives for news.ycombinator.com
and twitter.com
domains. Of course these two protects themselves from abuse, and concurrent access by Lychee are seen as such.
--max-concurrency 1
solves the issue.
But we forfeit any performance. The ideal solution would be a way to have either --max-concurrency-per-domain 1
or --delay-per-domain 1.5s
option.
from lychee.
You are absolutely right that the needed delay would need to be tweaked to fit all queried hosts... and there is likely no common ground. And of course concurrency needs to be set to 1 and everything would be slowed down.
Your proposal sounds like a pretty smart solution!
How common is the usage of this headers already? I see that the newest IETF document also covers APIs, but everything is still in draft?
from lychee.
Did you manage to check twitter links lately? It's failing on my end, even with our workaround to use nitter instead. Maybe the concurrency is what's killing it for me.
Haven't encountered any issues with HN yet, even though it's probably a matter of not triggering their rate-limiting. Out of curiosity, how many requests to news.ycombinator.com
does it take until you encounter any issues?
from lychee.
Did you manage to check twitter links lately?
It's more complicated than that: --max-concurrency 1
fixed it from my machine. But it doesn't from my GitHub action workflows. So there is hard rate-limiting by Twitter from requests originating from GitHub.
how many requests to
news.ycombinator.com
does it take until you encounter any issues?
Around 4 request:
![Screenshot 2023-06-12 at 19 06 27](https://private-user-images.githubusercontent.com/159718/245186531-07d49b59-b69a-4a10-9d4f-a42fa1365304.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDk1NDA4MDcsIm5iZiI6MTcwOTU0MDUwNywicGF0aCI6Ii8xNTk3MTgvMjQ1MTg2NTMxLTA3ZDQ5YjU5LWI2OWEtNGExMC05ZDRmLWE0MmZhMTM2NTMwNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwMzA0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDMwNFQwODIxNDdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kMGY4OGIzNjYwZjg5MjJmZGZiNTZlNTkwOTkyYjc4NmM5OGRmM2RiNjI5MDEzYjZhY2FiYjYxMjU2MGEyZGJkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.WTOKe-j2o-PMTjF4miSQXB32DmrLsq5GmnMskiNAAjI)
Source: kdeldycke/awesome-falsehood#159
from lychee.
That's great to hear! To whoever might be interested in tackling that, feel free to post a comment here.
Generally, we can follow the approach of muffet that was linked above.
from lychee.
Related Issues (20)
- Link Checker Report
- Support HTTP request method fallback HOT 3
- --accept says "invalid digit found in string" for documentation example HOT 2
- Read .gitignore HOT 2
- Cannot load configuration file: invalid type: integer `200`, expected a string HOT 7
- Link Checker Report
- accept 429 code not taken in account HOT 5
- Link Checker Report
- Link Checker Report
- include_mail and exclude_mail flags do not work from config file HOT 5
- Output format json is breaking when there are warnings HOT 1
- Pre-Commit Hook HOT 1
- Check for anchors in destination page? HOT 5
- Small bug with --include-fragments and headers with backticks HOT 7
- Improve error message for insecure URL
- Docker image with lychee executable HOT 5
- publish container images to ghcr.io HOT 5
- Link Checker Report HOT 1
- lychee.toml JSON schema HOT 3
- Feature request: ignore all links between specific section/lines of a file HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lychee.