Git Product home page Git Product logo

Comments (10)

musabgultekin avatar musabgultekin commented on May 21, 2024

Hi William. Thanks for reporting that! Once upon a time, I was working on a simple queue system. Based on chan.
Here is the branch: https://github.com/geziyor/geziyor/tree/queue

The change is basically like this: d4d9c07

Can you test with that branch with your scraper and give a feedback about it?. So that we can merge this to the master. Cause Ive used that approach a couple time, but I was just unsure about it.

from geziyor.

musabgultekin avatar musabgultekin commented on May 21, 2024

But max in memory request queue size is hardcoded rn: 1,000,000. We can make it configurable though. Or something dynamic as default, based on the RAM, like Max queue size= (Ram in MB * 10) etc. And then configurable setting.

from geziyor.

williamjulianvicary avatar williamjulianvicary commented on May 21, 2024

The queue size could be a problem for larger websites as it'll deadlock (I tested with lower number) once it fills up. Could we perhaps add a select for that and default to dumping any additional queued items > 1,000,000?

I've tested the branch, and memory usage is FAR FAR better, but I've made a couple of tweaks in this PR which increase performance more, in short:

  • RequestMiddleware is run before the g.do method is called - this avoids queuing up something unnecessarily and...
  • Importantly it stops the rate limiter from being invoked (and blocking) for requests that are going to be cancelled!

The next optimisation step would be to reduce the weight of each queue item, I can only assume that the client.Request is fairly heavy hence even though the changes are adding some (clear) optimisations, there is room to improve. Although that change is much deeper, I think what this branch brings is already a significant improvement.

To illustrate the improvements, 100 threads, Wikipedia.org (i'm polling memory/goroutines each second):

Before queue branch:

> go run main.go
2022/04/01 14:42:56 Starting crawl
2022/04/01 14:42:57 Num goroutines: 665 - 14 MB Mem
2022/04/01 14:42:58 Num goroutines: 1416 - 45 MB Mem
2022/04/01 14:42:59 Num goroutines: 23359 - 151 MB Mem
2022/04/01 14:43:00 Num goroutines: 62491 - 191 MB Mem
2022/04/01 14:43:01 Num goroutines: 85113 - 326 MB Mem
2022/04/01 14:43:02 Num goroutines: 134556 - 399 MB Mem
2022/04/01 14:43:03 Num goroutines: 174339 - 644 MB Mem
2022/04/01 14:43:04 Num goroutines: 242212 - 563 MB Mem
2022/04/01 14:43:05 Num goroutines: 295873 - 875 MB Mem
2022/04/01 14:43:06 Num goroutines: 342488 - 1191 MB Mem

After queue branch:

> go run main.go
2022/04/01 14:41:27 Starting crawl
2022/04/01 14:41:28 Num goroutines: 474 - 37 MB Mem
2022/04/01 14:41:29 Num goroutines: 1039 - 58 MB Mem
2022/04/01 14:41:30 Num goroutines: 1004 - 139 MB Mem
2022/04/01 14:41:31 Num goroutines: 1053 - 170 MB Mem
2022/04/01 14:41:32 Num goroutines: 1283 - 208 MB Mem
2022/04/01 14:41:33 Num goroutines: 1447 - 331 MB Mem
2022/04/01 14:41:34 Num goroutines: 1564 - 392 MB Mem
2022/04/01 14:41:35 Num goroutines: 1520 - 387 MB Mem
2022/04/01 14:41:36 Num goroutines: 1693 - 596 MB Mem
2022/04/01 14:41:37 Num goroutines: 1864 - 707 MB Mem

After my PR:

> go run main.go
2022/04/01 14:42:15 Starting crawl
2022/04/01 14:42:16 Num goroutines: 412 - 23 MB Mem
2022/04/01 14:42:17 Num goroutines: 394 - 74 MB Mem
2022/04/01 14:42:18 Num goroutines: 527 - 151 MB Mem
2022/04/01 14:42:19 Num goroutines: 617 - 153 MB Mem
2022/04/01 14:42:20 Num goroutines: 651 - 168 MB Mem
2022/04/01 14:42:21 Num goroutines: 643 - 182 MB Mem
2022/04/01 14:42:22 Num goroutines: 638 - 167 MB Mem
2022/04/01 14:42:23 Num goroutines: 655 - 124 MB Mem
2022/04/01 14:42:24 Num goroutines: 639 - 130 MB Mem
2022/04/01 14:42:25 Num goroutines: 644 - 224 MB Mem

from geziyor.

williamjulianvicary avatar williamjulianvicary commented on May 21, 2024

The PR mentioned above doesn't cover this challenge, but I also feel the Robots middleware should be extracted from the middleware stack and moved to the g.do method. The robots middleware also creates I/O overhead which is quite likely unnecessary if a request would otherwise be cancelled by a separate middleware.

Potentially extracting a third middleware "PreProcessMiddleware" or similar or a flag during middleware addition of when it should be executed? This way consumers of the app could also add I/O overhead immediately prior to the execution vs. during request addition as is the case right now.

That would mean:

  • RequestMiddleware - Executed prior to queuing a request
  • PreProcessMiddleware - Executed immediately prior to executing the HTTP request from the queue
  • ResponseMiddleware - as-is

Naming might need some work, but I think there is some value in having 3 differing middleware types for differing purposes.

from geziyor.

musabgultekin avatar musabgultekin commented on May 21, 2024

Yeah, I understand what you mean,
So this was not a concern before the queueing system. Since we're planning to integrate that, we should think of ways of doing this better. Like you mentioned.

I just gave another thinking, that, if we change the time that request middlewares, (just after do function and not when they are going to be executed), then this could affect the production systems that rely on custom middlewares.

So, Instead of running all the middlewares before the queue, we should add another middleware system that runs just before queued. So that you/anyone can cancel the request just before they queued. And they wont consume resources also

from geziyor.

musabgultekin avatar musabgultekin commented on May 21, 2024

BUT, if thats the case, why don't the developer does this cancel check just before they queue? And rely on the middleware? Like before even running g.Do method ?

from geziyor.

williamjulianvicary avatar williamjulianvicary commented on May 21, 2024

As it stands I don't think you lose anything by adding the queue system, it's just an internal method of handling the queuing - regardless the use of middleware currently is already inefficient so making changes won't impact that (they just won't benefit from the changes being made herein).

Personally I don't feel this should be an optional flag and instead be the default, if the queue isn't in place the internals simply block instead and the semaphore blocks instead (just much heavier given the goroutines) - it "just" more efficiently handles queuing.

from geziyor.

musabgultekin avatar musabgultekin commented on May 21, 2024

Sorry I meant changing middlewares system. Not the queue. I really like the current status of queuing system.
So instaed of adding a delay middleware, we'll have a different middleware system. Let me write code so that you understand better

from geziyor.

musabgultekin avatar musabgultekin commented on May 21, 2024

I just made some changes in addition to your code in the queue branch.

So I basically added reqPreQueueMiddlewares. And then moved all the middlewares that can cancel the request. (allowed domains, duplicate requests, robots)
And then delay middleware will stick in the requests middleware.
And also made the queue size configurable.

In this commit: 24250c6

Can you look at that. Is this good in your opinion? It fixes the issue you mentioned.
Also, do you think 1M is good ? Or we should increase it in any way that is much more dynamic, based on RAM available in the system etc.

from geziyor.

williamjulianvicary avatar williamjulianvicary commented on May 21, 2024

1M is ~8MB of memory usage I believe - I think that's a reasonable default.

I've checked the commit and all looks good from my perspective - thanks!

from geziyor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.