Git Product home page Git Product logo

Comments (8)

neon-mmd avatar neon-mmd commented on May 18, 2024

I added logs before and after random_user_agent() and found that its processing time even exceeded 10 seconds. Is this expected?

This was unexpected 🙂 After testing it on my side it turns out that for me it takes 7 seconds so it seems that it has to do with the performance of the system but we can try enabling one option, let's change the caching option from false to true for generating user agent and let's see if it improves the speed 🙂 .

image

The file that needs to be changed is located under src/search_results_handler/user_agent.rs.

from websurfx.

neon-mmd avatar neon-mmd commented on May 18, 2024

So after studying and digging deep into the crate itself, I found that the crate actually fetches data from the upstream website and scrapes it to get the required user agents that's why it causes a delay and also it looks like the project has been abandoned because the last commit seems to be 5 years which is a very long time for an open source repository.

Also, enabling the cache option did improve speed slightly by 2-3 seconds but I think that having a delay of 5 seconds seems to be good to allow some random delay to occur between requests which help to evade IP blocking I can think of reducing the random time delay that I have added in the code from 1-10 secs to 1-5 seconds to improve speed. What do you say @xffxff??

Also, maybe in the future, we might need to either explore an alternative for this crate or maybe implement our own 😄 .

from websurfx.

xffxff avatar xffxff commented on May 18, 2024

Also, enabling the cache option did improve speed slightly by 2-3 seconds but I think that having a delay of 5 seconds seems to be good to allow some random delay to occur between requests which help to evade IP blocking I can think of reducing the random time delay that I have added in the code from 1-10 secs to 1-5 seconds to improve speed

@neon-mmd Hmm, Do we have to insert a delay between different requests? This may conflict with our lighting-fast goal. Additionally, when there are many concurrent search requests, even with a delay, there will still be a lot of requests to the engine at the same time.

from websurfx.

xffxff avatar xffxff commented on May 18, 2024

pub fn random_user_agent() -> String {
UserAgentsBuilder::new()
.cache(false)
.dir("/tmp")
.thread(1)
.set_browsers(
Browsers::new()
.set_chrome()
.set_safari()
.set_edge()
.set_firefox()
.set_mozilla(),
)
.build()
.random()
.to_string()
}

I believe that it is unnecessary to construct a new UserAgent object for every request, as all requests can utilize the same instance of UserAgent. Instead, you can call UserAgent.random() to obtain a random user agent string for each request. Here is an example of how this can be implemented:

// Construct the UserAgent object once when the server starts
let user_agents = UserAgentsBuilder::new()
         .cache(false)
         .dir(/tmp)
         .thread(1)
         .set_browsers(
             Browsers::new()
                 .set_chrome()
                 .set_safari()
                 .set_edge()
                 .set_firefox()
                 .set_mozilla(),
         )
         .build()
...

// Retrieve a random user agent string in aggregator.rs
let user_agent = user_agents.random().to_string()

from websurfx.

alamin655 avatar alamin655 commented on May 18, 2024

I think @xffxff is right. Here is my implementation using the lazy_static crate:

use fake_useragent::{Browsers, UserAgents, UserAgentsBuilder};
use lazy_static::lazy_static;

lazy_static! {
    static ref USER_AGENTS: UserAgents = {
        UserAgentsBuilder::new()
            .cache(false)
            .dir("/tmp")
            .thread(1)
            .set_browsers(
                Browsers::new()
                    .set_chrome()
                    .set_safari()
                    .set_edge()
                    .set_firefox()
                    .set_mozilla(),
            )
            .build()
    };
}

/// A function to generate a random user agent to improve privacy of the user.
///
/// # Returns
///
/// A randomly generated user agent string.
pub fn random_user_agent() -> String {
    USER_AGENTS.random().to_string()
}

from websurfx.

neon-mmd avatar neon-mmd commented on May 18, 2024

@neon-mmd Hmm, Do we have to insert a delay between different requests? This may conflict with our lighting-fast goal. Additionally, when there are many concurrent search requests, even with a delay, there will still be a lot of requests to the engine at the same time.

No, actually we need it because if we do not add a random delay between requests especially for large-scale server use cases as these servers will have thousands of users and will create a lot of traffic and this, in turn, may cause the upstream search engines to get DDoSed which is not good and they might ban the IP that caused the DDoS but I can see one option like having a config option like production_use which when enabled puts random delay lets say after every 4 concurrent requests and when disabled it either it reduces the random delays or removes it completely this will be very helpful for small scale use like if you are hosting on your home server just for own use. What do you say @xffxff @alamin655 ??

from websurfx.

neon-mmd avatar neon-mmd commented on May 18, 2024

I think @xffxff is right. Here is my implementation using the lazy_static crate:

use fake_useragent::{Browsers, UserAgents, UserAgentsBuilder};
use lazy_static::lazy_static;

lazy_static! {
    static ref USER_AGENTS: UserAgents = {
        UserAgentsBuilder::new()
            .cache(false)
            .dir("/tmp")
            .thread(1)
            .set_browsers(
                Browsers::new()
                    .set_chrome()
                    .set_safari()
                    .set_edge()
                    .set_firefox()
                    .set_mozilla(),
            )
            .build()
    };
}

/// A function to generate a random user agent to improve privacy of the user.
///
/// # Returns
///
/// A randomly generated user agent string.
pub fn random_user_agent() -> String {
    USER_AGENTS.random().to_string()
}

This looks good 👍 but after doing some research to see whether there are any better and faster implementations than this I found that lazy_static seems to be a bit slow and there is even better and faster crate for the same called once_cell and it also has been merged into std::lazy and is available as an experimental feature right now in the nightly version so I see once_cell should be the way to go forward. What do you say @alamin655 ??

Here are some links to follow:

from websurfx.

xffxff avatar xffxff commented on May 18, 2024

No, actually we need it because if we do not add a random delay between requests especially for large-scale server use cases as these servers will have thousands of users and will create a lot of traffic and this, in turn, may cause the upstream search engines to get DDoSed which is not good and they might ban the IP that caused the DDoS but I can see one option like having a config option like production_use which when enabled puts random delay lets say after every 4 concurrent requests and when disabled it either it reduces the random delays or removes it completely this will be very helpful for small scale use like if you are hosting on your home server just for own use

@neon-mmd Thank you for the explanation! I think you are right and having a config option like production_use is helpful

from websurfx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.