Git Product home page Git Product logo

Comments (46)

Extravi avatar Extravi commented on June 15, 2024 1

This change should add more redundancy and make everything faster and more reliable.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024
def makeHTMLRequest(url: str):
    # block unwanted request from an edited cookie
    domain = unquote(url).split('/')[2]
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    # get google cookies
    with open("./2captcha.json", "r") as file:
        data = json.load(file)
    GOOGLE_OGPC_COOKIE = data["GOOGLE_OGPC_COOKIE"]
    GOOGLE_NID_COOKIE = data["GOOGLE_NID_COOKIE"]
    GOOGLE_AEC_COOKIE = data["GOOGLE_AEC_COOKIE"]
    GOOGLE_1P_JAR_COOKIE = data["GOOGLE_1P_JAR_COOKIE"]
    GOOGLE_ABUSE_COOKIE = data["GOOGLE_ABUSE_COOKIE"]

    # Choose a user-agent at random
    user_agent = random.choice(user_agents)
    headers = {
        "User-Agent": user_agent,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    cookies = {
        "OGPC": f"{GOOGLE_OGPC_COOKIE}",
        "NID": f"{GOOGLE_NID_COOKIE}",
        "AEC": f"{GOOGLE_AEC_COOKIE}",
        "1P_JAR": f"{GOOGLE_1P_JAR_COOKIE}",
        "GOOGLE_ABUSE_EXEMPTION": f"{GOOGLE_ABUSE_COOKIE}"
    }

    # Force all requests to only use IPv4
    requests.packages.urllib3.util.connection.HAS_IPV6 = False
    
    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")
    

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

this might be a useful cookie to add
image

from araa-search.

amogusussy avatar amogusussy commented on June 15, 2024

Here's a few changes I'd add:

from urllib.parse import urlparse


# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = urlparse(url).netloc
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        with open("./2captcha.json", "r") as file:
            data = json.load(file)
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

This sets requests.packages.urllib3.util.connection.HAS_IPV6 before the function, because it only needs to be set once.
Uses urlparse rather than splitting strings.
Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file.
And removes a few one time use variables because they don't need to be variables.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

yeah I'm still working on that request function

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

there will be more changes in the next few days

from araa-search.

amogusussy avatar amogusussy commented on June 15, 2024

Also, the Accept header accepts */*, so all the other MIME types don't need to be specified.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

I'm running tests on various capacha blocked vpn connections to see what headers and cookies that will make the request more reliable

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Also, the Accept header accepts */*, so all the other MIME types don't need to be specified.

I'm still going to specify it just in case and I'll continue to run test

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

2captcha is very cheap but it adds up overtime so I need to make it harder to detect and block so it uses the API less

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

once the first recaptca pops up it's going to pop up more often I noticed so I need to find ways to make the request system seem like a real user

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

I did notice that a headless chrome browser doesn't really use that much memory and there are undetected versions of it so that could become a scraping option in the config at some point

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

nvm that might not be practical

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

odd i cant seem to get the "_GRECAPTCHA" cookie

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

oh in chrome based browsers its not stored under cookies its stored under local storage

from araa-search.

amogusussy avatar amogusussy commented on June 15, 2024

Does 2captcha also work for self hosted instances without the hoster having to pay?

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

oh in chrome based browsers its not stored under cookies its stored under local storage

odd i cant get it atm
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Does 2captcha also work for self hosted instances without the hoster having to pay?

no but if you want to help test i can send you come credits

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Does 2captcha also work for self hosted instances without the hoster having to pay?

image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

if you make an account email me the email you used and i can send some credit that could help

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

some cookies used by google are region based so in the UK you won't get everything i can get testing in NA

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

but "_GRECAPTCHA" is in the EU, UK and NA

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

i also do my test using high load free vpn servers to make sure it has a recaptcha i send a request in a private window using "https://www.google.com/search?q=google"
image

from araa-search.

amogusussy avatar amogusussy commented on June 15, 2024

I think there should be an option in the config file for if you want to use a captcha solver then.
Maybe have something like #106 (the PR isn't that great, so I might redo it) for if the admin chooses not to use a captcha solver.
Having to pay will probably turn most people away from self hosting.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

its already an option in the config

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

its turned off by default but i have it on for testing

image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

i have done a total of 182 captchas in my test and only used 0.54 cents
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

most of this is from debugging the code on a instance it will use the api far less

from araa-search.

amogusussy avatar amogusussy commented on June 15, 2024

Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them.
Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

It does it in its first attempt out of the 182 sent; it got one wrong for the server to do everything with it and the web driver; it totals 43.99 seconds.
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

results will look something like this in the file
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them. Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.

Googles recaptcha also uses ai btw

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Here's a few changes I'd add:

from urllib.parse import urlparse


# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = urlparse(url).netloc
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        with open("./2captcha.json", "r") as file:
            data = json.load(file)
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

This sets requests.packages.urllib3.util.connection.HAS_IPV6 before the function, because it only needs to be set once. Uses urlparse rather than splitting strings. Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file. And removes a few one time use variables because they don't need to be variables.

It's been added

from urllib.parse import unquote, urlparse

# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = unquote(url).split('/')[2]
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        data = load_config()
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }
    
    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

alternative search engine

an alternative search engine is not a bad idea and i will be looking into that soon but i want to finish how request are made first

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

im going to add support to proxy google autocomplete as a setting because its faster then duckduckgos
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Each domain will now have its own persistent session, so I won't need to establish a new https/tls connection for each domain, and I can take advantage of connection reuse. This should greatly improve speeds. Also, each session will be isolated and have its own cookies, etc., making everything more reliable.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Video demo of what's possible with persistent sessions and connection reuse. Persistence sessions have already been added to my instance, but I cannot take advantage of connection reuse unless I set up a persistent session for each domain, and that's something I am currently working on.
https://github.com/Extravi/araa-search/assets/98912029/96a7d011-9efe-4e03-9120-578760f97b77

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

A good example is autocomplete. Instead of opening a new TLS/SSL connection for every input or request, it can just resume its connection to that domain. This will greatly reduce delay and improve response time.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

image

i will need to check each request and they will each need their own persistent session and each session is in memory/ram so its quite fast

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

ill add it tmr with some other stuff

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

@amogusussy @TEMtheLEM
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

the first request will look something like this
image
any request after will look like this
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

Now there is no need to start a new request every time, saving on request response time and making everything faster.

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

image at first request
image
image
any request after
image
image

from araa-search.

Extravi avatar Extravi commented on June 15, 2024

If you have any ideas on how I can further make the request better or faster, let me know.

from araa-search.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.