Requests made using makeHTMLRequest should look something like this to the server, mak

<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

this might be a useful cookie to add <a target="_blank" rel="noopener noreferrer"

Here's a few changes I'd add: <div class="snippet-clipboard-content notranslate po

Changes to how requests are made. about araa-search HOT 46 CLOSED

Extravi commented on June 15, 2024

Changes to how requests are made.

from araa-search.

Comments (46)

Extravi commented on June 15, 2024 1

This change should add more redundancy and make everything faster and more reliable.

from araa-search.

Extravi commented on June 15, 2024

def makeHTMLRequest(url: str):
    # block unwanted request from an edited cookie
    domain = unquote(url).split('/')[2]
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    # get google cookies
    with open("./2captcha.json", "r") as file:
        data = json.load(file)
    GOOGLE_OGPC_COOKIE = data["GOOGLE_OGPC_COOKIE"]
    GOOGLE_NID_COOKIE = data["GOOGLE_NID_COOKIE"]
    GOOGLE_AEC_COOKIE = data["GOOGLE_AEC_COOKIE"]
    GOOGLE_1P_JAR_COOKIE = data["GOOGLE_1P_JAR_COOKIE"]
    GOOGLE_ABUSE_COOKIE = data["GOOGLE_ABUSE_COOKIE"]

    # Choose a user-agent at random
    user_agent = random.choice(user_agents)
    headers = {
        "User-Agent": user_agent,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    cookies = {
        "OGPC": f"{GOOGLE_OGPC_COOKIE}",
        "NID": f"{GOOGLE_NID_COOKIE}",
        "AEC": f"{GOOGLE_AEC_COOKIE}",
        "1P_JAR": f"{GOOGLE_1P_JAR_COOKIE}",
        "GOOGLE_ABUSE_EXEMPTION": f"{GOOGLE_ABUSE_COOKIE}"
    }

    # Force all requests to only use IPv4
    requests.packages.urllib3.util.connection.HAS_IPV6 = False
    
    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

from araa-search.

Extravi commented on June 15, 2024

this might be a useful cookie to add

from araa-search.

amogusussy commented on June 15, 2024

Here's a few changes I'd add:

from urllib.parse import urlparse


# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = urlparse(url).netloc
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        with open("./2captcha.json", "r") as file:
            data = json.load(file)
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

This sets requests.packages.urllib3.util.connection.HAS_IPV6 before the function, because it only needs to be set once.
Uses urlparse rather than splitting strings.
Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file.
And removes a few one time use variables because they don't need to be variables.

from araa-search.

Extravi commented on June 15, 2024

yeah I'm still working on that request function

from araa-search.

Extravi commented on June 15, 2024

there will be more changes in the next few days

from araa-search.

amogusussy commented on June 15, 2024

Also, the Accept header accepts */*, so all the other MIME types don't need to be specified.

from araa-search.

Extravi commented on June 15, 2024

I'm running tests on various capacha blocked vpn connections to see what headers and cookies that will make the request more reliable

from araa-search.

Extravi commented on June 15, 2024

Also, the Accept header accepts */*, so all the other MIME types don't need to be specified.

I'm still going to specify it just in case and I'll continue to run test

from araa-search.

Extravi commented on June 15, 2024

2captcha is very cheap but it adds up overtime so I need to make it harder to detect and block so it uses the API less

from araa-search.

Extravi commented on June 15, 2024

once the first recaptca pops up it's going to pop up more often I noticed so I need to find ways to make the request system seem like a real user

from araa-search.

Extravi commented on June 15, 2024

I did notice that a headless chrome browser doesn't really use that much memory and there are undetected versions of it so that could become a scraping option in the config at some point

from araa-search.

Extravi commented on June 15, 2024

nvm that might not be practical

from araa-search.

Extravi commented on June 15, 2024

odd i cant seem to get the "_GRECAPTCHA" cookie

from araa-search.

Extravi commented on June 15, 2024

oh in chrome based browsers its not stored under cookies its stored under local storage

from araa-search.

amogusussy commented on June 15, 2024

Does 2captcha also work for self hosted instances without the hoster having to pay?

from araa-search.

Extravi commented on June 15, 2024

oh in chrome based browsers its not stored under cookies its stored under local storage

odd i cant get it atm

from araa-search.

Extravi commented on June 15, 2024

Does 2captcha also work for self hosted instances without the hoster having to pay?

no but if you want to help test i can send you come credits

from araa-search.

Extravi commented on June 15, 2024

Does 2captcha also work for self hosted instances without the hoster having to pay?

from araa-search.

Extravi commented on June 15, 2024

if you make an account email me the email you used and i can send some credit that could help

from araa-search.

Extravi commented on June 15, 2024

some cookies used by google are region based so in the UK you won't get everything i can get testing in NA

from araa-search.

Extravi commented on June 15, 2024

but "_GRECAPTCHA" is in the EU, UK and NA

from araa-search.

Extravi commented on June 15, 2024

i also do my test using high load free vpn servers to make sure it has a recaptcha i send a request in a private window using "https://www.google.com/search?q=google"

from araa-search.

amogusussy commented on June 15, 2024

I think there should be an option in the config file for if you want to use a captcha solver then.
Maybe have something like #106 (the PR isn't that great, so I might redo it) for if the admin chooses not to use a captcha solver.
Having to pay will probably turn most people away from self hosting.

from araa-search.

Extravi commented on June 15, 2024

its already an option in the config

from araa-search.

Extravi commented on June 15, 2024

its turned off by default but i have it on for testing

from araa-search.

Extravi commented on June 15, 2024

i have done a total of 182 captchas in my test and only used 0.54 cents

from araa-search.

Extravi commented on June 15, 2024

most of this is from debugging the code on a instance it will use the api far less

from araa-search.

amogusussy commented on June 15, 2024

Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them.
Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.

from araa-search.

Extravi commented on June 15, 2024

It does it in its first attempt out of the 182 sent; it got one wrong for the server to do everything with it and the web driver; it totals 43.99 seconds.

from araa-search.

Extravi commented on June 15, 2024

results will look something like this in the file

from araa-search.

Extravi commented on June 15, 2024

Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them. Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.

Googles recaptcha also uses ai btw

from araa-search.

Extravi commented on June 15, 2024

Here's a few changes I'd add:

from urllib.parse import urlparse


# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = urlparse(url).netloc
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        with open("./2captcha.json", "r") as file:
            data = json.load(file)
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

This sets requests.packages.urllib3.util.connection.HAS_IPV6 before the function, because it only needs to be set once. Uses urlparse rather than splitting strings. Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file. And removes a few one time use variables because they don't need to be variables.

It's been added

from urllib.parse import unquote, urlparse

# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = unquote(url).split('/')[2]
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        data = load_config()
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }
    
    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

from araa-search.

Extravi commented on June 15, 2024

alternative search engine

an alternative search engine is not a bad idea and i will be looking into that soon but i want to finish how request are made first

from araa-search.

Extravi commented on June 15, 2024

im going to add support to proxy google autocomplete as a setting because its faster then duckduckgos

from araa-search.

Extravi commented on June 15, 2024

from araa-search.

Extravi commented on June 15, 2024

Each domain will now have its own persistent session, so I won't need to establish a new https/tls connection for each domain, and I can take advantage of connection reuse. This should greatly improve speeds. Also, each session will be isolated and have its own cookies, etc., making everything more reliable.

from araa-search.

Extravi commented on June 15, 2024

Video demo of what's possible with persistent sessions and connection reuse. Persistence sessions have already been added to my instance, but I cannot take advantage of connection reuse unless I set up a persistent session for each domain, and that's something I am currently working on.
https://github.com/Extravi/araa-search/assets/98912029/96a7d011-9efe-4e03-9120-578760f97b77

from araa-search.

Extravi commented on June 15, 2024

A good example is autocomplete. Instead of opening a new TLS/SSL connection for every input or request, it can just resume its connection to that domain. This will greatly reduce delay and improve response time.

from araa-search.

Extravi commented on June 15, 2024