Comments (46)
This change should add more redundancy and make everything faster and more reliable.
from araa-search.
def makeHTMLRequest(url: str):
# block unwanted request from an edited cookie
domain = unquote(url).split('/')[2]
if domain not in WHITELISTED_DOMAINS:
raise Exception(f"The domain '{domain}' is not whitelisted.")
# get google cookies
with open("./2captcha.json", "r") as file:
data = json.load(file)
GOOGLE_OGPC_COOKIE = data["GOOGLE_OGPC_COOKIE"]
GOOGLE_NID_COOKIE = data["GOOGLE_NID_COOKIE"]
GOOGLE_AEC_COOKIE = data["GOOGLE_AEC_COOKIE"]
GOOGLE_1P_JAR_COOKIE = data["GOOGLE_1P_JAR_COOKIE"]
GOOGLE_ABUSE_COOKIE = data["GOOGLE_ABUSE_COOKIE"]
# Choose a user-agent at random
user_agent = random.choice(user_agents)
headers = {
"User-Agent": user_agent,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Dnt": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1"
}
cookies = {
"OGPC": f"{GOOGLE_OGPC_COOKIE}",
"NID": f"{GOOGLE_NID_COOKIE}",
"AEC": f"{GOOGLE_AEC_COOKIE}",
"1P_JAR": f"{GOOGLE_1P_JAR_COOKIE}",
"GOOGLE_ABUSE_EXEMPTION": f"{GOOGLE_ABUSE_COOKIE}"
}
# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False
# Grab HTML content with the specific cookie
html = requests.get(url, headers=headers, cookies=cookies)
# Return the BeautifulSoup object
return BeautifulSoup(html.text, "lxml")
from araa-search.
this might be a useful cookie to add
from araa-search.
Here's a few changes I'd add:
from urllib.parse import urlparse
# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False
def makeHTMLRequest(url: str, is_google=False):
# block unwanted request from an edited cookie
domain = urlparse(url).netloc
if domain not in WHITELISTED_DOMAINS:
raise Exception(f"The domain '{domain}' is not whitelisted.")
if is_google:
# get google cookies
with open("./2captcha.json", "r") as file:
data = json.load(file)
cookies = {
"OGPC": data["GOOGLE_OGPC_COOKIE"],
"NID": data["GOOGLE_NID_COOKIE"],
"AEC": data["GOOGLE_AEC_COOKIE"],
"1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
"GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
}
else:
cookies = {}
headers = {
"User-Agent": random.choice(user_agents), # Choose a user-agent at random
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Dnt": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1"
}
# Grab HTML content with the specific cookie
html = requests.get(url, headers=headers, cookies=cookies)
# Return the BeautifulSoup object
return BeautifulSoup(html.text, "lxml")
This sets requests.packages.urllib3.util.connection.HAS_IPV6
before the function, because it only needs to be set once.
Uses urlparse rather than splitting strings.
Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file.
And removes a few one time use variables because they don't need to be variables.
from araa-search.
yeah I'm still working on that request function
from araa-search.
there will be more changes in the next few days
from araa-search.
Also, the Accept header accepts */*
, so all the other MIME types don't need to be specified.
from araa-search.
I'm running tests on various capacha blocked vpn connections to see what headers and cookies that will make the request more reliable
from araa-search.
Also, the Accept header accepts
*/*
, so all the other MIME types don't need to be specified.
I'm still going to specify it just in case and I'll continue to run test
from araa-search.
2captcha is very cheap but it adds up overtime so I need to make it harder to detect and block so it uses the API less
from araa-search.
once the first recaptca pops up it's going to pop up more often I noticed so I need to find ways to make the request system seem like a real user
from araa-search.
I did notice that a headless chrome browser doesn't really use that much memory and there are undetected versions of it so that could become a scraping option in the config at some point
from araa-search.
nvm that might not be practical
from araa-search.
odd i cant seem to get the "_GRECAPTCHA" cookie
from araa-search.
oh in chrome based browsers its not stored under cookies its stored under local storage
from araa-search.
Does 2captcha also work for self hosted instances without the hoster having to pay?
from araa-search.
oh in chrome based browsers its not stored under cookies its stored under local storage
from araa-search.
Does 2captcha also work for self hosted instances without the hoster having to pay?
no but if you want to help test i can send you come credits
from araa-search.
Does 2captcha also work for self hosted instances without the hoster having to pay?
from araa-search.
if you make an account email me the email you used and i can send some credit that could help
from araa-search.
some cookies used by google are region based so in the UK you won't get everything i can get testing in NA
from araa-search.
but "_GRECAPTCHA" is in the EU, UK and NA
from araa-search.
i also do my test using high load free vpn servers to make sure it has a recaptcha i send a request in a private window using "https://www.google.com/search?q=google"
from araa-search.
I think there should be an option in the config file for if you want to use a captcha solver then.
Maybe have something like #106 (the PR isn't that great, so I might redo it) for if the admin chooses not to use a captcha solver.
Having to pay will probably turn most people away from self hosting.
from araa-search.
its already an option in the config
from araa-search.
its turned off by default but i have it on for testing
from araa-search.
i have done a total of 182 captchas in my test and only used 0.54 cents
from araa-search.
most of this is from debugging the code on a instance it will use the api far less
from araa-search.
Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them.
Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.
from araa-search.
It does it in its first attempt out of the 182 sent; it got one wrong for the server to do everything with it and the web driver; it totals 43.99 seconds.
from araa-search.
results will look something like this in the file
from araa-search.
Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them. Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.
Googles recaptcha also uses ai btw
from araa-search.
Here's a few changes I'd add:
from urllib.parse import urlparse # Force all requests to only use IPv4 requests.packages.urllib3.util.connection.HAS_IPV6 = False def makeHTMLRequest(url: str, is_google=False): # block unwanted request from an edited cookie domain = urlparse(url).netloc if domain not in WHITELISTED_DOMAINS: raise Exception(f"The domain '{domain}' is not whitelisted.") if is_google: # get google cookies with open("./2captcha.json", "r") as file: data = json.load(file) cookies = { "OGPC": data["GOOGLE_OGPC_COOKIE"], "NID": data["GOOGLE_NID_COOKIE"], "AEC": data["GOOGLE_AEC_COOKIE"], "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"], "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"] } else: cookies = {} headers = { "User-Agent": random.choice(user_agents), # Choose a user-agent at random "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-US,en;q=0.5", "Dnt": "1", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Sec-Fetch-User": "?1", "Upgrade-Insecure-Requests": "1" } # Grab HTML content with the specific cookie html = requests.get(url, headers=headers, cookies=cookies) # Return the BeautifulSoup object return BeautifulSoup(html.text, "lxml")
This sets
requests.packages.urllib3.util.connection.HAS_IPV6
before the function, because it only needs to be set once. Uses urlparse rather than splitting strings. Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file. And removes a few one time use variables because they don't need to be variables.
It's been added
from urllib.parse import unquote, urlparse
# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False
def makeHTMLRequest(url: str, is_google=False):
# block unwanted request from an edited cookie
domain = unquote(url).split('/')[2]
if domain not in WHITELISTED_DOMAINS:
raise Exception(f"The domain '{domain}' is not whitelisted.")
if is_google:
# get google cookies
data = load_config()
cookies = {
"OGPC": data["GOOGLE_OGPC_COOKIE"],
"NID": data["GOOGLE_NID_COOKIE"],
"AEC": data["GOOGLE_AEC_COOKIE"],
"1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
"GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
}
else:
cookies = {}
headers = {
"User-Agent": random.choice(user_agents), # Choose a user-agent at random
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Dnt": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1"
}
# Grab HTML content with the specific cookie
html = requests.get(url, headers=headers, cookies=cookies)
# Return the BeautifulSoup object
return BeautifulSoup(html.text, "lxml")
from araa-search.
alternative search engine
an alternative search engine is not a bad idea and i will be looking into that soon but i want to finish how request are made first
from araa-search.
im going to add support to proxy google autocomplete as a setting because its faster then duckduckgos
from araa-search.
from araa-search.
Each domain will now have its own persistent session, so I won't need to establish a new https/tls connection for each domain, and I can take advantage of connection reuse. This should greatly improve speeds. Also, each session will be isolated and have its own cookies, etc., making everything more reliable.
from araa-search.
Video demo of what's possible with persistent sessions and connection reuse. Persistence sessions have already been added to my instance, but I cannot take advantage of connection reuse unless I set up a persistent session for each domain, and that's something I am currently working on.
https://github.com/Extravi/araa-search/assets/98912029/96a7d011-9efe-4e03-9120-578760f97b77
from araa-search.
A good example is autocomplete. Instead of opening a new TLS/SSL connection for every input or request, it can just resume its connection to that domain. This will greatly reduce delay and improve response time.
from araa-search.
i will need to check each request and they will each need their own persistent session and each session is in memory/ram so its quite fast
from araa-search.
ill add it tmr with some other stuff
from araa-search.
from araa-search.
the first request will look something like this
any request after will look like this
from araa-search.
Now there is no need to start a new request every time, saving on request response time and making everything faster.
from araa-search.
image at first request
any request after
from araa-search.
If you have any ideas on how I can further make the request better or faster, let me know.
from araa-search.
Related Issues (20)
- Stop using cloudflare for the official instance? HOT 4
- Main instance sometimes doesn't show any result HOT 3
- reCAPTCHA proxy HOT 50
- Plans for adding new search engines HOT 21
- Cloudflare error 523: origin is unreachable on the main instance. HOT 2
- Docker domain env HOT 6
- If you're having issues after the update, clear cookies for Araa, and it will be fixed.
- image search does not work with post request HOT 11
- Set default autocomplete to Google? HOT 9
- update to community-made themes HOT 1
- Using httpx to send more requests at once with httpx sessions to further improve speeds and performance HOT 18
- I'm going to replace python requests sessions with httpx sessions so I can use http/2 to improve speeds. This should also make things more scalable because I can send more requests at once if needed. HOT 30
- I am now hosting a YouTube privacy frontend, and the API may be used for video search to speed things up. HOT 12
- Searching Git Repositories HOT 4
- [Feature Request] Redirect to alternatives (ex. Invidious instead of YouTube) HOT 3
- URL and title mismatch HOT 8
- Dates shown instead of description HOT 2
- suggest improvements and give me ideas pls HOT 13
- Issue/question: what happened to tailsx HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from araa-search.