omkarcloud / botasaurus Goto Github PK
View Code? Open in Web Editor NEWThe All in One Framework to build Awesome Scrapers.
Home Page: https://www.omkar.cloud/botasaurus/
License: MIT License
The All in One Framework to build Awesome Scrapers.
Home Page: https://www.omkar.cloud/botasaurus/
License: MIT License
I would like to know the performance of the framework. When parsing a large amount of data, even through queries, the program's memory usage exceeded 9GB, the program did not cache anything, writing everything to the database on its own. And I also got the error "[Errno 24] Too many open files", I tried to explore on my own, but I didn't find anything. There is an assumption that when running in parallel, the program retains the startup context, even after completion, which can lead to such problems. Maybe I'll find something else and I'll definitely add it!
on Linux rapsberrypi
After running the project, this error appears.
_File "/home/username/.local/lib/python3.11/site-packages/selenium/webdriver/common/service.py", line 71, in start
self.process = subprocess.Popen(cmd, env=self.env,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 1024, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.11/subprocess.py", line 1901, in execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 8] Exec format error: '/home/username/Desktop/projects/aa/bb/cc/build/chromedriver-120'
from botasaurus import *
@request(use_stealth=True)
def scrape_heading_task(request: AntiDetectRequests, data):
response = request.get('ANY_CF_UAM_WEBSITE')
print(response.status_code)
return response.text
scrape_heading_task()
This shares output almost instantaneously for all cloudflare UAM websites and doesn't work.
Tried with and without proxies.
In the browser alternative, it waits for those 8 seconds making it work. What could be wrong here?
Any way to simulate wait here too, the wait argument doesn't work.
So issue is with the Google Maps tutorial (https://www.omkar.cloud/botasaurus/docs/google-maps-scraping-tutorial/). I don't think anything is wrong with the code (aside from the missing comma on line 57), but I do seem to have issues with Google Maps itself. I'm testing this from Hong Kong using a residential HK ip. My Internet speed is 1Gbps and Google has data centers right in HK so it is usually a very low latency experience not to mention good download speeds on anything hosted by Google. I also have access to a US residential proxy and know how to use botasaurus with it if I need to.
So anyway I fired up the bot built from the tutorial. It's scrolling...., scrolling....., and then after a few times it looks like it's stuck as it's not scrolling anymore. I thought that there was an issue with the bot at first so I terminated it and then opened up a chrome browser in incognito mode prepared to inspect some elements using chrome developer tools.
I go to https://www.google.com/maps/search/restaurants+in+delhi/ and I noticed that it loads after scrolling fine initially (just like with the bot), but after scroll loading several times, I hit a chokepoint where google takes forever to load (just like the bot too). It actually took like 5 to 10 minutes to get through the first chokepoint. The second chokepoint is taking even longer. I've been waiting 30 minutes now to load what is after "Smoke Trailer Grill" but the multi colored circle is still spinning.
Are you experiencing such a phenomena on your end for the "restaurants in delhi" query? I can't see the actual end of the page and am unable to see the element(s) that will indicate to bot that it had reached the end of the page when it's scraping.
I am just seeing a ton of exceptions trying to run the first Selenium scraping task that goes to https://www.omkar.cloud/ and grabs the h1 heading. It's the first Botasaurus script here:
from botasaurus import *
@browser
def scrape_heading_task(driver: AntiDetectDriver, data):
# Navigate to the Omkar Cloud website
driver.get("https://www.omkar.cloud/")
# Retrieve the heading element's text
heading = driver.text("h1")
# Save the data as a JSON file in output/all.json
return {
"heading": heading
}
if __name__ == "__main__":
# Initiate the web scraping task
scrape_heading_task()
It's the first script in what is botasaurus: https://www.omkar.cloud/botasaurus/docs/what-is-botasaurus/
Expected behavior: [What you expect to happen]
Scrape the h1 heading and store it as a string called heading which is returned once the function is called (and presumably automatically saved into a json file by the botasaurus framework)
Actual behavior: [What actually happens]
Lots of errors:
(py311selenium) C:\py311seleniumbot>python main.py
Running
DevTools listening on ws://127.0.0.1:64985/devtools/browser/6520850b-e749-463b-9c45-8e5ecdea678e
[24816:3140:1224/150501.718:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate
[24816:3140:1224/150501.917:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate
Traceback (most recent call last):
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
close_driver(driver)
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
driver.quit()
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
self.close_proxy()
TypeError: 'bool' object is not callable
Error getting page source: Message: invalid session id
Stacktrace:
GetHandleVerifier [0x00916EE3+174339]
(No symbol) [0x00840A51]
(No symbol) [0x00556E8A]
(No symbol) [0x00580980]
(No symbol) [0x00581F8D]
GetHandleVerifier [0x009B4B1C+820540]
sqlite3_dbdata_init [0x00A753EE+653550]
sqlite3_dbdata_init [0x00A74E09+652041]
sqlite3_dbdata_init [0x00A697CC+605388]
sqlite3_dbdata_init [0x00A75D9B+656027]
(No symbol) [0x0084FE6C]
(No symbol) [0x008483B8]
(No symbol) [0x008484DD]
(No symbol) [0x00835818]
BaseThreadInitThunk [0x76FBFCC9+25]
RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]
Traceback (most recent call last):
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
close_driver(driver)
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
driver.quit()
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
self.close_proxy()
TypeError: 'bool' object is not callable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 431, in save_screenshot
self.get_screenshot_as_file(
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 927, in get_screenshot_as_file
png = self.get_screenshot_as_png()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 963, in get_screenshot_as_png
return b64decode(self.get_screenshot_as_base64().encode('ascii'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 975, in get_screenshot_as_base64
return self.execute(Command.SCREENSHOT)['value']
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
self.error_handler.check_response(response)
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
GetHandleVerifier [0x00916EE3+174339]
(No symbol) [0x00840A51]
(No symbol) [0x00556E8A]
(No symbol) [0x00580862]
(No symbol) [0x005A6EBA]
(No symbol) [0x005A2036]
(No symbol) [0x005A1CC2]
(No symbol) [0x005370DB]
(No symbol) [0x005375DE]
(No symbol) [0x005379EB]
GetHandleVerifier [0x009B4B1C+820540]
sqlite3_dbdata_init [0x00A753EE+653550]
sqlite3_dbdata_init [0x00A74E09+652041]
sqlite3_dbdata_init [0x00A697CC+605388]
sqlite3_dbdata_init [0x00A75D9B+656027]
(No symbol) [0x0084FE6C]
(No symbol) [0x00536F4C]
(No symbol) [0x00536AEA]
(No symbol) [0x006A526C]
BaseThreadInitThunk [0x76FBFCC9+25]
RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]
Failed to save screenshot
Failed for input: None
We've paused the browser to help you debug. Press 'Enter' to close.
Traceback (most recent call last):
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
close_driver(driver)
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
driver.quit()
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
self.close_proxy()
TypeError: 'bool' object is not callable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\py311seleniumbot\main.py", line 18, in <module>
scrape_heading_task()
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 443, in wrapper_browser
current_result = run_task(data_item, False, 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 411, in run_task
close_driver(driver)
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 249, in close_driver
driver.close()
^^^^^^^^^^^^^^
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
self.execute(Command.CLOSE)
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
self.error_handler.check_response(response)
File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
GetHandleVerifier [0x00916EE3+174339]
(No symbol) [0x00840A51]
(No symbol) [0x00556E8A]
(No symbol) [0x00580862]
(No symbol) [0x005A6EBA]
(No symbol) [0x005A2036]
(No symbol) [0x005A1CC2]
(No symbol) [0x005370DB]
(No symbol) [0x005375DE]
(No symbol) [0x005379EB]
GetHandleVerifier [0x009B4B1C+820540]
sqlite3_dbdata_init [0x00A753EE+653550]
sqlite3_dbdata_init [0x00A74E09+652041]
sqlite3_dbdata_init [0x00A697CC+605388]
sqlite3_dbdata_init [0x00A75D9B+656027]
(No symbol) [0x0084FE6C]
(No symbol) [0x00536F4C]
(No symbol) [0x00536AEA]
(No symbol) [0x006A526C]
BaseThreadInitThunk [0x76FBFCC9+25]
RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]
Reproduces how often: [What percentage of the time does it reproduce?]
It happens every time.
I setup a virtual environment with botasaurus
Do you are aware about this? I cannot download a pdf from a remote site. The headers shows the correct content-length but the content always get bigger (some kind of unicode encoded) but I cannot figure out how to correctly decode it.
Is it possible to correct download a remote pdf from a cloudflare protected site? I guess this happens with all kind of remote binary files...
The framework is failing for a new installation as it is not able to find ChromeDriver as it only goes up to version 114 due to driver restructuring by the Chromium Team for the new Chrome-for-Testing.
Traceback
Traceback (most recent call last):
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/runpy.py", line 198, in _run_module_as_main
return _run_code(code, main_globals, None,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/runpy.py", line 88, in _run_code
exec(code, run_globals)
File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/Users/shubhamgarg/src/google-maps-scraper/main.py", line 18, in <module>
launch_tasks(*tasks_to_be_run)
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/launch_tasks.py", line 54, in launch_tasks
current_output = task.begin_task(current_data, task_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 214, in begin_task
final = run_task(False, 0)
^^^^^^^^^^^^^^^^^^
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 155, in run_task
create_directories(self.task_path)
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 99, in create_directories
_download_driver()
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 34, in _download_driver
download_driver()
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/download_driver.py", line 50, in download_driver
move_driver()
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/download_driver.py", line 39, in move_driver
move_chromedriver()
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/download_driver.py", line 38, in move_chromedriver
shutil.move(src_path, dest_path)
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/shutil.py", line 845, in move
copy_function(src, real_dst)
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/shutil.py", line 436, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/shutil.py", line 256, in copyfile
with open(src, 'rb') as fsrc:
^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'build/115/chromedriver'
build fails without the packaging
package
Line 6 in 34ad082
Hello. The function is as follows;
def confirm_email():
link = bt.TempMail.get_email_link_and_delete_mailbox(email)
driver.get(link)
However, in this way, it tries to confirm directly from the first link it finds, and I think it cannot find my confirmation link. So what I mean is
def confirm_email():
link = bt.TempMail.get_email_link_and_delete_mailbox(email)
if link and link.startswith("https://www.xxx.com/index.php?app=core&module=global§ion=register&do=auto_validate"):
driver.get(link)
I want it to open the link that contains auto_validate as in the function or whose beginning is as I specified. I couldn't find how to do this. Could you help ? it opens xxx.com directly right now.
I used
@browser(
headless=True,
profile='my-profile',
proxy="http://your_proxy_address:your_proxy_port",
user_agent=bt.UserAgents.user_agent_106
)
but it not work, I don't know where the profile was created
Thanks!
Hey, Brave is chromium browser too, but it still asks me to download google chrome. I believe undetected_chromedriver has the compatibility for Brave as well. I would appreciate it if you could support this.
Thank you!
Running
โ JavaScript Error Call to 'launch' failed:
scrape_heading_task()
at (/root/folder/file.py:16)
current_result = run_task(data_item, False, 0)
at wrapper_browser (/usr/local/lib/python3.10/dist-packages/botasaurus/decorators.py:650)
driver = create_driver(data, options, desired_capabilities)
at run_task (/usr/local/lib/python3.10/dist-packages/botasaurus/decorators.py:528)
return do_create_stealth_driver(
at run (/usr/local/lib/python3.10/dist-packages/botasaurus/create_stealth_driver.py:282)
chrome = launch_chrome(start_url, options._arguments)
at do_create_stealth_driver (/usr/local/lib/python3.10/dist-packages/botasaurus/create_stealth_driver.py:229)
instance = ChromeLauncherAdapter.launch(**kwargs)
at launch_chrome (/usr/local/lib/python3.10/dist-packages/botasaurus/create_stealth_driver.py:102)
response = chrome_launcher.launch(kwargs, timeout=300)
at launch (/usr/local/lib/python3.10/dist-packages/botasaurus/chrome_launcher_adapter.py:12)
... across the bridge ...
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)
Error: connect ECONNREFUSED 127.0.0.1:39307
^
๐ Error: connect ECONNREFUSED 127.0.0.1:39307
When using the AntiDetectDriver
from Botasaurus to access websites that generate JavaScript alerts, the standard Selenium Alert
class methods do not work. This incompatibility leads to an inability to handle JavaScript alerts, which is a crucial feature for many web scraping tasks. The need to access native Selenium functions and classes while using the modified driver is also highlighted.
AntiDetectDriver
(e.g., http://www.restaurant-schwabenstuben.de/
).Alert
class.Alert
class methods do not work with AntiDetectDriver
.The Alert
class methods should work seamlessly with AntiDetectDriver
, allowing users to handle JavaScript alerts on web pages.
The Alert
class methods are incompatible with AntiDetectDriver
, causing an inability to interact with JavaScript alerts on web pages.
100% of the time when encountering JavaScript alerts with AntiDetectDriver
.
The inability to use Selenium's native alert handling capabilities with AntiDetectDriver
significantly limits the driver's functionality for web scraping tasks that encounter JavaScript alerts. Furthermore, a general integration of native Selenium functions and classes with AntiDetectDriver
would enhance its utility.
selenium.common.exceptions.UnexpectedAlertPresentException: Alert Text: [Alert text]
Message: unexpected alert open: {Alert text : [Alert text]}
(Session info: chrome=[version])
Stacktrace:
[Full stack trace]
Alert
class and AntiDetectDriver
.AntiDetectDriver
for handling JavaScript alerts.AntiDetectDriver
.Google Maps scraper fail on many queries (around 10k and more)
Work machine is Windows Server with 4 gb RAM (it's enough for 16 threads as I test it)
Actual behavior:
Error:
Failed to save screenshot
Closing Browser
Traceback (most recent call last):
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 192, in run_task
close_driver(driver)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 181, in close_driver
driver.close()
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\bose_driver.py", line 335, in close
return super().close()
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
self.execute(Command.CLOSE)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
self.error_handler.check_response(response)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: disconnected: Unable to receive message from renderer
(failed to check if window was closed: disconnected: not connected to DevTools)
(Session info: chrome=116.0.5845.141)
Stacktrace:
GetHandleVerifier [0x005B37C3+48947]
(No symbol) [0x00548551]
(No symbol) [0x0044C92D]
(No symbol) [0x0043E26E]
(No symbol) [0x0043D09F]
(No symbol) [0x0043D678]
(No symbol) [0x0043C695]
(No symbol) [0x00435811]
(No symbol) [0x00435AC4]
(No symbol) [0x0049D688]
(No symbol) [0x00495053]
(No symbol) [0x004716C7]
(No symbol) [0x0047284D]
GetHandleVerifier [0x007FFDF9+2458985]
GetHandleVerifier [0x0084744F+2751423]
GetHandleVerifier [0x00841361+2726609]
GetHandleVerifier [0x00630680+560624]
(No symbol) [0x0055238C]
(No symbol) [0x0054E268]
(No symbol) [0x0054E392]
(No symbol) [0x005410B7]
BaseThreadInitThunk [0x745962C4+36]
RtlSubscribeWnfStateChangeNotification [0x77191B69+1081]
RtlSubscribeWnfStateChangeNotification [0x77191B34+1028]`
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\google-maps-scraper-master\main.py", line 19, in <module>
launch_tasks(*tasks_to_be_run)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\launch_tasks.py", line 54, in launch_tasks
current_output = task.begin_task(current_data, task_config)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 219, in begin_task
final = run_task(False, 0)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 214, in run_task
close_driver(driver)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 181, in close_driver
driver.close()
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\bose_driver.py", line 335, in close
return super().close()
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
self.execute(Command.CLOSE)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
self.error_handler.check_response(response)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: disconnected: not connected to DevTools
(failed to check if window was closed: disconnected: not connected to DevTools)
(Session info: chrome=116.0.5845.141)
Stacktrace:
GetHandleVerifier [0x005B37C3+48947]
(No symbol) [0x00548551]
(No symbol) [0x0044C92D]
(No symbol) [0x0043D249]
(No symbol) [0x0043D79A]
(No symbol) [0x0043D738]
(No symbol) [0x004326FD]
(No symbol) [0x00432F8D]
(No symbol) [0x0049D288]
(No symbol) [0x00495053]
(No symbol) [0x004716C7]
(No symbol) [0x0047284D]
GetHandleVerifier [0x007FFDF9+2458985]
GetHandleVerifier [0x0084744F+2751423]
GetHandleVerifier [0x00841361+2726609]
GetHandleVerifier [0x00630680+560624]
(No symbol) [0x0055238C]
(No symbol) [0x0054E268]
(No symbol) [0x0054E392]
(No symbol) [0x005410B7]
BaseThreadInitThunk [0x745962C4+36]
RtlSubscribeWnfStateChangeNotification [0x77191B69+1081]
RtlSubscribeWnfStateChangeNotification [0x77191B34+1028]`
Reproduces how often:
Every time when I start it, but after 30 minutes or more of work
My log looks like that:
[7080:3484:0907/152340.096:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
[7080:3484:0907/152340.107:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
Done: V and B Le Mans
[7080:3484:0907/152342.742:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
[7080:3484:0907/152342.762:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
Done: V and B La Roche Nord
Filtered 5 links from 5.
View written JSON file at output/vandb-fr-in-france.json
View written CSV file at output/vandb-fr-in-france.csv
Closing Browser
Closed Browser
View Final Screenshot at tasks/1112/final.png
View written JSON file at output/all.json
Creating Driver with window_size=1920,1080 and user_agent=Mozilla/5.0 (Windows NT 10.0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
DevTools listening on ws://127.0.0.1:63583/devtools/browser/54835e3a-595b-4cce-8ce0-c9d1
f0639475
Launched Browser
[6804:3312:0907/152354.671:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
[6804:3312:0907/152354.717:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
Fetched 5 links.
Creating Driver with window_size=1920,1080 and user_agent=Mozilla/5.0 (Windows NT 10.0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
This library needs mobile proxy support which needs to run an IP changer request before the browser opens etc.
Each selenium.common.exceptions.InvalidSessionIdException
error breaks execution of bose.launch_tasks.launch_tasks
function.
InvalidSessionIdException
inside task.run(self, driver: BoseDriver, data: any)
TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'
Expected behavior: broken task can be finished normally
Actual behavior: broken task stops all process, next tasks will not executed
Reproduces how often: for sites with bot detection - 99% cases
Can't reproduce on host machine. Only inside docker container.
Full stack-trace:
Traceback (most recent call last):
2023-07-16T19:02:56.132915200Z File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 210, in run_task
2023-07-16T19:02:56.132923000Z close_driver(driver)
2023-07-16T19:02:56.132927600Z File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 203, in close_driver
2023-07-16T19:02:56.132939700Z driver.close()
2023-07-16T19:02:56.133008400Z File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 551, in close
2023-07-16T19:02:56.133173900Z self.execute(Command.CLOSE)
2023-07-16T19:02:56.133260300Z File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 429, in execute
2023-07-16T19:02:56.133411400Z self.error_handler.check_response(response)
2023-07-16T19:02:56.133519800Z File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
2023-07-16T19:02:56.133590700Z raise exception_class(message, screen, stacktrace)
2023-07-16T19:02:56.133747600Z selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
2023-07-16T19:02:56.133802800Z Stacktrace:
2023-07-16T19:02:56.133812700Z #0 0x55a7b015a233 <unknown>
2023-07-16T19:02:56.133817900Z #1 0x55a7afe89770 <unknown>
2023-07-16T19:02:56.133822800Z #2 0x55a7afeb9589 <unknown>
2023-07-16T19:02:56.133826700Z #3 0x55a7afee4b86 <unknown>
2023-07-16T19:02:56.133830800Z #4 0x55a7afee0dea <unknown>
2023-07-16T19:02:56.133834700Z #5 0x55a7afee0516 <unknown>
2023-07-16T19:02:56.133838800Z #6 0x55a7afe593a3 <unknown>
2023-07-16T19:02:56.133843100Z #7 0x55a7b011a114 <unknown>
2023-07-16T19:02:56.133858100Z #8 0x55a7b011df67 <unknown>
2023-07-16T19:02:56.133863200Z #9 0x55a7b01286b0 <unknown>
2023-07-16T19:02:56.133867700Z #10 0x55a7b011ebb3 <unknown>
2023-07-16T19:02:56.133871100Z #11 0x55a7b00ec95a <unknown>
2023-07-16T19:02:56.133874900Z #12 0x55a7afe57b83 <unknown>
2023-07-16T19:02:56.133878600Z #13 0x7f92a414e18a <unknown>
2023-07-16T19:02:56.133882400Z
2023-07-16T19:02:56.133886500Z
2023-07-16T19:02:56.133890300Z During handling of the above exception, another exception occurred:
2023-07-16T19:02:56.133893900Z
2023-07-16T19:02:56.133898000Z Traceback (most recent call last):
2023-07-16T19:02:56.133902300Z File "/code/main.py", line 5, in <module>
2023-07-16T19:02:56.133907800Z launch_tasks(*tasks_to_be_run)
2023-07-16T19:02:56.133919000Z File "/code/venv/lib/python3.11/site-packages/bose/launch_tasks.py", line 54, in launch_tasks
2023-07-16T19:02:56.134112000Z current_output = task.begin_task(current_data, task_config)
2023-07-16T19:02:56.134164700Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134195200Z File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 237, in begin_task
2023-07-16T19:02:56.134256000Z final = run_task(False, 0)
2023-07-16T19:02:56.134311100Z ^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134322600Z File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 221, in run_task
2023-07-16T19:02:56.134434300Z end_task(driver)
2023-07-16T19:02:56.134487500Z File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 149, in end_task
2023-07-16T19:02:56.134570900Z task.end()
2023-07-16T19:02:56.134647300Z File "/code/venv/lib/python3.11/site-packages/bose/task_info.py", line 38, in end
2023-07-16T19:02:56.134716200Z self.data["duration"] = format_time_diff(self.data["start_time"],self.data["end_time"])
2023-07-16T19:02:56.134774900Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134807900Z File "/code/venv/lib/python3.11/site-packages/bose/task_info.py", line 11, in format_time_diff
2023-07-16T19:02:56.134914700Z time_diff = end_time - start_time
2023-07-16T19:02:56.134991400Z ~~~~~~~~~^~~~~~~~~~~~
2023-07-16T19:02:56.135022200Z TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'
Hi, congratulations for your work. This is an amazing repository. I was checking the source code and I was wondering if it fits to a library I'm developing. It is used to run asynchronous requests against WebDrivers/Winium. I didn't find any unit test to help me to check it. This is the library. Let me know if you think it is possible to adapt/extend your code to use it, please. If it is possible, I'd be happy to play around and send some PRs related to it. Thank you!
Why are we doing this? What use cases does it support? What is the expected outcome?
I have a fresh library to interact with WebDrivers and Wnium, but the code I quite verbose. I think your code may make my library simpler to be used.
Use other repository to wrap my code or develop it by myself.
When I run the docker service on a real server and make a request, I get the following error. However, when I run it on my local computer, this error does not appear and I see that it works properly.
v12tj Waiting 10 seconds before connecting to Chrome...
v12tj 10.0.0.2 - - [05/Feb/2024 13:56:03] "๏ฟฝ[35m๏ฟฝ[1mGET /scrape?url=https://someurl.com HTTP/1.1๏ฟฝ[0m" 500 -
v12tj INFO:werkzeug:10.0.0.2 - - [05/Feb/2024 13:56:03] "๏ฟฝ[35m๏ฟฝ[1mGET /scrape?url=https://semizotomotivburdur.sahibinden.com HTTP/1.1๏ฟฝ[0m" 500 -
v12tj Traceback (most recent call last):
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1488, in call
v12tj return self.wsgi_app(environ, start_response)
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1466, in wsgi_app
v12tj response = self.handle_exception(e)
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1463, in wsgi_app
v12tj response = self.full_dispatch_request()
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 872, in full_dispatch_request
v12tj rv = self.handle_user_exception(e)
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 870, in full_dispatch_request
v12tj rv = self.dispatch_request()
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 855, in dispatch_request
v12tj return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
v12tj File "/app/main.py", line 20, in scrape
v12tj result = parser(dealer_url)
v12tj File "/app/boto_scraper.py", line 91, in parser
v12tj return scrape_dealer_page()
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/decorators.py", line 633, in wrapper_browser
v12tj current_result = run_task(data_item, False, 0)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/decorators.py", line 484, in run_task
v12tj driver = create_driver(
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 263, in run
v12tj return do_create_stealth_driver(
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 234, in do_create_stealth_driver
v12tj bypass_detection(remote_driver, raise_exception)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 193, in bypass_detection
v12tj wait_till_cloudflare_leaves(driver, previous_ray_id, raise_exception)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 109, in wait_till_cloudflare_leaves
v12tj current_ray_id = get_rayid(driver)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 91, in get_rayid
v12tj ray = driver.text(".ray-id code")
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/anti_detect_driver.py", line 172, in text
v12tj return el.text
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webelement.py", line 84, in text
v12tj return self._execute(Command.GET_ELEMENT_TEXT)['value']
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webelement.py", line 396, in _execute
v12tj return self._parent.execute(command, params)
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 429, in execute
v12tj self.error_handler.check_response(response)
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
v12tj raise exception_class(message, screen, stacktrace)
v12tj selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
v12tj from unknown error: cannot determine loading status
v12tj from tab crashed
v12tj (Session info: chrome=120.0.6099.109)
v12tj Stacktrace:
v12tj #0 0x55ae79f57f83
v12tj #1 0x55ae79c10b2b
v12tj #2 0x55ae79bf816d
v12tj #3 0x55ae79bf7882
v12tj #4 0x55ae79bf6586
v12tj #5 0x55ae79bf644a
v12tj #6 0x55ae79bf47e1
v12tj #7 0x55ae79bf518a
v12tj #8 0x55ae79c0607c
v12tj #9 0x55ae79c1e7c1
v12tj #10 0x55ae79c246bb
v12tj #11 0x55ae79bf592d
v12tj #12 0x55ae79c1e459
v12tj #13 0x55ae79ca9204
v12tj #14 0x55ae79c89e53
v12tj #15 0x55ae79c51dd4
v12tj #16 0x55ae79c531de
v12tj #17 0x55ae79f1c531
v12tj #18 0x55ae79f20455
v12tj #19 0x55ae79f08f55
v12tj #20 0x55ae79f210ef
v12tj #21 0x55ae79eec99f
v12tj #22 0x55ae79f45008
v12tj #23 0x55ae79f451d7
v12tj #24 0x55ae79f57124
v12tj #25 0x7feac06e3044
// my docker-compose.yaml
version: "3"
services:
botoscrape:
restart: "no"
container_name: botasaurus
shm_size: 2gb
build:
dockerfile: Dockerfile
context: .
volumes:
- ./output:/app/output
- ./tasks:/app/tasks
- ./profiles:/app/profiles
- ./profiles.json:/app/profiles.json
- ./local_storage.json:/app/local_storage.json
ports:
- "8191:9090"
command: ["python", "-u", "main.py"]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9090/"]
interval: 1m
timeout: 10s
retries: 3
// Dockerfile
FROM chetan1111/botasaurus:latest
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN mkdir app
WORKDIR /app
COPY . /app
Hello, saw this on a post from undetected-chromedriver and decided to check it out, but couldn't bypass a certain website and I imagine some others with the same tech would also have the same issue.
bet365.com doesn't load the main page and other pages are super inconsistent, works 1/20 times, so it does work just need to find what pattern makes it consistent. The code I've tried and had success is the one from the example:
from botasaurus import *
from botasaurus.create_stealth_driver import create_stealth_driver
@browser(
create_driver=create_stealth_driver(
start_url="https://www.bet365.com/#/AC/B151/C1/D50/E3/F163/",
wait=8, # it seems like the wait doesn't matter
),
)
def scrape_heading_task(driver: AntiDetectDriver, data):
driver.prompt()
heading = driver.text('h1')
return heading
scrape_heading_task()
btw this website is only accessible via undetected-chromedriver when using a workaround via disconnecting and reconnecting to the driver so I imagine on botasaurus would be something similar
case : Python : 3.11
os: Ubuntu 23
I wanted to share my solution in case it helps someone else who might run into this issue in the future.
npm install got-scraping-export
result = run_parallel(run, used_data, n)
at wrapper_browser (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:664)
parallel_thread.join(0.2) # time out not to block KeyboardInterrupt
at run_parallel (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:166)
raise self._exception
at join (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:152)
self.result = target(*args, **kwargs)
at function (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:143)
return Parallel(n_jobs=n_workers, backend="threading")(
at execute_parallel_tasks (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:158)
return output if self.return_generator else list(output)
at call (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1952)
yield from self._retrieve()
at _get_outputs (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1595)
self._raise_error_fast()
at _retrieve (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1699)
error_job.get_result(self.timeout)
at _raise_error_fast (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1734)
return self._return_or_raise()
at get_result (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:736)
raise self._result
at _return_or_raise (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:754)
... across the bridge ...
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1605:16)
Error: connect ECONNREFUSED 127.0.0.1:38387
^
๐ Error: connect ECONNREFUSED 127.0.0.1:38387
Keep getting this error with headless=False, this is the config:
@browser(window_size=bt.WindowSize.REAL,parallel=8, create_driver=create_stealth_driver(
start_url=lambda data: data["link"],
wait=12
), add_arguments=add_arguments, raise_exception=True, headless=False, keep_drivers_alive=True, cache=True, output=None, reuse_driver=True, block_resources=True, block_images=True, max_retry=10
)
works fine with Headless=False, using these arguments:
def add_arguments(data, options):
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--server')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--no-zygote')
options.add_argument('--disable-gpu-sandbox')
options.add_argument('--disable-software-rasterizer')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
options.add_argument('--use-gl=swiftshader')
options.add_argument('--window-size=1920,1080')
Also tried using only these:
def add_arguments(data, options):
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--server')
options.add_argument('--disable-setuid-sandbox')
Tried with both proxy turned on and off, nothing seems to help!
I am following the guide here: https://www.omkar.cloud/botasaurus/docs/sign-up-tutorial/
I pasted the following script:
from botasaurus import *
@browser(
data = lambda: bt.generate_users(3, country=bt.Country.IN),
block_resources=True,
profile= lambda account: account['username'],
tiny_profile= True,
)
def create_accounts(driver: AntiDetectDriver, account):
name = account['name']
email = account['email']
password = account['password']
def sign_up():
driver.type('input[name="name"]', name)
driver.type('input[type="email"]', email)
driver.type('input[type="password"]', password)
driver.click('button[type="submit"]')
def confirm_email():
link = bt.TempMail.get_email_link_and_delete_mailbox(email)
driver.get(link)
driver.google_get("https://www.omkar.cloud/auth/sign-up/")
sign_up()
confirm_email()
bt.Profile.set_profile(account)
@browser(
data = lambda: bt.Profile.get_profiles(),
block_resources=True,
profile= lambda account: account['username'],
tiny_profile= True,
)
def take_screenshots(driver: AntiDetectDriver, account):
username = account['username']
driver.get("https://www.omkar.cloud/")
driver.save_screenshot(username)
if __name__ == "__main__":
create_accounts()
take_screenshots()
So the example does take 3 screenshots and saves them in output and each screenshot is saved as the username but since images are blocked, there isn't much to the screenshots.
There is a create_accounts.json and a take_screenshots.json but they only report 3 null entries. Is this normal?
Also the tasks directory is empty and doesn't contain any metadata on the bot run.
I run sample code and error has occurred:
selenium.common.exceptions.WebDriverException: Message: 'chromedriver-122' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home
I think that system does not wait until downloading driver finish so it raise exception.
Here Code example:
from botasaurus import *
@browser(parallel=bt.calc_max_parallel_browsers, block_resources=True, block_images=True, data=["https://www.yahoo.com/", "https://www.google.com", "https://stackoverflow.com/"])
def scrape_heading_task(driver: AntiDetectDriver, data):
# print("metadata:", metadata)
print("data:", data)
# Navigate to the Omkar Cloud website
driver.get(data)
# Retrieve the heading element's text
heading = driver.text("h1")
title = driver.title
# Save the data as a JSON file in output/scrape_heading_task.json
return {
"heading": heading,
"title": title
}
if __name__ == '__main__':
scrape_heading_task()
The scraping stuff is all clear to me as I have experience with Selerium (And the Beautiful Soup stuff wasn't too bad either), but I am not getting how to properly use anything here: https://github.com/omkarcloud/botasaurus?tab=readme-ov-file#i-have-automated-the-creation-of-user-accounts-now-i-want-to-store-the-user-account-credentials-like-email-and-password-how-to-store-it
Here's the script its only purpose is to grasp how to use these functions
from botasaurus import *
user = bt.generate_user(country=bt.Country.IN)
print(user)
bt.Profile.set_profile(user)
bt.Profile.set_item("api_key", "BDEC26")
profiles = bt.Profile.get_all_profiles()
print(profiles)
Expected behavior: Save user data in a Chrome profile?
Actual behavior: [What actually happens]
Traceback (most recent call last):
File "C:\py311botasaurus\btprofiles.py", line 5, in
bt.Profile.set_profile(user)
File "C:\py311botasaurus\py311botasaurus\Lib\site-packages\botasaurus\profile.py", line 146, in set_profile
self.check_profile()
File "C:\py311botasaurus\py311botasaurus\Lib\site-packages\botasaurus\profile.py", line 83, in check_profile
raise Exception('This method can only be run in run method of Task and when you have given the current profile in the Browser Config.')
Exception: This method can only be run in run method of Task and when you have given the current profile in the Browser Config.
Reproduces how often: [What percentage of the time does it reproduce?]
Every time. I have no idea how to use anything here: https://github.com/omkarcloud/botasaurus?tab=readme-ov-file#i-have-automated-the-creation-of-user-accounts-now-i-want-to-store-the-user-account-credentials-like-email-and-password-how-to-store-it
Python has a wonderful library webdriver-manager it allows you to easily control the driver versions, install the latest ones, and those that fit the current version of the browser. This allows you to distract from the choice of the driver version and is convenient when used on different devices and platforms. How about using it to download detected drivers, and for undetected use this library as a template. I hope for a positive answer, I really want to help in the development of the project.
i tried to get botasaurus running on nixos, but the "hello world" test script fails
heading None
expected result
heading "Elementasaurus helps you become a 10x Web Designer"
# debug log is not helpful
import logging
logging_level = "INFO"
logging_level = "DEBUG"
logging.basicConfig(
#format='%(asctime)s %(levelname)s %(message)s',
# also log the logger %(name)s, so we can filter by logger name
format='%(asctime)s %(name)s %(levelname)s %(message)s',
level=logging_level,
)
logger = logging.getLogger("test")
# https://github.com/omkarcloud/botasaurus
from botasaurus import *
@browser
def scrape_heading_task(driver: AntiDetectDriver, data):
# Navigate to the Omkar Cloud website
driver.get("https://www.omkar.cloud/")
# Retrieve the heading element's text
heading = driver.text("h1")
# FIXME heading == None
print("heading", repr(heading))
# keep browser open
#import time; time.sleep(9999)
# Save the data as a JSON file in output/scrape_heading_task.json
# "return" would write "null" to the output file
return {
"heading": heading
}
if __name__ == "__main__":
# Initiate the web scraping task
scrape_heading_task()
$ python pkgs/python3/pkgs/botasaurus/test-botasaurus.py
Running
2024-01-16 17:36:27,932 selenium.webdriver.common.service DEBUG Started executable: `/nix/store/yjq4z3n7p66l8jp06s8cgq647s6iwm7c-chromedriver-117.0.5938.149/bin/chromedriver` in a child process with pid: 1034600
2024-01-16 17:36:28,440 selenium.webdriver.remote.remote_connection DEBUG POST http://localhost:33991/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"excludeSwitches": ["enable-automation"], "useAutomationExtension": false, "extensions": [], "args": ["--start-maximized", "--window-size=1440,900", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36", "--disable-blink-features=AutomationControlled", "--disable-site-isolation-trials"]}}}}
2024-01-16 17:36:28,443 urllib3.connectionpool DEBUG Starting new HTTP connection (1): localhost:33991
2024-01-16 17:36:31,726 urllib3.connectionpool DEBUG http://localhost:33991 "POST /session HTTP/1.1" 200 860
2024-01-16 17:36:31,727 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":{"capabilities":{"acceptInsecureCerts":false,"browserName":"chrome","browserVersion":"117.0.5938.149","chrome":{"chromedriverVersion":"117.0.5938.149 (e3344ddefa12e60436fa28c81cf207c1afb4d0a9-refs/branch-heads/5938@{#1539})","userDataDir":"/run/user/1000/.org.chromium.Chromium.xNxfuG"},"fedcm:accounts":true,"goog:chromeOptions":{"debuggerAddress":"localhost:34929"},"networkConnectionEnabled":false,"pageLoadStrategy":"normal","platformName":"linux","proxy":{},"setWindowRect":true,"strictFileInteractability":false,"timeouts":{"implicit":0,"pageLoad":300000,"script":30000},"unhandledPromptBehavior":"dismiss and notify","webauthn:extension:credBlob":true,"webauthn:extension:largeBlob":true,"webauthn:extension:minPinLength":true,"webauthn:extension:prf":true,"webauthn:virtualAuthenticators":true},"sessionId":"2dc31237c852897577d0335baf49b4fb"}} | headers=HTTPHeaderDict({'Content-Length': '860', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:36:31,728 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:36:31,729 selenium.webdriver.remote.remote_connection DEBUG POST http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/url {"url": "https://www.omkar.cloud/"}
2024-01-16 17:37:00,778 urllib3.connectionpool DEBUG http://localhost:33991 "POST /session/2dc31237c852897577d0335baf49b4fb/url HTTP/1.1" 200 14
2024-01-16 17:37:00,781 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:00,782 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:37:00,783 selenium.webdriver.remote.remote_connection DEBUG POST http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/element {"using": "css selector", "value": "h1"}
2024-01-16 17:37:01,478 urllib3.connectionpool DEBUG http://localhost:33991 "POST /session/2dc31237c852897577d0335baf49b4fb/element HTTP/1.1" 200 95
2024-01-16 17:37:01,480 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":{"element-6066-11e4-a52e-4f735466cecf":"407A8AABF752A82E39C6A1463C008FA9_element_32"}} | headers=HTTPHeaderDict({'Content-Length': '95', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:01,482 selenium.webdriver.remote.remote_connection DEBUG Finished Request
heading None
2024-01-16 17:37:01,484 selenium.webdriver.remote.remote_connection DEBUG GET http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/url {}
2024-01-16 17:37:01,631 urllib3.connectionpool DEBUG http://localhost:33991 "GET /session/2dc31237c852897577d0335baf49b4fb/url HTTP/1.1" 200 36
2024-01-16 17:37:01,633 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":"https://www.omkar.cloud/"} | headers=HTTPHeaderDict({'Content-Length': '36', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:01,642 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:37:01,646 selenium.webdriver.remote.remote_connection DEBUG DELETE http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/window {}
2024-01-16 17:37:02,299 urllib3.connectionpool DEBUG http://localhost:33991 "DELETE /session/2dc31237c852897577d0335baf49b4fb/window HTTP/1.1" 200 12
2024-01-16 17:37:02,301 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:02,302 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:37:02,308 selenium.webdriver.remote.remote_connection DEBUG DELETE http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb {}
2024-01-16 17:37:02,368 urllib3.connectionpool DEBUG http://localhost:33991 "DELETE /session/2dc31237c852897577d0335baf49b4fb HTTP/1.1" 200 14
2024-01-16 17:37:02,370 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:02,372 selenium.webdriver.remote.remote_connection DEBUG Finished Request
Written
output/scrape_heading_task.json
first i suspected that javascript.require fails to find the NPM dependencies
and that require would silently fail
but got_adapter
and chrome_launcher_adapter
are never used
/lib/python3.10/site-packages/botasaurus/got_adapter.py
got = require("got-scraping-export")
raise Exception(f"botasaurus/got_adapter.py: require got-scraping-export -> {got}")
/lib/python3.10/site-packages/botasaurus/chrome_launcher_adapter.py
chrome_launcher = require("chrome-launcher")
raise Exception(f"botasaurus/chrome_launcher_adapter.py: require chrome-launcher -> {chrome_launcher}")
clone my nur-packages repo
git clone --depth=1 https://github.com/milahu/nur-packages
cd nur-packages
relevant files
start a nix-shell with botasaurus
nix-shell -E '
let
pkgs = import <nixpkgs> {};
nurRepo = import ./. {};
in
pkgs.mkShell {
buildInputs = [
nurRepo.python3.pkgs.botasaurus
];
}
'
run the botasaurus test script
python pkgs/python3/pkgs/botasaurus/test-botasaurus.py
in my aiohttp_chromium im creating something similar to botasaurus
so im curious how similar projects work
ideas...?
i dont want to spend too much time debugging, because i will probably not need this
Hey,
I just started using your tool and it is truly amazing!
However, I have some trouble to bypass some cloudflare protections.
from botasaurus import *
@browser()
def scrape_heading_task(driver: AntiDetectDriver, data):
driver.google_get("https://vulbis.com")
driver.prompt()
heading = driver.text('html')
return heading
if __name__ == "__main__":
# Initiate the web scraping task
scrape_heading_task()
Issue: it keeps refreshing the page, not able to solve the challenge
Can you please tell me how to connect extensions and save all settings?
The link to Bose Framework Community page (https://omkar.cloud/community/) reports 404 Page not found
Expected behavior: [What you expect to happen]
I'd expect see the page content
Actual behavior: [What actually happens]
Page not found
Reproduces how often: [What percentage of the time does it reproduce?]
100%
Peace be upon you. I am a beginner in programming and I love projects like these. I want help, if possible. I want to link this project to the SQL Server 2008 database. please help me?
this should not create any temporary files
import botasaurus
currently, this will create
local_storage.json
output/
profiles/
profiles.json
tasks/
Hi, I want to build a full stack application scraping yellow pages where the user can enter the yellow pages url that they want to scrape and get the scraped data. I'm having issues integrating flask and the api routes in the application using requests module. There is a conflict because the @request decorator is the same name as the requests module in Flask. Do you guys have any examples of how this can be done? Thanks in advance.
from botasaurus import *
from flask import Flask, jsonify, request
from scraper.yp_usa_scraper import *
from flask_cors import CORS
app = Flask(__name__)
CORS(app)
@app.route('/scrape/yp-usa', methods=["POST"])
@request(use_stealth=True)
def scrape_heading_task(request: AntiDetectRequests, data):
data = request.get_json()
response = request.get('https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=1475848896&keywords=hairdressers&location=hatfield%2C+hertfordshire')
return response.text
if __name__ == "__main__":
# Run the Flask development server
app.run(debug=True)
AttributeError: 'AntiDetectRequests' object has no attribute 'get_json'
If I run the following code:
from botasaurus import browser, AntiDetectDriver # Replace with your actual scraping library
@browser(async_queue=True, close_on_crash=True)
def scrape_data(driver: AntiDetectDriver, data):
print("Starting a task.")
print(data)
sleep(1) # Simulate a delay, e.g., waiting for a page to load
print("Task completed.")
return data
if __name__ == "__main__":
# Start scraping tasks without waiting for each to finish
async_queue = scrape_data() # Initializes the queue
# Add tasks to the queue
async_queue.put([1])
async_queue.put(4)
async_queue.put([5, 6])
# Retrieve results when ready
results = async_queue.get() # Expects to receive: [1, 2, 3, 4, 5, 6]
It fail as the queue put method a list, and you are passing it an int. This is solved by changing it to a list.
But my question is: How can I make the program fail? because right now the program return an exception but doesn't finish, just stay in hold.
Thanks for the lib, amazing job!
Hi omkarcloud, I really appreciate your project in solving CloudFlare detection.
scrape_cookies
) with my provided proxy providers and particular urls. However browsers are not close properly after finishing.python -m botasaurus.close
in your suggestion because it interrupts concurrent requests.driver.close()
: close the browser but not the Chrome instance.driver.quit()
: this method doesn't work at all. Here is my change in decorators.py
.# line 339
def close_driver(driver: AntiDetectDriver):
if tiny_profile:
save_cookies(driver, driver.about.profile)
try:
# driver.close()
driver.quit()
except WebDriverException as e:
if "not connected to DevTools" in str(e):
print("Unable to close driver due to network issues")
# This error occurs due to connectivty issues
pass
else:
raise
os.kill(driver.service.process.pid, 9)
driver.server.process.kill()
driver.server.process.terminate()
Do you have any suggestion for me?
OS: MacOS
Proxy: datacenter IPs.
The library closed the browsers but not the instances.
OS: Docker on Mac
Dockerfile: Based on this botasaurus-starter.
Proxy: datacenter IPs.
Memory Usage: keeps going up after requests.
from botasaurus import *
from typing import List
from botasaurus.create_stealth_driver import create_stealth_driver
import json
from pydantic import BaseModel
from close import close_chrome
class CookieResponse(BaseModel):
heading: str
cookies: List[dict]
chromeOptions: dict
remoteAddress: str
def get_proxy(data):
return data["proxy"]
class Input(BaseModel):
proxy: str
url: str | None = "https://www.instacart.com/"
# I have web APIs to called this function
def scape_cookies(input: Input) -> CookieResponse:
pid = None
@browser(
create_driver=create_stealth_driver(
start_url=input["url"],
),
max_retry=3,
proxy=input["proxy"],
)
def scrape_website_args(driver: AntiDetectDriver, data) -> CookieResponse:
heading = driver.text('h1')
cookies = driver.get_cookies()
serialized_data = json.dumps(cookies)
nonlocal pid
pid = driver.service.process.pid
# I tried this three functions but it doesn't work.
# driver.service.process.kill()
# driver.service.process.terminate()
# driver.quit()
return {
"heading": heading,
"cookies": cookies,
"chromeOptions": driver.capabilities['goog:chromeOptions'],
}
response = scrape_website_args(input)
print(response)
# scrape_website_args.close() => this also does not work even with reuse_driver=True and keep_driver_alive=True
return response
if __name__ == "__main__":
response = scape_cookies(
{
"proxy": "proxy here",
"url": "https://www.instacart.com/",
}
)
How can I get response responses on the traffic network using Botsaurus?
Hello,
I would like to know where the Dockerfile is located so that I can update the packages on my end.
Thanks.
I was testing Botsaurus with stealth driver but I receive this error in the console.
"Your Node.js version is 12, which is less than 16. To use the stealth and auth proxy features of Botasaurus, you need Node.js 16, Kindly install it by visiting https://nodejs.org/. An exception has occurred, use %tb to see the full traceback.".
This occurs even with the examples below. My node version is 20.11, I tried using other versions using NVM but the error persists. Can you tell me to help?
Locally works fine, I want to move the scrapper to my server, and faced an error:
Code:
from botasaurus import *
@browser(
reuse_driver=True,
keep_drivers_alive=True,
headless=True,
block_resources=[
".css",
".jpg",
".jpeg",
".png",
".svg",
".gif",
".woff",
".pdf",
".zip",
],
)
def get_url_text(driver: AntiDetectDriver, url: str) -> str:
driver.get(url)
soup = driver.bs4()
get_url_text("https://github.com/Nv7-GitHub/googlesearch")
Failed with error:
Running Message: session not created: Chrome failed to start: exited normally. (session not created: DevToolsActivePort file doesn't exist) (The process started from chrome location [/opt/google/chrome/chrome](https://vscode-remote+ssh-002dremote-002bremote-005fcontainer.vscode-resource.vscode-cdn.net/opt/google/chrome/chrome) is no longer running, so ChromeDriver is assuming that Chrome has crashed.) Stacktrace: #0 0x55bdcb62bd93 <unknown> #1 0x55bdcb30f337 <unknown> #2 0x55bdcb343bc0 <unknown> #3 0x55bdcb33f765 <unknown> #4 0x55bdcb389b7c <unknown> #5 0x55bdcb37d1e3 <unknown> #6 0x55bdcb34d135 <unknown> #7 0x55bdcb34e13e <unknown> #8 0x55bdcb5efe4b <unknown> #9 0x55bdcb5f3dfa <unknown> #10 0x55bdcb5dc6d5 <unknown> #11 0x55bdcb5f4a6f <unknown> #12 0x55bdcb5c069f <unknown> #13 0x55bdcb619098 <unknown> #14 0x55bdcb619262 <unknown> #15 0x55bdcb62af34 <unknown> #16 0x7fc31a837ac3 <unknown>
Seems like something wrong with chromedriver. The current version is:
chromedriver-121 --version
ChromeDriver 121.0.6167.85 (3f98d690ad7e59242ef110144c757b2ac4eef1a2-refs/branch-heads/6167@{#1539})
Please anybody can share the code to download a mp3 file from https://example.com/file.mp3 . Thank you.
Extension class should support directory path where local extension is saved.
Found a workaround by creating a custom class given below, but if you can add it to Extension class, it will be cool. Thanks
class Extension:
def __init__(self, path):
self.path = path
def load(self, *args, **kwargs):
return os.path.abspath(self.path)
I believe there was an oversight issue of using the wrong By. ENUM parameter for this function.
file: anti_detect_driver.py
def get_elements_or_none_by_xpath(self: WebDriver, xpath, wait=Wait.SHORT):
try:
if wait is None:
return self.find_elements(By.XPATH, xpath)
else:
WebDriverWait(self, wait).until(
EC.presence_of_element_located((By.CSS_SELECTOR, xpath))
)
return self.find_elements(By.XPATH, xpath)
except:
return None
the line
EC.presence_of_element_located((By.CSS_SELECTOR, xpath))
should be changed to:
EC.presence_of_element_located((By.XPATH, xpath))
Will the docs be back? Want to build a bot with this sweet framework.
when i try to run hello world script on vps (Ubuntu 22)
from botasaurus import *
@browser(headless=True)
def scrape_heading_task(driver: AntiDetectDriver, data):
# Navigate to the Omkar Cloud website
driver.get("https://www.omkar.cloud/")
# Retrieve the heading element's text
heading = driver.text("h1")
# Save the data as a JSON file in output/scrape_heading_task.json
return {"heading": heading}
if __name__ == "__main__":
# Initiate the web scraping task
scrape_heading_task()
i obtain this error
(venv) root@1941865-hj59931:~/realm_of_python/sandbox# python botasaurus_collector.py
Running
[INFO] Downloading Chrome Driver. This is a one-time process. Download in progress...
Traceback (most recent call last):
File "/root/realm_of_python/sandbox/botasaurus_collector.py", line 18, in <module>
scrape_heading_task()
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/decorators.py", line 501, in wrapper_browser
current_result = run_task(data_item, False, 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/decorators.py", line 399, in run_task
driver = create_selenium_driver(options, desired_capabilities)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/create_driver_utils.py", line 221, in create_selenium_driver
driver = AntiDetectDriver(
^^^^^^^^^^^^^^^^^
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/anti_detect_driver.py", line 33, in __init__
super().__init__(*args, **kwargs)
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
super().__init__(DesiredCapabilities.CHROME['browserName'], "goog",
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 92, in __init__
super().__init__(
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 272, in __init__
self.start_session(capabilities, browser_profile)
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 364, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 429, in execute
self.error_handler.check_response(response)
File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
(session not created: DevToolsActivePort file doesn't exist)
(The process started from chrome location /opt/google/chrome/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x55ba42034fb3 <unknown>
#1 0x55ba41d084a7 <unknown>
#2 0x55ba41d3bc93 <unknown>
#3 0x55ba41d3810c <unknown>
#4 0x55ba41d7aac6 <unknown>
#5 0x55ba41d71713 <unknown>
#6 0x55ba41d4418b <unknown>
#7 0x55ba41d44f7e <unknown>
#8 0x55ba41ffa8d8 <unknown>
#9 0x55ba41ffe800 <unknown>
#10 0x55ba42008cfc <unknown>
#11 0x55ba41fff418 <unknown>
#12 0x55ba41fcc42f <unknown>
#13 0x55ba420234e8 <unknown>
#14 0x55ba420236b4 <unknown>
#15 0x55ba42034143 <unknown>
#16 0x7f1c63bcaac3 <unknown>
is there any advices?
This is a great project but i am having issues integrating it in my existing code.
I was previously using the UndetectedChromeDriver and would like to replace it with Botasaurus.
The goals are to handle sign-in, get user profiles and complete some user flow (fill forms, upload documents and click buttons).
I have created classes to easily integrate each part in the program.
Here is the code for the helper class
import subprocess
import os
from pathlib import Path
import logging
# from os import path
# import random
from time import sleep
# import undetected_chromedriver as uc
# from selenium.webdriver.chrome.options import Options
# from selenium.webdriver.chrome.service import Service
# from webdriver_manager.chrome import ChromeDriverManager
# from Tools.Bot.chrome_launcher_adapter import ChromeLauncherAdapter
# from Tools.Bot.create_stealth_driver import create_stealth_driver
from Tools.Bot.chrome_launcher_adapter import ChromeLauncherAdapter
from Tools.Bot.create_stealth_driver import create_stealth_driver
from selenium.webdriver.chrome.options import Options
from chromedriver_autoinstaller import install
from botasaurus import *
# from botasaurus_proxy_authentication import add_proxy_options
logger = logging.getLogger()
# COPIED FROM chrome-launcher code (https://github.com/GoogleChrome/chrome-launcher/blob/main/src/flags.ts), Mostly same but the extensions, media devices etc are not disabled to avoid detection
DEFAULT_FLAGS = [
# safe browsing service, upgrade detector, translate, UMA
"--disable-background-networking",
# Don't update the browser 'components' listed at chrome://components/
"--disable-component-update",
# Disables client-side phishing detection.
"--disable-client-side-phishing-detection",
# Disable syncing to a Google account
"--disable-sync",
# Disable reporting to UMA, but allows for collection
"--metrics-recording-only",
# Disable installation of default apps on first run
"--disable-default-apps",
# Disable the default browser check, do not prompt to set it as such
"--no-default-browser-check",
# Skip first run wizards
"--no-first-run",
# Disable backgrounding renders for occluded windows
"--disable-backgrounding-occluded-windows",
# Disable renderer process backgrounding
"--disable-renderer-backgrounding",
# Disable task throttling of timer tasks from background pages.
"--disable-background-timer-throttling",
# Disable the default throttling of IPC between renderer & browser processes.
"--disable-ipc-flooding-protection",
# Avoid potential instability of using Gnome Keyring or KDE wallet. crbug.com/571003 crbug.com/991424
"--password-store=basic",
# Use mock keychain on Mac to prevent blocking permissions dialogs
"--use-mock-keychain",
# Disable background tracing (aka slow reports & deep reports) to avoid 'Tracing already started'
"--force-fieldtrials=*BackgroundTracing/default/",
# Suppresses hang monitor dialogs in renderer processes. This flag may allow slow unload handlers on a page to prevent the tab from closing.
"--disable-hang-monitor",
# Reloading a page that came from a POST normally prompts the user.
"--disable-prompt-on-repost",
# Disables Domain Reliability Monitoring, which tracks whether the browser has difficulty contacting Google-owned sites and uploads reports to Google.
"--disable-domain-reliability",
]
class BotasaurusChromeHandler:
def __init__(self):
print("๐ก ChromeHandler init")
sleep(5)
self._driver = self.launch_chrome("https://ca.yahoo.com/?p=us", [])
create_stealth_driver()
print("โ
UndetectedChromeHandler launched โก๏ธ (๐ Google.com)")
def driver(self):
return self._driver
# @browser(profile='Profile 1',)
def launch_chrome(self,start_url, additional_args):
# Set Chrome options
chrome_options = Options(
# headless=True,
# add_argument(r"--user-data-dir=/Users/lifen/Library/Application Support/Google/Chrome/Profile 1"),
)
chrome_options.add_argument("--remote-debugging-port=9222")
# chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--disable-gpu")
# chrome_options.add_argument("--disable-extensions")
# chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--user-data-dir=/Users/lifen/Library/Application Support/Google/Chrome/Profile 1")
# add_proxy_options(chrome_options)
unique_flags = list(dict.fromkeys(DEFAULT_FLAGS + additional_args))
kwargs = {
"ignoreDefaultFlags": True,
"chromeFlags": unique_flags,
"userDataDir": "/Users/MacUser/Library/Application Support/Google/Chrome/Profile 1",
"port": 9222,
"headless": False,
"autoClose": True,
}
if start_url:
kwargs["startingUrl"] = start_url
instance = ChromeLauncherAdapter.launch(**kwargs)
return instance
Where the code is used:
import re
import logging
import random
from time import sleep
from configs.configs_model import ConfigsModel
from helpers.jobs_sql import JobsSQL
from helpers.html_page_handler import HTMLPageHandler
from helpers.shared import notification
from models.job_listing import JobListingModel
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.remote.webelement import WebElement
from helpers.botasaurus_chrome_handler import BotasaurusChromeHandler
from botasaurus import *
logger = logging.getLogger()
class IndeedChromeApplier:
def __init__(self, jobs_sql: JobsSQL, jobs: list):
print(f"๐ก IndeedChromeApplier init ")
self.jobs = jobs
self.chrome = BotasaurusChromeHandler()
# self.chrome.driver().maximize_window()
driver = bt.create_driver()
self.driver = driver
self.page = HTMLPageHandler(driver=driver)
self.jobs_sql = jobs_sql
def get_uid(self):
configs = ConfigsModel()
uid = configs.user_id
return uid
# @browser
def check_auth(self):
# driver = self.chrome.driver()
driver = self.driver
driver.get("https://profile.indeed.com/")
sleep(2)
url = driver.current_url
substring = "secure"
print(f"๐ข ๐ด {url=}")
if substring in url:
print("โ Not Logged in")
# Get input of the user to try again after he logs in
notification(
message="Please log in to Indeed.com and try again (y/n): ")
_input = input("Please log in to Indeed.com and try again (y/n): ")
_input: str = "" + _input
if _input.lower().__contains__("y"):
return self.check_auth()
elif _input.lower().__contains__("n"):
return False
else:
sleep(20000)
elif "profile.indeed.com" in url:
print("โ
Logged in")
return True
def answer_questions(self):
# Define a WebDriverWait with a timeout of 10 seconds
wait = WebDriverWait(self.chrome.driver(), 10)
# Wait for the radio button for commuting/relocation to be clickable and select it
try:
commute_option: WebElement = wait.until(
EC.element_to_be_clickable(
(
By.XPATH,
"//label[@for='input-q_38d8e685bb4b5228c2494ac85bc44d69-0']",
)
)
)
commute_option.click()
sleep(random.uniform(0.7, 2.2))
except TimeoutException:
print("Failed to find or click the commute option.")
def replace_resume(self, job_title):
print("โฏ๏ธ replace_resume")
is_upload_resume = (
"Upload or build a resume for this application"
in self.chrome.driver().title
)
paths = self.get_paths()
if is_upload_resume:
print("โ
is_upload")
# Find the "Replace" link using the full link text
replace_link = self.page.try_find_element(
driver=self.chrome.driver(),
name="Replace",
by=By.CSS_SELECTOR,
value='[data-testid="ResumeFileInfoCardReplaceButton-button"]',
)
sleep(1)
if replace_link:
print("โ
replace_link")
sleep(1)
# Find the file input element
file_input: WebElement = WebDriverWait(self.chrome.driver(), 10).until(
EC.presence_of_element_located(
(By.CSS_SELECTOR, 'input[type="file"]')
)
)
# Send the file path to the file input element
file_input.send_keys(
f"{paths.output_resumes_pdf_dir}/RalphNduwimana-{job_title}.pdf"
)
sleep(random.uniform(0.9, 1.8))
# self.page.click_to_next_page(name="Continue",by=By.CLASS_NAME,value='ia-continueButton ia-Resume-continue css-vw73h2 e8ju0x51')
notification(message=f"Resume replaced by {job_title}")
self.page.click_to_go_to_page(
name="Continue",
by=By.XPATH,
value="//div[contains(text(), 'Continue')]",
)
def submit_application(self):
print("โฏ๏ธ review_application")
notification(message="Reviewing application")
sleep(1.7)
notification(message="No cover letter required!")
submit = self.page.click_to_go_to_page(
name="Submit your application",
by=By.XPATH,
value="//button[contains(@class, 'ia-continueButton')]",
)
if submit:
notification("Application Submitted")
else:
notification("Application Submitted", code=0)
# submit_application_button.click()
# Wait for 2 seconds for the submission to be completed
sleep(2)
# Check if the page contains "Application Submitted"
application_submitted = (
"Application Submitted" in self.chrome.driver().page_source
)
# Check if the submission was completed and return True if "Application Submitted" was found
if application_submitted:
notification("Application submitted successfully!")
return True
else:
print("Application submission failed.")
return False
def click_button(self):
# Logic to click on buttons
pass
def type_text(self):
# Logic to click on buttons
pass
def run(self):
print("โฏ๏ธ IndeedChromeApplier run")
driver = self.chrome.driver()
authenticated = self.check_auth()
jobs_row = self.jobs_sql.load_jobs_by_status(query_status="Generated")
jobs_data = [job_row for job_row in jobs_row]
print(f'โ
โ
{str(jobs_data)[0:200]}')
if authenticated:
for data in jobs_data:
if not data:
print(f'๐ซ No Data in jobs_data')
job_data = self.convert_tuple_to_dict(data)
job = JobListingModel(job_data)
url = job.jobUrl
print(f'โ
โ
โ
โ
{job.jobUrl}')
page_loaded = self.page.go_to_page(url)
if not page_loaded:
print(f"๐ซ {url} not loaded")
# continue
if page_loaded:
print('โ
page_loaded')
application_started = self.page.click_to_go_to_page(
name="Apply",
by=By.ID,
value="indeedApplyButton",
)
data = re.search(
"This job has expired on Indeed",
driver.page_source,
)
# Get True of False
expired = data is not None
print(f"๐ {expired=}")
# sleep(10000)
sleep(random.uniform(0.2, 0.5))
if not application_started:
print("๐ซ Application not started")
sleep(1000)
if "indeed" not in driver.current_url:
print("Cannot apply on company websites (just indeed.com)")
sleep(10000)
pages = {
"questions": False,
"resume": False,
"review": False,
"work-experience": False,
"submitted": False,
}
try:
# there is a page that has not been completed
while (
False
in pages.values()
):
print('')
except NoSuchElementException:
print(
f"โ Failed to get page ")
def log_in(self, username, password):
print(f"โฏ๏ธ Starting log_in {username} {password}")
page = self.page
try:
username_bar = page.try_find_element(
name="username_bar",
by=By.ID,
value="session_key",
driver=self.driver,
)
assert username_bar is not None
username_bar.send_keys(f"{username}")
password_bar = page.try_find_element(
name="password_bar", by=By.ID, value="session_password", driver=self.chrome.driver()
)
assert password_bar is not None
password_bar.send_keys(f"{password}")
password_bar.send_keys(Keys.ENTER)
print("โ
User logged-in")
except NoSuchElementException:
print("No such element found")
except Exception:
print("Other exception")
print(f"โน๏ธ Finished log_in {username} {password}")
def log_out(self):
url = self.chrome.driver().current_url
print(f"โฏ๏ธ Starting log_out from {url}")
xpath = (
"/html/body/div[5]/header/div/nav/ul/li[6]/div/button"
if "Home" in url
else "/html/body/header/div/div[2]/div/div/button"
)
page = self.page
icon_button = page.try_find_element(
driver=self.chrome.driver(),
name="Log-Out",
by=By.XPATH,
value=xpath,
element_type="button",
)
try:
print(f"{icon_button=}")
try:
sign_out_option: WebElement = WebDriverWait(
self.chrome.driver(), 10
).until(EC.presence_of_element_located((By.LINK_TEXT, "Sign Out")))
sign_out_option.click()
print("โ
User logged-out")
except:
print(f"Sign Out not found ")
except:
print("Avatar button not found")
print(f"โน๏ธ Finished log_out from {url}")
I would appreciate any guidance on how to integrate Botasaurus features in my code.
Thanks in advance!!!
Hello. When i try to build project with botasaurus into docker image i get this error.
45.73 ChefBuildError
45.73
45.73 Backend subprocess exited when trying to invoke get_requires_for_build_wheel
45.73
45.73 You do not have node installed on your system, Kindly install it by visiting https://nodejs.org/
45.73
45.73
45.73 at /usr/local/lib/python3.11/site-packages/poetry/installation/chef.py:164 in _prepare
45.74 160โ
45.74 161โ error = ChefBuildError("\n\n".join(message_parts))
45.74 162โ
45.74 163โ if error is not None:
45.74 โ 164โ raise error from None
45.74 165โ
45.74 166โ return path
45.74 167โ
45.74 168โ def _prepare_sdist(self, archive: Path, destination: Path | None = None) -> Path:
45.74
45.74 Note: This error originates from the build backend, and is likely not a problem with poetry but with botasaurus-proxy-authentication (1.0.8) not supporting PEP 517 builds. You
can verify this by running 'pip wheel --no-cache-dir --use-pep517 "botasaurus-proxy-authentication (==1.0.8)"'.
is this okay? can i just use pip wheel --no-cache-dir --use-pep517 "botasaurus-proxy-authentication (==1.0.8)
??
Hello. There's been a mistake. Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'chrome-launcher' imported from /files/botasaurus/.venv/lib/python3.11/site-packages/javascript/js/deps.js
proxy-chain
is required by botasaurus-proxy-authentication but not by botasaurus
$ grep -r -w -h 'require(' botasaurus/botasaurus/
got = require("got-scraping-export")
chrome_launcher = require("chrome-launcher")
currently setup.py
will also install proxy-chain
Lines 57 to 61 in 34ad082
When I tried to open the form page using proxy, cloudflare fails, otherwise solves
code
import os.path
from botasaurus import *
from botasaurus.create_stealth_driver import create_stealth_driver
@browser(
# user_agent=bt.UserAgent.REAL,
# window_size=bt.WindowSize.REAL,
create_driver=create_stealth_driver(
start_url="https://dashboard.capsolver.com/passport/login",
),
)
def scrape_heading_task(driver: AntiDetectDriver, data):
driver.prompt()
heading = driver.text('h1')
return heading
scrape_heading_task()
Thanks
Can you add c# wrapper?
And can we use this library without decorating function? if yes can you give me a example
i'm looked at codes and a lot of antibot code is based on english like "Please verify you are a human", can you add localization support for that?
Enums, Consts can be on the other folder
def short_random_sleep(self):
sleep_for_n_seconds(uniform(2, 4))
def long_random_sleep(self):
sleep_for_n_seconds(uniform(6, 9))
def sleep_forever(self):
sleep_forever()
#anti_detect_driver.py #85 - 92 i'm think it's must be in the utils.py with the reference to enum in the wait.py
accept_cookies_btn = driver.get_element_or_none_by_selector("button#L2AGLb", None)
# accept_google_cookies.py #27 i'm think button#L2AGLb code and others can be changed/randomized with google because of that i'm think you can create json parser for your library with json parser users can be create his selector/captcha solver/clicker without forking the repository.
def google_get(self, link, wait=None, accept_cookies=False):
self.get("https://www.google.com/")
if accept_cookies:
accept_google_cookies(self)
return self.get_by_current_page_referrer(link, wait)
def get_google(self, accept_cookies=False):
self.get("https://www.google.com/")
if accept_cookies:
accept_google_cookies(self)
# self.get_element_or_none_by_selector('input[role="combobox"]', Wait.VERY_LONG)
#anti_detect_driver.py #342 - 352 there is a lot of google urls firstly you can add enum/consts class for that "https://www.google.com/" and secondly you can add support for customising url (supporting user to select google url)
#You can add support for searching on google (with keyword) and listing data, if you want to add i think it's most be enumerable because user can be don't want to click/go/navigate to the pages in the first list (or second list etc)...
#And looks like there is a speeling error on accept_google_cookies.py #25
raise Exception("Unabe to load Google")
And selenium automatically adds navigator.webdriver variable with true value you need to change this variable to false
And you need add --disable-blink-features=AutomationControlled (i don't looked all code, maybe it has already been added)
i don't used new versions of selenium maybe it's changed
Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///C:/Users/cntow/Downloads/Compressed/botasaurus-master/botasaurus-master
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
error: subprocess-exited-with-error
ร Getting requirements to build editable did not run successfully.
โ exit code: 1
โฐโ> [47 lines of output]
C:\Users\cntow\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe: No module named pip
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\importlib\metadata_init_.py", line 397, in from_name
return next(cls.discover(name=name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 77, in install_javascript_package
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\importlib\metadata\__init__.py", line 861, in distribution
return Distribution.from_name(distribution_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\importlib\metadata\__init__.py", line 399, in from_name
raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: No package metadata was found for javascript
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\cntow\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
main()
File "C:\Users\cntow\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\cntow\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 132, in get_requires_for_build_editable
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 441, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 480, in run_setup
super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 87, in <module>
File "<string>", line 83, in pre_install
File "<string>", line 79, in install_javascript_package
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\cntow\\AppData\\Local\\Microsoft\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\\python.exe', '-m', 'pip', 'install', 'javascript']' returned non-zero exit status 1.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
ร Getting requirements to build editable did not run successfully.
โ exit code: 1
โฐโ> See above for output.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.