webrecorder / browsertrix-crawler Goto Github PK

View Code? Open in Web Editor NEW

551.0 23.0 69.0 53.67 MB

Run a high-fidelity browser-based crawler in a single Docker container

Home Page: https://crawler.docs.browsertrix.com

License: GNU Affero General Public License v3.0

JavaScript 22.32% Dockerfile 0.57% HTML 2.51% Shell 0.18% TypeScript 74.42%

crawler crawling wacz warc web-archiving web-crawler webrecorder

browsertrix-crawler's Introduction

Browsertrix Crawler 1.x

Browsertrix Crawler is a standalone browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses Puppeteer to control one or more Brave Browser browser windows in parallel. Data is captured through the Chrome Devtools Protocol (CDP) in the browser.

For information on how to use and develop Browsertrix Crawler, see the hosted Browsertrix Crawler documentation.

For information on how to build the docs locally, see the docs page.

Support

Initial support for 0.x version of Browsertrix Crawler, was provided by Kiwix. The initial functionality for Browsertrix Crawler was developed to support the zimit project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.

Additional support for Browsertrix Crawler, including for the development of the 0.4.x version has been provided by Portico.

License

AGPLv3 or later, see LICENSE for more details.

browsertrix-crawler's People

Contributors

Stargazers

Watchers

browsertrix-crawler's Issues

Support Politeness Settings

Support settings for limiting number of pages by domain, possibly using existing settings in puppeteer-cluster.
Possibly options:

Limit number of concurrent pages to same domain
Delay for loading subsequent pages from same domain (supported in puppeteer-cluster)

Crawler SIGTERM by itself while running

Hi there.

When I run browsertrix-crawler on large sites (ex: www.androidcentral.com), after 2-3 hrs of crawling, the crawler will crash by itself - specifically, it looks like for some reason, it SIGTERMS itself.

Terminal output is as below:

docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://www.<somewebsite>.com/ --generateWACZ --text --workers 8 --collection androidcentral
Text Extraction: Enabled
Load timeout for https://<somewebsite>.com/968 TimeoutError: Navigation timeout of 90000 ms exceeded 15:23:00.924
    at /app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
Load timeout for https://<somewebsite>.com/793 TimeoutError: Navigation timeout of 90000 ms exceeded 15:23:00.924
    at /app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
Load timeout for https://<somewebsite>.com/595 TimeoutError: Navigation timeout of 90000 ms exceeded15:23:00.924
    at /app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
Load timeout for https://<somewebsite>.com/1145 Error: net::ERR_TOO_MANY_RETRIES at https://<somewebsite>.com/1145
    at navigate (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:115:23)
    at runMicrotasks (<anonymous>), errors: 1 (0.05%)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async FrameManager.navigateFrame (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:90:21)   8
    at async Frame.goto (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:416:16)
    at async Page.goto (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:789:16)
    at async Crawler.loadPage (/app/crawler.js:427:7)tails/1406
    at async Crawler.module.exports [as driver] (/app/defaultDriver.js:3:3)02
    at async Crawler.crawlPage (/app/crawler.js:258:7)ails/1145
SIGTERM received, exiting
ERRO[12641] error waiting for container: unexpected EOF

Any way to resolve this? thanks!

Windows 1251 (cyrillic) encoded text incorrectly encoded

When scraping a website encoded in Windows Cyrillic (windows-1251), the convertion to UTF-8 is faulty, resulting in tons of пїЅпїЅпїЅпїЅпїЅ strings.

Sample website: https://sattvinfo.net/
Sample ZIM (valid only 7d from now) https://s3.us-west-1.wasabisys.com/org-kiwix-zimit/other/sattvinfo.net_99fa2e51.zim
Recipe took 50mn to generate this 130MB ZIM.

Logs shows 2 instances of [WARNING] Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

Hint on what to do on stackoverflow

website

ZIM

Not all images in srcset are scraped

Working on openzim/zimit#70 I ended up finding that I get different results for different runs of the same query.

Using --url "https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/" on three consecutive runs, I got the following A articles for the first image on that page (img_9104.jpg)

run1

A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=1024&h=765
A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=300&h=224
A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=640&h=478'

run2

A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=1280&h=956
A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=640&h=478

run3

A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=1024&h=765
A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=640&h=478

run4

A/mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=640&h=478

The full list from srcset is:

https://mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=640&amp;h=478 640w,
https://mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=1280&amp;h=956 1280w,
https://mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=150&amp;h=112 150w,
https://mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=300&amp;h=224 300w,
https://mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=768&amp;h=574 768w,
https://mesquartierschinois.files.wordpress.com/2014/07/img_9104.jpg?w=1024&amp;h=765 1024w

The fallback on src is the 640w one which got included in all 3 runs.

Given the very same args (only url and name) were specified, I'm confident this is unrelated to image selection by the browser (it's always using the default mobile device) and the replayer works fine is the (dynamic) requested image is in the ZIM.

I've looked into autofetcher.js but couldn't find a problem. When injected in a running browser, it attempts to fetch all the links. I couldn't get it to output to the crawler process as this is running inside the launched browser so I'm not sure what is exactly happening. When testing manually, I had to disable CORS in chrome but that shouldn't apply here.

Note: I am mentioning ZIM articles here as this is easier to manipulate but checked the WARC files and it matches the content of the ZIM.

Wordpress lightbox plugin content skipped

When trying to crawl a website using the Wonder Lightbox Wordpress plugin, content using the lightbox (marked with class wplightbox) are unfortunately skipped.

Here are examples: https://www.wonderplugin.com/wordpress-lightbox/#examples.

docker vhdx issue

Last week I ran a Heritrix crawl (on Ubuntu 18LTS), resulting in about 23GB.

This week I try to run a Browsertrix crawl of the same domain (on WSL2 Ubuntu 20LTS with 8 workers), but have problems with the Docker ext4.vhdx file on Windows. It keeps inflating to all the available space left on the hard disk, after which the crawl hangs. The vhdx grows to over 60GB, while the warcs of the Browsertrix crawl are about 6GB.

Is this a Docker issue or is this influenced by Browerstrix? Solution for now is stopping the crawl (doesn't always succeed), purging WSL data in Docker, which also deletes the Browsertrix image. I could try to relocate the vhdx, but this is a crawl of less then 4000 pages (and quite some images). What would happen with a 40K pages crawl?

Improved Crawl Log Data

Determine what crawl logs should be generated to help debug crawl.
The page list (stored in pages.jsonl) already includes info on what pages were visited and when.
Possible additions:

Page crawl graph data? (seed and crawl depth of each page)
Behavior state log?
Page resources? (which pages were each resources loaded from?)

Thinking the crawl graph data per page (seed and depth) and additional logging of behaviors would be most useful. The page resources will of course be available in the cdx.

In future, a pageId may also be added to WARC headers to be able to better map resources->page

'create profile' should accept cookies from Facebook

Automatic script 'create profile' manages to login into Facebook with username and password,
but it doesn't click the 'Accept All' cookies button.

You can see that when using debugScreenshot option.

When browsing in replayweb, 'accept cookies' window pops up on every page.

Add Crawl Depth Option

To support #12, will also need to add a crawlDepth option to support crawling to various depths, eg. one-hop-out

'Browsertrix Profile Create" --> NameError

Hey everybody,

I've run into an issue with browsertrix. I've managed to install browsertrix itself but when I try to fire it up and create a profile to start crawling, I keep getting the following notice:

`Traceback (most recent call last):
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\docker-4.4.4-py3.9.egg\docker\api\client.py", line 159, in init
self._custom_adapter = NpipeHTTPAdapter(
NameError: name 'NpipeHTTPAdapter' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\Scripts\browsertrix-script.py", line 33, in
sys.exit(load_entry_point('browsertrix-cli==0.1.0.dev0', 'console_scripts', 'browsertrix')())
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\click-8.0.0a1-py3.9.egg\click\core.py", line 1025, in call
return self.main(*args, **kwargs)
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\click-8.0.0a1-py3.9.egg\click\core.py", line 955, in main
rv = self.invoke(ctx)
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\click-8.0.0a1-py3.9.egg\click\core.py", line 1517, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\click-8.0.0a1-py3.9.egg\click\core.py", line 1514, in invoke
super().invoke(ctx)
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\click-8.0.0a1-py3.9.egg\click\core.py", line 1279, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\click-8.0.0a1-py3.9.egg\click\core.py", line 710, in invoke
return callback(*args, **kwargs)
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\browsertrix_cli-0.1.0.dev0-py3.9.egg\browsertrix_cli\profile.py", line 42, in profile
docker_api = docker.from_env(version='auto')
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\docker-4.4.4-py3.9.egg\docker\client.py", line 96, in from_env
return cls(
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\docker-4.4.4-py3.9.egg\docker\client.py", line 45, in init
self.api = APIClient(*args, **kwargs)
File "C:\Users\u0102607\AppData\Local\Programs\Python\Python39\lib\site-packages\docker-4.4.4-py3.9.egg\docker\api\client.py", line 164, in init
raise DockerException(
docker.errors.DockerException: Install pypiwin32 package to enable npipe:// support`

The way I see it however, is that I've updated and installed everything necessary:
Pypiwin32 package is installed
pip 21.1.3 and python 3.9 is installed

Any idea's? Googling for the nameerror brings me to a similar issue with another application, the suggested fixes there don't resolve the problem for browsertrix.

Thanks in advance

profile.tar.gz not found

This is the error I get when trying to use a profile I created. I can see the file. its there.

`
Error: Command failed: tar xvfz /home/bill/crawls/profiles/profile.tar.gz
tar (child): /home/bill/crawls/profiles/profile.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

This is my commandline

docker run -p 9037:9037 -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --profile /home/bill/crawls/profiles/profile.tar.gz --url https://www.dummyhiddenurl.com/ --screencastPort 9037 --behaviors autoscroll,autoplay,autofetch,siteSpecific --scopeType host --workers 10 --headless --newContext window

Here is the full error:

`tar (child): /home/bill/crawls/profiles/profile.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
node:child_process:903
throw err;
^

Error: Command failed: tar xvfz /home/bill/crawls/profiles/profile.tar.gz
tar (child): /home/bill/crawls/profiles/profile.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

at checkExecSyncError (node:child_process:826:11)
at Object.execSync (node:child_process:900:15)
at module.exports.loadProfile (/app/util/browser.js:10:19)
at new Crawler (/app/crawler.js:59:23)
at Object.<anonymous> (/app/main.js:16:1)
at Module._compile (node:internal/modules/cjs/loader:1095:14)
at Object.Module._extensions..js (node:internal/modules/cjs/loader:1124:10)
at Module.load (node:internal/modules/cjs/loader:975:32)
at Function.Module._load (node:internal/modules/cjs/loader:816:12)
at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:79:12) {

status: 2,
signal: null,
output: [
null,
Buffer(0) [Uint8Array] [],
Buffer(218) [Uint8Array] [
116, 97, 114, 32, 40, 99, 104, 105, 108, 100, 41, 58,
32, 47, 104, 111, 109, 101, 47, 98, 105, 108, 108, 47,
99, 114, 97, 119, 108, 115, 47, 112, 114, 111, 102, 105,
108, 101, 115, 47, 112, 114, 111, 102, 105, 108, 101, 46,
116, 97, 114, 46, 103, 122, 58, 32, 67, 97, 110, 110,
111, 116, 32, 111, 112, 101, 110, 58, 32, 78, 111, 32,
115, 117, 99, 104, 32, 102, 105, 108, 101, 32, 111, 114,
32, 100, 105, 114, 101, 99, 116, 111, 114, 121, 10, 116,
97, 114, 32, 40,
... 118 more items
]
],
pid: 14,
stdout: Buffer(0) [Uint8Array] [],
stderr: Buffer(218) [Uint8Array] [
116, 97, 114, 32, 40, 99, 104, 105, 108, 100, 41, 58,
32, 47, 104, 111, 109, 101, 47, 98, 105, 108, 108, 47,
99, 114, 97, 119, 108, 115, 47, 112, 114, 111, 102, 105,
108, 101, 115, 47, 112, 114, 111, 102, 105, 108, 101, 46,
116, 97, 114, 46, 103, 122, 58, 32, 67, 97, 110, 110,
111, 116, 32, 111, 112, 101, 110, 58, 32, 78, 111, 32,
115, 117, 99, 104, 32, 102, 105, 108, 101, 32, 111, 114,
32, 100, 105, 114, 101, 99, 116, 111, 114, 121, 10, 116,
97, 114, 32, 40,
... 118 more items
]
}
`

Here's the file:


root@bill-HP-Pavilion-15-Notebook-PC:/home/bill/crawls/profiles# ls -lh
total 9,0M
-rw-r--r-- 1 root root 8,9M aug  9 12:45 profile.tar.gz

Also I am logged in as root... I know. bad practice. But its just a junk test machine. what am I doing wrong?

Support URL-level WARC-writing inclusion/exclusion lists

The system already has Page level scope inclusion/exclusion rules for determining which pages to visit.

The URL-level inclusion/exclusion rules would allow for excluding certain URLs from being written to WARC, even if loaded from another page. These could include known ad domains, etc..

By default, all URLs encountered are written.
This would support possible use cases:

Excluding URLs from exclusion list from being written to WARC
Excluding all URLs except those on a specific include list (maybe equal to the scope rules)
Writing a specific response to indicate exclusion (eg. a status 451 placeholder)

Add pending request count in pywb to ensure all requests written to WARC before terminating

Instead of the 5s wait for WARCs at the end, add actual counter to pywb to track pending requests and wait until all are done.
This should help with instances where a large video is being streamed, and page timeout may have exceeded, but still need to wait for full video to load and be written to WARC.

`--urlFile` option errors

When attempting to use the --urlFile option I am getting an error Missing required argument: url.

Command:
docker run -v $PWD/urlFile.txt:/app/urlFile.txt -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --urlFile urlFile.txt

What seems to be happening is that when running the above command urlFile.txt is being created as a directory and so there is no file for the crawler to use as a seed list - hence the error (see second paragraph here).

I have added the line RUN touch urlFile.txt to a local copy of the Dockerfile and this appears to have solved that issue but then I run into another error: pages/pages.jsonl creation failed [Error: ENOENT: no such file or directory, mkdir '/crawls/collections/testcollection/pages'].

Am I missing something obvious?
Thanks

Upload final output to an S3-compatible storage + webhook notification

If final output is a WACZ or single-WARC, support uploading to an S3-compatible endpoint, as per:
https://github.com/webrecorder/kubecaptures-backend/blob/main/driver/driver.js#L289

Also support optional notification via webhooks (if any), as per:
https://github.com/webrecorder/kubecaptures-backend/blob/main/driver/driver.js#L364

Need to determine best way to pass credentials, maybe not via command-line..

Feature - Nested seed rules

Hi again! So I've been trying to create this specific crawling rule:
Crawl all pages that have this prefix: "https://example.com", and for all outbound links, crawl no further than the link's destination page. I believe this isn't possible right now but here's what I think I think it could look like:

seeds:
  - url: https://example.com
    scopeType: prefix
    include:
      - url: /.*/
        scopeType: page

I think that would work if only the rules corresponding to the first valid "include" regex are used. So here, the implied [/^https://example.com/ rule would take precedence over /.*/ (if not, it would change the scopeType of example.com pages too)

It could lead to more complex behaviors such as:

seeds:
  - url: https://example.com
    scopeType: prefix
    include:
      - url: github.com
        depth: 2
      - url: /.*/
        scopeType: page

I would say, it's a bit like generating a seedfile on the fly, and applying seed rules on its urls. Or it could be seen as seed rules applied only on pages with a given referring domain (here example.com). There is probably a lot of little details to think about to avoid weird parsing issues and unwanted behaviors.

For now, I'll just crawl my target domain once. Then redo a crawl on the same collection using a seedfile containing all the outbound links and a scopeType set to page.

using profile

I'm having troubles crawling certain sites where browsertrix doesn't seem getting past the cookie consent form. So I've created an interactive profile which contains the cookies created after accepting the cookie consent form. (works fine!)

The issue is however using the profile. With code
sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /profiles/profile.tar.gz --url https://test.com/--generateWACZ --collection test-with-profile
the error is "No such file or directory" (also when using /home/../profiles/profile.tar.gz)

When trying to update the profile with code
sudo docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles --filename /profiles/newProfile.tar.gz -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://test.com/ --profile /profiles/profile.tar.gz"
the error is "unknown flag: --filename"

Is this an issue or rather a typo?

full error message when trying to use the profile:

tar (child): /home/testbtrix/profiles/profile.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
node:child_process:903
    throw err;
    ^

Error: Command failed: tar xvfz /home/testbtrix/profiles/profile.tar.gz
tar (child): /home/testbtrix/profiles/profile.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

Not following iframes relative links

I'm having issues pulling links from an iframe (it does not seem to follow them)

<iframe src="./somesite/Pages/This%20Is%20the%20'link'.html" scrollbars="1" height="900" width="100%"></iframe>

I'm trying to scrape a site and just get a list of all the URLs on the site to send over to ArchiveBox for backup.

Also is there a simple way to just get the URLs and not the text?

This is my config:

    browsertrixcrawler:
        image: webrecorder/browsertrix-crawler
        command: crawl --url https://example.com/ --workers 2 --generateWACZ --text --collection otrlib
        volumes:
            - /data/archivebox/crawls:/crawls
        network_mode: service:vpn
        depends_on:
          - vpn

Thanks!

userAgent option

Hello,
it seems that userAgent option doesn't work in my installation (latest image).

The crawl doesn't even start, message is
"Error: Unable to launch browser, error message: Could not find expected browser (chrome) locally."
When option is removed, chrome is suddenly found ok.

I have tried simple and more complex strings as userAgents, for example

--userAgent "https://www.kansalliskirjasto.fi/en/legal-deposit-office"
--userAgent https://www.kansalliskirjasto.fi/en/legal-deposit-office
--userAgent "something"
--userAgent something

Also tried to place this option in different places in the command line.

Regards,

Collection name validation

If the desired collection name passed to browsertrix via --collection contains ., /, :, or potentially other special characters, pywb silently fails to create the necessary directory structure for the crawl. Despite that failure, the crawl then proceeds, resulting in a series of ENOENT errors as the process attempts to write to non-existent paths as it goes.

Browsertrix should make sure the collection name is valid before launching the crawl. wb manager init should also fail noisily if it is passed an invalid collection name, ideally cleanly by throwing a validation error, but also messily if directory creation fails for whatever reason.

Sample output:

HLSC02VK05ZHTDG:browsertrix-crawler rcremona$ time docker-compose run crawler crawl --url https://example.com/ --limit 1 --collection job-1-example.com
Creating browsertrix-crawler_crawler_run ... done
Exclusions Regexes:  []
Scope Regexes:  [ /^https:\/\/example\.com\// ]
pages/pages.jsonl creation failed Error: ENOENT: no such file or directory, mkdir '/crawls/collections/job-1-example.com/pages'
    at Object.mkdirSync (fs.js:987:3)
    at Crawler.initPages (/app/crawler.js:596:12)r 503.0 ms)
    at Crawler.crawl (/app/crawler.js:492:10)0%)
    at async Crawler.run (/app/crawler.js:411:7) {
  errno: -2,: 0.0% CPU / 0.0% memory
  syscall: 'mkdir',
  code: 'ENOENT',
  path: '/crawls/collections/job-1-example.com/pages'
}
pages/pages.jsonl append failed Error: ENOENT: no such file or directory, open '/crawls/collections/job-1-example.com/pages/pages.jsonl'tart:     2021-03-24 20:08:28.092
    at Object.openSync (fs.js:476:3)6 (running for 3.5 seconds)
    at Object.writeFileSync (fs.js:1467:35)0%)
    at Object.appendFileSync (fs.js:1506:6)
    at Crawler.writePage (/app/crawler.js:624:10)
    at Crawler.crawlPage (/app/crawler.js:458:12)
    at processTicksAndRejections (internal/process/task_queues.js:93:5) {
  errno: -2,
  syscall: 'open',
  code: 'ENOENT',
  path: '/crawls/collections/job-1-example.com/pages/pages.jsonl'
}

== Start:     2021-03-24 20:08:28.092
== Now:       2021-03-24 20:08:31.914 (running for 3.8 seconds)
== Progress:  1 / 1 (100.00%), errors: 0 (0.00%)
== Remaining: 0.0 ms (@ 0.26 pages/second)
== Sys. load: 58.5% CPU / 79.9% memory
== Workers:   1
   #0 IDLE 
Waiting 5s to ensure WARCs are finished

real	0m12.088s
user	0m0.661s
sys	0m0.209s

How can I solve cloudflare's browser check...? (login profile, useragent)

I've tried creating a login profile, but no success.... and --useragent I've tried several times and it keeps giving me an error. Could you please give me an example?

Add support for WACZ creation.

This can be a command-line flag, say --generateWACZ similar to the --generateCDX option, which will generate a WACZ file after the crawl is done. This will also require keeping track of the pages crawled in a list that can be passed into py-wacz

This would involve:

Adding a --generateWACZ command line option.
Generate a pages/pages.jsonl file in the collection directory. Will need to make the pages dir also.
Run py-wacz to create the WACZ at the end of the crawl. For now, can just regenerate the CDX during wacz creation. In the future, can use the existing index in redis to speed up the process.

Support mounting a browser profile directory or tarball

The profile could be mounted as a docker volume and passed to the puppeteer options as userDataDir

Or, perhaps profile is mounted as a tar.gz and extracted in the crawler before use to simplify management of the profiles.

The profile will allow crawling with a pre-configured browser.

Error: "Missing required argument: url" when using yaml config option

Hello,

I am currently trying to record a collection of urls using the crawler. However, when I was doing this command:
"docker run -v $PWD/crawl-1.yaml:/app/crawl-1.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl —-config /app/crawl-1.yaml" The program will show "Missing required argument: url". The craw-1.yaml is downloaded from the fixtures folder. Did I do something wrong?

Thank you so much!

Post-process smaller WARCs to larger WARC with warcinfo

The crawl generates many WARCs due to several processes running in parallel.
If the final output of the crawl needs to be a WARC, there can be a 'post-processing' step to generate a single WARC:

Concatenate small WARCs to a larger WARC
Ensure concatenated WARC has a warcinfo record
Add max size limit (eg. 1GB), if limit exceeded, generate several WARCs upto size limit (each with warcinfo).

YAML based config

Similar to original Browsertrix, should probably have a YAML-based config as an alternative to the command-line options
and to simplify complex configurations. Will help with #12 also.

The command-line options can take precedence.

One or more workers are not working.

Below is my command...
workers is 6 and it keeps coming out as 1.
My cpu has 6 cores and 12 threads. The ram is 64gb, no problem.....

docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://example.com/ --generateWACZ --text --collection test --scopeType any -workers 6

Support Full-Text Extraction

Support text extraction from the DOM, using existing approaches implemented here:
https://github.com/webrecorder/archiveweb.page/blob/main/src/recorder.js#L1061
https://github.com/webrecorder/browsertrix/blob/pywb-instance/simple-driver/index.js#L197

This essentially involves calling the DOM.getDocument which returns JSON. Ideally, these approaches can be unified into a single reusable implementation also

The extracted text could then be added to:

The 'text' field in the WACZ
A separate text only WARC conversion record under say urn:text:<timestamp>/<url>

Both approaches have been tried in the past. If generating a WACZ, the WARC-record duplicate is not needed.
But if only generating a WARC, probably good to have the text in WARCs as well?

Add basic integration testing.

Using github CI, add basic integration test for running on a single page, maybe just https://example.com/
The test would be just to check basically functionality to start off:

Running the container in a crawl and ensure it succeeds, perhaps with WACZ output.
Verifying the WACZ output is valid
Running linter from #19

Can model it off of what the downstream zimit does for running a crawl:
https://github.com/openzim/zimit/blob/master/.github/workflows/ci.yaml

This is just to ensure a basic crawl is working. Will create a separate issue for domain-specific verification.

Videos missing

Investigating openzim/zimit#71 I realized I can't seem to be able to scrape videos reliably with the current version.

Even a very simple tests doesn't work:

https://wulu.yeleman.ml/yt.html (YT share with privacy mode)
https://wulu.yeleman.ml/yt2.html (default YT share)
https://wulu.yeleman.ml/vim.html (vimeo)

The actual video content is not fetched.

I think this is the root cause behind fondamentaux missing most (if not all) of videos. See openzim/zimit#78

That may indicate that the problem is not new (those November runs were using zimit:dev which at that time used webrecorder/browsertrix-crawler:0.1.0). Maybe something has changed on Youtube.com at that time?

Crawling brightcove videos

Hi there,
I'm trying to crawl an education website that streams the videos using brightcove player.
My crawling attempts keep timing out on pages and I end up with no video pages when I play it back.

Any suggestion on how this can be resolved?

Cheers

Support Screenshot Creation

Add option to generate screenshots, as per:
https://github.com/webrecorder/browsertrix/blob/pywb-instance/simple-driver/index.js#L177

Store in WARC records as urn:screenshot:<timestamp>/<url> based on previous convention.

To decide: need options for different dimensions (eg. thumbnails?), default dimensions for screenshot? (1024x768?)

Docker build currently broken due to setuptools changes

Attempting to build this repo from scratch, the pip package installation goes wonky after this happens:

Collecting simpleeval>=0.9
  Downloading simpleeval-0.9.10.tar.gz (26 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3.8 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-wx1yd77f/simpleeval_fe762bf3bf5e4b539cf9a0fdf1656944/setup.py'"'"'; __file__='"'"'/tmp/pip-inst
all-wx1yd77f/simpleeval_fe762bf3bf5e4b539cf9a0fdf1656944/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup
()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-kay8kd7a
         cwd: /tmp/pip-install-wx1yd77f/simpleeval_fe762bf3bf5e4b539cf9a0fdf1656944/
    Complete output (1 lines):
    error in simpleeval setup command: use_2to3 is invalid.
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/62/25/aec98426834844b70b7ab24b4cce8655d31e654f58e1fa9861533f5f2af1/simpleeval-0.9.10.tar.gz#sha256=692055488c2864637f6c2edb5fa48175978a2a07318009e7cf0
3c9790ca17bea (from https://pypi.org/simple/simpleeval/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

This appears to be an instance of this problem, where a breaking change in setuptools can break builds.

It works fine if we pin setuptools==57.5.0 but we should probably look to see if simpleeval can support Python 3.

Add optional fields to warcinfo that can identify who created the warc

The warcinfo metadata does not currently include properties that could identify which institution generated the WARC file. Fields such as operator, ip, and hostname would be helpful additions. These could be enabled/defined via the new YAML configuration file (#13).

Auto-generate new collection name if unspecified

Instead of crawling to the same capture collection by default, perhaps make a new crawl.<timestamp> collection automatically so each new crawl goes into a new isolated directory by default.

Experiment with supporting screencasting

Experiment with support for optional viewing of a screencast of all crawlers in a separate HTML page, streaming frames via a websocket.

Attribute name parameter for crawler.extractLinks

Having access to the crawler.extractLinks function within the overridable defaultDriver.js is very helpful for controlling the scope of the crawl. At the moment, though, it will only work if your selector is for tags with href attributes (see elem.href):

browsertrix-crawler/crawler.js

Line 480 in 7cfeefd

return [...document.querySelectorAll(selector)].map(elem => elem.href);

Converting the function to include an optional attribute property would make this more powerful:
extractLinks (selector = 'a[href]', attribute= 'href')
For example, the selector might be figure[data-url] and the attribute either data-url if using getAttribute(), or dataset.url if appending to elem.

proxy support

Hey,

from what I can see, the docker container has no support for (socks) proxies, so all outgoing requests are started from the machine running the docker container?

Puppeteer (and chromium) can be started with args: ["--proxy-server=socks5://localhost:1234"]

It seems like puppeteer-cluster allows passing puppeteer args: thomasdondorf/puppeteer-cluster#368 (comment)

But those are not configurable in browsertrix-crawler:

browsertrix-crawler/crawler.js

Lines 196 to 206 in c5494be

 get puppeteerArgs() { 

 // Puppeter Options 

 return { 

 headless: this.params.headless, 

 executablePath: this.browserExe, 

 ignoreHTTPSErrors: true, 

 args: this.chromeArgs, 

 userDataDir: this.profileDir, 

 defaultViewport: null, 

 }; 

 }

Invalid URLs in seeds file

Hi,
crawl doesn't start if there is even one 'Invalid URL' in seeds file, for example

yle.fi
instead of
https://yle.fi

Crawler should jump over invalid URLs, log them in errors and continue with valid URLs?

Support for submitting an initial seed list instead of single URL

Support for crawling multiple URLs in a fixed list.
These may be 'seeds' to a larger crawl or just the fixed list (need a depth option)

Probably support via multiple --url args as well as via a pre-made YAML config file.

error codes: handling - logging - reporting

While running a crawl with 8 workers I got as of a certain moment 403 response codes (generated by Cloudfront).

in the pages file the errors are mentioned
{"id":".....","url":"https://.........../04-12-1942/6538","title":"ERROR: The request could not be satisfied","text":"403 ERROR\nThe request could not be satisfied.\nRequest blocked. ......."}

in the crawl file state is "finished"
'{"url":"https://........./04-12-1942/6538","seedId":0,"depth":2,"started":"2022-02-12T07:21:41.004Z","finished":"2022-02-12T07:21:42.412Z"}'

As is, the yaml file cannot be used as config to relaunch the crawl. Which would be helpful.

Having a global error report/overview would also be helpful.

Stopping and resuming the crawler

Is there a way to stop the crawler and resume it later so I can do things like resuming the crawl in case of failure, changing configuration, or crawling the site only in idle hours?

Add eslint

Add eslint to cleanup all the code and keep it consistent!

"type" vs "scopeType" in the YAML config file

Hi! So while testing the config files I think I've found a discrepancy in the README.md.
So we are given this example:

seeds:
  - url: https://webrecorder.net/
    depth: 1
    type: "prefix"

Then when I tried to change type to "none", it didn't change anything. This is confirmed by the logs (using the parameter --logging stats,debug):

Seeds [
  ScopedSeed {
    url: 'https://webrecorder.net/',
    include: [ /^https:\/\/webrecorder\.net\// ],
    exclude: [],
    scopeType: 'prefix',
    sitemap: false,
    allowHash: false,
    maxDepth: 99999
  }
]

When using "scopeType" instead of "type", it does allow me to change the scope type. So I think it's either a problem with the way the config file is parsed, or an error in the documentation.

Stuck with all workers idle at the end of a crawl

Screenshot:

Seems that it thinks it needs to crawl another page, but there aren't actually any more pages to crawl. It's been unchanged for multiple hours.

Make background behaviors more modular

Currenly, the defaultDriver has several hard-coded functions for autoplay, autofetch and autoscroll behaviors.
Instead, move the to separate files, and add a background behavior maanager.
Allow toggling which background behaviors are run via a separate command-line option.

This will make the background behaviors more modular and further simplify the defaultDriver

Crawl shows error and exits if option `--urlFile` is used without setting `--scope`

Crawl fails if called with --urlFile but without --scope:

$> docker run -v$PWD/test-urls.txt:/test-urls.txt webrecorder/browsertrix-crawler:0.4.0-beta.1 crawl --urlFile /test-urls.txt
Exclusions Regexes:  []
Scope Regexes:  undefined
creating pages without full text
Queuing Error:  TypeError: Cannot read property 'length' of undefined
    at Crawler.shouldCrawl (/app/crawler.js:854:43)
    at Crawler.queueUrls (/app/crawler.js:758:33)r 4.0 seconds)
    at Crawler.extractLinks (/app/crawler.js:752:10)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async Crawler.loadPage (/app/crawler.js:735:7)
    at async Crawler.module.exports [as driver] (/app/defaultDriver.js:4:3)
    at async Crawler.crawlPage (/app/crawler.js:570:7)

Built from ae4ce97. See #55: I can confirm that 0.4.0-beta.0 (from hub.docker.com) is not affected.

Switch from Puppeteer to Playwright

https://github.com/microsoft/playwright-python is almost call-for-call compatible with puppeteer, and gives you access to Firefox and Webkit as well.

Certainly not a priority by any means, but would you hypothetically be open to a PR for this?

File suffix of combined WARCs should be .warc.gz

The combined WARC files (see #33) are compressed and should get the file suffix .warc.gz correspondingly.

	get puppeteerArgs() {
	// Puppeter Options
	return {
	headless: this.params.headless,
	executablePath: this.browserExe,
	ignoreHTTPSErrors: true,
	args: this.chromeArgs,
	userDataDir: this.profileDir,
	defaultViewport: null,
	};
	}