internetarchive / zeno Goto Github PK

View Code? Open in Web Editor NEW

70.0 10.0 8.0 1.34 MB

State-of-the-art web crawler 🔱

License: GNU Affero General Public License v3.0

Go 99.85% Dockerfile 0.15%

web-crawler zeno archiving

zeno's Introduction

Zeno

State-of-the-art web crawler 🔱

Introduction

Zeno is a web crawler designed to operate wide crawls or to simply archive one web page. Zeno's key concepts are: portability, performance, simplicity. With an emphasis on performance.

It has been originally developed by Corentin Barreau at the Internet Archive. It heavily relies on the warc module for traffic recording into WARC files.

The name Zeno comes from Zenodotus (Ζηνόδοτος), a Greek grammarian, literary critic, Homeric scholar, and the first librarian of the Library of Alexandria.

Usage

See ./Zeno -h

COMMANDS:
   get      Archive the web!
   version  Show the version number.
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --user-agent value                                     User agent to use when requesting URLs. (default: "Zeno")
   --job value                                            Job name to use, will determine the path for the persistent queue, seencheck database, and WARC files.
   --workers value, -w value                              Number of concurrent workers to run. (default: 1)
   --max-concurrent-assets value, --ca value              Max number of concurrent assets to fetch PER worker. E.g. if you have 100 workers and this setting at 8, Zeno could do up to 800 concurrent requests at any time. (default: 8)
   --max-hops value, --hops value                         Maximum number of hops to execute. (default: 0)
   --cookies value                                        File containing cookies that will be used for requests.
   --keep-cookies                                         Keep a global cookie jar (default: false)
   --headless                                             Use headless browsers instead of standard GET requests. (default: false)
   --local-seencheck                                      Simple local seencheck to avoid re-crawling of URIs. (default: false)
   --json                                                 Output logs in JSON (default: false)
   --debug                                                (default: false)
   --live-stats                                           (default: false)
   --api                                                  (default: false)
   --api-port value                                       Port to listen on for the API. (default: "9443")
   --prometheus                                           Export metrics in Prometheus format, using this setting imply --api. (default: false)
   --prometheus-prefix value                              String used as a prefix for the exported Prometheus metrics. (default: "zeno:")
   --max-redirect value                                   Specifies the maximum number of redirections to follow for a resource. (default: 20)
   --max-retry value                                      Number of retry if error happen when executing HTTP request. (default: 20)
   --http-timeout value                                   Number of seconds to wait before timing out a request. (default: 30)
   --domains-crawl                                        If this is turned on, seeds will be treated as domains to crawl, therefore same-domain outlinks will be added to the queue as hop=0. (default: false)
   --disable-html-tag value [ --disable-html-tag value ]  Specify HTML tag to not extract assets from
   --capture-alternate-pages                              If turned on, <link> HTML tags with "alternate" values for their "rel" attribute will be archived. (default: false)
   --exclude-host value [ --exclude-host value ]          Exclude a specific host from the crawl, note that it will not exclude the domain if it is encountered as an asset for another web page.
   --max-concurrent-per-domain value                      Maximum number of concurrent requests per domain. (default: 16)
   --concurrent-sleep-length value                        Number of milliseconds to sleep when max concurrency per domain is reached. (default: 500)
   --crawl-time-limit value                               Number of seconds until the crawl will automatically set itself into the finished state. (default: 0)
   --crawl-max-time-limit value                           Number of seconds until the crawl will automatically panic itself. Default to crawl-time-limit + (crawl-time-limit / 10) (default: 0)
   --proxy value                                          Proxy to use when requesting pages.
   --bypass-proxy value [ --bypass-proxy value ]          Domains that should not be proxied.
   --warc-prefix value                                    Prefix to use when naming the WARC files. (default: "ZENO")
   --warc-operator value                                  Contact informations of the crawl operator to write in the Warc-Info record in each WARC file.
   --warc-cdx-dedupe-server value                         Identify the server to use CDX deduplication. This also activates CDX deduplication on.
   --warc-on-disk                                         Do not use RAM to store payloads when recording traffic to WARCs, everything will happen on disk (usually used to reduce memory usage). (default: false)
   --warc-pool-size value                                 Number of concurrent WARC files to write. (default: 1)
   --warc-temp-dir value                                  Custom directory to use for WARC temporary files.
   --disable-local-dedupe                                 Disable local URL agonistic deduplication. (default: false)
   --cert-validation                                      Enables certificate validation on HTTPS requests. (default: false)
   --disable-assets-capture                               Disable assets capture. (default: false)
   --warc-dedupe-size value                               Minimum size to deduplicate WARC records with revisit records. (default: 1024)
   --cdx-cookie value                                     Pass custom cookie during CDX requests. Example: 'cdx_auth_token=test_value'
   --hq                                                   Use Crawl HQ to pull URLs to process. (default: false)
   --hq-address value                                     Crawl HQ address.
   --hq-key value                                         Crawl HQ key.
   --hq-secret value                                      Crawl HQ secret.
   --hq-project value                                     Crawl HQ project.
   --hq-batch-size value                                  Crawl HQ feeding batch size. (default: 0)
   --hq-continuous-pull                                   If turned on, the crawler will pull URLs from Crawl HQ continuously. (default: false)
   --hq-strategy value                                    Crawl HQ feeding strategy. (default: "lifo")
   --es-url value                                         ElasticSearch URL to use for indexing crawl logs.
   --exclude-string value [ --exclude-string value ]      Discard any (discovered) URLs containing this string.
   --help, -h                                             show help
   --version, -v                                          print the version

zeno's People

Contributors

Stargazers

Watchers

Forkers

borg-org machawk1 equals215 qu-ack nick2432 harshnarayanjha mch-dd saveweb

zeno's Issues

Investigate adding a version number to `software` field of warcinfo

Maybe the current git hash if possible? If not, hardcoded version number?

I guess we'll also need to pass this information to warc. warc should also pass its own version number.

Add logging capabilities for queue (index too) using custom `log` package

Ensure invalid HTTPS certificate still get crawled

Examples:
x509: cannot validate certificate for x.x.x.x because it doesn't contain any IP SANs (on individual IPs)
Get \"https://self-signed.badssl.com/\": x509: certificate signed by unknown authority (on self-signed)

edit: "fixed" by CorentinB@warc#14, will need to be implemented into Zeno.

Send on closed channel panic

Caused when CTRL+C a crawl, it was in finishing state then this happened.

panic: send on closed channel
        panic: send on closed channel

goroutine 778 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc0016dd2c0)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc00024f888, 0xc0016dd2c0)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc000594480, 0xc0016dd2c0)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc000594480)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119

Happened on 2 different VMs on 2 different crawls. The second one was more verbose.

panic: send on closed channel
        panic: send on closed channel

goroutine 210 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc013a64180)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc000492008, 0xc013a64180)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc0003c84c0, 0xc013a64180)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc0003c84c0)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119
panic: send on closed channel
        panic: send on closed channel

goroutine 336 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc013c1c000)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc000492008, 0xc013c1c000)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc00029e500, 0xc013c1c000)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc00029e500)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119
panic: send on closed channel
        panic: send on closed channel

goroutine 346 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc01cf08060)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc000492008, 0xc01cf08060)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc0012322c0, 0xc01cf08060)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc0012322c0)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119

Break down Zeno in smaller packages, especially `crawl` package which has grown too big

--exclude-host not found

Log rotation cause panic / SIGSEGV

This happened on 2 distinct very different crawls, after a LONG time of simply hanging (no crawling, no log, for >1h).

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x125056e]

goroutine 82 [running]:
github.com/internetarchive/Zeno/internal/pkg/log.(*Logfile).Filename(0x0)
        /home/corentin/projects/Zeno/internal/pkg/log/file.go:51 +0x4e
github.com/internetarchive/Zeno/internal/pkg/log.(*fileHandler).Rotate(0xc000129ce0)
        /home/corentin/projects/Zeno/internal/pkg/log/file.go:30 +0x45
github.com/internetarchive/Zeno/internal/pkg/log.(*Logger).rotate(0xc000260e10)
        /home/corentin/projects/Zeno/internal/pkg/log/rotate.go:57 +0x17c
github.com/internetarchive/Zeno/internal/pkg/log.(*Logger).startRotation.func1()
        /home/corentin/projects/Zeno/internal/pkg/log/rotate.go:22 +0x25
created by github.com/internetarchive/Zeno/internal/pkg/log.(*Logger).startRotation in goroutine 1
        /home/corentin/projects/Zeno/internal/pkg/log/rotate.go:17 +0x85

Create option to record DNS responses to WARC records

More efficient deduplication hash table

This is quite a half-baked idea, but we'd be looking to implement some sort of hit counter for items in the hash table, allowing us to clean it up when we've filled it up. This will also allow us to set a size "cap" on the table, helping during larger runs.

Hit counter and removal of items with low number of hits when we're running low on space
File based storage of the hash table (somehow...)

live-stats flag is broken

Standard logging is interspersed with the expected live-stats information, which breaks with the previous behavior of the live-stats flag. In other words, as the typical logs fly by in the terminal window, Zeno is also repeatedly logging the live-stats info which then moves out of the viewport. The issue seems to be that Zeno is no longer suppressing the normal logging when the --live-stats flag is active. I've included a screenshot of the output of Zeno with the --live-stats flag below.

Reset HQ entries on shutdown

Allow specifying proxies per country

Logs are written to the wrong location

Logs are being written to the /jobs directory instead of the /jobs/logs directory

Verify & test our XML extraction in the context of sitemaps

We have a generic XML extractor, it would be great to test it & make it better (if needed) in the context of sitemap extractions.

Flush HQ finished array on shutdown

If we're attempting to shutdown, we should be instantly flushing HQ finished and remove the minimum size so that we can ensure everything is getting sent.

Investigate mailto: links

These should probably not be handled as normal links.

Queue and Index should reuse free space

Being able to get Zeno's version with a command

It would be nice to be able to see Zeno's version with something like ./Zeno version.
It should be based on Git, either or both of Git commit hashes and Git tags.

WAL tests fail

Logs flooded with messages like that:

2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"

When executing tests:

go test ./... -v

AWS and mismatch for User-Agent Zeno

Hi @CorentinB,

See https://web.archive.org/web/20221119203341/https://www.coolblue.de/en/stores for context.

Is the Internet Archive organisation started to use this web crawler in the public? We started noticing requests from the Archive being blocked since the end of october by our AWS WAF BotControl rules because the user-agent is not matching archive.org_bot.

If you check the request blocking details below you can see it missed the label name awswaf:managed:aws:bot-control:bot:name:internet_archive which would automatically be added if AWS recognized the request originating from the Internet Archive organisation. This mismatch is caused because the user-agent is not matching.

action BLOCK
formatVersion 1
httpRequest.clientIp 207.241.235.164
httpRequest.country US
httpRequest.headers.0.name Host
httpRequest.headers.0.value www.coolblue.de
httpRequest.headers.1.name User-Agent
httpRequest.headers.1.value Zeno
httpRequest.headers.2.name Accept-Encoding
httpRequest.headers.2.value gzip
httpRequest.headers.3.name Connection
httpRequest.headers.3.value close
httpRequest.httpMethod GET
httpRequest.httpVersion HTTP/1.1
httpRequest.requestId xxxxxxxxMELtQmGpKoibv3XfAfSMPkDRiNSZprf8ktWPMkrNg==
httpRequest.uri /en/stores
httpSourceId xxxxWA9W0E5
httpSourceName CF
labels.0.name awswaf:managed:token:absent
labels.1.name awswaf:managed:aws:bot-control:signal:non_browser_user_agent
...
ruleGroupList.5.ruleGroupId AWS#AWSManagedRulesBotControlRuleSet
ruleGroupList.5.terminatingRule.action BLOCK
ruleGroupList.5.terminatingRule.ruleId SignalNonBrowserUserAgent
terminatingRuleId AWSManagedRulesBotControlRuleSet
terminatingRuleType MANAGED_RULE_GROUP
timestamp 1668890021691
webaclId arn:aws:wafv2:us-east-1:xxxx:global/webacl/webshop-firewall-cf-webacl-v2/xxxxxx-41dc-9162-020f7bb917e3

Add PDF outlinks extraction

Instantiate a `CODE_OF_CONDUCT.md` as the repo drags some traction

Allow free space threshold to be customizable

I built Zeno from source (687b5d5) and ran Zeno get url only be told I did not have enough space. It would be great if (1) this value was customizable (it appears to be hard-coded to 20 GB) and/or (2) the amount of space needed was reported to the user.

pprof API expose silently fail when port is used

Use case being: running many Zeno on the same machine.

Panic when starting Zeno with go run

Reported by @HarshNarayanJha in 85.

go run . get url https://some.url

panic: runtime error: slice bounds out of range [:7] with length 6

goroutine 1 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.GenerateCrawlConfig(0xc000478008)
	/home/corentin/Documents/work/ia/Zeno/internal/pkg/crawl/config.go:248 +0x108b
github.com/internetarchive/Zeno/cmd.init.func8(0xc000418c00?, {0xc000428520, 0x1, 0x14da456?})
	/home/corentin/Documents/work/ia/Zeno/cmd/get_url.go:24 +0x39
github.com/spf13/cobra.(*Command).execute(0x21cdf40, {0xc0004284f0, 0x1, 0x1})
	/home/corentin/go/pkg/mod/github.com/spf13/[email protected]/command.go:983 +0xaca
github.com/spf13/cobra.(*Command).ExecuteC(0x21cd6a0)
	/home/corentin/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
	/home/corentin/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039
github.com/internetarchive/Zeno/cmd.Run()
	/home/corentin/Documents/work/ia/Zeno/cmd/cmd.go:56 +0x211
main.main()
	/home/corentin/Documents/work/ia/Zeno/main.go:21 +0x13
exit status 2

No such file or directory panic

panic: open jobs/warcs/SPNOUTLINKS-20221021045127671-00030-crawl900.us.archive.org.warc.gz.open: no such file or directory

goroutine 149 [running]:
github.com/CorentinB/warc.isFileSizeExceeded({0xc166f684e0?, 0xc0001b4520?}, 0x408f400000000000)
        /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/utils.go:196 +0x10e
github.com/CorentinB/warc.recordWriter(0xc00057e0f0, 0x0?, 0x0?)
        /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/warc.go:120 +0x499
created by github.com/CorentinB/warc.(*RotatorSettings).NewWARCRotator
        /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/warc.go:50 +0x75

Investigate "i/o timeout" and "TLS timeout" stalling workers for multiple minutes

(Better settings to help with this, specifically TLS timeout which doesn't seem to follow any of our current settings.)

Allow specifying an URL for `get list` to use a remote list

Transform Zeno architecture to a crawling pipeline effectively making use of Go channels

Pipeline concept in SWE

Also only use handover when HQ is used → #115

SIGSEGV logging in BatchEnqueue

(on queue branch)

Extract URLs from images

Would be interesting to try to do OCR on images (as an option) to extract URLs from watermark and such.

Queue all items from seeds list before starting to crawl

We should queue all items when using get list BEFORE starting the crawl. It would avoid handover blocking the feeding of the queue, and also it would optimize distribution of the seeds to the workers.

Define Zeno's queuing behavior properly

As discussed with @equals215, Zeno's behavior in term of queuing should be as followed:

By default: handover + WAL writes batching
As an option: --disable-handover, --whatever (--whatever == a good argument name to ask for more atomicity == not use WAL batch writing)
When using HQ: handover++ (no queuing AT ALL)

Restore HQ flags

Replace github.com/clbanning/mxj/v2 with stdlib

github.com/clbanning/mxj/v2 is being used for XML processing, I think it can be replaced by standard lib-only code.

Panic on /workers access

2024/08/15 07:49:36 http: panic serving 127.0.0.1:45290: runtime error: invalid memory address or nil pointer dereference
goroutine 212832422 [running]:
net/http.(*conn).serve.func1()
        /var/www/.go/src/net/http/server.go:1903 +0xbe
panic({0x1371ac0?, 0x21a7f70?})