Git Product home page Git Product logo

zeno's Introduction

Zeno

State-of-the-art web crawler 🔱

Introduction

Zeno is a web crawler designed to operate wide crawls or to simply archive one web page. Zeno's key concepts are: portability, performance, simplicity. With an emphasis on performance.

It has been originally developed by Corentin Barreau at the Internet Archive. It heavily relies on the warc module for traffic recording into WARC files.

The name Zeno comes from Zenodotus (Ζηνόδοτος), a Greek grammarian, literary critic, Homeric scholar, and the first librarian of the Library of Alexandria.

Usage

See ./Zeno -h

COMMANDS:
   get      Archive the web!
   version  Show the version number.
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --user-agent value                                     User agent to use when requesting URLs. (default: "Zeno")
   --job value                                            Job name to use, will determine the path for the persistent queue, seencheck database, and WARC files.
   --workers value, -w value                              Number of concurrent workers to run. (default: 1)
   --max-concurrent-assets value, --ca value              Max number of concurrent assets to fetch PER worker. E.g. if you have 100 workers and this setting at 8, Zeno could do up to 800 concurrent requests at any time. (default: 8)
   --max-hops value, --hops value                         Maximum number of hops to execute. (default: 0)
   --cookies value                                        File containing cookies that will be used for requests.
   --keep-cookies                                         Keep a global cookie jar (default: false)
   --headless                                             Use headless browsers instead of standard GET requests. (default: false)
   --local-seencheck                                      Simple local seencheck to avoid re-crawling of URIs. (default: false)
   --json                                                 Output logs in JSON (default: false)
   --debug                                                (default: false)
   --live-stats                                           (default: false)
   --api                                                  (default: false)
   --api-port value                                       Port to listen on for the API. (default: "9443")
   --prometheus                                           Export metrics in Prometheus format, using this setting imply --api. (default: false)
   --prometheus-prefix value                              String used as a prefix for the exported Prometheus metrics. (default: "zeno:")
   --max-redirect value                                   Specifies the maximum number of redirections to follow for a resource. (default: 20)
   --max-retry value                                      Number of retry if error happen when executing HTTP request. (default: 20)
   --http-timeout value                                   Number of seconds to wait before timing out a request. (default: 30)
   --domains-crawl                                        If this is turned on, seeds will be treated as domains to crawl, therefore same-domain outlinks will be added to the queue as hop=0. (default: false)
   --disable-html-tag value [ --disable-html-tag value ]  Specify HTML tag to not extract assets from
   --capture-alternate-pages                              If turned on, <link> HTML tags with "alternate" values for their "rel" attribute will be archived. (default: false)
   --exclude-host value [ --exclude-host value ]          Exclude a specific host from the crawl, note that it will not exclude the domain if it is encountered as an asset for another web page.
   --max-concurrent-per-domain value                      Maximum number of concurrent requests per domain. (default: 16)
   --concurrent-sleep-length value                        Number of milliseconds to sleep when max concurrency per domain is reached. (default: 500)
   --crawl-time-limit value                               Number of seconds until the crawl will automatically set itself into the finished state. (default: 0)
   --crawl-max-time-limit value                           Number of seconds until the crawl will automatically panic itself. Default to crawl-time-limit + (crawl-time-limit / 10) (default: 0)
   --proxy value                                          Proxy to use when requesting pages.
   --bypass-proxy value [ --bypass-proxy value ]          Domains that should not be proxied.
   --warc-prefix value                                    Prefix to use when naming the WARC files. (default: "ZENO")
   --warc-operator value                                  Contact informations of the crawl operator to write in the Warc-Info record in each WARC file.
   --warc-cdx-dedupe-server value                         Identify the server to use CDX deduplication. This also activates CDX deduplication on.
   --warc-on-disk                                         Do not use RAM to store payloads when recording traffic to WARCs, everything will happen on disk (usually used to reduce memory usage). (default: false)
   --warc-pool-size value                                 Number of concurrent WARC files to write. (default: 1)
   --warc-temp-dir value                                  Custom directory to use for WARC temporary files.
   --disable-local-dedupe                                 Disable local URL agonistic deduplication. (default: false)
   --cert-validation                                      Enables certificate validation on HTTPS requests. (default: false)
   --disable-assets-capture                               Disable assets capture. (default: false)
   --warc-dedupe-size value                               Minimum size to deduplicate WARC records with revisit records. (default: 1024)
   --cdx-cookie value                                     Pass custom cookie during CDX requests. Example: 'cdx_auth_token=test_value'
   --hq                                                   Use Crawl HQ to pull URLs to process. (default: false)
   --hq-address value                                     Crawl HQ address.
   --hq-key value                                         Crawl HQ key.
   --hq-secret value                                      Crawl HQ secret.
   --hq-project value                                     Crawl HQ project.
   --hq-batch-size value                                  Crawl HQ feeding batch size. (default: 0)
   --hq-continuous-pull                                   If turned on, the crawler will pull URLs from Crawl HQ continuously. (default: false)
   --hq-strategy value                                    Crawl HQ feeding strategy. (default: "lifo")
   --es-url value                                         ElasticSearch URL to use for indexing crawl logs.
   --exclude-string value [ --exclude-string value ]      Discard any (discovered) URLs containing this string.
   --help, -h                                             show help
   --version, -v                                          print the version

zeno's People

Contributors

corentinb avatar equals215 avatar harshnarayanjha avatar machawk1 avatar ngtmeaty avatar yzqzss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zeno's Issues

Ensure invalid HTTPS certificate still get crawled

Examples:
x509: cannot validate certificate for x.x.x.x because it doesn't contain any IP SANs (on individual IPs)
Get \"https://self-signed.badssl.com/\": x509: certificate signed by unknown authority (on self-signed)

edit: "fixed" by CorentinB@warc#14, will need to be implemented into Zeno.

Send on closed channel panic

Caused when CTRL+C a crawl, it was in finishing state then this happened.

panic: send on closed channel
        panic: send on closed channel

goroutine 778 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc0016dd2c0)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc00024f888, 0xc0016dd2c0)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc000594480, 0xc0016dd2c0)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc000594480)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119

Happened on 2 different VMs on 2 different crawls. The second one was more verbose.

panic: send on closed channel
        panic: send on closed channel

goroutine 210 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc013a64180)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc000492008, 0xc013a64180)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc0003c84c0, 0xc013a64180)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc0003c84c0)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119
panic: send on closed channel
        panic: send on closed channel

goroutine 336 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc013c1c000)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc000492008, 0xc013c1c000)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc00029e500, 0xc013c1c000)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc00029e500)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119
panic: send on closed channel
        panic: send on closed channel

goroutine 346 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc01cf08060)
        /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50
panic({0x13699e0?, 0x169a9f0?})
        /var/www/.go/src/runtime/panic.go:770 +0x132
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture(0xc000492008, 0xc01cf08060)
        /X/Zeno/internal/pkg/crawl/capture.go:364 +0x12a7
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).unsafeCapture(0xc0012322c0, 0xc01cf08060)
        /X/Zeno/internal/pkg/crawl/worker.go:147 +0xa5
github.com/internetarchive/Zeno/internal/pkg/crawl.(*Worker).Run(0xc0012322c0)
        /X/Zeno/internal/pkg/crawl/worker.go:129 +0x7d1
created by github.com/internetarchive/Zeno/internal/pkg/crawl.(*WorkerPool).Start in goroutine 1
        /X/Zeno/internal/pkg/crawl/worker_pool.go:35 +0x119

Log rotation cause panic / SIGSEGV

This happened on 2 distinct very different crawls, after a LONG time of simply hanging (no crawling, no log, for >1h).

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x125056e]

goroutine 82 [running]:
github.com/internetarchive/Zeno/internal/pkg/log.(*Logfile).Filename(0x0)
        /home/corentin/projects/Zeno/internal/pkg/log/file.go:51 +0x4e
github.com/internetarchive/Zeno/internal/pkg/log.(*fileHandler).Rotate(0xc000129ce0)
        /home/corentin/projects/Zeno/internal/pkg/log/file.go:30 +0x45
github.com/internetarchive/Zeno/internal/pkg/log.(*Logger).rotate(0xc000260e10)
        /home/corentin/projects/Zeno/internal/pkg/log/rotate.go:57 +0x17c
github.com/internetarchive/Zeno/internal/pkg/log.(*Logger).startRotation.func1()
        /home/corentin/projects/Zeno/internal/pkg/log/rotate.go:22 +0x25
created by github.com/internetarchive/Zeno/internal/pkg/log.(*Logger).startRotation in goroutine 1
        /home/corentin/projects/Zeno/internal/pkg/log/rotate.go:17 +0x85

More efficient deduplication hash table

This is quite a half-baked idea, but we'd be looking to implement some sort of hit counter for items in the hash table, allowing us to clean it up when we've filled it up. This will also allow us to set a size "cap" on the table, helping during larger runs.

  • Hit counter and removal of items with low number of hits when we're running low on space
  • File based storage of the hash table (somehow...)

live-stats flag is broken

Standard logging is interspersed with the expected live-stats information, which breaks with the previous behavior of the live-stats flag. In other words, as the typical logs fly by in the terminal window, Zeno is also repeatedly logging the live-stats info which then moves out of the viewport. The issue seems to be that Zeno is no longer suppressing the normal logging when the --live-stats flag is active. I've included a screenshot of the output of Zeno with the --live-stats flag below.

Screenshot 2024-07-08 at 4 05 23 PM

Flush HQ finished array on shutdown

If we're attempting to shutdown, we should be instantly flushing HQ finished and remove the minimum size so that we can ensure everything is getting sent.

WAL tests fail

Logs flooded with messages like that:

2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test1369699584/index_wal: file already closed"
2024/08/04 23:35:09 ERROR failed to sync WAL, retrying error="sync /tmp/index_test4114659518/index_wal: file already closed"

When executing tests:

go test ./... -v

AWS and mismatch for User-Agent Zeno

Hi @CorentinB,

See https://web.archive.org/web/20221119203341/https://www.coolblue.de/en/stores for context.

Is the Internet Archive organisation started to use this web crawler in the public? We started noticing requests from the Archive being blocked since the end of october by our AWS WAF BotControl rules because the user-agent is not matching archive.org_bot.

If you check the request blocking details below you can see it missed the label name awswaf:managed:aws:bot-control:bot:name:internet_archive which would automatically be added if AWS recognized the request originating from the Internet Archive organisation. This mismatch is caused because the user-agent is not matching.

action BLOCK
formatVersion 1
httpRequest.clientIp 207.241.235.164
httpRequest.country US
httpRequest.headers.0.name Host
httpRequest.headers.0.value www.coolblue.de
httpRequest.headers.1.name User-Agent
httpRequest.headers.1.value Zeno
httpRequest.headers.2.name Accept-Encoding
httpRequest.headers.2.value gzip
httpRequest.headers.3.name Connection
httpRequest.headers.3.value close
httpRequest.httpMethod GET
httpRequest.httpVersion HTTP/1.1
httpRequest.requestId xxxxxxxxMELtQmGpKoibv3XfAfSMPkDRiNSZprf8ktWPMkrNg==
httpRequest.uri /en/stores
httpSourceId xxxxWA9W0E5
httpSourceName CF
labels.0.name awswaf:managed:token:absent
labels.1.name awswaf:managed:aws:bot-control:signal:non_browser_user_agent
...
ruleGroupList.5.ruleGroupId AWS#AWSManagedRulesBotControlRuleSet
ruleGroupList.5.terminatingRule.action BLOCK
ruleGroupList.5.terminatingRule.ruleId SignalNonBrowserUserAgent
terminatingRuleId AWSManagedRulesBotControlRuleSet
terminatingRuleType MANAGED_RULE_GROUP
timestamp 1668890021691
webaclId arn:aws:wafv2:us-east-1:xxxx:global/webacl/webshop-firewall-cf-webacl-v2/xxxxxx-41dc-9162-020f7bb917e3

Allow free space threshold to be customizable

I built Zeno from source (687b5d5) and ran Zeno get url only be told I did not have enough space. It would be great if (1) this value was customizable (it appears to be hard-coded to 20 GB) and/or (2) the amount of space needed was reported to the user.

Panic when starting Zeno with go run

Reported by @HarshNarayanJha in 85.

go run . get url https://some.url

panic: runtime error: slice bounds out of range [:7] with length 6

goroutine 1 [running]:
github.com/internetarchive/Zeno/internal/pkg/crawl.GenerateCrawlConfig(0xc000478008)
	/home/corentin/Documents/work/ia/Zeno/internal/pkg/crawl/config.go:248 +0x108b
github.com/internetarchive/Zeno/cmd.init.func8(0xc000418c00?, {0xc000428520, 0x1, 0x14da456?})
	/home/corentin/Documents/work/ia/Zeno/cmd/get_url.go:24 +0x39
github.com/spf13/cobra.(*Command).execute(0x21cdf40, {0xc0004284f0, 0x1, 0x1})
	/home/corentin/go/pkg/mod/github.com/spf13/[email protected]/command.go:983 +0xaca
github.com/spf13/cobra.(*Command).ExecuteC(0x21cd6a0)
	/home/corentin/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
	/home/corentin/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039
github.com/internetarchive/Zeno/cmd.Run()
	/home/corentin/Documents/work/ia/Zeno/cmd/cmd.go:56 +0x211
main.main()
	/home/corentin/Documents/work/ia/Zeno/main.go:21 +0x13
exit status 2

No such file or directory panic

panic: open jobs/warcs/SPNOUTLINKS-20221021045127671-00030-crawl900.us.archive.org.warc.gz.open: no such file or directory

goroutine 149 [running]:
github.com/CorentinB/warc.isFileSizeExceeded({0xc166f684e0?, 0xc0001b4520?}, 0x408f400000000000)
        /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/utils.go:196 +0x10e
github.com/CorentinB/warc.recordWriter(0xc00057e0f0, 0x0?, 0x0?)
        /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/warc.go:120 +0x499
created by github.com/CorentinB/warc.(*RotatorSettings).NewWARCRotator
        /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/warc.go:50 +0x75

Extract URLs from images

Would be interesting to try to do OCR on images (as an option) to extract URLs from watermark and such.

Define Zeno's queuing behavior properly

As discussed with @equals215, Zeno's behavior in term of queuing should be as followed:

  • By default: handover + WAL writes batching
  • As an option: --disable-handover, --whatever (--whatever == a good argument name to ask for more atomicity == not use WAL batch writing)
  • When using HQ: handover++ (no queuing AT ALL)

Panic on /workers access

2024/08/15 07:49:36 http: panic serving 127.0.0.1:45290: runtime error: invalid memory address or nil pointer dereference
goroutine 212832422 [running]:
net/http.(*conn).serve.func1()
        /var/www/.go/src/net/http/server.go:1903 +0xbe
panic({0x1371ac0?, 0x21a7f70?})

Replace github.com/tomnomnom/linkheader with stdlib

github.com/tomnomnom/linkheader is being used to parse Link HTTP headers, pretty sure we can replace it with standard lib-only code.

Whatever code we add, we need unit test for it. (we don't have unit tests for that right now)

Add basic UI to manage Zeno

So the idea is basically to "replicate" the excellent Heritrix3 web UI.

We want to give a way to start, stop, pause, unpause the crawl, but also inject seeds, search in crawl logs, maybe remove stuff matching a query from the frontier.. The possibilities are endless.

image

Disk space pause is not working

It appears that something is resetting the pause state immediately after the disk space pause, causing it to continue crawling when we no longer have space.

Custom headers defined by yml file

Allow operators to define headers in a yml file per domain to allow for greater control over headers like User-Agent or similar headers that may need to be configurated per host.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.