turicas / crau Goto Github PK

Easy-to-use Web archiver

License: GNU Lesser General Public License v3.0

Python 99.23% Makefile 0.77%

crau's Introduction

crau: Easy-to-use Web Archiver

crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs.

Installation

pip install crau

Running

Archiving

Archive a list of URLs by passing them via command-line:

crau archive myarchive.warc.gz http://example.com/page-1 http://example.org/page-2 ... http://example.net/page-N

or passing a text file (one URL per line):

echo "http://example.com/page-1" > urls.txt
echo "http://example.org/page-2" >> urls.txt
echo "http://example.net/page-N" >> urls.txt

crau archive myarchive.warc.gz -i urls.txt

Run crau archive --help for more options.

Extracting data from an archive

List archived URLs in a WARC file:

crau list myarchive.warc.gz

Extract a file from an archive:

crau extract myarchive.warc.gz https://example.com/page.html extracted-page.html

Playing the archived data on your Web browser

Run a server on localhost:8080 to play your archive:

crau play myarchive.warc.gz

Packing downloaded files into a WARC

If you've mirrored a website using wget -r, httrack or a similiar tool in which you have the files in your file system, you can use crau to create a WARC file based on this. Run:

crau pack [--inner-directory=path] <start-url> <path-or-archive> <warc-filename>

Where:

start-url: base URL you've downloaded (this will be joined with the actual file names to create the complete URL).
path-or-archive: path where the files are located. Can also be a .tar.gz, .tar.bz2, .tar.xz or .zip archive. crau will retrieve all files recursively.
warc-filename: file to be created.
--inner-directory: used when a TAR/ZIP archive is passed to filter which directory inside the archive will be used to retrieve files. Example: you have an archive with a backup/ directory on the root and a www.example.com/ inside of it, so the files are actually inside backup/www.example.com/ - just pass --inner-directory=backup/www.example.com/ and only the files inside this path will be considered (in this example, the file backup/www.example.com/contact.html will be archived as <start-url>/contact.html).

Why not X?

There are other archiving tools, of course. The motivation to start this project was a lack of easy, fast and robust software to archive URLs - I just wanted to execute one command without thinking and get a WARC file. Depending on your problem, crau may not be the best answer - check out more archiving tools in awesome-web-archiving.

Why not GNU Wget?

Lacks parallel downloading;
Some versions just crashes with segmentation fault depending on the website;
Lots of options make the task of archiving difficult;
There's no easy way to extend its behavior.

Why not Wpull?

Lots of options make the task of archiving difficult;
Easiest to extend than wget, but still difficult comparing to crau (since crau uses scrapy).

Why not crawl?

Lacks some features and it's difficult to contribute to (the Gitlab instance where it's hosted doesn't allow registration);
Has some bugs regarding to collecting page dependencies (like static assets inside a CSS file);
Has a bug where it enters in a loop (if a static asset returns a HTML page instead of the expected file it ignores depth and keep trying to get this page's dependencies - if any of the latter dependencies also has the same problem it keeps going on infinite depth).

Why not archivenow?

This tool can be used easily to use archiving services such as archive.is via command-line and can also, but when archiving it calls wget to do the job.

Contributing

Clone the repository:

git clone https://github.com/turicas/crau.git

Install development dependencies (you may want to create a virtualenv):

cd crau && pip install -r requirements-development.txt

Install an editable version of the package:

pip install -e .

Modify everything you want to, commit to another branch and then create a pull request at GitHub.

crau's People

Contributors

Stargazers

Watchers

Forkers

victor-torres beingsane andrebarradas zanachka marcusfreire0504 zacque0 rhenanbartels iftnt aosfatos lehvitus

crau's Issues

Check if redirects are being written to WARC file

If scrapy is skipping redirects, we may not be saving them.

Transfer encoding is not preserved

Based on the code (read: no tests have been conducted), it looks like crau does not preserve transfer encoding in the HTTP responses. Instead, the data will be stored with transfer encoding stripped, but the headers will likely still contain e.g. a Transfer-Encoding: chunked header, which means that the playback of these WARCs requires special code to handle this mismatch. (The relevant section in the specification is 6.3.2 in both WARC/1.1 and WARC/1.0.)

UnicodeDecodeError: 'ascii' codec can't decode byte

2022-03-31 20:32:47 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.aosfatos.org/todas-as-declara%C3%A7%C3%B5es-de-bolsonaro/?q=cloroquina&o=> (referer: https://www.aosfatos.org/noticias/medicamentos-aprovados-covid-19/)
Traceback (most recent call last):
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/utils/defer.py", line 132, in iter_errback
    yield next(it)
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/username/.pyenv/versions/3.9.10/envs/crau/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/Users/username/Documents/projects/pythonic-cafe/crau/crau/spider.py", line 164, in parse
    self.write_warc(response)
  File "/Users/username/Documents/projects/pythonic-cafe/crau/crau/spider.py", line 132, in write_warc
    write_warc_request_response(self.warc_writer, response)
  File "/Users/username/Documents/projects/pythonic-cafe/crau/crau/utils.py", line 172, in write_warc_request_response
    get_headers_list(response.headers),
  File "/Users/username/Documents/projects/pythonic-cafe/crau/crau/utils.py", line 147, in get_headers_list
    return [
  File "/Users/username/Documents/projects/pythonic-cafe/crau/crau/utils.py", line 148, in <listcomp>
    (key.decode("ascii"), value[0].decode("ascii"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 42: ordinal not in range(128)

Expose spider configurations to `crau archive`

Some configurations are currently available programatically, but can't be passed via CLI, like:

scrapy's LOG_LEVEL
scrapy's HTTPCACHE_ENABLED
spider's max_depth

Capture any HTTP code

By default scrapy does not pass responses for our spider if the answer have some HTTP codes (404, for example). We need to save everything.

Remove custom_settings from spider

We must pass all the settings from the command-line interface, so we can override all keys.
The default settings could be a classmethod inside Spider class, which we call on CLI code and then override the ones we'd like, then we pass it to CrawlerProcess. This way, we could use crau archive --settings DOWNLOAD_TIMEOUT=60 --settings DOWNLOAD_DELAY=30 ....
@rhenanbartels

Check possibility of migrating to scrapy.spiders.CrawlSpider

CrawlSpider does a lot of things CrauSpider does, but the depth's logic may be different (which justifies a custom implementation).

Black error on Ubuntu 18.04.03

Fresh installed and updated system:

Could not find a version that satisfies the requirement black (from -r requirements-development.txt (line 3)) (from versions: )
No matching distribution found for black (from -r requirements-development.txt (line 3))

tryd it with normal pip, results in an setup egg error. Dunno why it not updated the outdated stuff.
If someone has this also this error:

pip install --upgrade setuptools

And then run

pip install crau

again.

Create a scrapy backed cache based on WARC

If we extract the WARC write code from the spider and then create a WARC read routine we can implement a scrapy cache storage which reads and writes WARC files. This could be very handy to create archives from running old spiders just changing the cache setting.

Remove URL fragment before saving

Implement a browser-based spider

The current spider is pretty fast, but it extracts static dependencies from HTML, JS and CSS code - if any dependency is loaded dynamically it won't archive it. We may use @ArturGaspar's scrapy-qtwebkit or https://github.com/scrapy-plugins/scrapy-splash to do the job (read the HAR for each response and fetch all resources loaded).

Create `search` command

It would be very useful to have a full-text search page (which index is made from a WARC file or pywb's collection). May use whoosh.

Headers not preserved correctly

The HTTP headers appear to not be written to WARC correctly. The request headers are constructed independently of what Scrapy actually sends to the server; the response's original status line, as mentioned in a code comment, is discarded entirely; and the headers get normalised by Scrapy (in scrapy.http.headers.Headers). Instead, the exact bytes sent to and received from the server (on the HTTP layer) should be written to the WARC file.

Invalid Syntax

Fresh installed & updated Ubuntu 18.04.03

crau archive bild.warc.gz https://bild.de
Traceback (most recent call last):
File "/usr/local/bin/crau", line 11, in
load_entry_point('crau==0.1.0', 'console_scripts', 'crau')()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 489, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 2852, in load_entry_point
return ep.load()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 2443, in load
return self.resolve()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 2449, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/usr/local/lib/python2.7/dist-packages/crau/init.py", line 1, in
from .spider import CrauSpider
File "/usr/local/lib/python2.7/dist-packages/crau/spider.py", line 101
f"{request.method} {path} HTTP/1.1", headers_list, is_http_request=True
^
SyntaxError: invalid syntax

Change default settings to optimize broad crawls

Something like:

    custom_settings = {
        "CONCURRENT_REQUESTS": 256,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 16,
        "DNSCACHE_ENABLED": True,
        "DNSCACHE_SIZE": 500000,
        "DNS_TIMEOUT": 5,
        "DOWNLOAD_MAXSIZE": 5 * 1024 * 1024,
        "DOWNLOAD_TIMEOUT": 5,
        "HTTPCACHE_ALWAYS_STORE": True,
        "REACTOR_THREADPOOL_MAXSIZE": 40,
        "SCHEDULER_PRIORITY_QUEUE": "scrapy.pqueues.DownloaderAwarePriorityQueue",
    }

Also: check scrapy's doc on broad crawl.

Add option to restrict domains

Can use scrapy's allowed_domains spider setting.

Change User-Agent

Tasks:

Set spider's user-agent to crau {crau.__version__} (USER_AGENT setting)
Add --user-agent option to crau archive for changing user-agent