Git Product home page Git Product logo

warcworker's Introduction

Warcworker

A dockerized queued high fidelity web archiver based on Squidwarc (Chrome headless), RabbitMQ and a small web frontend. Using the scripting abilities of Squidwarc, you can add scripts that should be run for a specific job (e.g. src-set enrichment, comment expansion etc). Please note that Warcworker is not a crawler (it will not crawl a website automatically - you have to use other software to build lists of URL:s to send to Warcworker).

screenshot of Warcworker

Installation

Copy .env_example to .env. Update information in .env.

Start with docker-compose up -d --scale worker=3 (wait a minute for everything to start up)

Archiving and playback

Open web front end at http://0.0.0.0:5555 to enter URLs for archiving. You can prefill the text fields with the url and description request parameters. Play back the resulting WARC-files with Webrecorder Player

Using

Bookmarklet

Add a bookmarklet to your browser with the following link:

javascript:window.open('http://0.0.0.0:5555?url='+encodeURIComponent(location.href) + '&description=' + encodeURIComponent(document.title));window.focus();

Now you have two-click web archiving from your browser.

Command line

To use from the command line with curl:

curl -d "scripts=srcset&scripts=scroll_everything&url=https://www.peterkrantz.com/" -X POST http://0.0.0.0:5555/process/

Archivenow handler

To use from archivenow add a handler file handlers/ww_handler.py like this:

import requests
import json

class WW_handler(object):

    def __init__(self):
        self.enabled = True
        self.name = 'Warcworker'
        self.api_required = False

    def push(self, uri_org, p_args=[]):
        msg = ''
        try:
	    # add scripts in the order you want them to be run on the page
            payload = {"url":uri_org, "scripts":["scroll_everything", "srcset"]}

            r = requests.post('http://0.0.0.0:5555/process/', timeout=120,
                    data=payload,
                    allow_redirects=True)

            r.raise_for_status()
            return "%s added to queue" % uri_org

        except Exception as e:
            msg = "Error (" + self.name+ "): " + str(e)
        return msg

warcworker's People

Contributors

n0tan3rd avatar peterk avatar tripleo1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

warcworker's Issues

[Question] Squidwarc Frontier Mangament and long scalable crawls

One of the use cases I have wanted to support in Squidwarc is multiple worker crawlers populating and pulling from a single master frontier.

As well as a move from the current in memory frontier to a more scalable frontier scheme.

Since warcworker is light years ahead in this regard ๐Ÿ˜ (i.e. frontend for Squidwarc with multiple crawler workers and expandability potential for managing long crawls), I thought it best to see it if warcworker has any interest in this functionality and if so to coordinate development ๐Ÿ˜ƒ

FileNotFoundError after clicking Start archiving

Hi, I'm exploring tools for crawling social media. I got a FileNotFoundError after starting a crawl. I chose scroll_everything as script.

FileNotFoundError

FileNotFoundError: [Errno 2] No such file or directory: '/scripts/job/0fe3e4dc888e2f497d59d20ccf551c38.js'

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2464, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2450, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1867, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/main.py", line 121, in process
    message = make_job(jobid, output_path, seeds, description, scripts)
  File "/app/main.py", line 29, in make_job
    with open(jobscript_file, 'w') as outfile:
FileNotFoundError: [Errno 2] No such file or directory: '/scripts/job/8c5bad7d59ecfbe5f8aa2a4df4bffa6b.js'

Rewrite the worker using javascript instead of Python

Currently the worker is using Python 3.6 compiled from source. It could probably just as well use the bundled javascript facilities from the base image to work on queue items. Would reduce dependencies and make it a faster install.

Only pulled one page

  • How do I pull an entire website with this
  • How do I see what it is doing internally?

Consider using text other than "Archive it!" in the interface

Internet Archive runs a service called "Archive-It" that many that do personal archiving use.

The screenshot (and interface elements when testing) made me initially question what this tool has to do with Archive-It. As a suggestion: maybe put the name of the tool (warcworker) in the header box and something more representative (e.g., Archive URLs) in the button text. This would prevent any confusion and still be descriptive of what the tool accomplishes.

Docker compose doesnt work

Instructions in README give this:

/bin/sh: 1: wget: not found
E: gnupg, gnupg2 and gnupg1 do not seem to be installed, but one of them is required for this operation
ERROR: Service 'worker' failed to build: The command '/bin/sh -c wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -' returned a non-zero code: 255

Can't run scripts

Getting tracebacks when ticking a script checkbox.
FileNotFoundError: [Errno 2] No such file or directory: '/scripts/job/6469c99f84619919ec151be6e5d28a3c.js'

Works as expected when no script boxes are ticked.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.