Git Product home page Git Product logo

browsertrix's Introduction

Browsertrix 0.1.1

Browsertrix is a web archiving automation system, desgined to create high-fidelity web archives by automating real browsers running in containers (Docker) using Selenium and other automation tools. The system does not currently do any archiving of its own, but automates browsing loading through existing archiving and recording tools.

By loading pages directly through a browser, it will be possible to fully recreate a page as the user experiences it, including all dynamic content and interaction.

Browsertrix is named after Heritrix, the venerable web crawler technology which has become a standard for web archiving.

What Browsertrix Does

The first iteration of Browsertrix supports archiving a single web page, through an existing archiving back-end.

Urls can be submitted to Browsertrix via HTTP and it will attempt to load the urls in an available browser right away. Browsertrix can operate synchronously or asynchronously. If the operation does not complete within the specified timeout (default 30 secs), a queued response is returned and the user may retry the operation to get the result at a later time. The results of the archiving operation are cached (for 10 mins if successful, for 30 secs otherwise) so that future requests will return the cached result.

Redis is used to queue urls for archiving, and cache results for the archiving operation. Configurable options are currently available in the config.py module.

Additional automated browser "crawling" and multi-url features are planned for the next iteration.

Installation

Docker and Docker Compose are the only requirements for running Browsertrix.

Install Docker as recommended at: https://docs.docker.com/installation/

Install Docker Compose with: pip install docker-compose

After cloning this repository, run docker-compose up

Web Interface

In this version, a basic 'Archive This Website' UI is available on the home page and provides a form to submit urls to be archived through Chrome or Firefox. The interfaces wraps the Archiving API explained below.

The supported backends are https://webrecorder.io/ and IA Save Page Now feature.

http://$DOCKER_HOST/ where DOCKER_HOST is the host where Docker is running.

Scaling Workers

By default, Browsertrix starts with one Chrome and one Firefox worker. docker-compose scale can be used to set the number of workers as needed.

The set-scale.sh script is provided as a convenience to resize the number of workers, resizing both the Chrome and Firefox workers. For example, to have 4 of each browser, you can run:

./set-scale.sh 4

Archiving API /archivepage

This first iteration of Browsertrix provides an api endpoint at the /archivepage endpoint for archiving a single page.

To archive a url, a GET request can be made to http://<DOCKER HOST>/archivepage?url=URL&archive=ARCHIVE[&browser=browser]

  • url - The URL to be archived

  • archive - One of the available archives specified in config.py. Current archives are ia-save and webrecorder

  • browser - (Optional) Currently either chrome or firefox. Chrome is the default if omitted.

Results

The result of the archiving operation is a JSON block. The block contains one of the following.

  • error: true is set and msg field contains more details about the error. The type field indicates a specific type of error, eg: type: blocked currently indicates the archiving service can not archive this page.

  • queued: true is the timeout for archiving the page (currently 30 secs) has been exceeded. If this is the case, the url has been put on a queue and the query should be retried until the page is archived. queue-pos field indicates the position in the queue, with queue-pos: 1 means the url is up next, and queue-pos: 0 means the url is currently being loaded in the browser.

  • archived: true is set if the archiving of the page has fully finished. The following additional properties may be set in the JSON result:

    • replay_url - if the archived page is immediately available for replay, this is the url to access the archived content.

    • download_url - if the archived content is available for download as a WARC file, this is the link to the WARC.

    • actual_url - if the original url caused a redirect, this will contain the actual url that was archived (only present if different from original).

    • browser_url - The actual url loaded by the browser to "seed" the archive.

    • time - Timestamp of when the page was archived.

    • ttl - time remaining (in seconds) for this entry to be stored in the cache. After the entry expires, a subsequent query will re-archive the page. Default is 10 min (600 secs) and can be configured in config.py

    • log HTTP response log from the browser, available only in Chrome. The format is {<URL>: <STATUS>} for each url loaded to archive the current page.

Support

Initial work on this project was sponsored by the Hypothes.is Annotation Fund

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.