Git Product home page Git Product logo

awesome-web-archiving's Introduction

Awesome Web Archiving Awesome

Introduction

An Awesome List for getting started with web archiving. Inspired by the awesome list.

Table of Contents

Contribute

Please ensure your pull request adheres to the following guidelines:

  • Use the following format:
    • [Name](link) (Status: Stable or In Development) - Brief Description of what the module does
  • Make an individual pull request for each new item.
  • Link additions should be inserted alphabetically to the relevant category.
  • New categories or improvements to the existing categorization are welcome.
  • Check your spelling and grammar.
  • The pull request and commit should have a useful title.

License

CC0

To the extent possible under law, the owner has waived all copyright and related or neighboring rights to this work.

The List

Training/Documentation

Tools & Software

This list of tools and software is intended to briefly describe some of the most important and widely-used tools related to web archiving. For more details, we recommend you refer to (and contribute to!) these excellent resources from other groups:

Acquisition

  • ArchiveFacebook (Stable) - A Mozilla Firefox add-on for individuals to archive their Facebook accounts.

  • archivenow (Stable) - A Python library to push web resources into on-demand web archives.

  • Brozzler (Stable) - A distributed web crawler (็ˆฌ่™ซ) that uses a real browser (chrome or chromium) to fetch pages and embedded urls and to extract links.

  • F(b)arc (Stable) - A commandline tool and Python library for archiving data from Facebook using the Graph API.

  • grab-site (Stable) - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns.

  • Heritrix (Stable) - An open source, extensible, web-scale, archival quality web crawler.

  • html2warc (Stable) - A simple script to convert offline data into a single WARC file.

  • HTTrack (Stable) - An open source website copying utility.

  • Lentil (Stable) - A Ruby on Rails Engine that supports the harvesting of images from Instagram and provides several browsing views, mechanisms for sharing, tools for users to select their favorite images, an administrative interface for moderating images, and a system for harvesting images and submitting donor agreements in preparation of ingest into external repositories.

  • SiteStory (Stable) - A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server.

  • Squidwarc (In Development) - An open source, high-fidelity, page interacting archival crawler that uses Chrome or Chrome Headless directly.

  • twarc (Stable) - A command line tool and Python library for archiving Twitter JSON data.

  • WARCreate (Stable) - A Google Chrome extension for archiving an individual webpage or website to a WARC file.

  • WAIL (Stable) - A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; Python, Electron.

  • Web2Warc (Stable) - An easy-to-use and highly customizable crawler that enables anyone to create their own little Web archives (WARC/CDX).

  • Webrecorder (Stable) - Create high-fidelity, interactive recordings of any web site you browse.

  • Wget (Stable) - An open source file retrieval utility that of version 1.14 supports writing warcs.

  • Wget-lua (Stable) - Wget with Lua extension.

  • Wpull (Stable) - A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.

Replay

  • PyWb (Stable) - A Python (2 and 3) implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.

  • OpenWayback (Stable) - The open source project aimed to develop Wayback Machine, the key software used by web archives worldwide to play back archived websites in the user's browser.

  • Webrecorder Player - Webrecorder Player for Desktop OSX/Windows/Linux. (Built with Electron + Webrecorder)

  • InterPlanetary Wayback (ipwb) - Web Archive (WARC) indexing and replay using IPFS.

Search & Discovery

  • Shine (Stable) - A prototype web archives exploration UI, based on a Solr back-end that has been populated using the warc-discovery indexer.

  • Tempas v1 (Stable) - Temporal web archive search based on Delicious tags.

  • Tempas v2 (Stable) - Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., Obama@2005-2009 in Tempas).

  • warc-discovery (Stable) - WARC and ARC indexing and discovery tools.

  • Warclight (In Development) - A Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats. Designed to work with warc-discovery.

Utilities

  • HadoopConcatGz (Stable) - A Splitable Hadoop InputFormat for Concatenated GZIP Files (and *.warc.gz).

  • har2warc - Convert HTTP Archive (HAR) -> Web Archive (WARC) format. (Python)

  • httpreserve.info (Stable) - Service to return the status of a web page or save it to the Internet Archive. Returns JSON via browser or command line via CURL using GET. (Golang Package)

  • HTTPreserve Workbench (In Development) - Tool and API to describe the status of a web page encoded in a simple JSON output describing current status, and earliest and latest links on wayback.org. Save a web page to the Internet Archive. Audit lists of URIs and output a CSV with the data described above. (Golang)

  • Jwat (Stable) - Libraries and tools for reading/writing/validating WARC/ARC/GZIP files. (Java)

  • node-cdxj (Stable) - CDXJ file parser. (Node.js)

  • node-warc (Stable) - Parse WARC files or create WARC files using either Electron or chrome-remote-interface. (Node.js)

  • The Archive Browser - The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported. (OSX only, Proprietary app)

  • The Unarchiver - Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser. (OSX only, Proprietary app)

  • tikalinkextract (In Development) - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika. (Golang, Apache Tika Server)

  • Warcat (Stable) - Tool and library for handling Web ARChive (WARC) files. (Python)

  • warcio - Streaming WARC/ARC library for fast web archive IO. (Python)

  • warctools - Library to work with ARC and WARC files. (Python)

  • wasapi-downloader (Stable) - Java command line application to download crawls from WASAPI.

  • WarcPartitioner (Stable) - Partition (W)ARC Files by MIME Type and Year.

  • webarchive - Golang readers for ARC and WARC webarchive formats.

  • webarchive-indexing - Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

Analysis

  • ArchiveSpark (Stable) - An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation.

  • aut (Stable) - Archives Unleashed Toolkit is an open-source platform for managing & analyzing web archives.

Community Resources

Blogs and Scholarship

  • IIPC Blog
  • Web Archiving Roundtable - Currently dormant, but is a great archive of web archiving resources and links.
  • The Web as History - An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.

Mailing Lists

Slack

Twitter


Deprecated

  • pywb Wayback Web Recorder (Archiver) (Sunsetted) - A bare-bones example of how to create a simple web recording and replay system.

  • Warrick (Unknown) - An open source downloadable tool or web service for reconstructing websites from web archives, using Memento.

awesome-web-archiving's People

Contributors

ablwr avatar anjackson avatar atomotic avatar helgeho avatar ianmilligan1 avatar machawk1 avatar n0tan3rd avatar patcon avatar ross-spencer avatar ruebot avatar steffenfritz avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.