Git Product home page Git Product logo

archer-rs's Introduction

The Article Archiver

NOTE: Project unfinished and abandoned!

The internet doesn’t forget, they say. Except when it does. More than once I went to an old bookmark only to be greeted by a 404 page. The goal of this project was to create local backup copies of these cool/insightful/interesting articles (I originally had a Python script that parsed a text file and executed a bunch of wget calls, but it didn’t work reliably).

Why is this project abandoned? After being annoyed with the bugs of my Python script, I set out to build a “proper” solution based on Rust. Halfway through the project, I happened to discover a simpler way: OneNote. The OneNote Clipper integrates with your web browser and is able to extract the contents of an article into OneNote. It works reasonably well and saves me approximately one or two months working on this project.

Still, the source code might be useful to folks out there, which is why the source code along with these notes is published on GitHub. Maybe someone can learn from it or wants to continue hacking on it.

The Design

The project is split into two components:

  1. The scraper.
  2. The viewer.

The Scraper

The basic data structure is a directed acyclic graph (DAG) of resources (HTML, CSS, JS, images) and their dependencies (see tasks/mod.rs). The main loop traverses the DAG in order to find open tasks. Tasks may be:

  • Download a resource (download/).
  • Parse a resource to discover dependencies (parser/).
  • Store a resource in a SQLite database (storage/).

Download Pool

For performance reasons the download part is executed in parallel. There’s a download pool implementation (see download/pool.rs) which fetches URLs from the main thread and feeds the resulting HTTP response back to the main loop for parsing.

Resource Parsing

Parsing serves two tasks: It discovers new resources the website depends on (think JS/CSS files, images etc.) and rewrites the URLs so they point to the viewer.

Storage

A SQLite file serves as a storage mechanism. The scraper puts the resources there and the viewer uses it to – well – let you view the websites.

Run It

To execute the downloader, run $ cargo run in your console.

The Viewer

The viewer has two parts: a server and a JS-based web client. The server is written in Rust and based on Iron while the web client is written in JavaScript ES6 and uses React and Semantic UI under the hood.

For serving the web client, there are two ways.

  • In development, you can run npm start to compile the client and serve it on http://localhost:8080/. The backend is expected to be available at http://localhost:3000. To run the backend server, execute $ cargo run -- --server.
  • In release mode, the client is bundeled into the executable using a build.rs script that inserts the compiled files into the Rust source. Don’t forget to run $ npm build first to compile the client. Then compile and run the server using $ cargo run --release -- --server.

What’s left to do

The scraper is working so far, but is still very unpolished. Open tasks would be:

  • Handle redirects properly. This is currently completely untested.
  • Use GZIP when requesting content. Save some bandwidth.
  • Send a User-Agent header. That’s what good scrapers do.
  • Add support for robots.txt files. Don’t be a dick.
  • Add a comprehensive test suite. Because reasons.

The viewer is work in progress. There’s a basic overview page that displays the websites that have been downloaded as well as an unfinished Add Website to Queue dialog. Note that currently no websites are shown as the appropriate table isn’t filled by the scraper. What’s left to do here:

  • Finish the overview page and Add Website dialog.
  • Add a Queue overview page.
  • Add a page to add/modify/remove tags.

If you have questions about the project, feel free to ping me at Twitter (@siem3m) or open a GitHub issue.

archer-rs's People

Contributors

msiemens avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.