Git Product home page Git Product logo

zimfarm's Introduction

ZIM Farm

Build Status CodeFactor License: GPL v3 codecov

The ZIM farm (zimfarm) is a half-decentralised software solution to build ZIM files efficiently. This means scrapping Web content, packaging them into a ZIM file and uploading the result to an online ZIM files repository.

How it works?

The Zimfarm platform is a combination of different tools:

dispatcher

The dispatcher is a central database and API that records recipes (metadata of ZIM to produce) and tasks. It includes a scheduler that decides when a ZIM file should be recreated (based on recipe) and a dispatcher that creates and assigns tasks to workers.

frontend

The frontend, available at farm.openzim.org is a simple consumer of the API.

It is used to create, clone and edit recipes, but also to monitor the evolution of tasks and workers.

Anybody can use it in read-only mode.

workers

Workers are always-running computers which gets assigned ZIM creation tasks by the dispatcher. If you are interested in providing us worker resources, please read these instructions.

A worker is made of two software components:

worker-manager

The manager is responsible for declaring its available resources and configuration and receives tasks assigned to it by the dispatcher. It's a very-low resources container which job is to spawn task-worker ones.

task-worker

The task-worker is responsible for running a specific task. It's also a very-low resources container but contrary to the manager, one is spawned for each task assigned to the worker (the manager defines the concurrency based on resources).

The task-worker's role is to start and monitor the scraper's container for the task and to spawn uploader containers for both created ZIM files and logs.

uploader

The uploader is instantiated by the task-worker to upload, individually, each created ZIM files, as well as the scraper's container log.

The uploader supports both SCP and SFTP. We are currently using SFTP for all uploads due to a slight speed gain.

Uploader is very fast and convenient (can watch and resumes files) but works only off files at the moment.

receiver

The receiver is a jailed OpenSSH-server that receives scraper logs and ZIM files and pass the latter through a quarantine via the zimcheck tool which eventually either put them aside (invalid ZIM) or move those to the public download server.

scrapers

Scrapers are the tools used to actually convert a scraping request (recorded in a Zimfarm recipe) into one or several ZIM files.

The most important one is the Mediawiki scraper, called mwoffliner but there are many of them for Stack-Exchange, Project Gutenberg, PhET and others.

Scrapers are not part of the Zimfarm. Those are completely independent projects for which the requirements to integrate into the Zimfarm are minimal:

  • Works completely off a docker image
  • Arguments should be set on the command line
  • ZIM output folder should be settable via an argument

How do I request a ZIM file?

ZIM file requests are handled on zim-requests repository.

If there's already a scraper for he website you want to convert to ZIM, someone with editor access to the Zimfarm will create the recipe and in a few days, a ZIM file should be available.

zimfarm's People

Contributors

automactic avatar rgaudin avatar kelson42 avatar satyamtg avatar dependabot[bot] avatar haksoat avatar jenskorte avatar nemobis avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.