Git Product home page Git Product logo

xmldump2zim's Introduction

xmldump2zim

Converts MediaWiki XML dumps to the OpenZim format

This is heavily work in progress, for now conversion is somewhat possible using the MWPhpParser, but without images. Much is hardcoded. There is no autoconfig, just a makefile (where the dependencies have to be configured).

Compile

The xml parsing and zim generation dependes on

Parser

There are two attempts of parser, at the moment the MWPhpParser is used only

MWPhpParser

The MWPhpParser uses the mediawiki php code to do the HTML conversion, to use this you need:

  • php with --php-embedd, I'm using php-5.5.1
  • mediawiki I'm using mediawiki-1.21.1
  • A mediawiki css file, just grab it by calling Makefile all.css
  • other extensions as needed:Math (including texvc), Scribunto, ...)

Adjust the MWPhpParser.php to your needs (especially add extensions you need)

Hints:

  • some articles need quite some stack, ulimit -s 16384 helps.
  • this parser is not multithreaded and quite slow; the first 400000 articles of the german wiki take about 23h on my i5-2520, thus the whole wiki will take about 1 week.
  • if you link against a custom build php (like prepared in the Makefile), you need to specify the LD_LIBRARY_PATH

KiwiParser

The KiwiParser uses the Kiwi parser to do the HTML conversion, see https://github.com/aboutus/kiwi.

This is not really working at the moment; it hangs for some pages.

Usage

./xmldump2zim dump.xml output.zim [maximal number of articles]

Advanced

./xmldump2zim dump.xml index: creates an index file for the given dump.xml, containing beginning and length of each page tag

this index file is then used for two things;

  • if present, its loaded at startup and used for template lookup, you can speed this up by filtering only the necessary lines: grep Vorlage\|Modul index > index-small; mv index-small index (for german wiki, where Vorlage is the name of the Template namespace)
  • you can use the lines do create a file of only a selection if pages, which can be supplied instead of the [maximal number of articles] argument; ex: grep Bezirk index > bezirk-pages and then ./xmldump2zim dump.xml output.zim bezirk-pages

xmldump2zim's People

Contributors

kasparnagu avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

openmicrostacks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.