Git Product home page Git Product logo

mediawiki_to_git_md's Introduction

This is a work in progress quick hack, written in Python.

This script migrates content in MediaWiki to Markdown, preserving the edit history as git commits, for display on GitHub pages using Jekyll:

https://help.github.com/articles/using-jekyll-with-pages/

The idea here is to first prepare a MediaWiki XML dump of the current wiki contents (including revisions of the current pages), and turn each revision into a separate git commit of the page converted into markdown using pandoc.

This uses a crude Python script (calling pandoc and git). It assumes it is running in the base folder of a git repository on a suitable branch to which it will commit back-dated changes as markdown files.

Pages which were deleted on the wiki (e.g. spam) are not wanted (and appear to be excluded from the export XML file).

The user should provide a manual mapping table of MediaWiki usernames (column one) to names and email address as used for their GitHub accounts (column two), e.g.:

AnOther (tab) A. N. Other <[email protected]>

Spam revisions (and non-rollback reverts) are also not wanted, and can be ignored via a blacklist file of usernames (one per line). This means not every wiki revision will become a git commit. See helper script extract_blocklist.py for pulling the names of blocked users from an HTML download of the wiki's Special:BlockList page.

It is also worth double checking the first run's output for any reverts which could indicate additional accounts to block:

$ git log | grep "Reverted edits by "

Also, some revisions making minor changes to the wiki formatting may result in no changes to the converted markdown, and therefore ideally will not result in a git commit.

History

An early version of the script was used for the BioJava wiki http://biojava.org which is now hosted at https://github.com/biojava/biojava.github.io

A later version of the script (with support for slashes in wiki page names) was used for the Biopython wiki http://biopython.org which is now hosted at https://github.com/biopython/biopython.github.io

TODO

  • Start using formal version numbers
  • Add a proper command line API exposing options
  • Cope with unicode in the title / filename, e.g. BioPerl
  • Squash quick series of git commits from single author to a single page (with same or no comment)?
  • Skip git commits where there was no change in the markdown
  • Post-process pandoc output to fix wiki-links?

MediaWiki Export to XML

We use the dumpBackup.php script, the manual for this is online at https://www.mediawiki.org/wiki/Manual:dumpBackup.php

First, log into your mediawiki instance and find the PHP file .../maintenance/dumpBackup.php and your ../LocalSettings.php file. Then try:

$ cd ~
$ php .../maintenance/dumpBackup.php --conf .../LocalSettings.php --full --include-files --uploads > mediawiki_dump.xml

Note the inclusion of --include-files --uploads to ensure the log includes all the images etc.

Assuming you are running the conversion into MarkDown locally, zip-up and scp the XML dump back to your machine.

MediaWiki Block List

You can save the HTML page of your wiki's Special:BlockList page and parse it with:

$ curl -o blocklist.html "http://example.org/w/index.php/Special:BlockList?wpTarget=&limit=500"

Then run the script from this repository to pull out the user names:

$ ../mediawiki_to_git_md/extract_blocklist.py blocklist.html
Parse saved HTML file of wiki/Special:BlockList into simple text file
Extracted 50 users from 'blocklist.html' into 'user_blocklist.txt'

Usernames mapping

You will need to fill this in, try the conversion once to see which names to focus on collecting:

$ emacs usernames.txt

This is a simple two column tab separated table, mapping MediaWiki usernames (column one) to names and email address as used for their GitHub accounts (column two), e.g.:

AnOther (tab) A. N. Other <[email protected]>

MediWiki Conversion

Now run the conversion in your GitHub Pages repository, where git is already on the right branch and ready for new commits to be made:

$ ../mediawiki_to_git_md/convert.py mediawiki_dump.xml
============================================================
Parsing XML and saving revisions by page.
============================================================
Sorting changes by revision date...
...

If it works, it will print a summary of the missing usernames which you should probably add to usernames.txt and then after resetting your branches, retry the conversion. e.g.:

$ git checkout pre_auto_import && git branch -D master && git checkout -b master
Switched to branch 'pre_auto_import'
Deleted branch master (was a348cc5).
Switched to a new branch 'master'

Jekyll Setup

By default most converted pages are assigned the Jekyll layout wiki which assumes you have defined _layouts/wiki.html as a template. This can be changed, e.g, to None to use the default layout.

However, Category:XXX pages are instead mapped to layout tagpage, and given tag XXX. This assumes you have defined _layouts/tagpage.html which will add the automatic listing of all pages with the tag XXX. We use tags since Jekyll does not allow multiple categories per page like MediaWiki.

See Biopython's wiki template and tagpage template for examples. Note the later includes automatically generated links to all the pages with that tag.

mediawiki_to_git_md's People

Contributors

andreasprlic avatar huacayacauh avatar peterjc avatar scheibel avatar vincentdavis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mediawiki_to_git_md's Issues

Handle case changes in article titles

The script appears to not correctly handle case changes in article titles. It aborts with a git error about unknown files due to git add and git commit referencing a filename with incorrect case.

Image scaling during MediaWiki to Markdown conversion

Reported by @vincentdavis on issue #1

Some images are scaled, e.g. in http://biopython.org/wiki/Phylo or its MediaWiki equivalent https://github.com/peterjc/peterjc.github.io/blob/master/wiki/Phylo.mediawiki

[[File:phylo-draw-apaf1.png|256px|thumb|right|Rooted phylogram, via Phylo.draw]]

which becomes after conversion to GFM markdown with pandoc https://github.com/peterjc/peterjc.github.io/blob/master/wiki/Phylo.md

![Rooted phylogram, via Phylo.draw](phylo-draw-apaf1.png "fig:Rooted phylogram, via Phylo.draw")

On the actual wiki the image URL is:
http://biopython.org/w/images/thumb/0/04/Phylo-draw-apaf1.png/256px-Phylo-draw-apaf1.png

If you what to see the full size image
http://biopython.org/w/images/0/04/Phylo-draw-apaf1.png

@vincentdavis found a document that suggests code like this can be used for scaling images.

[[ http://url.to/image.png | height = 100px ]]

MediaWiki templates

Can we deal with these? e.g. Render them prior to passing to pandoc? Replace with a Jekyll templating mechanism?

Avoiding colons in filenames (for working on Windows)

See biopython/biopython.github.io#21 filed by @MarkusPiotrowski

MediaWiki uses Categories:XXX and User:XXX etc in its URLs, which means currently we end up with files on disk named Categories:XXX.md and so on. This is fine on Mac/Linux (although you do sometimes need to escape the :: as \: at the command line), but breaks on Windows.

Can we use a "safe" filename like Categories_XXX.md and exploit some Jekyll setting in the header to specify the URL for the page with a colon in it?

On a related point, we already have to escape the colon as %3A for relative URLs to these pages, e.g. 9aa53af and 3682565 although long term pandoc might take care of that - jgm/pandoc#2849

MediaWiki redirects --> Jekyll's redirection

See https://help.github.com/articles/redirects-on-github-pages/

Consider a MediaWiki page like this from http://biopython.org/wiki/Git which redirects to http://biopython.org/wiki/SourceCode

#REDIRECT [[SourceCode]]

The pandoc GitHub Flavour Markdown output results in a place holder page Git.md with a link to the new page.

What we seem to need to do instead is to add a redirect_from line to the Jekyll header in SourceCode.md, e.g.


---
... existing meta data like title ...
redirect_from: "/wiki/Git"

---

Update: Even easier, we can edit Git.md with the redirect_to tag instead. For instance,


---
title: Git
redirect_to: /wiki/SourceCode

---
This page should redirect you to [SourceCode](SourceCode "wikilink").

Handle MediaWiki categories

e.g. [[Category:Cookbook]] (generally as the last line in a page) flagged it for inclusion in the special wiki page .../wiki/Category:Cookbook.

Can we map these to Jekyll tags in the markdown header?

Exception in subprocess when pandoc not installed

Obviously pandoc is necessary, but when it isnt present the error doesnt indicate what went wrong.

Somewhat related, I've raised a bug in sarge as they are providing a nicer subprocess manager; maybe they can solve this.

Another way to approach this problem is to depend on another package which ensures that pandoc has been installed.

Ubuntu only provides one python wrapper pandocfilters for wily*, and also pypandoc since yakkety *.

pandocfilters doesnt even check that pandoc is installed (https://github.com/jgm/pandocfilters/blob/master/setup.py) - and it appears to have some compatibility issues with different versions of pandoc (https://github.com/jgm/pandocfilters#compatibility)

The two python wrappers that are in Fedora are pandocfilters * and pypandoc *

On Fedora, pypandoc package had a huge extra dependency tree, mostly of texlive addons. i.e. in addition to the main pandoc package (which has lots of texlive dependencies).

$ sudo dnf install python2-pypandoc python3-pypandoc
....
...
Installed:
  python2-pypandoc.noarch 1.1.3-1.fc24                                  python3-pypandoc.noarch 1.1.3-1.fc24                                                           
  texlive-avantgar.noarch 5:svn31835.0-24.fc24.1                        texlive-bookman.noarch 5:svn31835.0-24.fc24.1                                                  
  texlive-charter.noarch 5:svn15878.0-24.fc24.1                         texlive-cm-super.noarch 5:svn15878.0-24.fc24.1                                                 
  texlive-cmextra.noarch 5:svn32831.0-24.fc24.1                         texlive-collection-fontsrecommended.noarch 5:svn35830.0-24.20150728_r37987.fc24.1              
  texlive-courier.noarch 5:svn35058.0-24.fc24.1                         texlive-euro.noarch 5:svn22191.1.1-24.fc24.1                                                   
  texlive-eurosym.noarch 5:svn17265.1.4_subrfix-24.fc24.1               texlive-fpl.noarch 5:svn15878.1.002-24.fc24.1                                                  
  texlive-helvetic.noarch 5:svn31835.0-24.fc24.1                        texlive-lm-math.noarch 5:svn36915.1.959-24.fc24.1                                              
  texlive-manfnt-font.noarch 5:svn35799.0-24.fc24.1                     texlive-mathpazo.noarch 5:svn15878.1.003-24.fc24.1                                             
  texlive-mflogo-font.noarch 5:svn36898.1.002-24.fc24.1                 texlive-ncntrsbk.noarch 5:svn31835.0-24.fc24.1                                                 
  texlive-palatino.noarch 5:svn31835.0-24.fc24.1                        texlive-pxfonts.noarch 5:svn15878.0-24.fc24.1                                                  
  texlive-rsfs.noarch 5:svn15878.0-24.fc24.1                            texlive-scheme-basic.noarch 5:svn25923.0-24.20150728_r37987.fc24.1                             
  texlive-symbol.noarch 5:svn31835.0-24.fc24.1                          texlive-tex-gyre.noarch 5:svn18651.2.004-24.fc24.1                                             
  texlive-tex-gyre-math.noarch 5:svn36916.0-24.fc24.1                   texlive-times.noarch 5:svn35058.0-24.fc24.1                                                    
  texlive-txfonts.noarch 5:svn15878.0-24.fc24.1                         texlive-utopia.noarch 5:svn15878.0-24.fc24.1                                                   
  texlive-wasy.noarch 5:svn35831.0-24.fc24.1                            texlive-wasy2-ps.noarch 5:svn35830.0-24.fc24.1                                                 
  texlive-wasysym.noarch 5:svn15878.2.0-24.fc24.1                       texlive-zapfchan.noarch 5:svn31835.0-24.fc24.1          

For other OS (especially Windows), or if someone wants to install pypandoc manually (or via pip), it will also installs pandoc!

there are other packages which might be a simpler wrapper with verification that the pandoc binary is installed.

Quotes within article titles cause problems with the git command

The process gets stuck/killed by a git-error when the converter has to deal with articles which contain quotation with their titles. The converter creates the files as expected, but the git command exits with an error.

I worked around this issue by:

  1. create a copy of the generated .md and .mediawiki files
  2. issue git add wiki/ && git commit -m "fix quotation"

Fix inter-wiki links

e.g. [bp:SeqIO] used on Biopython wiki to link to SeqIO page on the sister BioPerl wiki, via a MediaWiki plugin SpecialInterwiki.php

Handling MediaWiki revisions which pandoc cannot parse

I've seen this with mal-formed tables (where the user generally fixed the page in a subsequent revision).

Currently these problem revisions are simply ignored (which is fine if the user fixed the issue themselves in the next revision, we end up with a combined commit)

Errors when converting dump with different localization than english

Within german instances of mediawiki it may appear, that some tags that are hard-coded and not localized are causing errors and eventually lead to breaking the process.

E.g. 'Kategorie' and 'Category'.

I worked around this issue by search-replacing these throughout the dump. Not beautiful but works.

Fix wiki links

May have to add .html to the links/filenames (or .md for use on GitHub rendering), but ideally would preserve the extension-less URLs to match the source MediaWiki website.

Deal with MediaWiki attachments nicely, e.g. images

A reprise of #1, see 8ed3dbd

When dumpBackup.php is run with --include-files --uploads then you get the images and other attachments in the XML, base64 encoded - including their revisions. This opens the door to a self contained solution to grabbing the images (although the workaround in #1 could be useful if the dump was done without this, and the wiki is still live).

However, the XML parser needs to consider <revision> vs <upload> tags, and the script may need refactoring to handle the file contents vs a page's text contents.

Another python tag quirk when given id

Example of broken MediaWiki from my naive pre-fixer,

<python id="recipe">
from Bio import SeqIO
...
</source>

Taken from Biopython page wiki/User:Davidw/cookbook.mediawiki

Convert usernames in commit comments

Sometimes the mediawiki comments have usernames, which we can map to @ style GitHub usernames. e.g.

Reverted edits by [[Special:Contributions/xxx|xxx]] ([[User talk:xxx|talk]]) to last revision by [[User:Peter|Peter]]

-->

Reverted edits by @xxx to last revision by @peterjc

As a bonus this is much shorter so the logs look nicer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.