peterjc / mediawiki_to_git_md Goto Github PK

Convert a MediaWiki export XML file into MarkDown as a series of git commits

License: MIT License

Python 100.00%

mediawiki_to_git_md's Issues

MediaWiki redirects --> Jekyll's redirection

See https://help.github.com/articles/redirects-on-github-pages/

Consider a MediaWiki page like this from http://biopython.org/wiki/Git which redirects to http://biopython.org/wiki/SourceCode

#REDIRECT [[SourceCode]]

The pandoc GitHub Flavour Markdown output results in a place holder page Git.md with a link to the new page.

What we seem to need to do instead is to add a redirect_from line to the Jekyll header in SourceCode.md, e.g.


---
... existing meta data like title ...
redirect_from: "/wiki/Git"

---

Update: Even easier, we can edit Git.md with the redirect_to tag instead. For instance,


---
title: Git
redirect_to: /wiki/SourceCode

---
This page should redirect you to [SourceCode](SourceCode "wikilink").

Exception in subprocess when pandoc not installed

Obviously pandoc is necessary, but when it isnt present the error doesnt indicate what went wrong.

Somewhat related, I've raised a bug in sarge as they are providing a nicer subprocess manager; maybe they can solve this.

Another way to approach this problem is to depend on another package which ensures that pandoc has been installed.

Ubuntu only provides one python wrapper pandocfilters for wily*, and also pypandoc since yakkety *.

pandocfilters doesnt even check that pandoc is installed (https://github.com/jgm/pandocfilters/blob/master/setup.py) - and it appears to have some compatibility issues with different versions of pandoc (https://github.com/jgm/pandocfilters#compatibility)

The two python wrappers that are in Fedora are pandocfilters * and pypandoc *

On Fedora, pypandoc package had a huge extra dependency tree, mostly of texlive addons. i.e. in addition to the main pandoc package (which has lots of texlive dependencies).

$ sudo dnf install python2-pypandoc python3-pypandoc
....
...
Installed:
  python2-pypandoc.noarch 1.1.3-1.fc24                                  python3-pypandoc.noarch 1.1.3-1.fc24                                                           
  texlive-avantgar.noarch 5:svn31835.0-24.fc24.1                        texlive-bookman.noarch 5:svn31835.0-24.fc24.1                                                  
  texlive-charter.noarch 5:svn15878.0-24.fc24.1                         texlive-cm-super.noarch 5:svn15878.0-24.fc24.1                                                 
  texlive-cmextra.noarch 5:svn32831.0-24.fc24.1                         texlive-collection-fontsrecommended.noarch 5:svn35830.0-24.20150728_r37987.fc24.1              
  texlive-courier.noarch 5:svn35058.0-24.fc24.1                         texlive-euro.noarch 5:svn22191.1.1-24.fc24.1                                                   
  texlive-eurosym.noarch 5:svn17265.1.4_subrfix-24.fc24.1               texlive-fpl.noarch 5:svn15878.1.002-24.fc24.1                                                  
  texlive-helvetic.noarch 5:svn31835.0-24.fc24.1                        texlive-lm-math.noarch 5:svn36915.1.959-24.fc24.1                                              
  texlive-manfnt-font.noarch 5:svn35799.0-24.fc24.1                     texlive-mathpazo.noarch 5:svn15878.1.003-24.fc24.1                                             
  texlive-mflogo-font.noarch 5:svn36898.1.002-24.fc24.1                 texlive-ncntrsbk.noarch 5:svn31835.0-24.fc24.1                                                 
  texlive-palatino.noarch 5:svn31835.0-24.fc24.1                        texlive-pxfonts.noarch 5:svn15878.0-24.fc24.1                                                  
  texlive-rsfs.noarch 5:svn15878.0-24.fc24.1                            texlive-scheme-basic.noarch 5:svn25923.0-24.20150728_r37987.fc24.1                             
  texlive-symbol.noarch 5:svn31835.0-24.fc24.1                          texlive-tex-gyre.noarch 5:svn18651.2.004-24.fc24.1                                             
  texlive-tex-gyre-math.noarch 5:svn36916.0-24.fc24.1                   texlive-times.noarch 5:svn35058.0-24.fc24.1                                                    
  texlive-txfonts.noarch 5:svn15878.0-24.fc24.1                         texlive-utopia.noarch 5:svn15878.0-24.fc24.1                                                   
  texlive-wasy.noarch 5:svn35831.0-24.fc24.1                            texlive-wasy2-ps.noarch 5:svn35830.0-24.fc24.1                                                 
  texlive-wasysym.noarch 5:svn15878.2.0-24.fc24.1                       texlive-zapfchan.noarch 5:svn31835.0-24.fc24.1

For other OS (especially Windows), or if someone wants to install pypandoc manually (or via pip), it will also installs pandoc!

there are other packages which might be a simpler wrapper with verification that the pandoc binary is installed.

Closing Python tags not always at end of line

Currently assume </python> will be a self contained line, it need not be. See also #5 and 4514824

Fix wiki links

May have to add .html to the links/filenames (or .md for use on GitHub rendering), but ideally would preserve the extension-less URLs to match the source MediaWiki website.

Another python tag quirk when given id

Example of broken MediaWiki from my naive pre-fixer,

<python id="recipe">
from Bio import SeqIO
...
</source>

Taken from Biopython page wiki/User:Davidw/cookbook.mediawiki

Handle MediaWiki categories

e.g. [[Category:Cookbook]] (generally as the last line in a page) flagged it for inclusion in the special wiki page .../wiki/Category:Cookbook.

Can we map these to Jekyll tags in the markdown header?

ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml

Hi,

This some how a follow-up on #38, the wish see mediawiki_to_md.py working.

I have #39 applied, I'm using python3.

I did execute these commands:

git init issue40
cd issue40/
wget -O obf_mediawiki_dump_2024-02-05.xml.bz2 https://www.dropbox.com/s/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2?dl=0
bunzip2 obf_mediawiki_dump_2024-02-05.xml.bz2 
ls -lh
head obf_mediawiki_dump_2024-02-05.xml 
../mediawiki_to_md.py obf_mediawiki_dump_2024-02-05.xml 
../mediawiki_to_md.py --input obf_mediawiki_dump_2024-02-05.xml

I got this output:

stappers@laptop:~/src/github/mediawiki_to_git_md
$ git init issue40
Initialized empty Git repository in /home/gs0604/src/github/mediawiki_to_git_md/issue40/.git/
stappers@laptop:~/src/github/mediawiki_to_git_md
$ cd issue40/
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ wget -O obf_mediawiki_dump_2024-02-05.xml.bz2 https://www.dropbox.com/s/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2?dl=0
--2024-07-15 22:19:25--  https://www.dropbox.com/s/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2 [following]
--2024-07-15 22:19:27--  https://www.dropbox.com/s/raw/edp7ukdhg2onls7/obf_mediawiki_dump_2024-02-05.xml.bz2
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com/cd/0/inline/CWzcE3O79ouVHBEP5Wa5jo_nWVRzuXfjkD43_yUNzfQQsdRdSBhiUOWMT9RjwOmOJy9XNHPw7k_0s9YYj910YlWlqqW3fQmlozGpycMaaIv2eSk8Xbot0gfZuKB_uK9q7Lg/file# [following]
--2024-07-15 22:19:27--  https://uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com/cd/0/inline/CWzcE3O79ouVHBEP5Wa5jo_nWVRzuXfjkD43_yUNzfQQsdRdSBhiUOWMT9RjwOmOJy9XNHPw7k_0s9YYj910YlWlqqW3fQmlozGpycMaaIv2eSk8Xbot0gfZuKB_uK9q7Lg/file
Resolving uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com (uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com)... 162.125.65.15, 2620:100:6021:15::a27d:410f
Connecting to uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com (uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com)|162.125.65.15|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /cd/0/inline2/CWw4RiqFmRxq166Jy7YLmZ6hm4uUWtwa4y9eUvLDwUFOIbKGDhxW0H7XRdMClFme5YAHFDkB9vBr33qcxyb5V3k-91VhD7-lSe3eC9up1kRml3roKOfwIdKzj6T8uvu4ULHv4DjD-KhI1VjehSK_YDr6Ol_UttTcENVvayodo5tIOVY4ZVuOqBGTIMfxrCFYcrUvdynwCP8xjy1mjgsUOfnTL1eKvZE1xqfCw0wu723uy6PXYV0g-e5252y7LO6oHyvzJ_4iucbM2YqWEHma2LpgCPgxp77CL5V4WxgfDmGzPYco23IHWAXMBKTRMy6Vl6ckV0wM98qEsQLhMjQ_axp3aGcuOzuS3DqFP38f1l_A3A/file [following]
--2024-07-15 22:19:29--  https://uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com/cd/0/inline2/CWw4RiqFmRxq166Jy7YLmZ6hm4uUWtwa4y9eUvLDwUFOIbKGDhxW0H7XRdMClFme5YAHFDkB9vBr33qcxyb5V3k-91VhD7-lSe3eC9up1kRml3roKOfwIdKzj6T8uvu4ULHv4DjD-KhI1VjehSK_YDr6Ol_UttTcENVvayodo5tIOVY4ZVuOqBGTIMfxrCFYcrUvdynwCP8xjy1mjgsUOfnTL1eKvZE1xqfCw0wu723uy6PXYV0g-e5252y7LO6oHyvzJ_4iucbM2YqWEHma2LpgCPgxp77CL5V4WxgfDmGzPYco23IHWAXMBKTRMy6Vl6ckV0wM98qEsQLhMjQ_axp3aGcuOzuS3DqFP38f1l_A3A/file
Reusing existing connection to uc3b64c3744fba36caf5b205fa69.dl.dropboxusercontent.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 126581686 (121M) [application/octet-stream]
Saving to: 'obf_mediawiki_dump_2024-02-05.xml.bz2'

obf_mediawiki_dump_2024-02-05.xml.bz2              100%[=============================================================>] 120.72M  2.25MB/s    in 58s     

2024-07-15 22:20:28 (2.07 MB/s) - 'obf_mediawiki_dump_2024-02-05.xml.bz2' saved [126581686/126581686]

stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ bunzip2 obf_mediawiki_dump_2024-02-05.xml.bz2 
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ ls -lh
totaal 221M
-rw-r--r-- 1 stappers stappers 221M 15 jul 22:20 obf_mediawiki_dump_2024-02-05.xml
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ head obf_mediawiki_dump_2024-02-05.xml 
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Open Bioinformatics Foundation</sitename>
    <dbname>obfiztkb_mw688</dbname>
    <base>https://www.open-bio.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.29.3</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ ../mediawiki_to_md.py obf_mediawiki_dump_2024-02-05.xml 
usage: mediawiki_to_md.py [-h] -i NAMES [NAMES ...] [-p PREFIX] [--mediawiki-ext EXT] [--markdown-ext EXT]
mediawiki_to_md.py: error: the following arguments are required: -i/--input
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$ ../mediawiki_to_md.py --input obf_mediawiki_dump_2024-02-05.xml 
Will be using pandoc 2.17.1.1
ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml
stappers@laptop:~/src/github/mediawiki_to_git_md/issue40
$

How to get beyond the ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml?

For what it is worth:

$ python3 --version
Python 3.11.2
$ python3
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
$

Convert usernames in commit comments

Sometimes the mediawiki comments have usernames, which we can map to @ style GitHub usernames. e.g.

Reverted edits by [[Special:Contributions/xxx|xxx]] ([[User talk:xxx|talk]]) to last revision by [[User:Peter|Peter]]

-->

Reverted edits by @xxx to last revision by @peterjc

As a bonus this is much shorter so the logs look nicer.

Deal with MediaWiki attachments nicely, e.g. images

A reprise of #1, see 8ed3dbd

When dumpBackup.php is run with --include-files --uploads then you get the images and other attachments in the XML, base64 encoded - including their revisions. This opens the door to a self contained solution to grabbing the images (although the workaround in #1 could be useful if the dump was done without this, and the wiki is still live).

However, the XML parser needs to consider <revision> vs <upload> tags, and the script may need refactoring to handle the file contents vs a page's text contents.

Image scaling during MediaWiki to Markdown conversion

Reported by @vincentdavis on issue #1

Some images are scaled, e.g. in http://biopython.org/wiki/Phylo or its MediaWiki equivalent https://github.com/peterjc/peterjc.github.io/blob/master/wiki/Phylo.mediawiki

[[File:phylo-draw-apaf1.png|256px|thumb|right|Rooted phylogram, via Phylo.draw]]

which becomes after conversion to GFM markdown with pandoc https://github.com/peterjc/peterjc.github.io/blob/master/wiki/Phylo.md

![Rooted phylogram, via Phylo.draw](phylo-draw-apaf1.png "fig:Rooted phylogram, via Phylo.draw")

On the actual wiki the image URL is:
http://biopython.org/w/images/thumb/0/04/Phylo-draw-apaf1.png/256px-Phylo-draw-apaf1.png

If you what to see the full size image
http://biopython.org/w/images/0/04/Phylo-draw-apaf1.png

@vincentdavis found a document that suggests code like this can be used for scaling images.

[[ http://url.to/image.png | height = 100px ]]

RuntimeError: Proxy error(ArgumentOutOfRangeException): Year, Month, and Day parameters describe an un-representable DateTime.

why?

MediaWiki templates

Can we deal with these? e.g. Render them prior to passing to pandoc? Replace with a Jekyll templating mechanism?

Errors when converting dump with different localization than english

Within german instances of mediawiki it may appear, that some tags that are hard-coded and not localized are causing errors and eventually lead to breaking the process.

E.g. 'Kategorie' and 'Category'.

I worked around this issue by search-replacing these throughout the dump. Not beautiful but works.

Deal with MediaWiki attachments, e.g. images

Avoiding colons in filenames (for working on Windows)

See biopython/biopython.github.io#21 filed by @MarkusPiotrowski

MediaWiki uses Categories:XXX and User:XXX etc in its URLs, which means currently we end up with files on disk named Categories:XXX.md and so on. This is fine on Mac/Linux (although you do sometimes need to escape the :: as \: at the command line), but breaks on Windows.

Can we use a "safe" filename like Categories_XXX.md and exploit some Jekyll setting in the header to specify the URL for the page with a colon in it?

On a related point, we already have to escape the colon as %3A for relative URLs to these pages, e.g. 9aa53af and 3682565 although long term pandoc might take care of that - jgm/pandoc#2849

Convert <bash> tags as another <source> variant.

e.g. the Intergenic_regions page on the Biopython wiki, see also biopython/biopython.github.io#39

The converter is not robust against markup errors within the MediaWiki dump

When processing a dump that contains a markup error such as [[Category: Something] instead of [[Category: Something]] the script gets stuck.

Deal with MediaWiki's File: and Media: links

There are case variants and optional leading colons, but focusing on the more commonly used title cases we have things like:

Treasurer's 2011 report: [[File:2011-OBF-Treasurers-Report_v1.pdf]] 

Click '''[[Media:BOSC2009_program_20090601.pdf | here]]''' to download
full program in PDF format (2 MB).

[[Image:Bioperl-Heidelberg-1999-2.jpg|frame|During ISMB, an introduction
meeting and planning session was held with some of the volunteers.]]

MediaWiki "File:" takes you to a page at wiki/File:... about the file with its history etc. For a converted wiki, that could be turned into a link to the file on GitHub.

MediaWiki "Media:" takes you directly to the file, typically under /w/... rather than /wiki/... but we don't have to follow that convention.

MediaWiki "Image:" shows the image in situ with caption below. Typically hosted under /w/... and clickable to the "File:" page under /wiki/File:...

Pandoc will strip the "Image:" prefix leaving a relative URL, and therefore leaving the images under /wiki/ (or however configured) next to the converted Markdown files makes sense.

However Pandoc currently leaves the "File:" and "Media:" prefixes in place.

Option 1

Could generate File:<name>.md and Media:<name>.md files (with colon escaping like the category pages), which can show images in situ and/or give links to the raw file, and link to the GitHub history page for the file? Seems of limited value.

Again, actual placement of the files in the git repository could be configurable.

Option 2

Pre/post process these to give a useful URL. Simplest would be to treat them like images and leave them in the same directory as the converted MarkDown files (and use relative links).

Might make this configurable, e.g. put them under "media/" with Markdown under "prefix/" via the existing prefix option?

Auto-squash flurries of revisions from same author into one commit?

Often see several revisions to the same page (with same/no comment) by the same user. Should these be collapsed into a single squashed git commit?

Skip empty commits (e.g. reverts after a skipped spammer's edit)

What to do with sub-folders (slashes in MediaWiki page names)?

Sometimes MediaWiki pages have slashes in the names, which become child folders after migration.

See OBF/OBF.github.io#1 where we have /wiki/BOSC_2015.md but also the folder /wiki/BOSC_2015/ which means the Jekyll website links to .../wiki/BOSC_2015 break.

See biopython/biopython.github.io#8 where we have /wiki/BioSQL.md and /wiki/BioSQL/Windows.md which means means the Jekyll website links to .../wiki/BioSQL break.

Spot case variation in category tags

MediaWiki seems to accept case variants like this, which convert.py did not spot, e.g.:

 [[category:Cookbook]]

Escape colons in permalinks and redirect_from

See e.g. biopython/biopython.github.io@3af6e87 for biopython/biopython.github.io#96

That was fixed in the Biopython wiki some time after this script was used.

Fix inter-wiki links

e.g. [bp:SeqIO] used on Biopython wiki to link to SeqIO page on the sister BioPerl wiki, via a MediaWiki plugin SpecialInterwiki.php

Quotes within article titles cause problems with the git command

The process gets stuck/killed by a git-error when the converter has to deal with articles which contain quotation with their titles. The converter creates the files as expected, but the git command exits with an error.

I worked around this issue by:

create a copy of the generated .md and .mediawiki files
issue git add wiki/ && git commit -m "fix quotation"

Handle case changes in article titles

The script appears to not correctly handle case changes in article titles. It aborts with a git error about unknown files due to git add and git commit referencing a filename with incorrect case.

Wish: mediawiki_export.xml in this git repo

Hi,

My attempts to convert a mediawiki XML export to git repository failed. (Most likely due an ancient mediawiki engine at my side.)

I think it would be a good thing, if this git repository contains a known good mediawiki XML export. Good for this project for having a reference XML, good for new users like me for having a "See, it works!".

Please add a mediawiki_export.xml to this git repository.

Groeten
Geert Stappers

Handle MediaWiki User:XXX pages

This will mean dealing with many of the same colon issues as with categories, #6.

Do not force capitalise upload images etc

Upload images filenames can be entirely lower case.

Ignoring revisions by username (e.g. spam) can taint following revisions

The MediaWiki dump records each revision of a page in full as a snapshot. Git records reach revision as a diff from the previous version. This becomes an issue if we try to skip unwanted revisions...

e.g. this is fine

Spam from A
Revert from B

If A is on the user block list, we skip the spam revision, and then have an empty commit from B which v1 of the script later drops. Likewise this kind of example is also fine:

Spam from A
More spam from A or A2
Revert all from B

This is also fine if there are intermediate revisions to other pages.

However,

Spam from A
Innocent edit elsewhere in the page from C
Revert from B

This time we'd record a combined commit from C which sadly includes the spam from A. Not good.

Update for Python 3

See e.g. #30 although I will probably do it a bit differently.

Handling MediaWiki revisions which pandoc cannot parse

I've seen this with mal-formed tables (where the user generally fixed the page in a subsequent revision).

Currently these problem revisions are simply ignored (which is fine if the user fixed the issue themselves in the next revision, we end up with a combined commit)

peterjc / mediawiki_to_git_md Goto Github PK

mediawiki_to_git_md's Issues

Option 1

Option 2

Recommend Projects

Recommend Topics

Recommend Org