Comments (9)
Switched from macOS to Linux, over 3 million revisions parsed in ~15mins but hit a 32GB memory limit. It occurs to me that the SQLite database currently has no indexing - I'd never pushed the script to such a large example.
from mediawiki_to_git_md.
@mathieujobin Currently the script does XML dump to SQLite to mediawiki files on disk, to markdown files on disk, which get tracked in git.
The SQLite intermediate is to sort the changes so that the git log is chronological. Looking back the earlier version checked in had this, even before it dealt with uploaded files - so perhaps the XML is sorted by page first?
from mediawiki_to_git_md.
My guess is an invalid date (e.g. month and day mixed up from US style somewhere, or something strange like 29 February in a non-leap year).
I would start by adding some exception handling to print out some debug information.
Are you willing and able to share the Wiki dump with me by email (assuming it is not overly large)?
from mediawiki_to_git_md.
https://dumps.wikimedia.org/zhwiki/20230920/#:~:text=zhwiki%2D20230920%2Dpages%2Darticles.xml.bz2
you can try this one. @peterjc
from mediawiki_to_git_md.
Larger than I was expecting, assuming this is the URL you meant: zhwiki-20230920-pages-articles.xml.bz2 2.5 GB
I need to have a clean out - this machine's drive is fuller than I thought!
from mediawiki_to_git_md.
Do you have the full traceback error still? I wanted to check where in the code this RuntimeError was triggered.
[The size of the Chinese wiki example makes testing this harder]
from mediawiki_to_git_md.
This script is not really suitable for a wiki dump this big! It took 30mins before I killed it, but Python was apparently using 18GB or RAM and had only recorded 1.8 million revisions in SQLite (taking 3.8GB).
Update: The file has over 4 million revisions, so I got less than halfway:
$ cat zhwiki-20230920-pages-articles.xml.bz2 | bzip2 -d | grep "<revision>" -c
4339799
[I'm trying this on Python 3 with some modifications, I assume you are using it on Python 2 - see issue #33]
from mediawiki_to_git_md.
I use ubuntu 20.04
and python 3.8
from mediawiki_to_git_md.
I'm curious if it would be possible to migrate straight from MySQL to Markdown/Git without the SQLite intermediate DB ?
from mediawiki_to_git_md.
Related Issues (20)
- Skip empty commits (e.g. reverts after a skipped spammer's edit) HOT 1
- Another python tag quirk when given id
- Closing Python tags not always at end of line
- What to do with sub-folders (slashes in MediaWiki page names)? HOT 2
- Avoiding colons in filenames (for working on Windows)
- Convert <bash> tags as another <source> variant. HOT 1
- Spot case variation in category tags
- Exception in subprocess when pandoc not installed HOT 2
- Handle case changes in article titles HOT 4
- Errors when converting dump with different localization than english HOT 2
- Quotes within article titles cause problems with the git command HOT 2
- The converter is not robust against markup errors within the MediaWiki dump HOT 9
- Update for Python 3 HOT 2
- Ignoring revisions by username (e.g. spam) can taint following revisions HOT 4
- Do not force capitalise upload images etc HOT 1
- Escape colons in permalinks and redirect_from
- Deal with MediaWiki's File: and Media: links HOT 2
- Wish: mediawiki_export.xml in this git repo HOT 3
- ERROR: Unexpected input obf_mediawiki_dump_2024-02-05.xml HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mediawiki_to_git_md.