Git Product home page Git Product logo

Comments (9)

peterjc avatar peterjc commented on July 18, 2024 1

Switched from macOS to Linux, over 3 million revisions parsed in ~15mins but hit a 32GB memory limit. It occurs to me that the SQLite database currently has no indexing - I'd never pushed the script to such a large example.

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024 1

@mathieujobin Currently the script does XML dump to SQLite to mediawiki files on disk, to markdown files on disk, which get tracked in git.

The SQLite intermediate is to sort the changes so that the git log is chronological. Looking back the earlier version checked in had this, even before it dealt with uploaded files - so perhaps the XML is sorted by page first?

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024

My guess is an invalid date (e.g. month and day mixed up from US style somewhere, or something strange like 29 February in a non-leap year).

I would start by adding some exception handling to print out some debug information.

Are you willing and able to share the Wiki dump with me by email (assuming it is not overly large)?

from mediawiki_to_git_md.

459737087 avatar 459737087 commented on July 18, 2024

https://dumps.wikimedia.org/zhwiki/20230920/#:~:text=zhwiki%2D20230920%2Dpages%2Darticles.xml.bz2

you can try this one. @peterjc

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024

Larger than I was expecting, assuming this is the URL you meant: zhwiki-20230920-pages-articles.xml.bz2 2.5 GB

I need to have a clean out - this machine's drive is fuller than I thought!

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024

Do you have the full traceback error still? I wanted to check where in the code this RuntimeError was triggered.

[The size of the Chinese wiki example makes testing this harder]

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024

This script is not really suitable for a wiki dump this big! It took 30mins before I killed it, but Python was apparently using 18GB or RAM and had only recorded 1.8 million revisions in SQLite (taking 3.8GB).

Update: The file has over 4 million revisions, so I got less than halfway:

$ cat zhwiki-20230920-pages-articles.xml.bz2 | bzip2 -d | grep "<revision>" -c
4339799

[I'm trying this on Python 3 with some modifications, I assume you are using it on Python 2 - see issue #33]

from mediawiki_to_git_md.

459737087 avatar 459737087 commented on July 18, 2024

I use ubuntu 20.04
and python 3.8

from mediawiki_to_git_md.

mathieujobin avatar mathieujobin commented on July 18, 2024

I'm curious if it would be possible to migrate straight from MySQL to Markdown/Git without the SQLite intermediate DB ?

from mediawiki_to_git_md.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.