tietew / mediawiki-xml2sql Goto Github PK
View Code? Open in Web Editor NEWDead project -- feel free to fork and update!
License: Other
Dead project -- feel free to fork and update!
License: Other
xml2sql fail with recent WP dumps due to missing element definition for "redirect".
$ xml2sql swwiki-20090821-pages-articles.xml
unexpected element <redirect>
xml2sql: parsing aborted at line 209 pos 16.
This issue can be solved by adding a definition for redirect to keywords:
--- a/keywords
+++ b/keywords
@@ -20,6 +20,7 @@ enum element {
el_minor,
el_comment,
el_text,
+ el_redirect,
};
%}
and re-run gperf as explained in mediawiki.c.
babilen
I am trying running the example of the README.ME, but this file does not exist: http://download.wikimedia.org/enwiki/pages-meta-current.xml.bz2
instead, I downloaded this one: https://dumps.wikimedia.org/enwiki/20160305/enwiki-20160305-pages-meta-current.xml.bz2
, but xml2sql
fails to import it. Is this file a valid input for your program?
$ bunzip2 -c enwiki-20160305-pages-meta-current.xml.bz2 | xml2sql -m
unexpected element <dbname>
xml2sql: parsing aborted at line 4 pos 12.
It works (it creates page.sql
, revision.sql
and text.sql
) if I remove some the tags as follows:
$ cat enwiki-20160305-pages-meta-current1.xml-p000000010p000030303 | egrep -v "<dbname>|<ns>|<redirect|<parentid>|<model>|<format>|<sha1>" | xml2sql -m
does this mean that the wikipedia format has evolved and mediawiki-xml2sql needs to be updated?
or is there an alternative tool to achieve the same thing?
also, the three generated sql files have INSERT INTO
statements, but the CREATE TABLE
statement is missing. Can you please tell me the required CREATE TABLE
statement?
If I run xml2sql i get the following output for revision:
$ head -n 6 revision.sql
-- xml2sql - MediaWiki XML to SQL converter
-- Table revision for PostgreSQL
COPY "revision" FROM STDIN;
287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 2009-08-19T03:20:58Z 1 0
32885 10 32885 +jamii 160 Baba Tabita 2007-02-02T10:14:01Z 1 0
If i try to import this i get the following error:
psql -h localhost -d test2 -U babilen -W < revision.sql
Password for user babilen:
ERROR: missing data for column "rev_len"
CONTEXT: COPY revision, line 1: "287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 20090819032058 1 0"
The issue can be fixed if the following awk oneliner is applied to the txt files (or lines within revision.sql that hold values)
awk -v e=10 'BEGIN { FS = OFS = "\t"; e++ } NF > e { exit(1) } { t = ""; for (a = 0; a < e - NF; a++) t = t "\t\\N"; printf "%s%s\n", $0, t }'
The difference in output is:
287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 2009-08-19T03:20:58Z 1 0
287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 2009-08-19T03:20:58Z 1 0 \N \N
The \N are placeholders for NULL values which is OK since the respective columns (rev_len and rev_parent_id) are defined as nullable:
CREATE TABLE revision (
....
rev_len INTEGER NULL,
rev_parent_id INTEGER NULL
);
thanks
babilen
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.