ids-mannheim / wikipedia-corpus-builder Goto Github PK
View Code? Open in Web Editor NEWBuilds Wikipedia corpora in I5 (a TEI-based format)
License: GNU General Public License v3.0
Builds Wikipedia corpora in I5 (a TEI-based format)
License: GNU General Public License v3.0
The element <classCode>
in <idsHeader>
lists categories (category links) assigned to a Wiki page. This element is only relevant in article pages. Although category links may appear in talk pages, they do not practically identify the categories of the page.
Attribute @scheme
classCode should refer to category content page (e.g. https://en.wikipedia.org/wiki/Category:Contents for the English Wiki and https://de.wikipedia.org/wiki/Kategorie:!Hauptkategorie for the German Wiki).
The list of category links must be unique. Redundant links should be filtered.
This issue was reported by Harald Lüngen.
Posting should include an attribute xml:id=“i.idTalkPage_n_m”
Example: xml:id=“i.68161_2_4”
where
Some Wikipages contain empty plain text after conversions. WikiXML containing less than 2 tokens should be omitted.
WUD17/A96/90243
https://de.wikipedia.org/wiki/Benutzer_Diskussion:AnjjaBaumann
Contains only ...
WUD17/G52/88846
https://de.wikipedia.org/wiki/Benutzer_Diskussion:Greekstar
Empty doc because it only contains a template (Dieser Benutzer wurde gesperrt.) converted as a gap and therefore there is no plain text.
This issue was reported by Nils Diewald.
Emojis or emoticons are encoded as templates in Wiki mark up. In I5, they should be represented by element <figure>
.
Wikitext:
{{S|:)}}
I5
<figure type="emoji" creation="template"> <desc type="template">[_EMOJI:{{S|:)}}_]</desc> </figure>
A few problems including a deadlock, have been occurred in WikiXML to I5 conversion, due to the appearances of headers within paragraph level elements such as <li> or <ref>.
The problematic WikiXML structures come probably due to improper parsing of wikitext to WikiXML by WikiXMLConverter.
Nevertheless, to make WikiI5Converter robuster, we should probably simplify the WikiXML to I5 conversion by simply using xsl:value-of when the situations appear and not further applying templates.
Stax-utis library relies on JSR173-ri library that is not available for download from the maven repository. It seems that it is also not downloadable from the original source as well.
The Stax-utils library is a pretty old library and it seems that it is not well-maintained anymore. It provides an XMLStreamWriter implementation that can indent the XML output. We should look for an alternative to do so using another library.
Signatures should contain user link represented as element <ref>
.
<signed type="signed"> <ref target="https://de.wikipedia.org/wiki/Benutzer:Neun-x"><name>Neun-x</name> </ref> <date>09:31, 1. Sep. 2013 (CEST)</date> </signed>
*Updated the format: <name>
should be inside <ref>
.
The category-links from article pages should be added in the talk pages as follows.
<textClass> <classCode scheme="https://de.wikipedia.org/wiki/Kategorie:!Hauptkategorie"> <ref target="https://de.wikipedia.org/wiki/Kategorie%3AFiktive_Person"> Kategorie:Fiktive Person</ref> </classCode> </textClass>
TagSoup parser automatically restructures a text starting with an empty element so that the element contains the plain text.
WikiXML
<autoSignature ="signed"></autoSignature> wurde bereits nachdrücklich ...
TagSoup parsing result
<autoSignature ="signed">wurde bereits nachdrücklich ...</autoSignature>
This should not happen.
There was an out of memory error during WikiXML to I5 conversion. The problem seems to come from net.sf.saxon.event.StreamWriterToReceiver. In Saxon 10.6 it has a namespace stack that contains huge amount of objects.
The StreamWriterToReceiver is used to write the final I5 file. It seems that the class always adds objects to the stack and never pop them.
Ideally multiple signatures should be annotated by multiple elements.
Wikitext
Sie lebt in jedem Geschöpf.--[[Benutzer:BALD|BALD]] 22:05, 19. Feb 2006 (CET) <small>Unterschrift nachgetragen--[[Benutzer:Chef|Pangloss]] [[Benutzer Diskussion:Chef|Diskussion]] 23:04, 19. Feb 2006 (CET)</small>
Currently, WikiXMLConverter annotates only the last user link as signature and the first user link is simply rendered as a normal link.
Ideally unsigned template should include only one username and timestamp as described in Template:Unsigned. In practice, however, there are often multiple usernames/IPs and timestamps.
English markups are often used in Wikis of other languages, e.g user links and unsigned templates. To improve posting segmentation, English markups used in posting segmentation should be always included in processing Wikis of other languages.
Wikipedia page id is used to built docSigle and textSigle. Shorter Wikipedia page id should be normalized into 9 digits by adding 0 before the first digit. From the 9 digits, 4 digits should be reserved for docSigle and 5 digits for textSigle.
English and French language links of their associated articles should be added to the German talk pages.
<biblStruct> <relatedItem type="langlink"> <ref target="https://en.wikipedia.org/wiki/Alan_Smithee" xml:lang="en">Alan Smithee </ref> </relatedItem> <relatedItem type="langlink"> <ref target="https://fr.wikipedia.org/wiki/Alan_Smithee" xml:lang="fr">Alan Smithee </ref> </relatedItem> </biblStruct>
Likewise for English talk pages, German and French language links of their associated articles should be added.
Maria DB treats VARCHAR incase-sensitive, thus the following category links are considered identical, whereas they indicate two different categories:
https://de.wikipedia.org/wiki/Kategorie:FVp-Mitglied
https://de.wikipedia.org/wiki/Kategorie:FVP-Mitglied
Both category links often belong to an article. This is problematic because they must be stored with unique constraint in combination with article-id.
All texts after a signature e.g postscripts are parsed separately, thus they lost of the original markup/styling, e.g the postscript in the example below should be included in the list.
Wikitext
*{{neutral}}, weil selbst überarbeitet --[[Benutzer:CHK|CHK]] 09:18, 1. Jan 2006 (CET) PS: Könnte vielleicht irgend jemand ein Mal endlich die Ladungen bei den Reaktionsgleichungen hochstellen!?
WikiXML
<posting indentLevel="0" who="WU00000001" synch="t00000000"> <ul> <li><span class="template"/>, weil selbst überarbeitet --<autoSignature type="signed"><timestamp>09:18, 1. Jan 2006 (CET)</timestamp></autoSignature> </li> </ul> <seg type="postscript"> <pre> PS: Könnte vielleicht irgend jemand ein Mal endlich die Ladungen bei den Reaktionsgleichungen hochstellen!?</pre> </seg> </posting>
Solution
<posting indentLevel="0" who="WU00000001" synch="t00000000"> <ul> <li><span class="template"/>, weil selbst überarbeitet --<autoSignature type="signed"><timestamp>09:18, 1. Jan 2006 (CET)</timestamp></autoSignature> <seg type="postscript"> PS: Könnte vielleicht irgend jemand ein Mal endlich die Ladungen bei den Reaktionsgleichungen hochstellen!?</seg></li> </ul> </posting>
User links should not be generated for IP addresses in a talk-user list is created by WikiXMLConverter.
<person xml:id="WU00000017"> <persName>213.148.129.70</persName> <signatureContent> <ref target="https://de.wikipedia.org/wiki/Benutzer:213.148.129.70"> 213.148.129.70</ref> </signatureContent> </person>
The test suite of WikiI5Converter uses a MySQL database that requires a setup in a local computer. We should implement another way to test using a database, e.g. using a self-contained database like Sqlite so that the database can be embedded for testing.
Wikipedia pages are grouped into documents represented by docSigle. The grouping is based on alphanumeric characters identified from the first character of the page titles, and the maximum number of pages per group. However, some Wikipedia pages have titles starting with a non-alphanumeric character, e.g. <title>Diskussion:.460 S&W Magnum</title>
.
Solution: take the next alphanumeric character.
@when-iso attribute should be added in posting elements indicating posting timestamp in ISO 8601 format.
<posting indentLevel="0" when-iso="2011-03-09T21:33+01">
For the German Wiki, English category links corresponding to the German category-links should be added to <textClass>
.
<textClass> <classCode scheme="https://de.wikipedia.org/wiki/Kategorie:!Hauptkategorie"> <ref target="https://de.wikipedia.org/wiki/Kategorie%3AFiktive_Person"> Kategorie:Fiktive Person</ref> </classCode> <classCode scheme="https://en.wikipedia.org/wiki/Category:Contents"> <ref target="https://en.wikipedia.org/wiki/Category:Fictional_characters”> Category:Fictional characters</ref> </classCode> </textClass>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.