ids-mannheim / wikipedia-corpus-builder Goto Github PK

Builds Wikipedia corpora in I5 (a TEI-based format)

License: GNU General Public License v3.0

Shell 1.92% Java 83.42% XSLT 14.66%

wikipedia corpus-builder wikipedia-corpus xml tei

wikipedia-corpus-builder's Introduction

Wikipedia Corpus Builder

The Leibniz-Institut für Deutsche Sprache (IDS) develops a corpus builder for Wikipedia, that converts Wikipedia pages from its native text format, Wikitext, into our target corpus format, I5. I5 is the IDS text model used in Das Deutsche Referenzkorpus (DeReKo). It is a customized TEI format based on XCES, enriched with metadata information on different corpus structure levels (Lüngen and Sperberg-McQueen, 2012). As part of DeReKo, Wikipedia corpora built using this tool, are accessible through Corpus Search, Management and Analysis System II (COSMAS II) and Corpus Analysis Platform (KorAP).

The corpus builder works in two stages of conversion (Margaretha and Lüngen, 2014). In the first stage, WikiXMLConverter converts Wikitext into WikiXML by using Sweble Parser (Dohrn and Riehle, 2011) and generates a WikiXML file for each wikipage within a Wikipedia namespace, for instance articles. In the second stage, WikiI5Converter converts each WikiXML file into I5 using XSLT Stylesheets and assemble them altogether as a single corpus file as required for DeReKo.

The corpus builder is also designed for building Computer Mediated Communication (CMC) corpora from Wikipedia talk or discussion pages, such as in the Talk and User talk namespaces. A talk corpus is structured by postings and threads following the TEI scheme for CMC corpus (Beißwenger, et al., 2012). Our posting segmentation is done heuristically in WikiXMLConverter.

The corpus builder supports parsing Wikipedia of multiple languages. It has been tested for the following languages: english, french, hungarian, norwegian, spanish, croatian, italian, polish and rumanian. We also provide Wikipedia corpora of these languages in WikiXML and I5 formats for download.

References

Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L., and Storrer, A. (2012). A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative [Online], 3. URL : https://doi.org/10.4000/jtei.476 ; DOI : 10.4000/jtei.476

Dohrn, H., Riehle, D. (2011). Design and implementation of the Sweble Wikitext parser: unlocking the structured data of Wikipedia. Proceedings of the 7th International Symposium on Wikis and Open Collaboration, DOI : 10.1145/2038558.2038571

Margaretha, E., and Lüngen,H. (2014). Building linguistic corpora from Wikipedia articles and discussions. Journal for Language Technologie and Computational Linguistics (JLCL), 2/2014. URL : http://www.jlcl.org/2014_Heft2/3MargarethaLuengen.pdf ;

Lüngen, H., and Sperberg-McQueen, C. M. (2012). A TEI P5 Document Grammar for the IDS Text Model. Journal of the Text Encoding Initiative [Online], 3. URL : http://jtei.revues.org/508 ; DOI : 10.4000/jtei.508

wikipedia-corpus-builder's People

Contributors

Watchers

wikipedia-corpus-builder's Issues

Case sensitive category links

Maria DB treats VARCHAR incase-sensitive, thus the following category links are considered identical, whereas they indicate two different categories:

https://de.wikipedia.org/wiki/Kategorie:FVp-Mitglied
https://de.wikipedia.org/wiki/Kategorie:FVP-Mitglied

Both category links often belong to an article. This is problematic because they must be stored with unique constraint in combination with article-id.

Headers within paragraph level elements

A few problems including a deadlock, have been occurred in WikiXML to I5 conversion, due to the appearances of headers within paragraph level elements such as <li> or <ref>.

The problematic WikiXML structures come probably due to improper parsing of wikitext to WikiXML by WikiXMLConverter.

Nevertheless, to make WikiI5Converter robuster, we should probably simplify the WikiXML to I5 conversion by simply using xsl:value-of when the situations appear and not further applying templates.

Join text after signature

All texts after a signature e.g postscripts are parsed separately, thus they lost of the original markup/styling, e.g the postscript in the example below should be included in the list.

Wikitext

*{{neutral}}, weil selbst überarbeitet --[[Benutzer:CHK|CHK]] 09:18, 1. Jan 2006 (CET)
    PS: Könnte vielleicht irgend jemand ein Mal endlich die Ladungen bei den
 Reaktionsgleichungen hochstellen!?

WikiXML

<posting indentLevel="0" who="WU00000001" synch="t00000000">
    <ul>
        <li><span class="template"/>, weil selbst überarbeitet --<autoSignature
         type="signed"><timestamp>09:18, 1. Jan 2006 (CET)</timestamp></autoSignature>
        </li>
    </ul>
    <seg type="postscript">    <pre>    PS: Könnte vielleicht irgend jemand ein Mal
    endlich die Ladungen bei den Reaktionsgleichungen hochstellen!?</pre> </seg>
</posting>

Solution

<posting indentLevel="0" who="WU00000001" synch="t00000000">
    <ul>
	<li><span class="template"/>, weil selbst überarbeitet --<autoSignature 
        type="signed"><timestamp>09:18, 1. Jan 2006 (CET)</timestamp></autoSignature>
        <seg type="postscript"> PS: Könnte vielleicht irgend jemand ein Mal endlich 
        die Ladungen bei den Reaktionsgleichungen hochstellen!?</seg></li>
    </ul> 
</posting>

Emoji or emoticon

Emojis or emoticons are encoded as templates in Wiki mark up. In I5, they should be represented by element <figure>.

Wikitext:

{{S|:)}}

<figure type="emoji" creation="template">
  <desc type="template">[_EMOJI:{{S|:)}}_]</desc>
</figure>

See https://de.wikipedia.org/wiki/Vorlage:Smiley

Cleaning empty pages

Some Wikipages contain empty plain text after conversions. WikiXML containing less than 2 tokens should be omitted.

WUD17/A96/90243
https://de.wikipedia.org/wiki/Benutzer_Diskussion:AnjjaBaumann
Contains only ...

WUD17/G52/88846
https://de.wikipedia.org/wiki/Benutzer_Diskussion:Greekstar

Empty doc because it only contains a template (Dieser Benutzer wurde gesperrt.) converted as a gap and therefore there is no plain text.

This issue was reported by Nils Diewald.

Multiple Signatures

Ideally multiple signatures should be annotated by multiple elements.

Wikitext

Sie lebt in jedem Geschöpf.--[[Benutzer:BALD|BALD]] 22:05, 19. Feb 2006 (CET) 
<small>Unterschrift nachgetragen--[[Benutzer:Chef|Pangloss]] [[Benutzer 
Diskussion:Chef|Diskussion]] 23:04, 19. Feb 2006 (CET)</small>

Currently, WikiXMLConverter annotates only the last user link as signature and the first user link is simply rendered as a normal link.

English markups in other Wikis

English markups are often used in Wikis of other languages, e.g user links and unsigned templates. To improve posting segmentation, English markups used in posting segmentation should be always included in processing Wikis of other languages.

Iso Timestamp

@when-iso attribute should be added in posting elements indicating posting timestamp in ISO 8601 format.

<posting indentLevel="0" when-iso="2011-03-09T21:33+01">

Unsigned templates with multiple usernames/IPs and timestamps

Ideally unsigned template should include only one username and timestamp as described in Template:Unsigned. In practice, however, there are often multiple usernames/IPs and timestamps.

ClassCode Element

The element <classCode> in <idsHeader> lists categories (category links) assigned to a Wiki page. This element is only relevant in article pages. Although category links may appear in talk pages, they do not practically identify the categories of the page.

Attribute @scheme classCode should refer to category content page (e.g. https://en.wikipedia.org/wiki/Category:Contents for the English Wiki and https://de.wikipedia.org/wiki/Kategorie:!Hauptkategorie for the German Wiki).

The list of category links must be unique. Redundant links should be filtered.

This issue was reported by Harald Lüngen.

Add the category-links from article pages to the talk pages

The category-links from article pages should be added in the talk pages as follows.

<textClass>
  <classCode scheme="https://de.wikipedia.org/wiki/Kategorie:!Hauptkategorie">
    <ref target="https://de.wikipedia.org/wiki/Kategorie%3AFiktive_Person">
    Kategorie:Fiktive Person</ref>
  </classCode>
</textClass>

Add English category links corresponding to German category-links

For the German Wiki, English category links corresponding to the German category-links should be added to <textClass>.

<textClass>
  <classCode scheme="https://de.wikipedia.org/wiki/Kategorie:!Hauptkategorie">
    <ref target="https://de.wikipedia.org/wiki/Kategorie%3AFiktive_Person">
    Kategorie:Fiktive Person</ref>
  </classCode>
  <classCode scheme="https://en.wikipedia.org/wiki/Category:Contents">
    <ref target="https://en.wikipedia.org/wiki/Category:Fictional_characters”> 
    Category:Fictional characters</ref>
  </classCode>
</textClass>

Database testing

The test suite of WikiI5Converter uses a MySQL database that requires a setup in a local computer. We should implement another way to test using a database, e.g. using a self-contained database like Sqlite so that the database can be embedded for testing.

Person from IP address

User links should not be generated for IP addresses in a talk-user list is created by WikiXMLConverter.

<person xml:id="WU00000017">
   <persName>213.148.129.70</persName>
   <signatureContent>
      <ref target="https://de.wikipedia.org/wiki/Benutzer:213.148.129.70"> 
      213.148.129.70</ref>
   </signatureContent>
</person>

Page titles starting with non alphanumeric characters

Wikipedia pages are grouped into documents represented by docSigle. The grouping is based on alphanumeric characters identified from the first character of the page titles, and the maximum number of pages per group. However, some Wikipedia pages have titles starting with a non-alphanumeric character, e.g. <title>Diskussion:.460 S&W Magnum</title> .

Solution: take the next alphanumeric character.

Out of Memory Error with SaxonEE 10.6

There was an out of memory error during WikiXML to I5 conversion. The problem seems to come from net.sf.saxon.event.StreamWriterToReceiver. In Saxon 10.6 it has a namespace stack that contains huge amount of objects.
The StreamWriterToReceiver is used to write the final I5 file. It seems that the class always adds objects to the stack and never pop them.

Wikipedia page id

Wikipedia page id is used to built docSigle and textSigle. Shorter Wikipedia page id should be normalized into 9 digits by adding 0 before the first digit. From the 9 digits, 4 digits should be reserved for docSigle and 5 digits for textSigle.

Add user link to signatures

Signatures should contain user link represented as element <ref>.

<signed type="signed">
    <ref target="https://de.wikipedia.org/wiki/Benutzer:Neun-x"><name>Neun-x</name>
    </ref>    
    <date>09:31, 1. Sep. 2013 (CEST)</date>
</signed>

*Updated the format: <name> should be inside <ref>.

Add other language links to talk pages

English and French language links of their associated articles should be added to the German talk pages.

<biblStruct>
  <relatedItem type="langlink">
    <ref target="https://en.wikipedia.org/wiki/Alan_Smithee" xml:lang="en">Alan Smithee
    </ref>
  </relatedItem>
  <relatedItem type="langlink">
    <ref target="https://fr.wikipedia.org/wiki/Alan_Smithee" xml:lang="fr">Alan Smithee
    </ref>
  </relatedItem>
</biblStruct>

Likewise for English talk pages, German and French language links of their associated articles should be added.

Posting Id

Posting should include an attribute xml:id=“i.idTalkPage_n_m”

n for thread numbering
m for post numbering in the thread i.e. the numbering starts at 1 at the beginning of each thread

Example: xml:id=“i.68161_2_4” where

68161 is the Wikipedia ID of the page.
2 is the number of the thread on the page (i.e. the second thread)
4 is the number of the post, i.e. the fourth post in the second thread

An alternative for indenting XML outputs

Stax-utis library relies on JSR173-ri library that is not available for download from the maven repository. It seems that it is also not downloadable from the original source as well.

The Stax-utils library is a pretty old library and it seems that it is not well-maintained anymore. It provides an XMLStreamWriter implementation that can indent the XML output. We should look for an alternative to do so using another library.

Handling empty elements at the start of a text

TagSoup parser automatically restructures a text starting with an empty element so that the element contains the plain text.

WikiXML

<autoSignature ="signed"></autoSignature> wurde bereits nachdrücklich ...

TagSoup parsing result

<autoSignature ="signed">wurde bereits nachdrücklich ...</autoSignature>

This should not happen.