Git Product home page Git Product logo

empirist-corpus's Introduction

EmpiriST corpus

Introduction

The EmpiriST corpus is a manually annotated corpus consisting of German web pages and German computer-mediated communication (CMC), i.e. written discourse. Examples for CMC genres are monologic and dialogic tweets, social and professional chats, threads from Wikipedia talk pages, WhatsApp interactions and blog comments. Here is an overview of the sizes of the corpus and its subsets in tokens:

CMC Web Total
Training 5,109 4,944 10,053
Test 5,237 7,568 12,805
Total 10,346 12,512 22,858

The dataset was originally created by Beißwenger et al. (2016) for the EmpiriST 2015 shared task and featured manual tokenization and part-of-speech tagging. Subsequently, Rehbein et al. (2018) incorporated the dataset into their harmonised testsuite for POS tagging of German social media data, manually added sentence boundaries and automatically mapped the part-of-speech tags to UD pos tags. In our own annotation efforts (Proisl et al., 2020), we manually normalized and lemmatized the data and converted the corpus into a “vertical” format suitable for importing into the Open Corpus Workbench, CQPweb, SketchEngine, or similar corpus tools. In addition, we manually annotated the corpus with USAS semantic tags.

Annotation

Here is a one-sentence posting illustrating the corpus format. The seven columns are: Word form, STTS IBK tag, UD POS tag, USAS tag, normalized form, surface-oriented lemma, normalized lemma.

<posting id="cmc_train_003_099" author="quaki" origid="1-114">
<s>
die       ART     DET     Z5           die       der       der
viecha    NN      NOUN    L2           Viecher   Viech     Viech
reissen   VVFIN   VERB    A1.1.2/MWU:7 reißen    reissen   reißen
imma      ADV     ADV     N6           immer     imma      immer
die       ART     DET     Z5           die       der       der
müllsäcke NN      NOUN    O2           Müllsäcke Müllsack  Müllsack
auf       PTKVZ   PART    A1.1.2/MWU:3 auf       auf       auf
hmmmm     ITJ     INTJ    Z4           hm        hmmmm     hm
</s>
</posting>

The following subsections give a bit of additional information about the annotation process.

Tokenization and part-of-speech tagging

Beißwenger et al. (2016: 47) describe the annotation process as follows:

All data sets were manually tokenized and PoS tagged by multiple annotators, based on the official tokenization […] and tagging guidelines […]. Cases of disagreement were then adjudicated by the task organizers to produce the final gold standard.

Sentence splitting

Rehbein et al. (2018: 20) used the following rules to guide the segmentation:

  • Hashtags and URLs at the beginning or the end of the tweet that are not integrated in the sentence are separated and form their own unit […].
  • Emoticons are treated as non-verbal comments to the text and are thus integrated in the utterance.
  • Interjections (Aaahh), inflectives (*grins*), fillers (ähm) and acronyms typical for CMC (lol, OMG) are also not separated but considered as part of the message.

Normalization and lemmatization

The data were individually normalized and lemmatized by four student annotators according to the lemmatization guidelines. Unclear cases were decided in group meetings with the team leaders.

Semantic Tagging

A preliminary version of the semantic tags for German has been added to the corpus file, with heuristics to represent typical CMC phenomena. Each token may have several tags, separated by slashes. Expressions such as idioms or particle verbs are treated as multiword units (MWU). If no further information is given, the MWU consists of the entire sequence of subsequent tokens marked as MWU. Discontinuous multi-word expressions are marked by one or several numbers separated by colons which point to the line number of all other tokens forming part of the expression (i.e. MWU:15 on line 8 indicates that tokens 8 and 15 form a unit).

Authors

The corpus data was collected, tokenized and part-of-speech tagged by the organizers of the EmpiriST 2015 shared task: Michael Beißwenger, Sabine Bartsch, Stefan Evert and Kay-Michael Würzner.

Ines Rehbein, Josef Ruppenhofer and Victor Zimmermann added sentence boundaries and automatically mapped the STTS pos tags to UD pos tags.

Thomas Proisl, Natalie Dykes, Philipp Heinrich, Besim Kabashi and Stefan Evert added normalization and lemmatization.

References

  • Beißwenger, Michael, Sabine Bartsch, Stefan Evert, and Kai-Michael Würzner. 2016. “EmpiriST 2015: A shared task on the automatic linguistic annotation of computer-mediated communication and web corpora.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, 44–56, Berlin. Association for Computational Linguistics. PDF.
  • Proisl, Thomas, Natalie Dykes, Philipp Heinrich, Besim Kabashi, Andreas Blombach, and Stefan Evert. 2020. “EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus.” In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 6142–6148, Marseille. European Language Resources Association. PDF.
  • Rehbein, Ines, Josef Ruppenhofer, and Victor Zimmermann. 2018. “A harmonised testsuite for POS tagging of German social media data.” In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018), 18–28, Wien. PDF.

empirist-corpus's People

Contributors

tsproisl avatar nfdykes avatar

Stargazers

Lisa Raithel avatar Peter avatar

Watchers

James Cloos avatar Markus Opolka avatar  avatar Philipp Heinrich avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.