Git Product home page Git Product logo

gerdracor's Introduction

GerDraCor

Corpus Description

This is the German Drama Corpus (GerDraCor), a collection of TEI P5-encoded German-language plays from the 1500s to the 1940s. The corpus is released under the Creative Commons Zero copyright waiver (CC0).

If you want to cite the corpus, please use this publication:

  • Fischer, Frank, et al. (2019). Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama. In Proceedings of DH2019: "Complexities", Utrecht University, doi:10.5281/zenodo.4284002.

We started to build the corpus by extracting all plays from TextGrid Repository (TGRep). The source for the versions in TGRep was zeno.org's text collection. However, TGRep's conversion from zeno.org's proprietary XML to TEI caused some bugs and inconsistencies which we fixed for GerDraCor in a longer process between 2017 and 2019. All our fixes including enhancements are documented on GerDraCor's Wiki. After this clean-up process, GerDraCor is now in a position to grow by taking on new plays from sources such as Deutsches Textarchiv, Project Gutenberg, Projekt Gutenberg-DE, Wikisource, or Google Books.

GerDraCor is an autonomous corpus and will be maintained independently. Yet it is also integrated into the dracor.org website, the showcase for our newly introduced "Programmable Corpora" concept.

If you just want to download the corpus in its current state in XML-TEI, do this:

svn export https://github.com/dracor-org/gerdracor/trunk/tei

Character Relations

Character relations encode the information provided in the dramatis personae and make it machine-readable. This is mainly about family and power relations.

The following relations have been annotated (by Nathalie Wiedmer et al.):

Relation label Directed/Undirected Description
parent_of directed One character is a parent of the other
lover_of directed For lovers
related_with directed Other family relations (e.g., uncles)
associated_with directed For clearly associated characters (e.g., butlers)
siblings undirected Characters that have at least one parent in common
spouses undirected Characters in marriage (or engaged)
friends undirected Characters marked as being friends

All relations are marked in XML in the <listPerson> element within <listRelation>. Directed relations are encoded with an active and passive attribute where the active part is always the one in front of the relation, if expressed as a sentence. E.g., Odoardo is parent of Emilia translates to this:

Undirected relations use the mutual attribute to collect all IDs that are part of a relationship:

The label from the table above is contained in the name attribute.

API

An easy way to download the network data (instead of the actual TEI files) is to use our API (documentation here). If you have jq installed, it would work like this:

for play in `curl 'https://dracor.org/api/corpora/ger' | jq -r ".dramas[] .name"`; do
    wget -O "$play".csv https://dracor.org/api/corpora/ger/play/"$play"/networkdata/csv
done

The API info page is at https://dracor.org/api/info. It also tells you which version of eXist-db we're running on dracor.org.

Simple Visualisation with R

To take a first look at the distribution of the number of speakers per play over time, you could feed the metadata table into R:

library(data.table)
library(ggplot2)
gerdracor <- fread("https://dracor.org/api/corpora/ger/metadata/csv")
ggplot(gerdracor[], aes(x = yearNormalized, y = numOfSpeakers)) + geom_point()

Result:

number of speakers per play over time

Here is a barplot showing the number of plays per decade (outdated, not containing most recent changes):

number of plays per decade

A Bit of History

Until we rebuilt our working corpus under its new name GerDraCor, we've been working with an intermediary format to conduct our research. This format only held structural information, not the texts themselves. Back then, our research group called itself DLINA (digitally-enabled literary network analysis). Since our focus broadened, we stopped using this name. Our future endeavours will sail under the Programmable Corpora flag.

(README last updated on December 21, 2022.)

gerdracor's People

Contributors

cmil avatar heinp avatar ingoboerner avatar leaduempelmann avatar lehkost avatar mathias-goebel avatar melandresen avatar nevmenandr avatar pagelj avatar peertrilcke avatar sreyfe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gerdracor's Issues

teiHeader consistencies

  • embed schema https://dracor.org/schema.rng
  • TEI root element: add xml:lang="de"
  • transform DLINA IDs to dracor IDs: <idno type="DLINA-ID">123</idno><idno type="dracor" xml:base="https://dracor.org/id/">ger000123</idno>
  • wrap <docTitle> around <titlePart […]> elements
  • change licence to CC0: <ab>CC0</ab><ref target="https://creativecommons.org/publicdomain/zero/1.0/">Licence</ref>
  • in xml:base="https://www.wikidata.org/wiki/", change /wiki/ to /entity/

Textgrid ids

thanks for putting this corpus together, this is super cool.

One suggestion/question, however:
The textgrid tei files have this <idno type="TextGridUri">textgrid:rksp.0</idno>-element. It would be very useful if you keep this field. Or is there a particular reason why it's not in your corpus?

Thanks!

Goethe's "Faust I" (1808) doesn't contain entire text

The file only contains "Vorspiel auf dem Theater" so far, the missing parts have to be reintegrated. (It had to be stitched together from the TG repo, since for some reason it was divided into four parts after conversion.) The character list in <particDesc> seems to be complete, though.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.