Git Product home page Git Product logo

wibarab / featuredb Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 62.34 MB

WIBARAB is a project in the field of Arabic dialectology. It consists of various regional sub-projects (four PhD projects) and a large database about bedouin-type dialects of Arabic. The Feature Database will be the main point of integrating the results of the sub-projects. In this repository we collect the primary data of the database in TEI/XML.

License: Other

XSLT 1.45% HTML 93.43% CSS 0.02% Jupyter Notebook 5.11%
acdh-ch arabic-dialects linguistics

featuredb's Introduction

WIBARAB feature database

About WIBARAB

WIBARAB is a very nice project in the field of Arabic dialectology. It consists of various regional sub-projects (four PhD projects) and a large database about bedouin-type dialects of Arabic.

The Feature Database will be the main point of integrating the results of the sub-projects. In this repository we collect the primary data of the database in TEI/XML.

Principal Investigator: Stephan Procházka (University of Vienna)
National Cooperation Partner: Charly Mörth (Austrian Academy of Sciences)

See https://wibarab.acdh.oeaw.ac.at/ for more information

Contact us at [email protected] or follow us on Twitter.

Status of the data

THIS IS PRELIMINARY DATA AND COPYRIGHTED MATERIAL!

If you want to use any material in this repository please contact us at [email protected]

This will change at the end of the project.

Directory Structure

Directory Content Remarks
001_src Original sources Any external source data coming to the project
082_scripts_xsl XSLT scripts various XSLT scripts to convert the data scripts
102_derived_TEI TEI-XML documents TEI documents derived from a automatized conversion process (from 001_src or elsewhere)
010_manannot manually annotated TEI-XML documents TEI documents which are manually annotated / curated / edited. Automated processed are not expected to write into this directory. We want to make sure that a human curator has validated the data in this directory and that nothing manually curated is overwritten by some script.
802_tei_odd TEI customization (ODD) This is the source of truth for the WIBARAB FeatureDB Schema and the HTML documentation generated from it.
804_xsd XML Schemas These are derived from the ODD in 802_tei_odd. Each version of the schema should bear its number in the file name.
850_docs Documentation Further data documentation, encoding guidelines etc.

Schema Development

At this point, the model of the WIBARAB Feature Database schema is still evolving to a certain extent while new data is being curated, existing data being curated etc. In order to make sure that transitioning from one version of the schema to the next happens in a structured manner, we set up the following rules:

  • Any development of the schema is done in 802_tei_odd/featuredb.odd. This file might also contain unpublished, unfinished, backwards-incompatible changes not reflected in any derived schema or documentation.
  • Naming conventions: We follow the Semantic Versioning Best Practices 2.0.0 which - applied to our case - boil down to the following principles:
    • If a change potentially makes documents invalid which were previously valid, it is a new MAJOR version (i.e. increment the first number)
    • If a change does not break validity of existing documents (e.g. in that it only adds optional elements or attributes or adds a significant portion of prose to the documentation) it is a new MINOR version (i.e. increment second number)
    • If a change in the schema is merely a bug fix (typo etc.) or a minor addition to the documentation (change in wording, added examples etc.) this constitutes a PATCH version (i.e. the third number is incremented).

Schema release workflow

When a new version of the schema is to be released:

  • In the ODD document:
    • update @n on <edition> to only contain the exact version number (e.g. 2.1.3b).
    • change <edition> to include the version number. These elements are treated only as labels and can thus include human-readble additions (like e.g. Version 2.1.3 Beta)
    • add a <change> element with your editor ID and the current date, setting @status="published". Ideally add a <list> with all the changes you did in the ODD.
    • Do not change the filename of the ODD document.
  • In oXygen:
    • Generate the XSD schema from the ODD by right-clicking on 802_tei_odd/featuredb.odd and selecting Transform > Transform with > TEI ODD to XML Schema. The resulting files are placed into a new directory 802_tei_odd/out.
    • create a new subfolder named {versionnumber} in 804_xsd/, e.g. 804_xsd/2.1.3b/ and move the files from 802_tei_odd/out to that folder.
  • Generate the html documentation and place it under 850_docs/featuredb_{versionnumber}.html
  • Afterwards delete 802_tei_odd/out.
  • Write a conversion script to transform documents from the previous schema version to the current one.
    • Important: make sure that the conversion script updates the @xsi:schemaLocation in the migrated document instance.
    • Place the XSLT script under 082_scripts_xsl/migrations and name it migrate_to_{versionnumber}.xsl (e.g. migrations/migrate_to_1.0.0b.xsl`).
  • Run the conversion script on the oddtest.xml document in 802_tei_odd and check it does produce the wanted results.
  • Apply the conversion script to the files in 010_manannot. They should be output to 102_derived_TEI
  • Commit all changes to git and add a tag named after the schema version number.
  • Curators have to check the converted TEI documents and move them from 102_derived_TEI to 010_manannot to approve the change.

About this file

This README file has a long-wound and dark history of editing. If you dare, you can check it out here.

featuredb's People

Contributors

antonellat73 avatar charlymo avatar claudialaaber avatar dasch124 avatar github-actions[bot] avatar gundak95 avatar hessabi3108 avatar iriartedia89 avatar johdop avatar kisram avatar likeanga avatar mariarebecca avatar prochas8 avatar simar0at avatar terlan712 avatar veronikaengler avatar

Stargazers

 avatar

Watchers

 avatar  avatar

featuredb's Issues

Add document status "in progress"

meeting 2024-01-11:

Currently, validation is done only on documents indicated as "done". For feature documents which are based on fieldwork, it will take some time until they reach this status, yet we might want at least parts of the to be validated.
We could think to introduce a third document status "in progress" where validation errors of fvos with status != "done" are dropped, so they don't bloat the status list.

Regarding the Sociolinguistic constraint again

We discussed again briefly the difference between the sociolinguistic constraints and the PersonGroup, and we came to the conclusion that a simple note element within the sociolinguistic constraints section would suit our purposes just fine, basically as it is now but in the transformation it would show as 'Sociolinguistic constraint'. The PersonGroup would include what we discussed.

Validation error - Fieldwork

'fieldwork' violates enumeration constraint of 'publication personalCommunication campaign'.
The attribute 'type' with value 'fieldwork' failed to parse.

multiple values in one fvo or in seperate fvos?

I think we discussed this before but unfortunately weʔre not sure anymore what we landed on: if for one feature and one dialect we have several realisations, is it better to create seperate fvos or put both/all realisations in one fvo?

Zotero export: entries without biblid

2024-01-08T10:40:38.4676328Z 2024-01-08 10:40:38,467 - 5U3YWIMG no biblid
2024-01-08T10:40:38.4679487Z 2024-01-08 10:40:38,467 - TYKGGJEB no biblid
2024-01-08T10:40:38.4684605Z 2024-01-08 10:40:38,468 - LG2SHTMB no biblid
2024-01-08T10:40:38.4686009Z 2024-01-08 10:40:38,468 - 6TNYZUA8 no biblid
2024-01-08T10:40:38.4687022Z 2024-01-08 10:40:38,468 - QCPMWAYN no biblid
2024-01-08T10:40:38.4687968Z 2024-01-08 10:40:38,468 - P4WYQADG no biblid
2024-01-08T10:40:38.4689154Z 2024-01-08 10:40:38,468 - TZUT6CRI no biblid
2024-01-08T10:40:38.4690108Z 2024-01-08 10:40:38,468 - VBMVMQE8 no biblid
2024-01-08T10:40:38.4691026Z 2024-01-08 10:40:38,468 - HHA62AUL no biblid
2024-01-08T10:40:38.4692023Z 2024-01-08 10:40:38,468 - DYHVZN2P no biblid
2024-01-08T10:40:38.4692912Z 2024-01-08 10:40:38,468 - JULCPNGK no biblid
2024-01-08T10:40:38.4693925Z 2024-01-08 10:40:38,468 - 8F46VZCI no biblid
2024-01-08T10:40:38.4694859Z 2024-01-08 10:40:38,468 - XP62YEX8 no biblid
2024-01-08T10:40:38.4695691Z 2024-01-08 10:40:38,468 - EEKF92L3 no biblid
2024-01-08T10:40:38.4696728Z 2024-01-08 10:40:38,468 - VZWM5K3W no biblid
2024-01-08T10:40:38.4697584Z 2024-01-08 10:40:38,468 - 2SVX5GW7 no biblid
2024-01-08T10:40:38.4698484Z 2024-01-08 10:40:38,468 - EQQCQX4I no biblid
2024-01-08T10:40:38.4699860Z 2024-01-08 10:40:38,469 - Y4VTSSEN malformed biblid: (biblid:āl_1968_2357)
2024-01-08T10:40:38.4701257Z 2024-01-08 10:40:38,469 - RDCRA9ZI malformed biblid: biblid:ouldbaba_2023_9273)
2024-01-08T10:40:38.4703118Z 2024-01-08 10:40:38,469 - APTEQYR4 malformed biblid: biblid:danna_2023_9272)

introduce a controlled vocabulary for tribe names

Controlled vocabulary for tribe names

We want to make sure that the tribe names are consistent across our data so we should both add the list to the ODD / Schema and to the tei_enricher

There are several ways of implementing that:

Option 1: source from language profiles

Each tribe is represented in a language profile; so we could extract the list out of those profiles describing a tribe (leaving out others).

Pro:

  • tribes and langProfiles will be consistent.
  • no duplication of information

Con: Technically probably a bit more complicated:

  • tei_enricher will need one file with a list, so this will need to be generated programmatically every time a new tribe is added
  • also, the ODD and the schema will have to be re-generated
  • it is questionable whether / when we will have language profiles for each tribe

Option 2: dedicated list of tribes

Actually, there is already a stub of a list of tribes at 010_manannot/wibarab_tribes.xml

Pro:

  • easy to edit / consume
  • could use schematron rule to

Con:

  • duplication of sources (some tribes will also have a language profile containing overlapping information)

define curation workflow

Define curation workflow

In order to be implemented into values of @status attributes, we need to define a curation workflow.
Here's what has been proposed so far in our meeting on 2023-01-12:

  • Draft (default status) - data gathering is still ongoing
  • Done (WIBARAB marks it) major bulk of data gathering is already done (minus some fieldwork and doubts). The document is ready to be validated
  • Validated (ACDH CH marks it)- 1st round of validation has been done and finished and no changes are required from the ACDH-CH Team.
  • Needs revision (ACDH CH team)- ACDH-CH Team needs some changes from WIBARAB team for a second final round of validation.
  • Revised (WIBARAB team marks it)- Some changes have been done after 1st validation and the document needs to be validated again
  • Completed (ACDH CH and WIBARAB need to agree) - Final version of the document. Ready to publish.

fix xml:ids in Zotero export

fix xml:ids in Zotero export

Description

Currently, the xmls:ids in 010_manannot/vicav_biblio_tei_zotero.xml are generated by the Zotero client and referenced from the single feature documents. However, these IDs are not reliably stable and can change as entries are added (e.g. adding another publication from an author from the same year will result in both records' xml:ids be updated to "lastName2023a" and "lastName2023b".

Solution

To avoid this, we have introduced the "biblid" values in Zotero's extra field which we have full control over.
We now just need to add a post-processing step to the 080_scripts_generic/vicav_zotero/fetch_generated_tei_and_process.ipynb

Authorship attribution for feature descriptions.

As discussed in our meeting on 2023-12-21, we want to attribute authorship to the descriptive part of a feature document, potentially also for external contributors. For this, we should …

  • add <byline> to the ODD and make it mandatory within <div type="description">
  • add a to 010_manannot/wibarab_dmp.xml where the <person> elements for external contributors can be listed
  • make @resp mandatory on <div type="description">

Further, we should decide whether the author of the feature description should also be mentioned in the <titleStmt> (IMHO s*he should), and how (<author> ? <respStmt> with a dedicated <resp> ?)

introduce divGen to indicate location of featureValues list

We want to allow editors to decide where the list of possible feature values should be placed in the description part of a feature value document. We could use divGen for that purpose.

<divGen type="featureValues"/>
  • add to ODD / Schema
  • implement in html preview transformation

develop expansion XSLT script

Develop expansion XSLT script

The feature documents are made up of references to various external documents. For full validation and for querying the data, these references need to be resolved and the data being included in a "full" feature documents.

remove comment "potentially ambiguous references"

Description / Background

In the past, I've added XML comments to bibliographic references which were potentially ambiguous so curators could systematically check them and set @status on the <bibl> element to OK (cf. ODD). The issues in the data should have been resolved by now, however in many cases, curators only changed the value of @status but did not remov the XML comment.

What's to be done

Remove XML comments reading "potentially ambiguous references" inside of <bibl> elements with @status="OK"

Introduce new publication subtypes

Introduce new publication subtypes

Currently, the ODD allows several values for @subtype on <bibl> (based on what's in the VICAV Zotero Library

   <attDef ident="subtype" mode="add">
      <!-- This is extraced by running distinct-values(//biblStruct/@type) on the TEI export of the VICAV bibliography. -->
      <valList type="closed">
         <valItem ident="conferencePaper"/>
         <valItem ident="bookSection"/>
         <valItem ident="journalArticle"/>
         <valItem ident="book"/>
         <valItem ident="encyclopediaArticle"/>
         <valItem ident="thesis"/>
         <valItem ident="magazineArticle"/>
         <valItem ident="manuscript"/>
      </valList>

It would be great if those would show up in the tei_enricher

Open/view @target in editor

To access the profiles directly from the editor, the editor must be able to open/view files from the values of target attributes.

replace xml:base="{docPath}" with some other encoding

two problems:

  • @xml:base contains an URI, { is an invalid character there
  • resolving uris won't work any more as expected out of the box

since the purpose of this construct was specific to the Enricher, probably a processing instruction would be the most suitable solution

New personGroup Role

Two particular tribes are not tribes in the traditional sense of the word, they are groups which have come together for multiple reasons, such as work, have mingled with each other and created their own tribal group and linguistic variety. We would like to call them something along the lines of TribalGroup e.g. . The only problem is that do not have a define relation with the others, and as such it would be good if they existed outside of the predefined hierarchy which applies for the clan - tribe - confederation.

Validation: relate validation errors to editors

h1. Description

As usual, the various levels of validation only report errors for file names + line numbers.
To ease managing the resolution of errors in the feature documents, each error should be assigned to the editor of the respective feature value observation element the error was caused.

Zotero to TEI: represent date of data collection

Cf. #42: Each bibliographic entry used for feature value observations will have a tag for the decade of data collection and the level of certainty in Zotero. This will come out as any other "normal" Zotero tag as <note type="tag"> in the TEI export, however we probably want to make it more expressive.

<biblStruct> itself does not offer many choices. Either

<biblStruct>
     …
      <note type="dataCollection">The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date> and <date type="dataCollection" cert="low" notBefore-iso="1960-01-01" notAfter="1969-12-31">1960s</date>.
        </note>
</biblStruct>

OR: we can attach this to an @ana attribute on <biblStruct>:

<biblStruct type="conferencePaper" xml:id="Agius_26291991_9868" corresp="http://zotero.org/groups/2165756/items/EJHJT3CB" n="Agius1991a" ana="dataCollected:d1950s dataCollected:d1960s">
      …
</biblStruct>

and at the end of the document:

<interp xml:id="d1950s"><desc>The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date></desc></interp>

Neither of which I find very convincing, honestly. ... Other ideas, @charlymo @kisram @VeronikaEngler ?

reference language profile using `<lang>`

Currently, we point to the language profile related to a feature value observation a simple <ptr> element.
Probably it would be more expressive to use <lang corresp="../profiles/profile.xml"/>

Zotero-TEI export is broken

  • many invalid xml:ids (spaces, unescaped single quotation marks, brackets etc., e.g. belnap r. kirk_2009_3013)
  • same xml:id is used for different entries (e.g. behnstedt_1994_0001 is used for @corresp='http://zotero.org/groups/2165756/items/EEI7S3N8' and `@corresp='http://zotero.org/groups/2165756/items/7TMLL9NZ')
  • <extent> must be at the end of the entry (currently it's directly following the <title>
  • <title> missing in <monogr> of an analytic publication (e.g. http://zotero.org/groups/2165756/items/5SWT5LJ7)
  • HTML-Elements inside of TEI (<h2>, <i>)

Introduce controlled vocabulary for names of religions

Introduce controlled vocabulary for names of religions

feature value observations which are attested to a specific religious group contain a <personGrp> element:

<personGrp type="religousGroup">
      <name>Christians</name>
</personGrp>

We want to limit the possible values of <name> to one of the following:

  • Christians
  • Jews
  • Muslim
  • Ibadi
  • Malikite
  • Sunni
  • Shiite
  • Druze

TODO

wrong relative paths pointing from feature document to profiles

E.g. in 010_manannot/features/features_djim.xml:

    <wib:featureValueObservation cert="unknown" status="draft" xml:id="fr_0000_new_moon_please" resp="dmp:???">
         ...
               <ptr target="profiles\vicav_profile_LBN-ABI.xml"/>
          ...
    </wib>

Since the profiles are located under 010_manannot/profiles the @target attribute on <ptr> should read ..\profiles\vicav_profile_LBN-ABI.xml Changing this in the data isn't a problem, but would this break something in the tei_enricher, @charlymo ?

zotero: id unrelated to bibliographic data

In 010_manannot/vicav_biblio_tei_zotero.xml the entry http://zotero.org/groups/2165756/items/F4V22ECD has xml:id "prochazka_2016_3167" - which is strange because neither editor nor year are related to the actual bibliographic data. Moreover, in Zotero, the entry has a different, more plausible bibl:id in the extra field (biblid:HerinZammit_2016_3447). We should investigate if there exist other cases like that and provide a fix so that the information in the extra field always matches the xml:id in the TEI export.

label (@n) for taxonomy

Added n="sedentismType" to taxonomy. Look for better term under which bedouin (=nomadic?), sedentary, mixed can be grouped.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.