wibarab / featuredb Goto Github PK

WIBARAB is a project in the field of Arabic dialectology. It consists of various regional sub-projects (four PhD projects) and a large database about bedouin-type dialects of Arabic. The Feature Database will be the main point of integrating the results of the sub-projects. In this repository we collect the primary data of the database in TEI/XML.

License: Other

XSLT 1.45% HTML 93.43% CSS 0.02% Jupyter Notebook 5.11%

acdh-ch arabic-dialects linguistics

featuredb's Introduction

WIBARAB feature database

About WIBARAB

WIBARAB is a very nice project in the field of Arabic dialectology. It consists of various regional sub-projects (four PhD projects) and a large database about bedouin-type dialects of Arabic.

The Feature Database will be the main point of integrating the results of the sub-projects. In this repository we collect the primary data of the database in TEI/XML.

Principal Investigator: Stephan Procházka (University of Vienna)
National Cooperation Partner: Charly Mörth (Austrian Academy of Sciences)

See https://wibarab.acdh.oeaw.ac.at/ for more information

Status of the data

THIS IS PRELIMINARY DATA AND COPYRIGHTED MATERIAL!

If you want to use any material in this repository please contact us at [email protected]

This will change at the end of the project.

Directory Structure

Directory	Content	Remarks
`001_src`	Original sources	Any external source data coming to the project
`082_scripts_xsl`	XSLT scripts	various XSLT scripts to convert the data scripts
`102_derived_TEI`	TEI-XML documents	TEI documents derived from a automatized conversion process (from `001_src` or elsewhere)
`010_manannot`	manually annotated TEI-XML documents	TEI documents which are manually annotated / curated / edited. Automated processed are not expected to write into this directory. We want to make sure that a human curator has validated the data in this directory and that nothing manually curated is overwritten by some script.
`802_tei_odd`	TEI customization (ODD)	This is the source of truth for the WIBARAB FeatureDB Schema and the HTML documentation generated from it.
`804_xsd`	XML Schemas	These are derived from the ODD in `802_tei_odd`. Each version of the schema should bear its number in the file name.
`850_docs`	Documentation	Further data documentation, encoding guidelines etc.

Schema Development

At this point, the model of the WIBARAB Feature Database schema is still evolving to a certain extent while new data is being curated, existing data being curated etc. In order to make sure that transitioning from one version of the schema to the next happens in a structured manner, we set up the following rules:

Any development of the schema is done in 802_tei_odd/featuredb.odd. This file might also contain unpublished, unfinished, backwards-incompatible changes not reflected in any derived schema or documentation.
Naming conventions: We follow the Semantic Versioning Best Practices 2.0.0 which - applied to our case - boil down to the following principles:
- If a change potentially makes documents invalid which were previously valid, it is a new MAJOR version (i.e. increment the first number)
- If a change does not break validity of existing documents (e.g. in that it only adds optional elements or attributes or adds a significant portion of prose to the documentation) it is a new MINOR version (i.e. increment second number)
- If a change in the schema is merely a bug fix (typo etc.) or a minor addition to the documentation (change in wording, added examples etc.) this constitutes a PATCH version (i.e. the third number is incremented).

Schema release workflow

When a new version of the schema is to be released:

In the ODD document:
- update @n on <edition> to only contain the exact version number (e.g. 2.1.3b).
- change <edition> to include the version number. These elements are treated only as labels and can thus include human-readble additions (like e.g. Version 2.1.3 Beta)
- add a <change> element with your editor ID and the current date, setting @status="published". Ideally add a <list> with all the changes you did in the ODD.
- Do not change the filename of the ODD document.
In oXygen:
- Generate the XSD schema from the ODD by right-clicking on 802_tei_odd/featuredb.odd and selecting Transform > Transform with > TEI ODD to XML Schema. The resulting files are placed into a new directory 802_tei_odd/out.
- create a new subfolder named {versionnumber} in 804_xsd/, e.g. 804_xsd/2.1.3b/ and move the files from 802_tei_odd/out to that folder.
Generate the html documentation and place it under 850_docs/featuredb_{versionnumber}.html
Afterwards delete 802_tei_odd/out.
Write a conversion script to transform documents from the previous schema version to the current one.
- Important: make sure that the conversion script updates the @xsi:schemaLocation in the migrated document instance.
- Place the XSLT script under 082_scripts_xsl/migrations and name it migrate_to_{versionnumber}.xsl (e.g. migrations/migrate_to_1.0.0b.xsl`).
Run the conversion script on the oddtest.xml document in 802_tei_odd and check it does produce the wanted results.
Apply the conversion script to the files in 010_manannot. They should be output to 102_derived_TEI
Commit all changes to git and add a tag named after the schema version number.
Curators have to check the converted TEI documents and move them from 102_derived_TEI to 010_manannot to approve the change.

About this file

This README file has a long-wound and dark history of editing. If you dare, you can check it out here.

featuredb's People

Contributors

Stargazers

Watchers

featuredb's Issues

investigate ways of converting LAMETA to TEI

https://github.com/onset/lameta/blob/master/sample%20data/Edolo%20sample/Sessions/ETR009/ETR009_Careful.mp3.meta

Add document status "in progress"

meeting 2024-01-11:

Currently, validation is done only on documents indicated as "done". For feature documents which are based on fieldwork, it will take some time until they reach this status, yet we might want at least parts of the to be validated.
We could think to introduce a third document status "in progress" where validation errors of fvos with status != "done" are dropped, so they don't bloat the status list.

Regarding the Sociolinguistic constraint again

We discussed again briefly the difference between the sociolinguistic constraints and the PersonGroup, and we came to the conclusion that a simple note element within the sociolinguistic constraints section would suit our purposes just fine, basically as it is now but in the transformation it would show as 'Sociolinguistic constraint'. The PersonGroup would include what we discussed.

Validation error - Fieldwork

'fieldwork' violates enumeration constraint of 'publication personalCommunication campaign'.
The attribute 'type' with value 'fieldwork' failed to parse.

multiple values in one fvo or in seperate fvos?

I think we discussed this before but unfortunately weʔre not sure anymore what we landed on: if for one feature and one dialect we have several realisations, is it better to create seperate fvos or put both/all realisations in one fvo?

Zotero export: entries without biblid

2024-01-08T10:40:38.4676328Z 2024-01-08 10:40:38,467 - 5U3YWIMG no biblid
2024-01-08T10:40:38.4679487Z 2024-01-08 10:40:38,467 - TYKGGJEB no biblid
2024-01-08T10:40:38.4684605Z 2024-01-08 10:40:38,468 - LG2SHTMB no biblid
2024-01-08T10:40:38.4686009Z 2024-01-08 10:40:38,468 - 6TNYZUA8 no biblid
2024-01-08T10:40:38.4687022Z 2024-01-08 10:40:38,468 - QCPMWAYN no biblid
2024-01-08T10:40:38.4687968Z 2024-01-08 10:40:38,468 - P4WYQADG no biblid
2024-01-08T10:40:38.4689154Z 2024-01-08 10:40:38,468 - TZUT6CRI no biblid
2024-01-08T10:40:38.4690108Z 2024-01-08 10:40:38,468 - VBMVMQE8 no biblid
2024-01-08T10:40:38.4691026Z 2024-01-08 10:40:38,468 - HHA62AUL no biblid
2024-01-08T10:40:38.4692023Z 2024-01-08 10:40:38,468 - DYHVZN2P no biblid
2024-01-08T10:40:38.4692912Z 2024-01-08 10:40:38,468 - JULCPNGK no biblid
2024-01-08T10:40:38.4693925Z 2024-01-08 10:40:38,468 - 8F46VZCI no biblid
2024-01-08T10:40:38.4694859Z 2024-01-08 10:40:38,468 - XP62YEX8 no biblid
2024-01-08T10:40:38.4695691Z 2024-01-08 10:40:38,468 - EEKF92L3 no biblid
2024-01-08T10:40:38.4696728Z 2024-01-08 10:40:38,468 - VZWM5K3W no biblid
2024-01-08T10:40:38.4697584Z 2024-01-08 10:40:38,468 - 2SVX5GW7 no biblid
2024-01-08T10:40:38.4698484Z 2024-01-08 10:40:38,468 - EQQCQX4I no biblid
2024-01-08T10:40:38.4699860Z 2024-01-08 10:40:38,469 - Y4VTSSEN malformed biblid: (biblid:āl_1968_2357)
2024-01-08T10:40:38.4701257Z 2024-01-08 10:40:38,469 - RDCRA9ZI malformed biblid: biblid:ouldbaba_2023_9273)
2024-01-08T10:40:38.4703118Z 2024-01-08 10:40:38,469 - APTEQYR4 malformed biblid: biblid:danna_2023_9272)

introduce a controlled vocabulary for tribe names

Controlled vocabulary for tribe names

We want to make sure that the tribe names are consistent across our data so we should both add the list to the ODD / Schema and to the tei_enricher

There are several ways of implementing that:

Option 1: source from language profiles

Each tribe is represented in a language profile; so we could extract the list out of those profiles describing a tribe (leaving out others).

Pro:

tribes and langProfiles will be consistent.
no duplication of information

Con: Technically probably a bit more complicated:

tei_enricher will need one file with a list, so this will need to be generated programmatically every time a new tribe is added
also, the ODD and the schema will have to be re-generated
it is questionable whether / when we will have language profiles for each tribe

Option 2: dedicated list of tribes

Actually, there is already a stub of a list of tribes at 010_manannot/wibarab_tribes.xml

Pro:

easy to edit / consume
could use schematron rule to

Con:

duplication of sources (some tribes will also have a language profile containing overlapping information)

simplify titles of feature documents

define curation workflow

Define curation workflow

In order to be implemented into values of @status attributes, we need to define a curation workflow.
Here's what has been proposed so far in our meeting on 2023-01-12:

Draft (default status) - data gathering is still ongoing
Done (WIBARAB marks it) major bulk of data gathering is already done (minus some fieldwork and doubts). The document is ready to be validated
Validated (ACDH CH marks it)- 1st round of validation has been done and finished and no changes are required from the ACDH-CH Team.
Needs revision (ACDH CH team)- ACDH-CH Team needs some changes from WIBARAB team for a second final round of validation.
Revised (WIBARAB team marks it)- Some changes have been done after 1st validation and the document needs to be validated again
Completed (ACDH CH and WIBARAB need to agree) - Final version of the document. Ready to publish.

implement script to reorder FVO content to conform to the schema

Currently the ODD requires the order to be:

name
bibl
placeName
lang
date

afterwards optional elements in any number or order:

personGrp
cit
note

fix xml:ids in Zotero export

Description

Currently, the xmls:ids in 010_manannot/vicav_biblio_tei_zotero.xml are generated by the Zotero client and referenced from the single feature documents. However, these IDs are not reliably stable and can change as entries are added (e.g. adding another publication from an author from the same year will result in both records' xml:ids be updated to "lastName2023a" and "lastName2023b".

Solution

To avoid this, we have introduced the "biblid" values in Zotero's extra field which we have full control over.
We now just need to add a post-processing step to the 080_scripts_generic/vicav_zotero/fetch_generated_tei_and_process.ipynb

Authorship attribution for feature descriptions.

As discussed in our meeting on 2023-12-21, we want to attribute authorship to the descriptive part of a feature document, potentially also for external contributors. For this, we should …

add <byline> to the ODD and make it mandatory within <div type="description">
add a to 010_manannot/wibarab_dmp.xml where the <person> elements for external contributors can be listed
make @resp mandatory on <div type="description">

Further, we should decide whether the author of the feature description should also be mentioned in the <titleStmt> (IMHO s*he should), and how (<author> ? <respStmt> with a dedicated <resp> ?)

move common lists into dedicated git repository

Several "TEI-encoded lists" are used (and potentially edited) in parallel by different projects: these files should be kept in a central place and thus be moved into a dedicated git repository (e.g. coined vicav-commons ?) which can be included as a submodule in the project-specific git repositories.

Candidates are:

fLib.xml
vicav_geodata.xml
vicav_biblio_tei_zotero.xml (i.e. VICAV Zotero dump)

introduce divGen to indicate location of featureValues list

We want to allow editors to decide where the list of possible feature values should be placed in the description part of a feature value document. We could use divGen for that purpose.

<divGen type="featureValues"/>

add to ODD / Schema
implement in html preview transformation

socioLinguisticConstraints are not rendered in HTML preview

develop expansion XSLT script

Develop expansion XSLT script

The feature documents are made up of references to various external documents. For full validation and for querying the data, these references need to be resolved and the data being included in a "full" feature documents.

personGrp for religions

Changed personGrp/@type religiousGroup to religiousAffiliation.
Not sure what the right term would be. Discuss.

remove comment "potentially ambiguous references"

Description / Background

In the past, I've added XML comments to bibliographic references which were potentially ambiguous so curators could systematically check them and set @status on the <bibl> element to OK (cf. ODD). The issues in the data should have been resolved by now, however in many cases, curators only changed the value of @status but did not remov the XML comment.

What's to be done

Remove XML comments reading "potentially ambiguous references" inside of <bibl> elements with @status="OK"

Introduce new publication subtypes

Currently, the ODD allows several values for @subtype on <bibl> (based on what's in the VICAV Zotero Library

   <attDef ident="subtype" mode="add">
      <!-- This is extraced by running distinct-values(//biblStruct/@type) on the TEI export of the VICAV bibliography. -->
      <valList type="closed">
         <valItem ident="conferencePaper"/>
         <valItem ident="bookSection"/>
         <valItem ident="journalArticle"/>
         <valItem ident="book"/>
         <valItem ident="encyclopediaArticle"/>
         <valItem ident="thesis"/>
         <valItem ident="magazineArticle"/>
         <valItem ident="manuscript"/>
      </valList>

It would be great if those would show up in the tei_enricher

adapt XSLT for on-the-fly display of profiles

The existing XSLT is VICAV-specific and only creates a div-snippet.

Open/view @target in editor

To access the profiles directly from the editor, the editor must be able to open/view files from the values of target attributes.

Commit number ce0cd788aaddfe7a15c8dc091da96f4e42eb261f - New locations in the Galilee by Ana (August 8th)

Commit ce0cd78 (New locations in the Galilee by Ana - August 8th) was pushed but does not appear on TEI enricher. The modified files though are present in the backup folder. Since many other modifications have been done after that, how shall I proceed?

encode author of feature description section

replace xml:base="{docPath}" with some other encoding

two problems:

@xml:base contains an URI, { is an invalid character there
resolving uris won't work any more as expected out of the box

since the purpose of this construct was specific to the Enricher, probably a processing instruction would be the most suitable solution

New personGroup Role

Two particular tribes are not tribes in the traditional sense of the word, they are groups which have come together for multiple reasons, such as work, have mingled with each other and created their own tribal group and linguistic variety. We would like to call them something along the lines of TribalGroup e.g. . The only problem is that do not have a define relation with the others, and as such it would be good if they existed outside of the predefined hierarchy which applies for the clan - tribe - confederation.

Validation: relate validation errors to editors

h1. Description

As usual, the various levels of validation only report errors for file names + line numbers.
To ease managing the resolution of errors in the feature documents, each error should be assigned to the editor of the respective feature value observation element the error was caused.

validation errors q-file "chapter"

In the "chapter" of the q-file are two recurring errors that come up in validation:
Das Attribut 'type' des Elements '{http://www.tei-c.org/ns/1.0}graphic' ist im DTD/Schema nicht definiert .
Das Attribut 'type' des Elements '{http://www.tei-c.org/ns/1.0}num' ist im DTD/Schema nicht definiert .

Zotero to TEI: represent date of data collection

Cf. #42: Each bibliographic entry used for feature value observations will have a tag for the decade of data collection and the level of certainty in Zotero. This will come out as any other "normal" Zotero tag as <note type="tag"> in the TEI export, however we probably want to make it more expressive.

<biblStruct> itself does not offer many choices. Either

<biblStruct>
     …
      <note type="dataCollection">The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date> and <date type="dataCollection" cert="low" notBefore-iso="1960-01-01" notAfter="1969-12-31">1960s</date>.
        </note>
</biblStruct>

OR: we can attach this to an @ana attribute on <biblStruct>:

<biblStruct type="conferencePaper" xml:id="Agius_26291991_9868" corresp="http://zotero.org/groups/2165756/items/EJHJT3CB" n="Agius1991a" ana="dataCollected:d1950s dataCollected:d1960s">
      …
</biblStruct>

and at the end of the document:

<interp xml:id="d1950s"><desc>The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date></desc></interp>

Neither of which I find very convincing, honestly. ... Other ideas, @charlymo @kisram @VeronikaEngler ?

translate dialect list into VICAV language profiles

we want to retire the dialect list and transform the information to stubs of language profiles

note type should be rendered

It would be helpful to indicate the @type on <note> also in the html rendering

convert featurestructure-elements to FeatureValueObservation-Elements in q-type-file

reference language profile using `<lang>`

Currently, we point to the language profile related to a feature value observation a simple <ptr> element.
Probably it would be more expressive to use <lang corresp="../profiles/profile.xml"/>

Zotero-TEI export is broken

many invalid xml:ids (spaces, unescaped single quotation marks, brackets etc., e.g. belnap r. kirk_2009_3013)
same xml:id is used for different entries (e.g. behnstedt_1994_0001 is used for @corresp='http://zotero.org/groups/2165756/items/EEI7S3N8' and `@corresp='http://zotero.org/groups/2165756/items/7TMLL9NZ')
<extent> must be at the end of the entry (currently it's directly following the <title>
<title> missing in <monogr> of an analytic publication (e.g. http://zotero.org/groups/2165756/items/5SWT5LJ7)
HTML-Elements inside of TEI (<h2>, <i>)

Introduce controlled vocabulary for names of religions

feature value observations which are attested to a specific religious group contain a <personGrp> element:

<personGrp type="religousGroup">
      <name>Christians</name>
</personGrp>

We want to limit the possible values of <name> to one of the following:

Christians
Jews
Muslim
Ibadi
Malikite
Sunni
Shiite
Druze

TODO

add to ODD (@dasch124 )
introduce to tei_enricher (@charlymo) - related to https://gitlab.oeaw.ac.at/acdh-ch/object-pascal/tei-enricher/-/issues/4

How to group features?

ident on lang; corresp on personGrp

Zotero: add decade of data collection

To each publication in the VICAV bibliography, we want to add the information when the data was collected

automatically import tribes from dialect list in excel to the tei-tribes-file as discussed in our meetting

validation: make sure that fvo ids are globally unique

Theoretically, all fvo elements should have a globally unique xml:id by prefixing them with the document ids. Since this is beyond the current document-internal validation we've implemented so far, we need to add this.

Currently there are some fvo ids where the "document id" part of the fvo id reads "tf_template". https://github.com/search?q=repo%3Awibarab%2Ffeaturedb+xml%3Aid%3D%22ft_template&type=code

wrong relative paths pointing from feature document to profiles

E.g. in 010_manannot/features/features_djim.xml:

    <wib:featureValueObservation cert="unknown" status="draft" xml:id="fr_0000_new_moon_please" resp="dmp:???">
         ...
               <ptr target="profiles\vicav_profile_LBN-ABI.xml"/>
          ...
    </wib>

Since the profiles are located under 010_manannot/profiles the @target attribute on <ptr> should read ..\profiles\vicav_profile_LBN-ABI.xml Changing this in the data isn't a problem, but would this break something in the tei_enricher, @charlymo ?

zotero: id unrelated to bibliographic data

In 010_manannot/vicav_biblio_tei_zotero.xml the entry http://zotero.org/groups/2165756/items/F4V22ECD has xml:id "prochazka_2016_3167" - which is strange because neither editor nor year are related to the actual bibliographic data. Moreover, in Zotero, the entry has a different, more plausible bibl:id in the extra field (biblid:HerinZammit_2016_3447). We should investigate if there exist other cases like that and provide a fix so that the information in the extra field always matches the xml:id in the TEI export.

label (@n) for taxonomy

Added n="sedentismType" to taxonomy. Look for better term under which bedouin (=nomadic?), sedentary, mixed can be grouped.