iljackb / mixtepec_mixtec Goto Github PK

Mostly XML (TEI) markup of Mixtepec-Mixtec Language resources

HTML 89.54% XSLT 9.75% XQuery 0.46% CSS 0.26% Shell 0.01%

lexicon-construction lexical-analysis lexicography lexical-resource language-documentation language-processing mixtec

mixtepec_mixtec's Introduction

Mixtepec_Mixtec

Mostly XML (TEI) markup of Mixtepec-Mixtec Language resources Speech files available on Dataverse https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BF2VNK

mixtepec_mixtec's People

Contributors

Stargazers

Watchers

Forkers

katyfelkner

mixtepec_mixtec's Issues

Add <w @type> wrapper to component <w>'s in SIL and non () data

Where Mixtec content contains (contiguous) multi-word expressions, compounds, or phrases components which are currently tagged as individual 's & annotated, an additional <w @type> wrapper should be added in order to enable direct search- retrievability in a final output.

This can be done automatically with XSLT by identifying the contents whose annotation and/or translation points to 2 or more consecutive 's.

Note: this process may need to be revised to account for possible false positives in the process, may need to wait for grammatical annotations in addition to translations before carrying out. This issue should be updated as progress is made in completing the preliminary steps needed to be carried out (i.e. translations and annotations) before the script and transformation can be run.

The example below shows the item "ntasaa xeen" (@xml:id's "d1e832" & "d1e834" respectively) which is not merely a adverbial adjectival phrase as its meaning is not the sum of its parts (however whether to classify this as a compound or a multi-word expression is not yet decided).

Example:

               <seg xml:id="L147-07-06" type="S">
                  <w xml:id="d1e820">Sara</w>
                  <w xml:id="d1e822">ntiva'u</w>
                  <w xml:id="d1e824">ka</w>
                  <w xml:id="d1e826">tuchii</w><!-- with force -->
                  <w xml:id="d1e828" cert="low">niki'via</w><!--  -->
                  <w xml:id="d1e830">ra</w>
                  <w xml:id="d1e832">ntasaa</w><!-- ntasaa xeen - very angry, (JB) furious -->
                  <w xml:id="d1e834">xeen</w>
                  <w xml:id="d1e836">tina</w>
                  <w xml:id="d1e838">ka</w>
                .....
                  <pc>.</pc>
               </seg>
               <spanGrp type="translation">
                .....
                  <span target="#d1e832 #d1e834" xml:lang="en">furious</span>
                  <span target="#d1e832 #d1e834" xml:lang="es">furioso</span>
                .....
               </spanGrp>

Target transformation should result in:
.... <w xml:id="d1e831"> <w xml:id="d1e832">ntasaa</w> <w xml:id="d1e834">xeen</w> </w> .....

Caveat
However, there will be no way of automatically tagging the value @type="(compound | MWE | phrase)" so this will have to be done manually by searching and editing all occurrences of: //w/w in the data.

Fix all normalized instances of "nuu" (back from "nu")

In most files I encoded I normalized the non-literal instances of 'face' "nuu", given that in neither the SIL publications, nor any non-published documents is this item written with a single "u", I would rather match what everyone else is using

Further encode numbers in L162 in embedded <w>'s

Currently the markup of the numbers (which are compounds) in document L162 are encoded as follows:
<w xml:id="L162-01-32">oko utsi uvi</w>

In practice everywhere else, the components of compounds in numbers are wrapped in <w>'s and they are also wrapped in a common <w> i.e.
<w xml:id="L162-01-32"><w>oko</w> <w>utsi</w> <w>uvi</w></w>

Create CSS Stylesheet for SIL "Las_aves-mix.xml"

This document image & the conceptual content is the basis for the vocabulary. Thus it would be useful to be able to display it in a with potential consumers of the data, namely speakers in mind.
This stylesheet should display the following in a user friendly layout:

Original content from SIL:
- image of bird
- Mixtec orthographic form
- Spanish orthographic form
  I should add:
English translations added by me
link to URI's for semantics which contain:
- (external) multilingual equivalents
  - (particularly) Latin (+ optional additional language user can choose as desired)
- (external) images (photos, etc)
- etc...
Embedded sound files for any recorded instances of any of the bird vocabulary contained

Merge normalized <w>'s into a single <w> and express the previous separated state using spaces in @orig

Previously in order to keep the original content from written Mixtec from original content from speakers, normalized forms incorrectly (or according to antiquated conventions) separated into multiple tokens which were previously encoded as:

           ```<w xml:id="d1e3186" norm="kunkutu'u">
                 <w xml:id="d1e3186a">ku</w>
                 <w xml:id="d1e3188">nkutu'u</w>
              </w>```

these should now be encoded as:

             ` <w xml:id="d1e3186" orig="ku nkutu'u">kunkutu'u</w>`

This is so that search programs retrieving the mixtec lexical content do not have to be modified and further complicated by requiring that if an item has @norm, return the value of that attribute as a string

Thus, those already encoded in the former way must be updated to match the latter.

Add punctuation to transcribed sentences

As of now, most sentences that were transcribed do not include normal punctuation and capitalization in the TEI output. This needs to be added for all complete sentences.

Add @xml:lang to utterance files

Currently the vocabulary from the utterance files do not show up in the multi-language script searches due to the fact that there is no language tags on them.

Add @xml:lang="mix" to all in each file

Tagging emphatic pronouns

In tagging gram feature subtypes (e.g. ‘emphatic pronoun’) e.g. "mee" should I use: “#PRON #EMPH”
or “#PRON-EMPH”;

To consider: ‘emphatic’ doesn’t exist independently of pronoun in old ISOcats;

Currently using #PRON #EMPH (can always change using XSLT)

Should the tagging for where these are reflexive be “#PRON #EMPH #RFLX” or “#PRON #RFLX”?

(Wiktionary English categorizes them as #PRON #EMPH & #PRON #RFLX)

Automatically add translations for indefinite articles

In most sentences indefinite articles are not being translated. After the grammatical tagging is done this can be automated via identifying the 's which have the following characteristics:

           <s>
              <!-- other content above -->
              <w xml:id="d1e257">ian</w>
              <w xml:id="d1e260">kuntuta'in</w>
              <w xml:id="d1e262" orig="iin">in</w>
              <w xml:id="d1e264">ntixi</w>
              <pc>.</pc>
           </s>
        
        <spanGrp type="gram">
           <span type="pos" target="#d1e262" ana="#ARTCL #INDEF"/>
           <span type="number" target="#d1e262" ana="#SG"/>
        </spanGrp>

e.g.

Where an @xml:id is tagged with "#ARTCL #INDEF" on  and;
Where that same @xml:id is tagged with "#SG" on 
For each in (<s> or <seg>) and for where: @xml:lang ="en" & @xml:lang ="es"'
Add in <spanGrp type="translation">;
a
un

Add speakers and pointer to speakers list to file headers

Currently in the TEI utterance files derived from the Praat textgrid output, the only marker of the speaker is at the end of the source file name which is included as an attribute value of in the header.

The initials of the speaker are always the last part of the filename string and always precede the file extension. (See below):

<teiHeader>
    <fileDesc>
      <titleStmt>
          <title>Mixtepec-Mixtec TEI output for Praat TextGrid Transcriptions</title>
          <respStmt>
             <name>Jack T. Bowers</name>
             <resp>Transcription</resp>
             <resp>Data Modeling</resp>
             <resp>Speaker Consultation</resp>
          </respStmt>
       </titleStmt>
     .......
       <sourceDesc>
          <p>This file was converted from the source file <ptr target="file:/Users/jackbowers/Desktop/test/SOC_good_bye_01_02_03_spkrTS.txt"/> which was extracted from the Praat TextGrid transcriptions of the speech file <ptr target="file:/Users/jackbowers/Desktop/test/SOC_good_bye_01_02_03_spkrTS.wav"/>
          </p>
       </sourceDesc>
    </fileDesc>
 </teiHeader>

Three changes should be made via XSLT script to improve this (for all files not containing the speech of more than one person):

Add the speaker's initials (w/ hashtag) to each  as the value of @who
Add <respStmt> the speaker's name and role <resp>speaker</resp> to the header , declaring the speaker's initials to be hashtagged in in the value of @xml:id in <name> (e.g. <name xml:id="TS">)
Add a pointer in header (specific format to be determined after consultation with other experts) to the document: Mixtepec_Mixtec/MIX-Project-Personography-TEI.xml

Fix contents of <title> in SIL documents!!

Many <title>'s contain unsegmented content because of erroneous scripts that accidentally were applied to that portion. Need to go through and fix. Only further segment title's if they contain lexical content that is not found anywhere else in data collection

Fully encode legacy publication, rights and licensing info in header from SIL Docs

Currently info from SIL documents declaring their own metadata, bibliographic and rights is included in TEI docs but this content is not yet fully encoded in TEI.

Need to:

hand encode one
write template or program to automatically do this for the rest of the files

Decide basic translation principles (Indefinite noun phrases)

Need to make decision about how and whether to:

Translate indefinite noun phrases as a single translation
Translate just the noun and label the indefinite article just in grammar
Translate both one for Noun & another with the N & “in”

(Note: this one might be needed for at least the first instance of a noun in a document and where there wouldn’t otherwise be a single translation of just that)

Refine semantic relations categories and systematize use of <usg type="domain">

Thus far I have been (ab)using the and have been using it in cases where there would be a more specific and systematic lexical relation type that could be used instead.

Adopting a system/inventory of semantic and lexical relations as the primary semantic annotation (beyond sense that is) would form the basis for the creation of an expandable and searchable system which would be a necessary baseline for the creation of a wordnet like system in MIX.

Domain could be relegated to a more general usage in line with encyclopedic theme or subject matter.

Translation of Mixtec modals into English and Spanish

Ask Mille how to translate the “modal” that are future subjunctive in meaning into both Spanish and English given there is no future subjunctive in either language

Write XSLT to convert new Praat textgrid format to TEI

An altered version of the file: Mixtepec_Mixtec/stylesheets-scripts/xslt/parse-textgrid-output-altered-b02-inProgress-LR.xslt is needed to convert the files in: Mixtepec_Mixtec/media/speech-mix/new-unprocessed/tisu-vienna/annotated/ as well as any future annotated speech files (as they will all from now on adhere to the praat textgrid annotation format as used in therein).

Specifically, the tiers to be parsed are: Tokens; Gloss; Pron; Orth.

Each Token should be converted to: <annotationBlock> (note: unlike the older transcriptions, the tokens in these are not to be assumed to be duplicates);
Each Gloss should create a: (<seg> x2 (one for Orth and one for Pron w/ needed @Notation attribute-value pairs for each) and a <spanGrp type="translation">(1st content before comma): e.g.
Gloss pulga, flea)
..
(2nd content after comma):
e.g.
Gloss pulga, flea)

(If there is no comma, the contents should just be put into  and an empty  should be created to fill in later)

Each Pron should be converted to values of <w>'s within the <seg notation="ipa">
Each Orth should be converted to values of <w>'s within the <seg notation="orth">

Come up with final means of encoding documents with blank spaces

L095, L095, cruxigramas, L144, L160, L162 have blank spaces which are supposed to provide space for users to provide answers to the questions.

In final output, we need to:

find a systematic way (according to community recommendations) to encode this in TEI;
create a XSLT (convert to XHTML) and possibly CSS stylesheets to allow this to be done online;
provide answers to check against (need to figure out where to put and how to encode)

Add speaker consultant in <respStmt> for SIL files

For each SIL file where the glossing was assisted by a speaker, a should be added to the header. Currently this is done in: L099, L147, L157, L163 & Cruxigramas:
e.g.

Glossing, Consultation
Juan "Tisu'ma" Salazar

Need to normalize values as well.

Enable source document return in results of English and Spanish searches in "query-mix-translations.xq"

Currently in the script "query-mix-translations.xq" only searches from Mixtec return the results with the source documents listed. The script needs to be modified to allow this in searches from English and Spanish

Wrap all original Mixtec content being annotated in <annotationBlock>

Add to surround all Mixtec content to wrap the original and annotation content in one common element.

Change language tags of Colonial Mixtec to "nds-x-colmix"

Currently in the language content from the colonial sources (Alvarado currently only one), the @xml:lang values are "mix-alv" which is not valid as it doesn't conform to BCP 47. There is no language tag for Colonial Mixtec thus the ISO 639 tag must begin with "nds" (for languages w/o tags) and be extended as follows: "nds-x-colmix". This will stand for "colonial mixtec"

Combine components of IPA complex phones in single <c>?

Mixtec has numerous complex phones e.g. (“nd”, “kw”, “ndʒ”, "nts", etc.) which are phonologically a single unit. In the old Praat annotation I separated all phonetic units by manner of articulation and thus split up many.

e.g.

                <w xml:id="d1e81" synch="#T6">
                     <c>n</c>
                     <c>ts</c>
                     <c>a</c>
                     <c function="tone">T</c>
                     <c>ʔ</c>
                     <c>n</c>
                     <c>i</c>
                     <c function="tone">H</c>
                  </w>
                  <w xml:id="d1e98" synch="#T14">
                     <c>ɲ</c>
                     <c>a</c>
                     <c function="tone">F</c>
                  </w>
                  <w xml:id="d1e105" synch="#T17">
                     <c>n</c>
                     <c>ts</c>
                     <c>ḭ</c>
                     <c function="tone">T</c>
                     <c>ʔ</c>
                     <c>ḭ</c>
                     <c function="tone">T</c>
                  </w>

This makes searching for phonological strings a pain and it means despite my overly meticulous Praat annotation the encoding of the phonology is not complete. These strings in context are predictable and will not occur in sequence when not part of the given complex phone, thus automatic combining should not be to complex or risk incorrect grouping.

Replace <w>____</w> for blank spaces with other element

need to change from to avoid false, useless results

Add genre to each Mixtec TEI file in the header

Need to review previous list of Genre's and add:

Elicited speech
Diary (or something related)
(OTHERS)

Add "Resumen" to the SIL documents where present in original

(more details as needed)

Convert Tone Transcriptions

Currently due to legacy transcription format from Praat textgrid in which tones were transcribed on separate tiers and originally in ZSampa, the majority of the phonetic transcription contents in the utterance files have the tones labeled as a capital letter e.g. (L, M, H, R, F, R_F, F_R, T, etc.). These need to be changed to unicode IPA.

Certain key theoretical description of tone distinctions are not yet settled thus some of these transformations may themselves are meant as placeholder markups until more study and consultation with other researchers can be done.

The main issue in question is:

Whether there is a phonological distinction between the tone hight levels in falling and rising tones?

Thus, until this is settled, those corresponding to a general rising or falling will just be transcribed as a global rise or fall (i.e. ↗↘)

Note: the replacement for the "unknown tone" tag (ZSampa) "T" or (antiquated variant) "E" will be: bullet operator "∙" (U+2219, UTF-8: E2 88 99)

The inventory of unique values found in //c[@funtion="tone"] are below (left), their target values to convert to are on the right separated by the rightward arrow "→".

H → ˥
F → ↘
F_R → ↘↗
M → ˧
R → ↗
L → ˩
T → ∙
R_F → ↗↘
R_F! → ↗↘ꜜ
H_F → ˥↘
F_R! → ↘↗ꜜ
M^ → ˧ꜛ
H! → ˥ꜜ
L^ → ˩ꜛ
F^ → ↘ꜛ
R! → ↗ꜜ
F! → ↘ꜜ
H^ → ˥ꜛ
H_L → ˥˩
H_F^ → ˥↘ꜛ
M! → ˧ꜜ
F_R_F → ↘↗↘
M_L → ˧˩
E → ∙
H_M → ˥˧
F_R_F! → ↘↗↘ꜜ
FF → ↘
HF → ˥↘
L_R → ˩↗
L_H → ˩˥
H_R → ˥↗
HL → ˥˩
LM → ˩˧
HH → ˥ꜛ
Low → ˩
H_M^ → ˥˧ꜛ
F_L → ↘˩
LH → ˩˥
F_R^ → ↘↗ꜛ
M_H → ˧˥
F or M → ∙

Note: these converted transcriptions are also meant as temporary placeholders until further study can be made and more normalized tone marker can be made (either using a superscripted number system or potentially combining diacritics over the vowels).

Write stylesheet to automatically create with all <w> xml:id

Create import/conversion script for Mixtec verb charts

Currently the Mixtec verb conjugation chart is being kept and edited in a Google docs spreadsheet in the following location https://drive.google.com/open?id=1KTdL3q5fvGA2PWSyEMT_kgC1hE0YZQn4l11LPMTuaaU (because of the convenience and constant changes).

An XSLT script to convert an export of these contents needs to be written to TEI dictionary structures needs to be written and exports made as needed.

In doing this, there will need to be a way to address several difficult issues that are both theoretical and practical in nature:

Identify the headword at column 1 and line 1 and regularly (+16 lines) thereafter;
- these should be output to <form type="lemma">
Allow for the possibility of phonetic forms to the right of the headword prior to "-", these are all in IPA style brackets "[ ]";
To the right of the dash "-" is the gloss (need to normalize where there are multiple)
How to deal with pronouns?
How to deal with variant pronouns?
How to deal with compound verbs that have whitespace? i.e. How to distinguish from pronoun?
(e.g. "tsinu ini yu")
Need to systematize whether to use accented or non-accented orthographic forms;
Can we allow for both? (if so make systematic)
Figure out what to do (normalize) about high tone on formal feminine inflections (these are often minimal pairs with 3s.inf thus it is important to distinguish (currently some are written with an IPA vowel to the right of the form (e.g. "nikitsai [í]")

(Add details of desired output structures here in updates to this issue)

(Add sample of desired output from a sample table)

Check, merge and/or fork "parse-textgrid-output..." XSLT file variants

The following files are all variants of the same one:

Mixtepec_Mixtec/stylesheets-scripts/xslt/parse-textgrid-output-altered-b02-inProgress-LR.xslt is the most recent and the one used in the major transformation scenario used in original Praat to TEI conversions. This should be renamed to something more simple and practical and the changes should be reflected in the Oxygen transformation scenario that uses this file.

The others are the source of the aforementioned and/or possibly altered versions of it.

Need to examine, keep most recent valid one, and review whether the others should be kept. If yes, they should be renamed and differences documented in the XSL docs themselves and the commits. Future changes should be done via forks so they can be properly tracked and recovered if needed.

File "MIX-External-LR_Corpus.xml" removed, replace contents individually

The file had too many different messy formats so it was removed from the contents. When ready, add each document individually so that it can be integrated into the common resources. Importantly, use normalization attributes. @norm @orig

Add @xml:lang tag to all utterance files on

Add @xml:lang tag to all utterance files on (they currently have no language tag!!!)

Fix <w> structure in Aves document

Currently in the Las_Aves-mix.xml the non-compounds are wrapped in the same as are compounds (only without the @type="compound"). This is a left over from when the structure was //seg/w for every item.
e.g.
<w xml:id="d1e587" xml:lang="es"> <w xml:id="d1e588">cuitlacoche</w> </w>

The annotations already point to the highest level and thus don't have to be changed.

Need to make script that:

find //w[not(@type="compound")]/w
copy content from //w/w/text()

move to value of //w

delete the second /w element

Tagging & Annotating Yes-No questions

Should I tag yes-no question as "#Q #YN" or "#Q-YN?"

> Currently using #Q-YN

Combining them is more efficient but combines categories (utterance type= interrogative (i.e. "Q") but what category is "Y-N"?

Also What category (@type) in should be used for utterance type?

Currently using "phrase" and pointing to the (sentence level)
e.g.

               <seg xml:id="d1e43"
                    function="utterance"
                    notation="orth"
                    type="interrogative">
                  <pc>¿</pc>
                  <w xml:id="d1e44" synch="#T1">A</w>
                  <w xml:id="d1e46" synch="#T3">kuun</w>
                  <w xml:id="d1e48" synch="#T6">savi</w>
                  <pc>?</pc>
               </seg>
               .......
            <spanGrp type="gram">
               <span type="phrase" target="#d1e43 ..." ana="#Y-N"/>
              ....
            </spanGrp>

test

@laurentromary

Modify translation xquery script "query-mix-translations.xq" to allow extraction of speaker in output

Ideally any searched Mixtec content should also be able to retrieve the speaker (or author) who spoke/wrote/created the content. This is especially important due to possible sub-dialectal variations that may be inherent in the language of the Yucunani speakers and the others. This may also be due to specific socio-linguistic reasons as well such as the residence of the speakers and frequency of usage of Mixtec with other native speakers.

To do this every file needs to have the speaker/author in the header, //respStmnt/resp and to make this searchable, the script must be able to search the value of //respStmnt/resp and //respStmnt/name.

The possible values that may occur in must be systematized before this can be applied..

Add <back> session to SIL docs with extra vocabulary

Often when consulting speakers on translation of SIL documents, they give glosses or related terms to that in the original content that are not actually what fits in with the annotation process but which are relevant to the building of a maximal vocabulary. Thus we can add a section where we can add this information at a later step in which we can include this information.

Write script to retrieve all targets of "unknown" cert

All unknown lexical items are tagged in: //spanGrp[@type="translations"]/span[@cert="unknown"];
write script to:
a) retrieve all the items pointed to in @target: //span[@cert="unknown" and @target];
b) retrieve the context sentences in which they occur: //ancestor(seg or s)

Change the old orthography, Add (@orig) & Add <encodingDecl> in header with description

(I)
In the encoding of the SIL documents is necessary to update the orthography in the older documents to match the updated spelling. This is being done by changing the text and putting the original (antiquated version) in the @orig on <w>.
(II)
This needs to be recorded & declared in the header. This should be done as the following:

<encodingDesc>
         <editorialDecl>
            <normalization>
               <ab>
                  <date>2018-02-05</date>Spelling of lexical items which in the old orthography were homographs and which have been changed according to updated orthographic principles as per <ref>../Mille/GU%C3%8DA%20ORTOGR%C3%81FICA%20PRELIMINAR%20Short%20version.xml</ref>
               </ab>
            </normalization>
         </editorialDecl>

<!-- other content here as needed-->
</encodingDesc>

(III)
In final output, in , generate <change> for each of the the specific words changed (including both the old and revised form)

Tag morphological units/inflections with <m>?

While a systematic use of <m> (with @xml:id) in addition to <w> in the encoding schema would enable a more specific and detailed representation of the content.

eg.
nuu <w>nuu</w> is inflected as: nui <w>nu<m>i</m></w>

However, implementing this would create a significant problem in searching for exact strings as they would be split up between two (or potentially more) element segments.

My priority (early on) is to be able to achieve a corpus from which I can easily retrieve all instances of a given piece of data.

So is it possible to search for these strings without significant modification/burden?

What is the best approach?

Write bigram and trigram retrieval scripts

Create xslt script which searches and retrieves consecutive elements, if they have translations and/or are part of a larger phrase sequence, return the full phrase and translations

Normalize annotations of Utterances (remove pointers to phonetic forms)

In order to make the annotations of the spoken language content () files with those of the text sources, it is necessary to:

remove double pointers in utterance files, point only to orthographic forms;

add @sameAs on phonetic transcription <seg>'s and <w>'s

 ` <u xml:id="d1e37" n="2" start="0" end="0.68">
        <seg xml:id="d1e38" function="utterance" notation="orth" type="phrase" sameAs="#d1e41">
           <w xml:id="d1e39" synch="#T1">kui'<c xml:id="d1e40">i̠</c></w>
        </seg>
        <seg xml:id="d1e41" function="utterance" notation="ipa" type="phrase" sameAs="#d1e37">
           <w xml:id="d1e42" synch="#T1">
              <c>k</c>
              <c>w</c>
              <c>ḭ</c>
              <c function="tone">H</c>
              <c>ʔ</c>
              <c xml:id="d1e45">iː</c>
              <c function="tone" xml:id="d1e46" synch="#d1e45">R_F</c>
           </w>
        </seg>
     </u>`

However, a major question to be resolved is that:

this will cause a loss of information in terms of connecting tone to the grammatical & morpho-semantic info it expresses, eg the ability to point to @d1e40 (i.e " i̠") and@d1e45 (i.e. "iːR_F") in order to label these subsegments as 1st person singular:

  `<u xml:id="d1e37" n="2" start="0" end="0.68">
       <seg xml:id="d1e38" function="utterance" notation="orth" type="phrase">
          <w xml:id="d1e39" synch="#T1">kui'<c xml:id="d1e40">i̠</c></w>
       </seg>
       <seg xml:id="d1e41" function="utterance" notation="ipa" type="phrase">
          <w xml:id="d1e42" synch="#T1">
             <c>k</c>
             <c>w</c>
             <c>ḭ</c>
             <c function="tone">H</c>
             <c>ʔ</c>
             <c xml:id="d1e45">iː</c>
             <c function="tone" xml:id="d1e46" synch="#d1e45">R_F</c>
          </w>
       </seg>
    </u>
    <spanGrp type="gram">
       <span type="phrase" target="#d1e38 #d1e41" ana="#NP #POSS"/>          
       <span type="pos" target="#d1e39 #d1e42" ana="#N"/>
       <span type="morph" target="#d1e40 #d1e46" ana="#TONE"/>
       <span type="person" target="#d1e40 #d1e46" ana="#1PERS"/>
       <span type="number" target="#d1e40 #d1e46" ana="#SG"/>
    </spanGrp>`

Finish updating project feature structures according to MAF, merge duplicates

The lexical inventories in which the tags with which the data is being annotated and which in the larger picture represent the grammatical and lexical characteristics asserted (and/or accepted) about the MIX language by this study are being stored in the following feature structure files:

(1) & (2) contain the general morphological, semantic, and other theoretical lexical features and (3) contains the phonetic and phonological features. All of the of the MIX data are currently incomplete and the first two documents should be merged to one and the other deleted.

All the contents in these files should be modified to be in accordance with the standards defined in: ISO 24611:2012 Language resource management -- Morpho-syntactic annotation framework (MAF).
ISO_FDIS_24611_(E)_MAF.pdf

All of the contents should be formatted as follows:

             <fs>
               <f name="place-of-articulation">
                  <vAlt>
                     <symbol value="bilabial" xml:id="bilab"/>
                     <symbol value="linguo-labial" xml:id="linguo-lab"/>
                     <symbol value="labio-dental" xml:id="lab-dent"/>
                     <symbol value="dental" xml:id="dent"/>
                     <symbol value="alveo-dental" xml:id="alv-dent"/>
                     <symbol value="alveolar" xml:id="alv"/>
                     <symbol value="post-alveolar" xml:id="post-alv"/>
                     <symbol value="retroflex" xml:id="retro-flx"/> <!-- not in MIX -->
                     <symbol value="palatal" xml:id="palat"/>
                     <symbol value="velar" xml:id="velar"/>
                     <!-- shares similar acoustic and articulatory properties with velar (tongue position,location) and bi-labials (rounding);  eventually point to, or refine feature definition by connecting to acoustic  (and articulatory features) eg. 'velar pinch' F3 path...etc... -->
                     <symbol value="uvular" xml:id="uvlr"/>
                     <symbol value="pharyngeal" xml:id="phryngl"/><!-- not in MIX -->
                     <symbol value="glottal" xml:id="glot"/>
                     <symbol value="epiglottal" xml:id="epiglot"/> <!-- not in MIX -->
                  </vAlt>
               </f>
            </fs>

The values of @xml:id are the tags that are used in the annotation of the entirety of the Mixtec corpus.

The final output will contain both the declarations for all features used and prose descriptions of the features with examples from the data. This prose will serve as a canonical source for the most basic description of the language's features and can eventually be accompanied by a CSS to make it human readable and schemas which can extract the contents for re-use in prose.

Decide on linguistic level for default transitivity in <gramGrp>

In assigning transitivity to a lemma, it needs to be decided (and thus declared) whether this refers to semantic or syntactic transitivity.
<gramGrp> <gram type="transitivity">trans</gram> </gramGrp>

This is important because often many verbs which are semantically transitive can be used syntactically as intransitives (i.e. without an object/undergoer argument).
e.g.
I'm drinking water (syntactically and semantically transitive)
vs
I'm drinking (syntactically intransitive and semantically transitive)

Since it is tagged in it would imply syntactic, which if it has the ability to occur as syntactically intransitive (or vice versa) this would have to be done in sense which poses two problems:

a) an alternate syntactic transitivity doesn't necessarily change the sense (e.g. there is no difference in the core meaning of the work "drink" in the two sentences above)

b) which tag to use to encode this? (?)

Write script to retrieve all targets of unknown"

Enhance 'Bichos' document with semantic links to KB about the spider species described

Connect English and Spanish translations to Wiktionary entries for given lemmata

(expand on idea with details)

Translation of Mixtec incompletive into Spanish

Should the present (incompletive) in Mixtec always be translated to Spanish as both the gerund and present indicative?

Decide on how to classify translation components

In Mixtec complex phrases comprising of multiple verbs ("I want to eat" "I have to go", etc) both verbs take the personal argument e.g.
kuni yu kusu yu
want 1s sleep 1s
<w xml:id="d1e1667">kuni</w> <w xml:id="d1e1669">ta</w> <w xml:id="d1e1671">yu</w> <w xml:id="d1e1673">kusu</w> <w xml:id="d1e1675">yu</w>
In translating, I have been translating the whole phrases as their translation equivalent would be but the components I have also been translating separately in order to add to my inventory in case there are no instances of the given verb, or if there are no other instances of it with the particular person, or in the given tense, etc.

The question is:

Should I use a @type on these translations? (e.g. type="literal")?

How to encode compounds, multi-word expressions, and inflected verbs (with separate pronoun)

In annotating compounds, as well as multi-word expressions (in which there is whitespace in between the components) I am not clear on what principles to use in the encoding & standoff annotation.

Thus far I have kind of been doing a mixture of leaving the standoff in certain cases and wrapping in the extra <w> in others.

Up to this point, I have been wrapping certain compounds consistently, including numbers, proper nouns, and other items which I think even though they are written separately, they are lexicalized concepts: i.e.

oko in 'twenty one'
<w xml:id="d1e148"><w>oko</w> <w>in</w></w>

Ñu'u Ncha'i '(planet) Earth'
<w xml:id="d1e535"><w>Ñu'u</w> <w>Ncha'i</w></w>

In annotating the translations for these, I point to the @xml:id of the highest level <w>

                 `<w xml:id="d1e535"><w>Ñu'u</w> <w>Ncha'i</w></w>`

              ```
                   <span target="#d1e535" xml:lang="en">Earth</span>
                   <span target="#d1e535" xml:lang="es">Tierra</span>
              ```

Whereas these are good in distinguishing the encoding structure of a compound (a single lexical unit), and allows for further annotation of it's components if desired, they also make searching more complicated in XQuery (as in issue #44 with the decision as to whether to use <m> within <w>).

However
I (for a non-specific reason) have not wrapped less concrete lexical items which are also lexicalized such as:

tsa'a ña 'because' (which is a combination of tsa'a 'foot' + ña 'that/which,'...)
thus this remains:

            `<w xml:id="d1e4229">Tsa'a</w>
              <w xml:id="d1e4231">ña</w>`

..and I point to and translate them in the standoff annotation pointing at both parts with the following:

              `
                <span target="#d1e4229 #d1e4231" xml:lang="en">because</span>
                <span target="#d1e4229 #d1e4231" xml:lang="es">por causa de</span>
              `

Additionally however...

there is the issue of items which it is not clear whether they are not they are lexicalized, and given the ambiguity, that creates the possibility for inconsistency if I were to adopt a policy of wrapping everything I think of as a single lexical item (compound or multi-word expression): e.g.

The word:
chi kuchi 'north' (also chi ninu 'south'_)

currently encoded as:
<w xml:id="d1e252"><w>chi</w> <w>kuchi</w></w>

are also somewhat problematic as in a sentence we have:
chi kuchi tsi ninu '...north and south'

Here there is a sentence in which it is referring to the north and south but the "chi" found in both items (which I don't think has meaning out of context but it has some directional association) is split up, and while it occurs naturally in front of the "kuchi" (north) it is separated from the "ninu" (south), so if I were to group the <w> it would only be able to wrap the first word but not the second.
e.g.

                     .....
                     <w xml:id="d1e252"><w>chi</w>
                     <w>kuchi</w></w>
                     <w xml:id="d1e254">tsi</w>
                     <w xml:id="d1e256">ninu</w><!-- should be: "chi ninu" -->

The fact that they (SIL in the booklets) do this (i.e. split up the portions), raised the question that the degree of lexicalization may not be so far that perhaps it shouldn't be considered a fully lexicalized compound.. This makes deciding on a supplementary encoding difficult.

Concluding thoughts

The basis of the conflict are the following factors:

the desire to be able to search for a word with a simple string; vs

the desire to group compounds in a single `<w>` for easier translation and extraction; _vs_

the need to avoid grouping components of what for sure is translated as a single item but which may not be lexicalized in the minds of the speakers;

```
the need to be consistent! 
```

Therefore, the question is what should my <w> encoding policy be?:

→ Should I be wrapping all, potentially only certain, or none of these in a common <w>?

→ If yes (to wrapping any) should I consider (in order to facilitate easier string searches) converting all instances of <w><w> into a single <w>?

e.g.:
<w xml:id="d1e248">Ñu'u Ncha'i</w>

Complete translations for all Mixtec content

The first priority moving forward in the annotation should be completing the translations, which gives the next stages a better basis for completion and will provide for a significant benchmark in the work at a sooner time rather than completing all at once.

(finish)
Stages:

Utterances
Sil Documents
Other