In annotating compounds, as well as multi-word expressions (in which there is whitespace in between the components) I am not clear on what principles to use in the encoding & standoff annotation.
Thus far I have kind of been doing a mixture of leaving the standoff in certain cases and wrapping in the extra <w>
in others.
Up to this point, I have been wrapping certain compounds consistently, including numbers, proper nouns, and other items which I think even though they are written separately, they are lexicalized concepts: i.e.
oko in 'twenty one'
<w xml:id="d1e148"><w>oko</w> <w>in</w></w>
Ñu'u Ncha'i '(planet) Earth'
<w xml:id="d1e535"><w>Ñu'u</w> <w>Ncha'i</w></w>
In annotating the translations for these, I point to the @xml:id of the highest level <w>
`<w xml:id="d1e535"><w>Ñu'u</w> <w>Ncha'i</w></w>`
```
<span target="#d1e535" xml:lang="en">Earth</span>
<span target="#d1e535" xml:lang="es">Tierra</span>
```
Whereas these are good in distinguishing the encoding structure of a compound (a single lexical unit), and allows for further annotation of it's components if desired, they also make searching more complicated in XQuery (as in issue #44 with the decision as to whether to use <m>
within <w>
).
However
I (for a non-specific reason) have not wrapped less concrete lexical items which are also lexicalized such as:
tsa'a ña 'because' (which is a combination of tsa'a 'foot' + ña 'that/which,'...)
thus this remains:
`<w xml:id="d1e4229">Tsa'a</w>
<w xml:id="d1e4231">ña</w>`
..and I point to and translate them in the standoff annotation pointing at both parts with the following:
`
<span target="#d1e4229 #d1e4231" xml:lang="en">because</span>
<span target="#d1e4229 #d1e4231" xml:lang="es">por causa de</span>
`
Additionally however...
there is the issue of items which it is not clear whether they are not they are lexicalized, and given the ambiguity, that creates the possibility for inconsistency if I were to adopt a policy of wrapping everything I think of as a single lexical item (compound or multi-word expression): e.g.
The word:
chi kuchi 'north' (also chi ninu 'south'_)
currently encoded as:
<w xml:id="d1e252"><w>chi</w> <w>kuchi</w></w>
are also somewhat problematic as in a sentence we have:
chi kuchi tsi ninu '...north and south'
Here there is a sentence in which it is referring to the north and south but the "chi" found in both items (which I don't think has meaning out of context but it has some directional association) is split up, and while it occurs naturally in front of the "kuchi" (north) it is separated from the "ninu" (south), so if I were to group the <w>
it would only be able to wrap the first word but not the second.
e.g.
.....
<w xml:id="d1e252"><w>chi</w>
<w>kuchi</w></w>
<w xml:id="d1e254">tsi</w>
<w xml:id="d1e256">ninu</w><!-- should be: "chi ninu" -->
The fact that they (SIL in the booklets) do this (i.e. split up the portions), raised the question that the degree of lexicalization may not be so far that perhaps it shouldn't be considered a fully lexicalized compound.. This makes deciding on a supplementary encoding difficult.
Concluding thoughts
The basis of the conflict are the following factors:
- the desire to be able to search for a word with a simple string; vs
-
the desire to group compounds in a single `<w>` for easier translation and extraction; _vs_
-
the need to avoid grouping components of what for sure is translated as a single item but which may not be lexicalized in the minds of the speakers;
-
the need to be consistent!
Therefore, the question is what should my <w>
encoding policy be?:
→ Should I be wrapping all, potentially only certain, or none of these in a common <w>
?
→ If yes (to wrapping any) should I consider (in order to facilitate easier string searches) converting all instances of <w><w>
into a single <w>
?
e.g.:
<w xml:id="d1e248">Ñu'u Ncha'i</w>