gregorio-project / hyphen-la Goto Github PK

View Code? Open in Web Editor NEW

23.0 12.0 5.0 2.03 MB

Latin ecclesiastic hyphenation patterns and resources

Home Page: http://gregorio-project.github.io/hyphen-la/

License: MIT License

Python 2.70% Makefile 0.27% TeX 58.42% Lua 37.26% Shell 1.36%

hyphenation latin

hyphen-la's People

Contributors

Stargazers

Watchers

Forkers

jsrjenkins isaacdime rpspringuel bbloomf ghschaden

hyphen-la's Issues

obscurus

In my general review of words beginning with ob, I have indicated the hyphenation ob-scurus (plus derivatives ans compounds).

There is a dilemma here about the priority to be given between etymology and phonetic. The case is different from that of the preservation of the group sc + e (and others), as in di-scedo, for which we privileged phonetic. Here, in fact, the sound is hard and we therefore pronounce obs-curus regardless of the place of the hyphen.

So it might be better to correct patterns?

abstulit - ábstulit

abstulit is correctly hyphened in abs-tu-lit, but the accented word is hyphened in ábstu-lit, while Claudio's file, line 197, gives correct hyphenation.
How to solve the problem? Just copy the accented word into one of the 'Solesmes' accented files? Or should we manage to group all the words in a single file to track duplicates?

pseudo and eu

A revision of the patterns around the diphthong eu, especially in words beginning with pseudo is necessary.
For now, the tests are wrong on the following words:

differences in proofreading-Solesmes-3.txt:
leucaspis 		% leu-cas-pis  (not le-u-cas-pis)
neurospastos 		% neu-ros-pas-tos  (not ne-u-ros-pas-tos)
pheuxaspidion 		% pheu-xas-pi-di-on  (not phe-u-xas-pi-di-on)

differences in proofreading-Solesmes-3-accents.txt:
result without accent (correct) / result with accent (incorrect)
pseudósphex 		% pseu-do-sphex  (not pse-u-dó-sphex)

I will complete the list with all the words beginning with pseudo. But to search for all the words containing the diphthong eu is too long. The words already present will suffice for the moment, unless someone has a list…

Ephraim, Ephraïm

Ephraim syllabifies to: E()ph()raim() or E()ph()ra()ïm() when given with a diaeresis over the i.

"ph" cannot be a syllable by itself because it has no vowel. I'm not sure whether the syllabification should be E()phra()im or Eph()ra()im

End of word hyphen char in syllabifier

It would be incredibly useful for Gregorio users if there was an option in the syllabifier to add the hyphen char to the end of every word, and not just between the syllables inside of words. If one is prepping a text for gabc, then one needs the () to appear after every syllable, not just inside words.

abiens

abiens derives from ab + ire. It should be hyphenated ab-i-ens.

prostrato

The word prostrato (and the accented prostáto) is hyphenated pro-s-tra-to instead of pro-stra-to.

accented vs non-accented words

Several words are hyphenated differently if they are accented or not:

per-é-git vs pe-re-git
dé-pe-rit vs de-per-it
ás-ti-tit vs a-sti-tit
ré-da-met vs red-a-met

In the first case, it is the version with accent that is good, in the other three cases, it is the version without accent. It will also be necessary to see other forms of words.
I will explore those differences as soon as possible.

Iulus, Iuleus

The proper noun Iulus and the adjective Iuleus have a vocalic I. They should be hyphenated I-u-lus and I-u-le-us with the liturgical patterns.

su + vowel

My Latin grammar states that an unstressed u preceded by s and followed by a vowel is semivocalic, e.g. Suēbi. Another grammar that I have consulted does not mention this rule but states that u is semivocalic as an exception in the words suāvis, suādeō, suēscō, Suēbi.

liturgical-hyphenation.md does not mention these cases and the current liturgical patterns treat u as a full vowel in all of these words. This should probably be changed.

a few patterns to test

Some words are wrongly hyphenated in the non-liturgical patterns, they should be tested with the liturgical patterns too:

Arachne
cedrus
corruptrix
currus
Daphne
genuit
longaevus
neglego
nescio
animadverto
Ioseph
rhythmicus
rhythmus

potuere

The word potuere (from "possum"), should be hyphened like this:

pot-u-e-re

but for now it's

po-tu-e-re.

I'm not very good at modifying the patterns in order to obtain the good result, @eroux can you tell me how to do this?

Broken Case

Laus tibi, Domine, rex aeternae gloriae.

Goes to:

Laus() ti()bi,() Do()mi()ne,() rex() a()e()ter()na()e() glo()ri()a()e.()

Of course, in Aeternae and gloriae, there are additional syllables.

Compounds of iacio

There are two orthographic variants of the compounds of iacio: a classical one ending in -icio and a medieval one ending in -jicio. A variant ending in -iicio is not in use.
It should be taken care that the liturgical patterns as well as the test files wordlist-liturgical.txt and wordlist-liturgical-accents.txt take into account both variants of words like abicio, adicio, eicio, inicio, obicio, subicio etc.
Currently, the patterns are fine, but the test files contain many forms with the wrong -iicio spelling.

arguet => ar-guet

I just noticed that arguet is getting syllabified as two syllables: ar-guet instead of ar-gu-et.

some errors

In the last corrections, I added two words for the examples which are, for the moment, badly hyphenated.
For memory:
redarius % re-da-ri-us (not red-a-ri-us)
redauspico % red-au-spi-co (not red-aus-pi-co)

I'll take care of it soon.

hyphen-la demo gives caelis three syllables. The latin dictionary says its just two.

I would assume caelis (of all the words to not get right) is correct, but the dictionary shows the pronunciation having two syllables. [email protected] hyphen-la shows 3.

Punctuation

So, in using the syllabifier I've noticed that punctuation ends up on the wrong side of the hyphen character when the end of word option is used. I.e. puer. becomes pu-er-. This is less than desirable (at least for me) and so I'm looking to fix it.

In exploring how to fix this I first found the regex module which defines a character class based on the unicode punctuation category: \p{P} In my admittedly limited testing using import regex as re and wordregex = re.compile(r'\b[^\W\d_]+\b\p{P}*') does fix the problem. However, I've also discovered that regex is not installed by default on all systems (indeed, I had to install it myself in order to do this testing). (I should also note that this introduces an approximately 25% performance hit from the current state.)

Having asked on StackExchange, I've gotten two solutions that will stick with using re.

The first involves building the \p{P} class manually using the unicodedata module (which is a built-in library). This involves a performance hit, because of the iteration over the entire unicode character map, but my testing says this hit, while large relatively speaking (~6x slower than regex, ~8x slower than current) it's small absolutely (less than half-a-second to define the class).

The second approximates \p{P} with [^\w\s]. I don't think these are entirely convertible as, for one, I think symbols (+=$ etc.) are not punctuation but are neither word nor space characters either (i.e. they are not in \p{P} but are in [^\w\s]). Such characters, however, are probably fairly rare in chant lyrics, aren't they? (Performance wise this is indistinguishable from current behavior.)

So, I have a question: which solution should I pursue:

use the not-installed-by-default module (regex),
take the performance hit, or
approximate the punctuation category?

I should also point out that I can use try ... except to combine (1) with either (2) or (3). In that case regex would be used if available but its absence would not be fatal. Doing so does not noticeably effect performance beyond what the individual solution that ends up being used does.

Format-specific tools

If someone wants to build a format-specific hyphenator e.g. for lilypond or gabc, do you prefer to put it here in the one package, or to make hyphen-la an installable Python library and make the format-specific tools separate applications depend on it?

obsoletus/obsolesco/obsoleficio

These words derive from ob(s) and alo. The hyphenation should be obs-o-le-tus/obs-o-les-co/obs-o-le-fi-ci-o.

ngu

Is there any rule for distinguishing vocalic and semivocalic u in words with ngu + vowel? liturgical-hyphenation.md states that “langueo” and “languesco” have a semivocalic u, but “langui” and “languere” a vocalic one.
The patterns yield “lan-gue-o”, “lan-gues-co”, “lan-gue-re” and “lan-gu-i”.

inuleus

inuleus is a variant of hinnuleus. As in is no prefix here, the hyphenation should be i-nu-le-us instead of in-u-le-us.

perendie

The etymology of perendie is uncertain. So it should be hyphenated according to the general rules: pe-ren-di-e.

abs-

The following hyphenation points are missing when using the liturgical patterns:
ab-stare, ab-sto, abs-tentus (from abstineo), ab-stiti (from absisto)

The following hyphenation points are incorrect:
absu-mpsi, absu-msi, absu-mptus (all from absumere)

tenuia/tenuior

If the general rule applies, tenuia (neuter plural) and tenuior (comparative), both derived from tenuis, have a semi-vocalic i. So the hyphenations should be te-nu-ia and te-nu-ior instead of te-nu-i-a and te-nu-i-or.

Hyphenation of Greek compounds

The documentation of the liturgical patterns should make more clear, how Greek compounds shall be hyphenated.
From the example a-pos-to-los one might guess that Greek prefixes (in this case apo-) are ignored. In accordance with this, the patterns yield e-pis-co-pus (Greek prefix epi-) and the test file wordlist-liturgical.txt contains the following examples ignoring a Greek prefix:

e-pis-tra-te-gi-a
e-pis-tra-te-gus
a-pos-to-lo-rum
a-pos-phra-gis-ma
pseu-de-ne-drus
pseu-di-so-do-mos

One the other hand, the same test file contains the following entries, that only make sense from an etymological point of view:

e-pi-stro-phe
a-po-stro-pha
a-po-stro-phe
a-po-stro-phos
a-po-sple-nos
a-na-stro-phe
pseud-an-chu-sa
pseud-a-pos-to-lus
pseu-do-sma-rag-dus
pseu-do-sphex

In my opinion, this is not consistent.

indusium/indusio

According to the information I found in etymologic dictionaries, it is doubtful that indusium is derived from ind-uo. More likely is the derivation from Greek ἔν-δυσις/έν-δύω.
So it should be hyphenated in-du-si-um/in-du-si-o.

puleium, Gaius

According to the general rule for semi-consonantic i stated in liturgical-hyphenation.md, the i in puleium and Gaius is not a full vowel. The current liturgical patterns yield pu-le-i-um and Ga-i-us, but it should be pu-le-ium and Ga-ius. The result for Pompeius is already correct: Pom-pe-ius.

super-

Compound words should be proofread with the prefix super to check the place of the hyphenation before or after the r.

differences in proofreading-Solesmes-3.txt:
superaspergere 		% su-per-a-sper-ge-re  (not su-pe-ra-sper-ge-re)
superaspergo 		% su-per-a-sper-go  (not su-pe-ra-sper-go)

-ex-

The prefix ex is sometimes broken. A revision seems necessary especially when an additional prefix is added.

differences in proofreading-Solesmes-3.txt:
præexspectare 		% præ-ex-spec-ta-re  (not præ-e-x-spec-ta-re)
præexspecto 		% præ-ex-spec-to  (not præ-e-x-spec-to)
redexspectare 		% red-ex-spec-ta-re  (not re-de-x-spec-ta-re)
redexspecto 		% red-ex-spec-to  (not re-de-x-spec-to)

perire

Several forms beginning by per- have to be proofread in detail, as the work and unaccented words #38 show for:

perieam
perieamus
perieant
perieas
perieatis
periee
perieim
perieimus
perieint
perieis
perieit
perieitis
perieo
perieunt
periur

In the same line, we have to review the line 206 of the doc about a doubtful homograph: perire.

Appropinquant

Appropinquant syllabifies as ap-pro-pin-qu-ant rather than as ap-pro-pin-quant

Prǽstet

Prǽstet ought to hyphenate as Prǽ-stet, the same way that Præstet hyphenates as Præ-stet.

However, it is currently hyphenating as Prǽs-tet

Idomeneus

Idomeneus should be hyphenated I-do-me-neus (so far: Id-o-me-ne-us). eu is a diphthong here and I see no prefix id in this name.

induo

The etymology of induo/induere is unclear. I recommend to replace ind-u-o by in-du-o as there is no verb uo/uere. Even if induo is derived from ind, the following uo has to be considered as some kind of ending and not as a second part of a compound.

seditio/seditiosus

seditio is a compound with the elements se(d) and eo/itum/ire. The hyphenation should be sed-i-ti-o/sed-i-ti-o-sus.

prodeunt - pródeunt

The accented word pródeunt is hyphenated pró-de-unt instead of pród-e-unt, as the non-accented word which is already correctly hyphenated.

sp block

While working on question #23, I touched on a larger and more complex problem, that of the break around the sp block. I have done a vast search through the dictionary (Goelzer) of all the Latin words that contain this block. I only listed the input forms of the dictionary, except for the verbs for which I added the infinitive. That's what it gives:

about 930 words are concerned (the number of accented words is around 980)
approximately 500 words begin with sp (purely indicative since they are not concerned by hyphenation)
the number of errors in the tests with the complete list of words (accented or not) added is about 170
there are currently 52 pattern patterns that concern sp

I propose to review all the patterns affecting the sp block to get the least possible errors. A search allowed me to estimate circa 70 patterns will be necessary instead of the 52 existing ones to get the most reliable results possible.

Existing resources

In order to avoid reinventing the wheel, it would be useful to have a list of existing resources (preferably open source) which implement Latin dictionary and/or hyphenation rules. Even if these aren't usable out of the box with gabc files, they might provide a data source that can be exploited.

The one I know about is the OpenOffice project which does have a Latin spell check dictionary with hyphenation patterns: http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries

TeXLive / Babel

It would be a good thing to update the liturgical patterns of TeXLive, and that the babel package really recognizes these liturgical patterns, but I do not know how to do these two operations.

antidea

antid in antid-ea is not a compound of ante and id, but an old form of ante. The hyphenation should be an-tid-e-a.

dispereo

The word dispereo which should be hypenated dis-per-e-o is incorrect.
All pereo derivatives could be examined in the correction work.

Word lists: verba delenda

According to the Gaffiot, abstare is an intransitive verb; so there should be no passive forms except impersonal ones (third person singular).
The following words should be removed from the word lists:

abstemini
abstentur
abster
abstere
absteris

potens/potentia

You find pot-ens and pot-en-ti-a in the liturgical books, but I consider these hyphenations to be based on a faulty overgeneralization.
It is correct to hyphenate pot-es, pot-est and the like, because in this case the first element is a shortened potis and the second element a form of esse.
But potens is not a compound of potis and ens. The participle ens of esse has been introduced artificially in the late antiquity to translate the corresponding Greek participle. It was not known in classical times.
potens is older than ens and derives from a lost verb potere. It should be hyphenated po-tens just as pæ-ni-tens, the participle of pænitere. The same holds for po-ten-ti-a.

Word lists

There is one duplicate in wordlist-liturgical.txt: eucharistia.
As far as I can see, the entry a-bi-ens is wrong. This is a participle of ab-ire.
There are eight accented forms in the file, all of them beginning with trans.

There is also one duplicate in wordlist-liturgical-accents.txt: transpadáneus.
There are some entries with three or more syllables not having an accent in the file:
abstergo
adieuntis (strange form, does it exist?)
altertra
astasti
astastis
compinguescerent
cumscribillo
discrepentia
displuvia
distrivsti
distrivstis
elanguescam
elanguescerent
epithalamus
eschato
euhias
exsanguescerent
horreus
impinguescerent
impinguescetis
inexstinguibilis
inexstinguibiliter
interstinguant
languescerent
languoris
linguacioris
linguosioris
longivus
obfidire
obiex
perieam
perieamus
perieant
perieas
perieatis
periee (also strange)
perieim
perieimus
perieint
perieis
perieit
perieitis
perieo
perieunt
periur
perunguerent
pinguescerent
præaudio
præiens
præobturans
præstruo
relanguescerent
respire
sanguinolentus
satisaccipere
semetipsum
sorbuis
substinguitur
superescit
superimpedens
suscribere

File encoding (revisiting #10)

RE: #10.

So, after doing some research and testing, it appears that at least as far as stdin and stdout are concerned, setting PYTHONIOENCODING="utf8" as an environment variable will enable the syllabifier to work (i.e. use UTF-8 ecoding) under a restricted shell where it would otherwise not (i.e. use US-ASCII).

However, this does not apply to files. For that the "fix" appears to be to specify the encoding argument for open when accessing the file.

For my own use case, fixing stdin and stdout with the environment variable is sufficient. If I'm trying to operate on files, then I'm working from my normal shell environment where Python can pick-up the necessary encoding automatically. Inside the restricted shell where the problem occurs I'm always using stdin and stdout.

So, unless there are others out there who are having a file encoding problem, I'm not going to worry about this any more. I'm opening this issue to provide a place for individuals with a file encoding problem to post about their problem and let me know that I need to revisit this decision.

But the ones beginning with prodess- do work:

prod-es-se
prod-es-sem
prod-es-ses
prod-es-set
prod-es-se-mus
prod-es-se-tis
prod-es-sent