gregorio-project / hyphen-la Goto Github PK
View Code? Open in Web Editor NEWLatin ecclesiastic hyphenation patterns and resources
Home Page: http://gregorio-project.github.io/hyphen-la/
License: MIT License
Latin ecclesiastic hyphenation patterns and resources
Home Page: http://gregorio-project.github.io/hyphen-la/
License: MIT License
In my general review of words beginning with ob
, I have indicated the hyphenation ob-scurus
(plus derivatives ans compounds).
There is a dilemma here about the priority to be given between etymology and phonetic. The case is different from that of the preservation of the group sc + e
(and others), as in di-scedo
, for which we privileged phonetic. Here, in fact, the sound is hard and we therefore pronounce obs-curus
regardless of the place of the hyphen.
So it might be better to correct patterns?
abstulit is correctly hyphened in abs-tu-lit, but the accented word is hyphened in ábstu-lit, while Claudio's file, line 197, gives correct hyphenation.
How to solve the problem? Just copy the accented word into one of the 'Solesmes' accented files? Or should we manage to group all the words in a single file to track duplicates?
A revision of the patterns around the diphthong eu
, especially in words beginning with pseudo
is necessary.
For now, the tests are wrong on the following words:
differences in proofreading-Solesmes-3.txt:
leucaspis % leu-cas-pis (not le-u-cas-pis)
neurospastos % neu-ros-pas-tos (not ne-u-ros-pas-tos)
pheuxaspidion % pheu-xas-pi-di-on (not phe-u-xas-pi-di-on)
differences in proofreading-Solesmes-3-accents.txt:
result without accent (correct) / result with accent (incorrect)
pseudósphex % pseu-do-sphex (not pse-u-dó-sphex)
I will complete the list with all the words beginning with pseudo
. But to search for all the words containing the diphthong eu
is too long. The words already present will suffice for the moment, unless someone has a list…
Ephraim syllabifies to: E()ph()raim() or E()ph()ra()ïm() when given with a diaeresis over the i.
"ph" cannot be a syllable by itself because it has no vowel. I'm not sure whether the syllabification should be E()phra()im or Eph()ra()im
It would be incredibly useful for Gregorio users if there was an option in the syllabifier to add the hyphen char to the end of every word, and not just between the syllables inside of words. If one is prepping a text for gabc, then one needs the ()
to appear after every syllable, not just inside words.
abiens derives from ab + ire. It should be hyphenated ab-i-ens
.
The word prostrato (and the accented prostáto) is hyphenated pro-s-tra-to instead of pro-stra-to.
Several words are hyphenated differently if they are accented or not:
In the first case, it is the version with accent that is good, in the other three cases, it is the version without accent. It will also be necessary to see other forms of words.
I will explore those differences as soon as possible.
The proper noun Iulus and the adjective Iuleus have a vocalic I. They should be hyphenated I-u-lus
and I-u-le-us
with the liturgical patterns.
My Latin grammar states that an unstressed u preceded by s and followed by a vowel is semivocalic, e.g. Suēbi. Another grammar that I have consulted does not mention this rule but states that u is semivocalic as an exception in the words suāvis, suādeō, suēscō, Suēbi.
liturgical-hyphenation.md
does not mention these cases and the current liturgical patterns treat u as a full vowel in all of these words. This should probably be changed.
Some words are wrongly hyphenated in the non-liturgical patterns, they should be tested with the liturgical patterns too:
Arachne
cedrus
corruptrix
currus
Daphne
genuit
longaevus
neglego
nescio
animadverto
Ioseph
rhythmicus
rhythmus
The word potuere (from "possum"), should be hyphened like this:
pot-u-e-re
but for now it's
po-tu-e-re.
I'm not very good at modifying the patterns in order to obtain the good result, @eroux can you tell me how to do this?
Laus tibi, Domine, rex aeternae gloriae.
Goes to:
Laus() ti()bi,() Do()mi()ne,() rex() a()e()ter()na()e() glo()ri()a()e.()
Of course, in Aeternae and gloriae, there are additional syllables.
There are two orthographic variants of the compounds of iacio: a classical one ending in -icio and a medieval one ending in -jicio. A variant ending in -iicio is not in use.
It should be taken care that the liturgical patterns as well as the test files wordlist-liturgical.txt
and wordlist-liturgical-accents.txt
take into account both variants of words like abicio, adicio, eicio, inicio, obicio, subicio etc.
Currently, the patterns are fine, but the test files contain many forms with the wrong -iicio spelling.
I just noticed that arguet is getting syllabified as two syllables: ar-guet instead of ar-gu-et.
In the last corrections, I added two words for the examples which are, for the moment, badly hyphenated.
For memory:
redarius % re-da-ri-us (not red-a-ri-us)
redauspico % red-au-spi-co (not red-aus-pi-co)
I'll take care of it soon.
I would assume caelis (of all the words to not get right) is correct, but the dictionary shows the pronunciation having two syllables. [email protected] hyphen-la shows 3.
So, in using the syllabifier I've noticed that punctuation ends up on the wrong side of the hyphen character when the end of word option is used. I.e. puer.
becomes pu-er-.
This is less than desirable (at least for me) and so I'm looking to fix it.
In exploring how to fix this I first found the regex
module which defines a character class based on the unicode punctuation category: \p{P}
In my admittedly limited testing using import regex as re
and wordregex = re.compile(r'\b[^\W\d_]+\b\p{P}*')
does fix the problem. However, I've also discovered that regex
is not installed by default on all systems (indeed, I had to install it myself in order to do this testing). (I should also note that this introduces an approximately 25% performance hit from the current state.)
Having asked on StackExchange, I've gotten two solutions that will stick with using re
.
The first involves building the \p{P}
class manually using the unicodedata
module (which is a built-in library). This involves a performance hit, because of the iteration over the entire unicode character map, but my testing says this hit, while large relatively speaking (~6x slower than regex
, ~8x slower than current) it's small absolutely (less than half-a-second to define the class).
The second approximates \p{P}
with [^\w\s]
. I don't think these are entirely convertible as, for one, I think symbols (+=$
etc.) are not punctuation but are neither word nor space characters either (i.e. they are not in \p{P}
but are in [^\w\s]
). Such characters, however, are probably fairly rare in chant lyrics, aren't they? (Performance wise this is indistinguishable from current behavior.)
So, I have a question: which solution should I pursue:
regex
),I should also point out that I can use try ... except
to combine (1) with either (2) or (3). In that case regex
would be used if available but its absence would not be fatal. Doing so does not noticeably effect performance beyond what the individual solution that ends up being used does.
If someone wants to build a format-specific hyphenator e.g. for lilypond or gabc, do you prefer to put it here in the one package, or to make hyphen-la an installable Python library and make the format-specific tools separate applications depend on it?
These words derive from ob(s) and alo. The hyphenation should be obs-o-le-tus
/obs-o-les-co
/obs-o-le-fi-ci-o
.
Is there any rule for distinguishing vocalic and semivocalic u in words with ngu
+ vowel? liturgical-hyphenation.md
states that “langueo” and “languesco” have a semivocalic u, but “langui” and “languere” a vocalic one.
The patterns yield “lan-gue-o”, “lan-gues-co”, “lan-gue-re” and “lan-gu-i”.
inuleus is a variant of hinnuleus. As in is no prefix here, the hyphenation should be i-nu-le-us
instead of in-u-le-us
.
The etymology of perendie is uncertain. So it should be hyphenated according to the general rules: pe-ren-di-e
.
The following hyphenation points are missing when using the liturgical patterns:
ab-stare, ab-sto, abs-tentus (from abstineo), ab-stiti (from absisto)
The following hyphenation points are incorrect:
absu-mpsi, absu-msi, absu-mptus (all from absumere)
If the general rule applies, tenuia (neuter plural) and tenuior (comparative), both derived from tenuis, have a semi-vocalic i. So the hyphenations should be te-nu-ia
and te-nu-ior
instead of te-nu-i-a
and te-nu-i-or
.
The documentation of the liturgical patterns should make more clear, how Greek compounds shall be hyphenated.
From the example a-pos-to-los
one might guess that Greek prefixes (in this case apo-) are ignored. In accordance with this, the patterns yield e-pis-co-pus
(Greek prefix epi-) and the test file wordlist-liturgical.txt
contains the following examples ignoring a Greek prefix:
e-pis-tra-te-gi-a
e-pis-tra-te-gus
a-pos-to-lo-rum
a-pos-phra-gis-ma
pseu-de-ne-drus
pseu-di-so-do-mos
One the other hand, the same test file contains the following entries, that only make sense from an etymological point of view:
e-pi-stro-phe
a-po-stro-pha
a-po-stro-phe
a-po-stro-phos
a-po-sple-nos
a-na-stro-phe
pseud-an-chu-sa
pseud-a-pos-to-lus
pseu-do-sma-rag-dus
pseu-do-sphex
In my opinion, this is not consistent.
According to the information I found in etymologic dictionaries, it is doubtful that indusium is derived from ind-uo. More likely is the derivation from Greek ἔν-δυσις/έν-δύω.
So it should be hyphenated in-du-si-um
/in-du-si-o
.
According to the general rule for semi-consonantic i stated in liturgical-hyphenation.md
, the i in puleium and Gaius is not a full vowel. The current liturgical patterns yield pu-le-i-um
and Ga-i-us
, but it should be pu-le-ium
and Ga-ius
. The result for Pompeius is already correct: Pom-pe-ius
.
Compound words should be proofread with the prefix super
to check the place of the hyphenation before or after the r
.
differences in proofreading-Solesmes-3.txt:
superaspergere % su-per-a-sper-ge-re (not su-pe-ra-sper-ge-re)
superaspergo % su-per-a-sper-go (not su-pe-ra-sper-go)
The prefix ex
is sometimes broken. A revision seems necessary especially when an additional prefix is added.
differences in proofreading-Solesmes-3.txt:
præexspectare % præ-ex-spec-ta-re (not præ-e-x-spec-ta-re)
præexspecto % præ-ex-spec-to (not præ-e-x-spec-to)
redexspectare % red-ex-spec-ta-re (not re-de-x-spec-ta-re)
redexspecto % red-ex-spec-to (not re-de-x-spec-to)
Several forms beginning by per-
have to be proofread in detail, as the work and unaccented words #38 show for:
In the same line, we have to review the line 206 of the doc about a doubtful homograph: perire
.
Appropinquant syllabifies as ap-pro-pin-qu-ant rather than as ap-pro-pin-quant
Prǽstet ought to hyphenate as Prǽ-stet, the same way that Præstet hyphenates as Præ-stet.
However, it is currently hyphenating as Prǽs-tet
Idomeneus should be hyphenated I-do-me-neus
(so far: Id-o-me-ne-us
). eu is a diphthong here and I see no prefix id in this name.
The etymology of induo/induere is unclear. I recommend to replace ind-u-o
by in-du-o
as there is no verb uo/uere. Even if induo is derived from ind, the following uo has to be considered as some kind of ending and not as a second part of a compound.
seditio is a compound with the elements se(d) and eo/itum/ire. The hyphenation should be sed-i-ti-o
/sed-i-ti-o-sus
.
The accented word pródeunt is hyphenated pró-de-unt instead of pród-e-unt, as the non-accented word which is already correctly hyphenated.
While working on question #23, I touched on a larger and more complex problem, that of the break around the sp block. I have done a vast search through the dictionary (Goelzer) of all the Latin words that contain this block. I only listed the input forms of the dictionary, except for the verbs for which I added the infinitive. That's what it gives:
I propose to review all the patterns affecting the sp block to get the least possible errors. A search allowed me to estimate circa 70 patterns will be necessary instead of the 52 existing ones to get the most reliable results possible.
In order to avoid reinventing the wheel, it would be useful to have a list of existing resources (preferably open source) which implement Latin dictionary and/or hyphenation rules. Even if these aren't usable out of the box with gabc files, they might provide a data source that can be exploited.
The one I know about is the OpenOffice project which does have a Latin spell check dictionary with hyphenation patterns: http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries
It would be a good thing to update the liturgical patterns of TeXLive, and that the babel package really recognizes these liturgical patterns, but I do not know how to do these two operations.
antid in antid-ea is not a compound of ante and id, but an old form of ante. The hyphenation should be an-tid-e-a
.
The word dispereo
which should be hypenated dis-per-e-o
is incorrect.
All pereo
derivatives could be examined in the correction work.
According to the Gaffiot, abstare is an intransitive verb; so there should be no passive forms except impersonal ones (third person singular).
The following words should be removed from the word lists:
You find pot-ens
and pot-en-ti-a
in the liturgical books, but I consider these hyphenations to be based on a faulty overgeneralization.
It is correct to hyphenate pot-es
, pot-est
and the like, because in this case the first element is a shortened potis and the second element a form of esse.
But potens is not a compound of potis and ens. The participle ens of esse has been introduced artificially in the late antiquity to translate the corresponding Greek participle. It was not known in classical times.
potens is older than ens and derives from a lost verb potere. It should be hyphenated po-tens
just as pæ-ni-tens
, the participle of pænitere. The same holds for po-ten-ti-a
.
There is one duplicate in wordlist-liturgical.txt
: eucharistia.
As far as I can see, the entry a-bi-ens
is wrong. This is a participle of ab-ire.
There are eight accented forms in the file, all of them beginning with trans
.
There is also one duplicate in wordlist-liturgical-accents.txt
: transpadáneus.
There are some entries with three or more syllables not having an accent in the file:
abstergo
adieuntis
(strange form, does it exist?)
altertra
astasti
astastis
compinguescerent
cumscribillo
discrepentia
displuvia
distrivsti
distrivstis
elanguescam
elanguescerent
epithalamus
eschato
euhias
exsanguescerent
horreus
impinguescerent
impinguescetis
inexstinguibilis
inexstinguibiliter
interstinguant
languescerent
languoris
linguacioris
linguosioris
longivus
obfidire
obiex
perieam
perieamus
perieant
perieas
perieatis
periee
(also strange)
perieim
perieimus
perieint
perieis
perieit
perieitis
perieo
perieunt
periur
perunguerent
pinguescerent
præaudio
præiens
præobturans
præstruo
relanguescerent
respire
sanguinolentus
satisaccipere
semetipsum
sorbuis
substinguitur
superescit
superimpedens
suscribere
RE: #10.
So, after doing some research and testing, it appears that at least as far as stdin
and stdout
are concerned, setting PYTHONIOENCODING="utf8"
as an environment variable will enable the syllabifier to work (i.e. use UTF-8 ecoding) under a restricted shell where it would otherwise not (i.e. use US-ASCII).
However, this does not apply to files. For that the "fix" appears to be to specify the encoding
argument for open
when accessing the file.
For my own use case, fixing stdin
and stdout
with the environment variable is sufficient. If I'm trying to operate on files, then I'm working from my normal shell environment where Python can pick-up the necessary encoding automatically. Inside the restricted shell where the problem occurs I'm always using stdin
and stdout
.
So, unless there are others out there who are having a file encoding problem, I'm not going to worry about this any more. I'm opening this issue to provide a place for individuals with a file encoding problem to post about their problem and let me know that I need to revisit this decision.
quotiescumque syllabifies as quot-ies-cum-que rather than as five syllables
As far as I can see, posterus and posterior are not compound words. They should by hyphenated as pos-te-rus
and pos-te-ri-or
in liturgical Latin.
compotes derives from compos. The hyphenation should be com-po-tes
instead of com-pot-es
.
I noticed that prodest was syllabifying as pro-dest, so I checked all the forms of prosum that begin with prod- and most of them don't keep the D with the first syllable:
pro-des
pro-dest
pro-des-tis
pro-des-tis
pro-de-ram
pro-de-ras
pro-de-rat
pro-de-ra-mus
pro-de-ra-tis
pro-de-rant
pro-de-ro
pro-de-ris
pro-de-re
pro-de-rit
pro-de-ri-mus
pro-de-ri-tis
pro-de-runt
pro-des-te
pro-des-to
pro-des-to-te
But the ones beginning with prodess- do work:
prod-es-se
prod-es-sem
prod-es-ses
prod-es-set
prod-es-se-mus
prod-es-se-tis
prod-es-sent
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.