So, in using the syllabifier I've noticed that punctuation ends up on the wrong side o

I've never used the for my personal use so feel free to implement a behavior th

Punctuation,about gregorio-project/hyphen-la

Comments (11)

eroux commented on May 29, 2024

Well, my first remark would be that hyphen-la is solely dealing with latin language, so I'm not sure handling Chinese or Tibetan punctuation is a priority... so I would go for 3. Maybe replacing the last part with something like (untested):

def hyphenate_one_word(word, punctuation):
    global hyphenator,args
    r = hyphenator.inserted(word,args.hyphenchar)
    if args.endofword:
        r+=args.hyphenchar
    r+=punctuation
    return r

wordregex = re.compile(r'\b([^\W\d_]+)([-.!?;,])\b')

for line in input:
    line = line.strip()
    hyphenline = wordregex.sub(lambda match: hyphenate_one_word(match.group(1), match.group(2)), line)
    output.write(hyphenline+'\n')

from hyphen-la.

rpspringuel commented on May 29, 2024

That's a very good point. I may have been overthinking the problem. I'll grab a copy of the Graduale and start assembling a list of punctuation.

Programmatically, my approach is a bit different and exploits the fact that the hyphenator doesn't hiccup on non-word characters. All I really do is modify wordregex so that it includes postfixed punctuation without creating a second matching group.

from hyphen-la.

rpspringuel commented on May 29, 2024

Okay, so I'm going to list the punctuation I'm planning on including here as I find it. I'm going to split them into rough groups in order to make it easier to see what's already present. Feel free to edit this list and add to it:

connectors: _
dashes: - ‐ ‑ ‒ – —
closers: ) ] } 〉
finals: » ’ ” ›
initials: (none)
other: ! " ' * , . : ; ? † ⁓ ⁕ ⁜
openers: (none)

Notes:

I included the underscore character for those who might be trying to simulate a lyric extender. I know those aren't usual for chant, but I figure someone more used to modern music might try to put them in. (Turns out ＿ raises errors about character range, so I'm excluding it.)

I really don't know whether all those variants on the dashes are needed. I've already been somewhat selective, but I think the set could be trimmed down further.

For closers I've basically included all types of braces in their basic form, though I'm not sure if they are really necessary at all.

Closing quotation marks (finals) is another one where I don't know exactly what to include. Latin, of course, wouldn't have used them in the first place, but I've come across Latin texts which do have them. Those texts being US generated, they follow those conventions, but I've included the guillments too on the assumption that Latin texts generated in a place which uses them (like France) would use them if they used finals at all.

Since I'm worried about ends of words, I've elected not to include anything from initials or openers, both of which would normally occur at starts of words, not ends.

For other, besides the obvious, I've also picked out a few of the look-alikes that closely resemble some of the non-letter signs that you might encounter in chant. Along those lines there are some others candidates for inclusion in other categories:

math: ~ +
other: ✝ ✠

from hyphen-la.

rpspringuel commented on May 29, 2024

Here are some pages which list the various characters in each class:

Connector Punctuation
Dash Punctuation
Close Punctuation
Final Punctuation
Initial Punctuation
Other Punctuation
Open Punctuation

I'm going to reorganize the list above to sort things in this manner, but only include the ones I think are pertinent.

from hyphen-la.

rpspringuel commented on May 29, 2024

Reorganization done. Comments welcome, especially in response to the potential issues I've pointed out in my notes.

from hyphen-la.

rpspringuel commented on May 29, 2024

So I'm working on implementation and have noticed something:

The underscore (_) and digits \d are word characters (i.e. in the set \w and not in the set \W). So [^\W\d_] is all word characters which are not also digits or the underscore.

\b is a word boundary, i.e. a between character object which has \w on one side and \W on the other. Because digits and the underscore are on the word side of the boundary, \b[^\W\d_]+\b means that any string beginning, ending, or containing a digit or an underscore is not a match. For digits this isn't much of a concern, but for the underscore it might be. I can understand excluding what would otherwise be words if they have an underscore at the beginning or in the middle, but what about at the end?

For example the syllabifier will not work on puer_. For this string the word boundaries are at the beginning of the string and at the end because all of the characters are word characters. However, because of the underscore, it also does not match [^\W\d_]+ so there is no match at all in this string. Is that intentional? Should an underscore (or a digit, for that matter) automatically disqualify anything it's attached to from being a word?

from hyphen-la.

eroux commented on May 29, 2024

well, the underscore not really being a Latin character, it doesn't really matter, you can do as it's most practical

from hyphen-la.

rpspringuel commented on May 29, 2024

I've played around with it and it's not all that hard for me to do either. I just need to know which I should do.

Basically it boils down to this, current behavior givesthe following transformations (with --end-of-word active):

puer → pu-er-
_puer → _puer
puer_puer → puer_puer
puer_ → puer_
puer. → pu-er-.
_puer. → _puer.
puer_puer. → puer_puer.
puer_. → puer_.
puer._ → pu-er-._

Is this the desired set of transformations? If not, what should the transformations be? I realize that the underscore is not a normal Latin character and thus this behavior has probably not been thought through before, that's why I'm asking about it: to make sure I have a proper working base before I start making my changes.

The new behavior I'm working on will affect how the . gets treated (i.e. which side of the - it ends up on, affecting primarily 5 and 9), but before I do that, I need to know if the base behavior is as intended.

from hyphen-la.

eroux commented on May 29, 2024

I've never used the script for my personal use so feel free to implement a behavior that suits your needs

from hyphen-la.

eroux commented on May 29, 2024

does the pull request close the issue?

from hyphen-la.

rpspringuel commented on May 29, 2024

Yes.

from hyphen-la.

Punctuation about hyphen-la HOT 11 CLOSED

Comments (11)

Notes:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent