Git Product home page Git Product logo

Comments (11)

eroux avatar eroux commented on May 29, 2024

Well, my first remark would be that hyphen-la is solely dealing with latin language, so I'm not sure handling Chinese or Tibetan punctuation is a priority... so I would go for 3. Maybe replacing the last part with something like (untested):

def hyphenate_one_word(word, punctuation):
    global hyphenator,args
    r = hyphenator.inserted(word,args.hyphenchar)
    if args.endofword:
        r+=args.hyphenchar
    r+=punctuation
    return r

wordregex = re.compile(r'\b([^\W\d_]+)([-.!?;,])\b')

for line in input:
    line = line.strip()
    hyphenline = wordregex.sub(lambda match: hyphenate_one_word(match.group(1), match.group(2)), line)
    output.write(hyphenline+'\n')

from hyphen-la.

rpspringuel avatar rpspringuel commented on May 29, 2024

That's a very good point. I may have been overthinking the problem. I'll grab a copy of the Graduale and start assembling a list of punctuation.

Programmatically, my approach is a bit different and exploits the fact that the hyphenator doesn't hiccup on non-word characters. All I really do is modify wordregex so that it includes postfixed punctuation without creating a second matching group.

from hyphen-la.

rpspringuel avatar rpspringuel commented on May 29, 2024

Okay, so I'm going to list the punctuation I'm planning on including here as I find it. I'm going to split them into rough groups in order to make it easier to see what's already present. Feel free to edit this list and add to it:

  • connectors: _
  • dashes: - ‐ ‑ ‒ – —
  • closers: ) ] } 〉
  • finals: » ’ ” ›
  • initials: (none)
  • other: ! " ' * , . : ; ? † ⁓ ⁕ ⁜
  • openers: (none)

Notes:

I included the underscore character for those who might be trying to simulate a lyric extender. I know those aren't usual for chant, but I figure someone more used to modern music might try to put them in. (Turns out _ raises errors about character range, so I'm excluding it.)

I really don't know whether all those variants on the dashes are needed. I've already been somewhat selective, but I think the set could be trimmed down further.

For closers I've basically included all types of braces in their basic form, though I'm not sure if they are really necessary at all.

Closing quotation marks (finals) is another one where I don't know exactly what to include. Latin, of course, wouldn't have used them in the first place, but I've come across Latin texts which do have them. Those texts being US generated, they follow those conventions, but I've included the guillments too on the assumption that Latin texts generated in a place which uses them (like France) would use them if they used finals at all.

Since I'm worried about ends of words, I've elected not to include anything from initials or openers, both of which would normally occur at starts of words, not ends.

For other, besides the obvious, I've also picked out a few of the look-alikes that closely resemble some of the non-letter signs that you might encounter in chant. Along those lines there are some others candidates for inclusion in other categories:

  • math: ~ +
  • other: ✝ ✠

from hyphen-la.

rpspringuel avatar rpspringuel commented on May 29, 2024

Here are some pages which list the various characters in each class:

Connector Punctuation
Dash Punctuation
Close Punctuation
Final Punctuation
Initial Punctuation
Other Punctuation
Open Punctuation

I'm going to reorganize the list above to sort things in this manner, but only include the ones I think are pertinent.

from hyphen-la.

rpspringuel avatar rpspringuel commented on May 29, 2024

Reorganization done. Comments welcome, especially in response to the potential issues I've pointed out in my notes.

from hyphen-la.

rpspringuel avatar rpspringuel commented on May 29, 2024

So I'm working on implementation and have noticed something:

The underscore (_) and digits \d are word characters (i.e. in the set \w and not in the set \W). So [^\W\d_] is all word characters which are not also digits or the underscore.

\b is a word boundary, i.e. a between character object which has \w on one side and \W on the other. Because digits and the underscore are on the word side of the boundary, \b[^\W\d_]+\b means that any string beginning, ending, or containing a digit or an underscore is not a match. For digits this isn't much of a concern, but for the underscore it might be. I can understand excluding what would otherwise be words if they have an underscore at the beginning or in the middle, but what about at the end?

For example the syllabifier will not work on puer_. For this string the word boundaries are at the beginning of the string and at the end because all of the characters are word characters. However, because of the underscore, it also does not match [^\W\d_]+ so there is no match at all in this string. Is that intentional? Should an underscore (or a digit, for that matter) automatically disqualify anything it's attached to from being a word?

from hyphen-la.

eroux avatar eroux commented on May 29, 2024

well, the underscore not really being a Latin character, it doesn't really matter, you can do as it's most practical

from hyphen-la.

rpspringuel avatar rpspringuel commented on May 29, 2024

I've played around with it and it's not all that hard for me to do either. I just need to know which I should do.

Basically it boils down to this, current behavior givesthe following transformations (with --end-of-word active):

  1. puerpu-er-
  2. _puer_puer
  3. puer_puerpuer_puer
  4. puer_puer_
  5. puer.pu-er-.
  6. _puer._puer.
  7. puer_puer.puer_puer.
  8. puer_.puer_.
  9. puer._pu-er-._

Is this the desired set of transformations? If not, what should the transformations be? I realize that the underscore is not a normal Latin character and thus this behavior has probably not been thought through before, that's why I'm asking about it: to make sure I have a proper working base before I start making my changes.

The new behavior I'm working on will affect how the . gets treated (i.e. which side of the - it ends up on, affecting primarily 5 and 9), but before I do that, I need to know if the base behavior is as intended.

from hyphen-la.

eroux avatar eroux commented on May 29, 2024

I've never used the script for my personal use so feel free to implement a behavior that suits your needs

from hyphen-la.

eroux avatar eroux commented on May 29, 2024

does the pull request close the issue?

from hyphen-la.

rpspringuel avatar rpspringuel commented on May 29, 2024

Yes.

from hyphen-la.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.