Comments (11)
Well, my first remark would be that hyphen-la is solely dealing with latin language, so I'm not sure handling Chinese or Tibetan punctuation is a priority... so I would go for 3. Maybe replacing the last part with something like (untested):
def hyphenate_one_word(word, punctuation):
global hyphenator,args
r = hyphenator.inserted(word,args.hyphenchar)
if args.endofword:
r+=args.hyphenchar
r+=punctuation
return r
wordregex = re.compile(r'\b([^\W\d_]+)([-.!?;,])\b')
for line in input:
line = line.strip()
hyphenline = wordregex.sub(lambda match: hyphenate_one_word(match.group(1), match.group(2)), line)
output.write(hyphenline+'\n')
from hyphen-la.
That's a very good point. I may have been overthinking the problem. I'll grab a copy of the Graduale and start assembling a list of punctuation.
Programmatically, my approach is a bit different and exploits the fact that the hyphenator doesn't hiccup on non-word characters. All I really do is modify wordregex
so that it includes postfixed punctuation without creating a second matching group.
from hyphen-la.
Okay, so I'm going to list the punctuation I'm planning on including here as I find it. I'm going to split them into rough groups in order to make it easier to see what's already present. Feel free to edit this list and add to it:
- connectors:
_
- dashes:
- ‐ ‑ ‒ – —
- closers:
) ] } 〉
- finals:
» ’ ” ›
- initials: (none)
- other:
! " ' * , . : ; ? † ⁓ ⁕ ⁜
- openers: (none)
Notes:
I included the underscore character for those who might be trying to simulate a lyric extender. I know those aren't usual for chant, but I figure someone more used to modern music might try to put them in. (Turns out _
raises errors about character range, so I'm excluding it.)
I really don't know whether all those variants on the dashes are needed. I've already been somewhat selective, but I think the set could be trimmed down further.
For closers I've basically included all types of braces in their basic form, though I'm not sure if they are really necessary at all.
Closing quotation marks (finals) is another one where I don't know exactly what to include. Latin, of course, wouldn't have used them in the first place, but I've come across Latin texts which do have them. Those texts being US generated, they follow those conventions, but I've included the guillments too on the assumption that Latin texts generated in a place which uses them (like France) would use them if they used finals at all.
Since I'm worried about ends of words, I've elected not to include anything from initials or openers, both of which would normally occur at starts of words, not ends.
For other, besides the obvious, I've also picked out a few of the look-alikes that closely resemble some of the non-letter signs that you might encounter in chant. Along those lines there are some others candidates for inclusion in other categories:
- math:
~ +
- other:
✝ ✠
from hyphen-la.
Here are some pages which list the various characters in each class:
Connector Punctuation
Dash Punctuation
Close Punctuation
Final Punctuation
Initial Punctuation
Other Punctuation
Open Punctuation
I'm going to reorganize the list above to sort things in this manner, but only include the ones I think are pertinent.
from hyphen-la.
Reorganization done. Comments welcome, especially in response to the potential issues I've pointed out in my notes.
from hyphen-la.
So I'm working on implementation and have noticed something:
The underscore (_
) and digits \d
are word characters (i.e. in the set \w
and not in the set \W
). So [^\W\d_]
is all word characters which are not also digits or the underscore.
\b
is a word boundary, i.e. a between character object which has \w
on one side and \W
on the other. Because digits and the underscore are on the word side of the boundary, \b[^\W\d_]+\b
means that any string beginning, ending, or containing a digit or an underscore is not a match. For digits this isn't much of a concern, but for the underscore it might be. I can understand excluding what would otherwise be words if they have an underscore at the beginning or in the middle, but what about at the end?
For example the syllabifier will not work on puer_
. For this string the word boundaries are at the beginning of the string and at the end because all of the characters are word characters. However, because of the underscore, it also does not match [^\W\d_]+
so there is no match at all in this string. Is that intentional? Should an underscore (or a digit, for that matter) automatically disqualify anything it's attached to from being a word?
from hyphen-la.
well, the underscore not really being a Latin character, it doesn't really matter, you can do as it's most practical
from hyphen-la.
I've played around with it and it's not all that hard for me to do either. I just need to know which I should do.
Basically it boils down to this, current behavior givesthe following transformations (with --end-of-word
active):
puer
→pu-er-
_puer
→_puer
puer_puer
→puer_puer
puer_
→puer_
puer.
→pu-er-.
_puer.
→_puer.
puer_puer.
→puer_puer.
puer_.
→puer_.
puer._
→pu-er-._
Is this the desired set of transformations? If not, what should the transformations be? I realize that the underscore is not a normal Latin character and thus this behavior has probably not been thought through before, that's why I'm asking about it: to make sure I have a proper working base before I start making my changes.
The new behavior I'm working on will affect how the .
gets treated (i.e. which side of the -
it ends up on, affecting primarily 5 and 9), but before I do that, I need to know if the base behavior is as intended.
from hyphen-la.
I've never used the script for my personal use so feel free to implement a behavior that suits your needs
from hyphen-la.
does the pull request close the issue?
from hyphen-la.
Yes.
from hyphen-la.
Related Issues (20)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyphen-la.