professoro / esperanta-analizisto Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 80 KB

An Esperanto parser.

License: MIT License

HTML 100.00%

esperanta-analizisto's People

Contributors

Watchers

esperanta-analizisto's Issues

Algorithm for dealing with text

This is a ticket to hash out our algorithm for turning a block of Esperanto text into useful information. I don't have time to start coding this tonight, but I need to write down what I have in my head.

Here's my first pass at an algorithm:

Split the text by words
Look for tiny words first, like prepositions or other two or three letter words
For the words that are left, start by looking for a vowel at the end to dictate part of speech
Then start chopping off roots that we know about, like we discussed

Keep track of properly parsed words (maybe click)

This would allow for regression testing when trying new parsing / selection algorithms.

Light blue color looks like selection highlighting

The light blue color (first in the list) is too similar to the default color used by most browsers for highlighting the selection (at least under Mac OS X). Is this a big enough deal to warrant a change?

-o-n vs -on- (e.g., dekonaĵon)

The former is a noun + accusative (direct object) ending, while the latter means "fraction" (e.g., dek = ten, dekono = tenTH). So "dekonaĵon" (Genesis 14:20, tithe) is a challenge. The first -on- is a single root, while the one at the end is actually two roots, -o + -n.

In theory, we could add entries that have -on after all the numbers (duon, trion, kvaron, k.t.p.). However, those aren't roots—they're two parts. Looks like there's some interesting stuff about where roots can appear, as I think -on- can only appear in the middle of a word, while -o and -n can appear at the end.

Do we want to mark certain roots with some restrictions as to where they can appear? Only after numbers, only in the middle of a word, etc.

Greedy prefix method sometimes fails for complex words

The word "unuenaskitoj" (firstborns) should parse as unu-e-nask-it-o-j, but because we grab the longest root that matches the front of the word, we instead get unu-en-as-ki-{toj} (where "toj" is an unknown root, which luckily doesn't mean anything in Esperanto, AFAIK). It looks like we're going to have to go with something like "if the greedy prefix method leaves some un-parseable sections, try something else (say, greedy suffix, or iterating through all possible parsings, which would be far slower)."

Output list of unique roots that appear in the passage

Perhaps highlighting them in the root list.

Parsing challenge: enuiĝis

Should parse as enu-iĝ-is, but with longest-root-first (LRF?) algorithm, it parses as enu-i-ĝis. That's because ĝis is a root—the preposition "until." I'd love to remove that lonely "i" as a root, since those single-letter roots cause problems, but it's the infinitive verb ending. Perhaps gathering all the possible parsings and discarding any with single-letter roots if there are other alternatives would take care of this (and perhaps other) problems.

Parsing challenge: deven-

Should be de-ven-, but parses as dev-en- due to greedy prefix algorithm. None of the others I've proposed would fix this. Maybe it's time to wander into multiple parsings per word, with some interface for popping up a list of alternatives to the one the algorithm has chosen to show, and perhaps a way to indicate that a word has multiple parse options.

Select ES6 testing framework

Jasmine, Karma, something else?

Don't allow addition of duplicate roots

Try longest suffix first (instead of longest prefix first)

Polymer component to highlight extracted roots

Think carefully about architecture so it's properly modular. Use data binding.

Alternate methods to input Esperanto special characters

Add buttons near the textarea to insert each of the 5 characters with diacriticals: ĉ ĝ ĥ ĵ ŝ ŭ. Oh, and capitalized ones, too. Though ŭ only comes after another vowel (just like j).

Translate from (and to?) alternate "encodings" (ux, ch, j^, others?)

Since there's no 7-bit (or even standard 8-bit) ASCII encoding for the Esperanto characters with diacriticals, early texts use several 7-bit ASCII "encodings". Common ones I've found are:

Adding an x after the character with a diacritical, so ĉ is represented as cx. This is by far the most common one.
The older system of adding an h after the character with a diacritical, so
Alternate characters: use w instead of ŭ (since w isn't used in Esperanto).

...

professoro / esperanta-analizisto Goto Github PK

esperanta-analizisto's People

Contributors

Watchers

esperanta-analizisto's Issues

Recommend Projects

Recommend Topics

Recommend Org