An Esperanto parser.
professoro / esperanta-analizisto Goto Github PK
View Code? Open in Web Editor NEWAn Esperanto parser.
License: MIT License
An Esperanto parser.
License: MIT License
The former is a noun + accusative (direct object) ending, while the latter means "fraction" (e.g., dek = ten, dekono = tenTH). So "dekonaĵon" (Genesis 14:20, tithe) is a challenge. The first -on- is a single root, while the one at the end is actually two roots, -o + -n.
In theory, we could add entries that have -on after all the numbers (duon, trion, kvaron, k.t.p.). However, those aren't roots—they're two parts. Looks like there's some interesting stuff about where roots can appear, as I think -on- can only appear in the middle of a word, while -o and -n can appear at the end.
Do we want to mark certain roots with some restrictions as to where they can appear? Only after numbers, only in the middle of a word, etc.
Jasmine, Karma, something else?
Add buttons near the textarea to insert each of the 5 characters with diacriticals: ĉ ĝ ĥ ĵ ŝ ŭ. Oh, and capitalized ones, too. Though ŭ only comes after another vowel (just like j).
Should be de-ven-, but parses as dev-en- due to greedy prefix algorithm. None of the others I've proposed would fix this. Maybe it's time to wander into multiple parsings per word, with some interface for popping up a list of alternatives to the one the algorithm has chosen to show, and perhaps a way to indicate that a word has multiple parse options.
This is a ticket to hash out our algorithm for turning a block of Esperanto text into useful information. I don't have time to start coding this tonight, but I need to write down what I have in my head.
Here's my first pass at an algorithm:
I need to investigate integrating Polymer into the project and play with data bindings to save data to Firebase or localStorage.
Initial draft of function to split words: function splitWords(word, rootList), returns array of roots (strings). The rootList will eventually be a richer data storage object, but will initially be just an array of strings.
Since there's no 7-bit (or even standard 8-bit) ASCII encoding for the Esperanto characters with diacriticals, early texts use several 7-bit ASCII "encodings". Common ones I've found are:
...
Think carefully about architecture so it's properly modular. Use data binding.
The word "unuenaskitoj" (firstborns) should parse as unu-e-nask-it-o-j, but because we grab the longest root that matches the front of the word, we instead get unu-en-as-ki-{toj} (where "toj" is an unknown root, which luckily doesn't mean anything in Esperanto, AFAIK). It looks like we're going to have to go with something like "if the greedy prefix method leaves some un-parseable sections, try something else (say, greedy suffix, or iterating through all possible parsings, which would be far slower)."
Should parse as enu-iĝ-is, but with longest-root-first (LRF?) algorithm, it parses as enu-i-ĝis. That's because ĝis is a root—the preposition "until." I'd love to remove that lonely "i" as a root, since those single-letter roots cause problems, but it's the infinitive verb ending. Perhaps gathering all the possible parsings and discarding any with single-letter roots if there are other alternatives would take care of this (and perhaps other) problems.
The light blue color (first in the list) is too similar to the default color used by most browsers for highlighting the selection (at least under Mac OS X). Is this a big enough deal to warrant a change?
Perhaps highlighting them in the root list.
This would allow for regression testing when trying new parsing / selection algorithms.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.