Git Product home page Git Product logo

esperanta-analizisto's Introduction

esperanta-analizisto

An Esperanto parser.

esperanta-analizisto's People

Contributors

professoro avatar stephenwade avatar

Watchers

 avatar  avatar

esperanta-analizisto's Issues

-o-n vs -on- (e.g., dekonaĵon)

The former is a noun + accusative (direct object) ending, while the latter means "fraction" (e.g., dek = ten, dekono = tenTH). So "dekonaĵon" (Genesis 14:20, tithe) is a challenge. The first -on- is a single root, while the one at the end is actually two roots, -o + -n.

In theory, we could add entries that have -on after all the numbers (duon, trion, kvaron, k.t.p.). However, those aren't roots—they're two parts. Looks like there's some interesting stuff about where roots can appear, as I think -on- can only appear in the middle of a word, while -o and -n can appear at the end.

Do we want to mark certain roots with some restrictions as to where they can appear? Only after numbers, only in the middle of a word, etc.

Parsing challenge: deven-

Should be de-ven-, but parses as dev-en- due to greedy prefix algorithm. None of the others I've proposed would fix this. Maybe it's time to wander into multiple parsings per word, with some interface for popping up a list of alternatives to the one the algorithm has chosen to show, and perhaps a way to indicate that a word has multiple parse options.

Algorithm for dealing with text

This is a ticket to hash out our algorithm for turning a block of Esperanto text into useful information. I don't have time to start coding this tonight, but I need to write down what I have in my head.

Here's my first pass at an algorithm:

  • Split the text by words
  • Look for tiny words first, like prepositions or other two or three letter words
  • For the words that are left, start by looking for a vowel at the end to dictate part of speech
  • Then start chopping off roots that we know about, like we discussed

Investigate Polymer and Firebase

I need to investigate integrating Polymer into the project and play with data bindings to save data to Firebase or localStorage.

Split words

Initial draft of function to split words: function splitWords(word, rootList), returns array of roots (strings). The rootList will eventually be a richer data storage object, but will initially be just an array of strings.

Translate from (and to?) alternate "encodings" (ux, ch, j^, others?)

Since there's no 7-bit (or even standard 8-bit) ASCII encoding for the Esperanto characters with diacriticals, early texts use several 7-bit ASCII "encodings". Common ones I've found are:

  1. Adding an x after the character with a diacritical, so ĉ is represented as cx. This is by far the most common one.
  2. The older system of adding an h after the character with a diacritical, so
  3. Alternate characters: use w instead of ŭ (since w isn't used in Esperanto).

...

Greedy prefix method sometimes fails for complex words

The word "unuenaskitoj" (firstborns) should parse as unu-e-nask-it-o-j, but because we grab the longest root that matches the front of the word, we instead get unu-en-as-ki-{toj} (where "toj" is an unknown root, which luckily doesn't mean anything in Esperanto, AFAIK). It looks like we're going to have to go with something like "if the greedy prefix method leaves some un-parseable sections, try something else (say, greedy suffix, or iterating through all possible parsings, which would be far slower)."

Parsing challenge: enuiĝis

Should parse as enu-iĝ-is, but with longest-root-first (LRF?) algorithm, it parses as enu-i-ĝis. That's because ĝis is a root—the preposition "until." I'd love to remove that lonely "i" as a root, since those single-letter roots cause problems, but it's the infinitive verb ending. Perhaps gathering all the possible parsings and discarding any with single-letter roots if there are other alternatives would take care of this (and perhaps other) problems.

Light blue color looks like selection highlighting

The light blue color (first in the list) is too similar to the default color used by most browsers for highlighting the selection (at least under Mac OS X). Is this a big enough deal to warrant a change?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.