Git Product home page Git Product logo

Comments (4)

mjansche avatar mjansche commented on May 3, 2024

Not yet in the form of a Java library function. However, the functionality exists, with a command-line interface in C++. (This could be hooked into Java via JNI.) Here is how to use it:

$ bazel build -c opt //my:detect-charset
INFO: Found 1 target...
Target //my:detect-charset up-to-date:
  bazel-bin/my/detect-charset
INFO: Elapsed time: 0.193s, Critical Path: 0.00s
$ echo ဧဝရက္ေတာင္ထိပ္  | bazel-bin/my/detect-charset
ဧဝရက္ေတာင္ထိပ္  zawgyi.fst  32.5596 unicode.fst 62.6379
$ 

The last command produces tab-separated output, with the name of the most likely model in the second column. As you can see in the above example, that string was correctly classified as Zawgyi. This approach is based on statistical language models and makes occasional mistakes. The models are checked in, so you can run the above example just by compiling the runtime code with blaze build as shown above.

This approach still needs to be tweaked a bit in order to better detect strings that are not ambiguous (e.g. ပါ is the same in Zawgyi and in Unicode). Also it wouldn't work if you give it Unicode input that's not in Burmese language (e.g. Shan text).

from language-resources.

brawer avatar brawer commented on May 3, 2024

By the way, just in case you might find it useful: for detecting Zawgyi, there’s a regexp in the ICU codebase. As far as I know (which might easily be wrong), there’s currently no public API for calling it, but you might find it interesting. For converting Zawgyi to Unicode, CLDR v30 contains a transformation with BCP47-T identifier my-t-my-s0-zawgyi. CLDRv30 is currently in alpha, but the final release should happen very soon.

Not sure whether there’s any quality differences between CLDR/ICU and the code in this repo. If you notice anything, please don’t hesitate to tell.

from language-resources.

4tee avatar 4tee commented on May 3, 2024

@mjansche , the statistical model is very nice. 32.5596 is percentage of likelihood? unicode.fst has 62.6379, shouldn't the text higher weight for unicode encoding? But i really like the idea of statistical model.

Thanks for the pointers, @brawer . It is really interesting and I will give a try.

from language-resources.

mjansche avatar mjansche commented on May 3, 2024

For completeness: the scores are negative log-probabilities assigned by each model and the decision is made based on the difference in log-likelihood (i.e. likelihood ratio). In this example, the Zawgyi model is much more likely than the Unicode model.

from language-resources.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.