Hi, Thanks for this library. I am wondering if there is a function t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Detection of font encoding about language-resources HOT 4 CLOSED

google commented on May 3, 2024

Detection of font encoding

from language-resources.

Comments (4)

mjansche commented on May 3, 2024

Not yet in the form of a Java library function. However, the functionality exists, with a command-line interface in C++. (This could be hooked into Java via JNI.) Here is how to use it:

$ bazel build -c opt //my:detect-charset
INFO: Found 1 target...
Target //my:detect-charset up-to-date:
  bazel-bin/my/detect-charset
INFO: Elapsed time: 0.193s, Critical Path: 0.00s
$ echo ဧဝရက္ေတာင္ထိပ္  | bazel-bin/my/detect-charset
ဧဝရက္ေတာင္ထိပ္  zawgyi.fst  32.5596 unicode.fst 62.6379
$

The last command produces tab-separated output, with the name of the most likely model in the second column. As you can see in the above example, that string was correctly classified as Zawgyi. This approach is based on statistical language models and makes occasional mistakes. The models are checked in, so you can run the above example just by compiling the runtime code with blaze build as shown above.

This approach still needs to be tweaked a bit in order to better detect strings that are not ambiguous (e.g. ပါ is the same in Zawgyi and in Unicode). Also it wouldn't work if you give it Unicode input that's not in Burmese language (e.g. Shan text).

from language-resources.

brawer commented on May 3, 2024

By the way, just in case you might find it useful: for detecting Zawgyi, there’s a regexp in the ICU codebase. As far as I know (which might easily be wrong), there’s currently no public API for calling it, but you might find it interesting. For converting Zawgyi to Unicode, CLDR v30 contains a transformation with BCP47-T identifier my-t-my-s0-zawgyi. CLDRv30 is currently in alpha, but the final release should happen very soon.

Not sure whether there’s any quality differences between CLDR/ICU and the code in this repo. If you notice anything, please don’t hesitate to tell.

from language-resources.

4tee commented on May 3, 2024

@mjansche , the statistical model is very nice. 32.5596 is percentage of likelihood? unicode.fst has 62.6379, shouldn't the text higher weight for unicode encoding? But i really like the idea of statistical model.

Thanks for the pointers, @brawer . It is really interesting and I will give a try.

from language-resources.

mjansche commented on May 3, 2024

For completeness: the scores are negative log-probabilities assigned by each model and the decision is made based on the difference in log-likelihood (i.e. likelihood ratio). In this example, the Zawgyi model is much more likely than the Unicode model.

from language-resources.

Detection of font encoding about language-resources HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent