Comments (4)
Not yet in the form of a Java library function. However, the functionality exists, with a command-line interface in C++. (This could be hooked into Java via JNI.) Here is how to use it:
$ bazel build -c opt //my:detect-charset
INFO: Found 1 target...
Target //my:detect-charset up-to-date:
bazel-bin/my/detect-charset
INFO: Elapsed time: 0.193s, Critical Path: 0.00s
$ echo ဧဝရက္ေတာင္ထိပ္ | bazel-bin/my/detect-charset
ဧဝရက္ေတာင္ထိပ္ zawgyi.fst 32.5596 unicode.fst 62.6379
$
The last command produces tab-separated output, with the name of the most likely model in the second column. As you can see in the above example, that string was correctly classified as Zawgyi. This approach is based on statistical language models and makes occasional mistakes. The models are checked in, so you can run the above example just by compiling the runtime code with blaze build
as shown above.
This approach still needs to be tweaked a bit in order to better detect strings that are not ambiguous (e.g. ပါ is the same in Zawgyi and in Unicode). Also it wouldn't work if you give it Unicode input that's not in Burmese language (e.g. Shan text).
from language-resources.
By the way, just in case you might find it useful: for detecting Zawgyi, there’s a regexp in the ICU codebase. As far as I know (which might easily be wrong), there’s currently no public API for calling it, but you might find it interesting. For converting Zawgyi to Unicode, CLDR v30 contains a transformation with BCP47-T identifier my-t-my-s0-zawgyi
. CLDRv30 is currently in alpha, but the final release should happen very soon.
Not sure whether there’s any quality differences between CLDR/ICU and the code in this repo. If you notice anything, please don’t hesitate to tell.
from language-resources.
@mjansche , the statistical model is very nice. 32.5596 is percentage of likelihood? unicode.fst has 62.6379, shouldn't the text higher weight for unicode encoding? But i really like the idea of statistical model.
Thanks for the pointers, @brawer . It is really interesting and I will give a try.
from language-resources.
For completeness: the scores are negative log-probabilities assigned by each model and the decision is made based on the difference in log-likelihood (i.e. likelihood ratio). In this example, the Zawgyi model is much more likely than the Unicode model.
from language-resources.
Related Issues (20)
- How to build phonology.json (consonant , vowel , tone marke) with IPA? HOT 10
- Error in Training a Bangla Voice HOT 11
- Merlin integration with festival for Bengali language. HOT 8
- How to create pronunciation lexicon for Bengali? HOT 6
- how to build phoneset for a language using phonology.json HOT 1
- I can't train thai language. HOT 20
- How to make predictions using python code HOT 2
- How did you generate 'universal_depot.far' file? HOT 3
- Why 'markup' pattern in .tsv files HOT 1
- How to conversion of FestVox voices to Flite? HOT 25
- Error while trying to build prompts for Tamil language HOT 1
- .grm file HOT 3
- Duration of speech recordings in datasets? HOT 1
- Can this python script "generate_hts_questions.py" be used for other languages? HOT 2
- Error in creating labels from festival utts HOT 2
- SLR72 is about Colombian Spanish not "Columbian" HOT 3
- How to use it without google cloud console HOT 2
- speaker level annotation in the speech datasets HOT 3
- Using Ossian and Merlin, can't synthesise voice HOT 1
- Need en text normalization resources
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from language-resources.