Comments (7)
The Tesseract documentation is not clear about the question whether ISO 639-2 or ISO 639-3 is used:
doc/combine_lang_model.1.asc: Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
doc/tesseract.1.asc: Tesseract uses 3-character ISO 639-2 language codes
include/tesseract/baseapi.h: * The language is (usually) an ISO 639-3 string or nullptr will default to
src/api/baseapi.cpp: * The language is (usually) an ISO 639-3 string or nullptr will default to eng.
I don't mind fixing this and using ISO 639-3 everywhere, but would like to get more feedback here from people who are affected by a renaming of the models for Chinese.
from tessdata_fast.
So the suggestion is that the OCR models starting with chi_
be renamed to zho_
instead.
Comments from Chinese users of Tesseract are welcome.
from tessdata_fast.
AFAIK chi
is (also) valid code for Chinese:
https://www.w3.org/WAI/ER/IG/ert/iso639.htm
https://www.loc.gov/standards/iso639-2/php/code_list.php
https://iso639-3.sil.org/code/chi
Never rely on one source (especially if anybody can change it).
from tessdata_fast.
Here is the official site, I should have probably linked to that instead of wikipedia in the first place.
While yes, chi
is also a valid code for chinese, it is the ISO 639-2/B code (as can also be seen on the official site which you also linked to).
All other languages use the ISO 639-3 codes however. For example Czech is ces.traineddata
(ISO 639-3) and not cze.traineddata
(ISO 639-2/B).
from tessdata_fast.
Thats really unfortunate, that tesseract it self isnt even consistent with the used standard.
Note the (usually)
😆
from tessdata_fast.
There exist more ISO 639-3 codes for Chinese variants (at least cmn, yue, nan).
from tessdata_fast.
Yes those are for variants of the language although Id argue that the model is applicable for the macro language chinese hence zho
fits best. Ofcause it would also be possible to have multiple models for individual chinese languages.
from tessdata_fast.
Related Issues (16)
- Duplicate name problem with Lao / lao HOT 6
- kur_ara does not have Arabic unicharset. HOT 12
- Tesseract_fast trained data cannot be used in .NET wrapper Tesseract4.0 engine HOT 1
- How to actually use these tessdata files? HOT 4
- Tag release for 4.0.0 HOT 1
- Can we use fast dataset with Java program? Is it supported HOT 1
- Does tesseract 4.0 can be used offline? HOT 1
- what does "vert" traineddata do? HOT 2
- Update description for repo - Suggested Text: HOT 4
- Create a new tessdata_fast release/tag? HOT 11
- How to package tessconfigs? HOT 2
- Are models compatible with Tesseract 5? HOT 1
- Trained data doesn't seem to be working HOT 3
- equ.traineddata is not included HOT 17
- osd.traineddata leads core dump when psm =0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tessdata_fast.