Comments (35)
As said before here, it is supported.
You need Tesseract 4.0 or newer version, and to download the hye.traineddata.
from langdata.
Vahe, Please add the following info.
-
Which language code - arm or hye
-
Modern Armenian or Classical Armenian
-
Sources for primary texts in unicode the Armenian language to use for training
-
Freely available unicode fonts to render the text
from langdata.
langdata has https://github.com/tesseract-ocr/langdata/blob/master/Armenian.unicharset
but no folders for armenian languages.
@theraysmith Is this one of the new languages included in your current training?
I had closed an earlier issue - #51
from langdata.
http://crubadan.org/languages/hy
(zip file has word frequency lists, unigrams, bigrams etc)
https://en.wikipedia.org/wiki/Armenian_language
https://en.wikipedia.org/wiki/Eastern_Armenian
https://en.wikipedia.org/wiki/Western_Armenian
https://en.wikipedia.org/wiki/Classical_Armenian_orthography
https://en.wikipedia.org/wiki/Armenian_orthography_reform
from langdata.
https://en.wikipedia.org/wiki/Armenian_alphabet
from langdata.
Thank for all comments (sorry for being late to response):
Language code is: arm
Modern Armenian: Eastern_Armenian
For fonts please refer to this link: http://armunicode.com/en/fonts/unicode/
from langdata.
For this one:
Sources for primary texts in unicode the Armenian language to use for training
Do you need any Armenian text pages ?
from langdata.
from langdata.
Yes there is an Armenian wikipedia, this is the link:
https://hy.wikipedia.org/wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D5%A7%D5%BB
I will try to get some unicode text resources and share it with you.
Thank you once again.
from langdata.
I attached some text file Armenian unicode hope it help, if you need any more please let me know.
from langdata.
Thanks, I will give a try and let you know.
from langdata.
Attached is a zip file with arm.traineddata for use with --oem 0 i.e. legacy engine only for testing. Please give it a try, I have not done any eval on it.
I did training using the following command:
training/tesstrain.sh \
--fonts_dir /mnt/c/Windows/Fonts \
--lang arm \
--exposures "0" \
--langdata_dir ../langdata \
--tessdata_dir ../tessdata \
--output_dir ~/tesstutorial/arm \
--fontlist "Arial" \
"Consolas" \
"Courier New" \
"DejaVu Sans" \
"DejaVu Sans Mono" \
"DejaVu Serif" \
"FreeMono" \
"FreeSans" \
"FreeSerif" \
"Microsoft Sans Serif" \
"Segoe UI" \
"Sylfaen" \
"Tahoma" \
"Times New Roman," \
"Trebuchet MS" \
"Verdana" \
"Verdana Bold" \
"Verdana Bold Italic" \
"Verdana Italic"
from langdata.
Attached is an eval report using one of the training text images - arm.Sylfaen.exp0.txt
CER 2.91
WER 5.02
WER (order independent) 4.63
from langdata.
Thanks a lot for the files, could you please tell me what to do exactly for the next step, and what we are missing ?
Thank you very much once again.
from langdata.
I did some tests, for the fist one I got:
Error in pixGenHalftoneMask: pix too small: w = 270, h = 97
But the output in overall is not bad (attaching the original and the output) there some characters wrong.
armeniantext.txt
from langdata.
The next test was better, no errors.
second.txt
from langdata.
Waiting for your suggestions.
from langdata.
from langdata.
Please see attached zip file.
It has a newer arm.traineddata as well as the training_text, fonts list etc that I used. You can test so see if this is better than the earlier version - use --oem 0 since it does not have lstm traineddata.
You can do training by modifying training text etc.
You will need to add arm as a valid language code in
https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L21
and also add a line similar to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L921 for arm.
from langdata.
Thank you very much once again.
I will try to do the test on Monday and post the result, I tested this new one arm-2.zip got the same output no big difference.
from langdata.
from langdata.
Could you please help me with this issue:
training/./tesstrain.sh --fonts_dir /root/ocr/training/Fonts --lang arm --exposures "0" --langdata_dir ../langdata --tessdata_dir ../tessdata --output_dir /root/ocr/training_output --fontlist "Aramian Normal" "Arial AM"
=== Starting training for language 'arm'
ERROR: Error: arm is not a valid language code
Thank you once again.
from langdata.
from langdata.
Thanks, Ray.
However, hye is marked as unusable language code. Also there is no folder for hye in langdata.
https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L36
from langdata.
@vahenr Please see earlier comment at #67 (comment)
You will need to add arm as a valid language code in
https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L21
and also add a line similar to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L921 for arm.
Or as suggested by Ray, use hye as the language code.
from langdata.
What do I need to put in this file: arm.training_text ? This is for the option: --langdata_dir ../langdata
from langdata.
https://github.com/tesseract-ocr/langdata/files/923560/arm-2.zip
The above zip file has the files that I used. Put them in a folder named arm under langdata. The training text I used has the text from the doc file you sent, Unicode text for udhr and some text copied from Wikipedia.
The wordlist is taken from crubdan site, link is given in some earlier comment in this thread.
These will be sufficient for legacy training. My trial for LSTM training were not successful. Hopefully Ray will provide new traineddata for Armenian soon.
from langdata.
Also download other required files from langdata repo. Read the readme file for requirements or just clone the whole repo.
from langdata.
See https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh for info on training.
from langdata.
Thank you @Shreeshrii for your help in adding armenian to tessa !!!
from langdata.
https://github.com/tesseract-ocr/tessdata/tree/master/best
Armenian.traineddata
hye.traineddata
from langdata.
@vahenr @gelinger777 Please test Armenian support with the newly posted best traineddata for use with the LSTM engine
from langdata.
are there any progress on this ticket?
from langdata.
Is there any updates?
from langdata.
@Shreeshrii, you opened this issue in 2017. I think you can close it now.
from langdata.
Related Issues (20)
- this is not an issue, i just need some guaidline for urdu dataset, any expert please?
- Missing many special characters in desired_characters file (Swedish)
- what is the use of Traintext ? Shouldnt it be images instead? HOT 1
- [tha] Please add support for Thai Character "Phinthu"
- Romanian Cyrillic HOT 4
- Update description for repo - Suggested Text:
- Can't encode transcription HOT 3
- About Uyghur Language recognition
- Balinese Script OCR HOT 26
- Santali Language (Ol Chiki script) OCR
- ful HOT 8
- Cannot show Persian numbers
- I'm ssory
- Failed to initialise tesseract engine: .net 6.0 [Tesseract 4.1.1 + Tesseract.Data.English 4.0.0] HOT 2
- Language Request: Kurdish Sorani (Central Kurdish) HOT 1
- install language
- Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) HOT 1
- Language pack request: Accented Belarusian HOT 2
- Trouble with "separator lines" made of **** or ----- or ======= HOT 1
- special characters missing from `nor` and `dan` `desired_characters`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from langdata.