tesseract-ocr / langdata Goto Github PK

View Code? Open in Web Editor NEW

813.0 813.0 888.0 400.77 MB

Source training data for Tesseract for lots of languages

License: Apache License 2.0

langdata's People

Contributors

Stargazers

Watchers

Forkers

autotagamerica gylns teresasun521 markatithinkbest linghushaoxia hotice222 yze123 myrddin669 domen13 sigurross attozk banban minthanthtoo atuyosi chiahungtai pierfio kinter5201314 odiawikimedia srglxxf langdead lowmaster iamsee gyfrozen duguruiyuan lenbin pkdevbox anongithum huuquocha1990 nickjwhite magicsen jayantanth tungtt93 mars-rover huangzongwu nicecai ramazanaktolu adityavs zhaog hotelzululima brucezhang80 vigneshv59 baituhuangyu eric013 pipi1226 transformersprimeabcxyz asdlei00 ryanfb wzpsgit siddhartha161 tyronewo illusioniststory amani-lei rahulmod wjacker youmoula eisenhiemex knightmstr teguhsatria92 ahkscripter champ10 gabrielyun modulexcite defe989 sagartn sindney roswell53 erkin-m shirley0806 zettacristiano leobarone mks786 codeguesser vikramkumariiit devashishd12 mahadevbobhate jemmasters piyalmaduranga rspandah nidhiyanandh phoenix256 sohaibx zutuyu cj1988 7ac0b95 superxiaoqiang shahsankets auspost garyhui86 kenshido sarthakmittal neonaldo riverqh ianblenke coolthejackal kingwenchen songsofwindfish josempersichini tguichin xuxhtest kedroon

langdata's Issues

Add U+02BC to Devanagari.unicharset

Some languages of India make use of U+02BC “ ’ ” modifier letter apostrophe, either as a tone mark or as a length mark in their texts written in Devanagari script.

eg. ख’ल्ल
ित’लकना
दख’ना
खर’
कत’ पड़ा’ गेल’?

kan.unicharambigs seems to be copy of eng.unicharambigs

Reviving an old issue - see RaghavBhardwaj/tesseract-ocr#801

I am shocked to notice that kan.unicharambigs contains only English script
instead of kannada script . I had expected kan.unicharambigs will contain
kannada script only.

Suggest building of https://github.com/tesseract-ocr/langdata/blob/master/kan/kan.unicharambigs based on https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-801/comment-6/kan.DangAmbigs.txt and review by someone who knows kannada.

Pending Issues from googlecode are under Pull Requests

https://github.com/tesseract-ocr/langdata/pulls?q=is%3Aopen+is%3Apr

These should be moved to Open Issues and resolved.

Thanks!

Inuktitut Resources

Inuktitut Unicode Fonts

http://www.inuktitutcomputing.ca/IO/Fonts/info.php?lang=en
http://www.pirurvik.ca/en/productions/iu-computing/font-download

iku - iu - Inuktitut - http://crubadan.org/languages/iu

Ref: tesseract-ocr/tesseract#654 (comment)

Add Arabic-Indic numerals to Arabic

Please see tesseract-ocr/tesseract#858

include both 0-9 and ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) for Arabic.

Neither deu/desired_characters nor deu/deu.training_text or deu/deu.wordlist include the paragraph character (§). That character is used very often especially in German legal texts, also in the form §§ (plural form, meaning paragraphs). Instead of the paragraph character, Tesseract typically detects a dollar sign (but also other confusions). The paragraph character is missing in other languages, too.

Also missing is the long S character (ſ) which was used in texts from the 18th century, not only in German but also in Latin, French and Spanish.

For Tesseract 4, Fraktur typefaces are currently unsupported. They were used often until 1945 (rarely used after that).

Langdata for 3.02

I am looking for langdata (Tel, Kan, Guj). How can I get the langdata for 3.02, currently repository is saying langdata is for 3.04

CC: @jimregan

khmer - not working with --oem 1

Ref: tesseract-ocr/tesseract#654 (comment)

Attached box/tiff pairs created using text2image / tesstrain.sh rendering proces.

Fonts used (installed in Windows)

KHMER_FONTS=( \
    "Leelawadee UI Bold" \
    "Leelawadee UI" \
    "Noto Sans Khmer" \
    "Noto Serif Khmer" \
    "Noto Sans Khmer Bold" \
    "Noto Serif Khmer Bold" \
    "Noto Sans Khmer UI Bold" \
    "Noto Sans Khmer UI" \
    )

command used:

training/tesstrain.sh --fonts_dir /mnt/c/Windows/Fonts --lang khm  \
  --linedata_only --noextract_font_properties \
   --langdata_dir ../langdata --tessdata_dir ./tessdata \
  --output_dir ~/tesstutorial/khm

European 18th century texts

Many European texts from the 18th century use modern types with some special properties. OCR for such texts is currently only partially supported by Tesseract, notably by enm, frm, ita_old and spa_old (see wiki) which are the only models including the long s.

Support is missing for Latin texts (used very often at that time) or German texts, maybe others, too.

French 'ï' is recognized as 'i'

Originally reported by @jbarlow83 in the tesseract-dev forum.

Umlaut on the letter in French ï seems to be read as a regular i, e.g. the word "ovoïde".

Bigrams file not in sync with training text

@theraysmith

Ray, I am using a modified version of Sanskrit training text with Vedic accents, rupee sign etc.

The font properties generation in the processing by tesstrain.sh script uses the ngrams file which seems to be based on the bigrams file

Is the bigrams file generated based on the training text? If so, is there a utility to create bigrams from training text.

Will use of a new training text which is not in sync with bigrams file cause errors during training?

Please Tag for 3.04.00 release

Thanks, Zdenko.

training small fonts- Chinese

Hi, where can I go for information on how to train small Chinese fonts. Do I just make my PDF have much smaller fonts? My target is the 100% Chinese font in Chrome as seen in Google News.

Myanmar Resources

Ref: tesseract-ocr/tesseract#654 (comment)

Myanmar Unicode Fonts

http://www.myanmarlanguage.org/unicode/myanmar-fonts-which-follow-unicode-rules
http://www.mymfont.com/

mya - my - Myanmar - http://crubadan.org/languages/my

Add rupee sign to Devanagari.unicharset

₹ 0 60,69,203,215,84,106,1,24,97,129 Common 1260 4 1260 ₹ # ₹ [20b9 ]

The glyph metrics need to be updated.

MICR fonts

copied from #59 (comment)

@Shreeshrii 3. Traineddata for MICR

@theraysmith Beyond the scope of this change.

Tibetan Unicharset

Tibetan Unicharset is in a different format than the unicharsets for other script sets. It was uploaded as part of 3.04 langdata and seems to be be based on syllables rather than characters (similar to unicharsets that are generated during training are based on training_text).

http://unicode.org/charts/PDF/U0F00.pdf shows all the Tibetan characters in unicode but the Tibetan unicharset https://github.com/tesseract-ocr/langdata/blob/master/Tibetan.unicharset is based on combination of characters.

For other scripts eg. Devanagari, the unicharset and unicode chart follow similar format. This data was uploaded during the initial creation of repo and reflects 3.02.
http://unicode.org/charts/PDF/U0900.pdf
https://github.com/tesseract-ocr/langdata/blob/master/Devanagari.unicharset

@theraysmith Which is the correct format of unicharsets going forward? for 4.0.0-alpha?

If it is being limited to the character combinations as in Tibetan for 3.04, then it could cause a problem if a new training text is used which has some new syllable combinations.

About Uyghur(Uighur) langdata

Hi,
I am native Uyghur.
I found some error characters(not Uyghur characters in "uig.training_text"). I fixed the errors. please update.

I make uig.frequent_words_list file. Words sorted by frequency.
the "all_character_forms.txt" contains all uyghur chracters and their all forms. It can be added to uig.training_text file.

uig.zip
all_character_forms.txt

add vowel diacritics characters in Arabic charset

Hi,
Please refer to this discussion:
tesseract-ocr/tesseract#552
and I request you to add these characters to the Arabic charset (total 8chs, each one in a line):
ّ
َ
ً
ُ
ٌ
ِ
ٍ
ْ
these َ ً is similar to ِ ٍ but former is above letter and the latter used below the letter.

Thanks and Happy New Year

The lstmtraining program is a multi-purpose tool for training neural networks. The following table describes its command-line options:

Flag	Type	Default	Explanation
U	string	none	Path to the unicharset for the character set.

and also

Fine tuning is the process of training an existing model on new data without changing anything else like the character set or any part of the network. Doesn't need a unicharset, script_dir, or net_spec, as they all come from the existing model.

and

Fine tuning is OK if you don't want to change the character set, but what if you want to train for Klingon? You are unlikely to have much training data and it is unlike anything else, so what do you do? You can try removing some of the top layers of an existing network model, replace some of them with new randomized layers, and train with your data. The command-line is mostly the same as Training from scratch, as you have to supply a unicharset and net_spec, and you also have to provide a model to --continue_from and --append_index.

@theraysmith Ray, please clarify what is the character set referred to here? Thanks!

add « and » characters to French

the « and » characters are unrecognized in French

See

Request for deu-frak data

Enough said I imagine.

langdata/pol/pol.wordlist duplicated entries

Hello!

In https://raw.githubusercontent.com/tesseract-ocr/langdata/master/pol/pol.wordlist (05ec588 on 25 Jun 2015) is a great list of Polish words.

Somehow, though, even as I am Polish, using Polish keyboard and Polish Windows 8.1 with Polish fonts, I see "ﬆ" (or "?" in notepad2) in many lines - this is not a character you encounter at all in Polish language.

After a few seconds, I realised that every line with the mysterious ﬆ sign follows a line with "st" bigram - for which the ﬆ substitutes.

There is no "st" digraph (there are no strict bigraphs in Polish at all - they are always made with two separate letters, "rz" is just "r" and "z", "ch" is just "c" and "h", the same goes for "sz", "cz", "dz", "dż" and "dź") in Polish.

So, basically, the list has a duplicated entry for every word with "st" bigraph.
There are currently 658822 full lines + 1 newline in the raw file; after I made a quick regexp in notepad2 to remove the duplicates, I ended up with 608933 full lines + 1 newline - an 8% reduction in line count.

Now, if there is a legitimate reason for the duplicates with a non-existent characters (maybe it's easier to OCR with such redundancy? I don't think I know the topic well enough to even guess) then great, this issue is moot and invalid. But if there is no such reason, then the Polish wordlist can be automatically pruned.

Use a different set of fonts for Persian

(moved from tesseract-ocr/tesseract#294)

First of all, thanks for adding support to tesseract finally. From quickly inspecting Persian related codes on tesseract I reached to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L520 which I can say speculatively is not a good set of fonts for training Persian printed text and can result in poor performance of OCR quality as most Persian fonts don't have the style these fonts have. On "Font recognition using Variogram fractal dimension", a good set of Persian fonts is introduced (second page, at the bottom) which as you can see there also, it is different from favorites Arabic language fonts (even the fact both are using Arabic script). So for training Persian OCR for tesseract I suggest adding or replacing current fonts with these free fonts, Nazli (i.e. Nazanin as indicated on that article) and Titr from Debian fonts-farsiweb package and also XB Zar and XB Yaghut from OFL licensed xfonts. Thank you.

Add vulgar fraction for 1/2

@theraysmith

Please see tesseract-ocr/tesseract#841 (comment)

Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

https://cloud.githubusercontent.com/assets/1194896/25436113/477a23b6-2a60-11e7-967f-c4b97b21e3a9.png

I could not find any font which has 1/2 in this vertical format with straight line between 1 and 2.

Add Devanagari-extended and Vedic extensions to Devanagari.unicharset

See tesseract-ocr/tesseract#545 for details

at a minimum add support for U+0951, U+0952, U+A8F3, U+1CDA in Devanagari unicharset otherwise LSTM training/evaluation fails with messages such as

Can't encode transcription: शान्तिः॒ शान्तिः॑ ॥ ॐ पूर्ण॒मदः॒ पूर्ण॒मिदं॒ पूर्णा॒त्पूर्ण॒मुद॒च्यते । पूर्ण॒स्य
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffff82 ffffffe0 fffff
fa5 ffffff91 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffff
ffa4 ffffffaf ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffff
a4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa4 ffffffad ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffff82 ffffffe0 ffffffa5
 ffffff91 20 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff92 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8c ffffffe0 ffffffa4 ffffffad ffffffe0 ffffffa5 ffffff91 ffffffe0 ffff
ffa4 ffffff97 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffff9c ffffffe0 ffffff
a4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 20 ffffffe0 ffffffa5 ffffffa5 20 ffffffe0 ffffffa5 ffffffac ffffffe0 ffffffa5 ffffffa5
Can't encode transcription: त॒नुवं॑ पि॒प्रय॑स्वा॒स्मभ्यं॑ च॒ सौभ॑ग॒माय॑जस्व ॥ ६॥
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffaf 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ff
ffffa5 ffffff83 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffb7 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa0 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff92 ffffffe0 ffff
ffa4 ffffffad ffffffe0 ffffffa4 ffffffbf 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffff82 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffb8 ffffffe0 fff
fffa4 ffffffbe ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa5 ffffff92 20 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff88 ffffffe0 ff
ffffa4 ffffffb7 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa3 ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff80 ffffffe0 ffffffa4 ffffff82 20 ffffffe0 f
fffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffff95 20 ffffffe0 ffffffa4 ffffff87 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffb9 20 ffffff
e0 ffffffa4 ffffffae ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff8d ffffffe0
 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff8d 20 ffffffe0 ffffffa5 ffffffa5 20 ffffffe0 ffffffa5 ffffffad ffffffe0 ffffffa5 ffffffa5
Can't encode transcription: नाक॑स्य पृ॒ष्ठम॒भि सं॒वसा॑नो॒ वैष्ण॑वीं लो॒क इ॒ह मा॑दयन्ताम् ॥ ७॥
At iteration 0, stage 0, Eval Char error rate=3.5082962, Word error rate=13.780345

Dzongkha Resources

dzo - dz - Dzongkha - http://crubadan.org/languages/dz

Ref: tesseract-ocr/tesseract#654 (comment)

Add superscripts , subscripts, symbols, more nukta text for Hindi etc

@theraysmith

Please add superscripts (0-9) and subscripts (0-9) in training text for Hindi, Sanskrit and other Indic languages.

Please also include symbols that looks similar to the following:

Langdata for per and fas

As per
http://www-01.sil.org/iso639-3/documentation.asp?id=per
and
http://www-01.sil.org/iso639-3/documentation.asp?id=fas
Persian and Farsi are Equivalent.

So are two separate sets of langdata required for these?

Superscripts & subscripts

Copied from 59:

@Shreeshrii commented

Just checking whether this new training will also address:

Correct handling of superscripts

@theraysmith commented

Correct handling of superscripts

Beyond the scope of this change.
Sub/superscript are much harder to deal with, as they have to be trained,
and that means incorporating them correctly into the training path, and how
to pass the information back out of the line recognizer to the output.
At the moment it seems the iterator supports discovery of sub/super, but
there is no output renderer that handles it. (Not even hocr?)

Question:
For which languages/scripts is is desirable to support sub/super?

Shreeshrii commented

Regarding superscripts/subscripts etc, I can point out three cases based on the languages I know.

a. English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

b. Tamil - Sanskrit texts transliterated in Tamil scripts use superscripts/subscripts 2,3,4 (sometimes 1 also) to distinguish between different sounds (to support sanskrit alphabet which does not have direct mapping in Tamil script). These can actually be in middle of Tamil words.

c. Hindi, Sanskrit and other Indian languages - Hindi books, thesis etc use superscripts for referring to footnotes (similar to English above). The difference is that in some cases these will be using the Latin alphabet 0-9 and in some cases using Devanagari digits (in case of Hindi, Sanskrit etc). Unicode has superscripts 0-9 for Latin script but not for Devanagari script. I would suggest support for the Latin script superscript numbers.

Scanned pages with devanagari superscripts should also be mapped to the Latin script superscript numbers. Similarly for other Indian languages.

@stweil commented

English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

At least it applies to German. There are also superscripts after punctuation characters at the end of sentences.

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.

Shreeshrii commented

See page 3 in http://sanskritdocuments.org/doc_ganesha/gaNanAyak8-ta.pdf for superscripts usage in Tamil.

Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals.

The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.

Shreeshrii commented

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.

All superscripts have a special UTF-8 code, though in different ranges. Not all fonts have support for all superscripts and subscripts.

Arabic and Hebrew unicharsets seems to be corrupted

Arabic unicharset has character descriptions in two formats - seems to be corrupted. See the following for example:

؀ 0 92,92,148,148,152,152,5,5,161,161 Arabic 1 5 1 ؀	# ؀ [600 ]
؁ 0 67,67,135,135,306,306,1,1,312,312 Arabic 2 5 2 ؁	# ؁ [601 ]
؂ 0 93,93,141,141,152,152,5,5,163,163 Arabic 3 5 3 ؂	# ؂ [602 ]
؃ 0 93,93,155,155,171,171,5,5,182,182 Arabic 4 5 4 ؃	# ؃ [603 ]

؋ 0 36,36,195,195,54,54,5,5,64,64 Arabic 5 13 5 ؋	# ؋ [60b ]
؍ 10 106,106,148,148,47,47,7,7,61,61 Arabic 6 13 6 ؍	# ؍ [60d ]p

؎ 0 111,111,156,156,138,138,5,5,150,150 Arabic 7 10 7 ؎	# ؎ [60e ]
؏ 0 10,10,174,174,99,99,1,1,113,113 Arabic 8 10 8 ؏	# ؏ [60f ]
ؐ 0 213,213,251,251,47,47,35,35,0,0 Arabic 9 17 9 ؐ	# ؐ [610 ]
ؑ 0 209,209,244,244,47,47,31,31,0,0 Arabic 10 17 10 ؑ	# ؑ [611 ]
ؒ 0 213,213,255,255,68,68,26,26,0,0 Arabic 11 17 11 ؒ	# ؒ [612 ]
ؓ 0 213,213,255,255,66,66,26,26,0,0 Arabic 12 17 12 ؓ	# ؓ [613 ]
ؔ 0 227,240,255,255,87,98,0,1,0,0 Arabic 13 17 13 ؔ	# ؔ [614 ]
ؕ 0 228,248,255,255,50,78,8,49,0,179 Arabic 14 17 14 ؕ	# ؕ [615 ]

؞ 10 96,99,146,148,49,51,5,6,59,64 Arabic 15 13 15 ؞	# ؞ [61e ]p
ء 1 32,112,129,220,43,163,3,67,62,199 Arabic 16 13 16 ء	# ء [621 ]x
آ 1 26,117,230,255,36,161,0,58,33,198 Arabic 17 13 17 آ	# آ [622 ]x
أ 1 26,117,248,255,29,148,0,67,33,193 Arabic 18 13 18 أ	# أ [623 ]x
ؤ 1 0,68,190,255,70,290,0,27,62,266 Arabic 19 13 19 ؤ	# ؤ [624 ]x
إ 1 0,59,200,255,29,181,0,67,33,222 Arabic 20 13 20 إ	# إ [625 ]x
ئ 1 0,100,185,255,95,431,0,45,103,467 Arabic 21 13 21 ئ	# ئ [626 ]x
ا 1 26,117,200,255,11,181,7,82,33,222 Arabic 22 13 22 ا	# ا [627 ]x
ب 1 0,71,140,224,113,339,0,50,123,378 Arabic 23 13 23 ب	# ب [628 ]x
ة 1 55,123,190,255,40,181,0,60,48,222 Arabic 24 13 24 ة	# ة [629 ]x
ت 1 58,123,170,255,113,339,2,50,123,378 Arabic 25 13 25 ت	# ت [62a ]x
ث 1 58,121,192,255,113,339,2,50,123,378 Arabic 26 13 26 ث	# ث [62b ]x

Maldivian - Dhivehi - Thaana

https://en.wikipedia.org/wiki/Maldivian_writing_systems

https://www.ethnologue.com/language/div

http://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Thaa

http://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Qa61

Ref: https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L31

Suggest 'deva' for Devanagari

With LSTM training the dictionary dawg files have become optional. In light of this, I want to suggest an additional traineddata file for Devanagari script, which can cater to all main languages written in it.

The reason for suggesting this is, when I tested OCR on a Marathi text, a lot of words with rakaara were not recognised correctly. However, same page OCRed with Sanskrit recognised them correctly, but some others were incorrect.

So, in addition to the multiple traineddata for various languages written in Devaन

Seven Segment Display fonts

copied from #59 (comment)

@Shreeshrii 4. Traineddata for Seven Segment (or 14 segment) Display

@theraysmith Beyond the scope of this change.

cherokee resources

In response to tesseract-ocr/tesseract#654 (comment),

@theraysmith

An Crúbadán edited by Scannell, Kevin
is licensed under a Creative Commons Attribution 4.0 International License .

The zip files linked from the above pages have word lists as well as the list of URLs scrubbed from vast quantities of text freely available on the web used for building corpora for languages with small numbers of speakers and/or limited computational resources.

chr - Cherokee - http://crubadan.org/languages/chr

Cherokee Unicode Fonts

http://www.cherokee.org/AboutTheNation/Language/CherokeeFont.aspx
http://www.languagegeek.com/font/fontdownload.html

Because there are too many Chinese characters, how to quickly train the required data？

Would like to help for Burmese/Myanmar language training?

Hello,
I would like to help. I've already cloned all repository. How do I start?

German Fraktur

From tesseract-ocr/tesseract#40

@stweil commented

Are there also new data files planned for old German (deu_frak)? I was
surprised that the default English model with LSTM could recognize some
words.

@theraysmith commented

I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. With English at least, the language was different in the days of Fraktur (Ye Olde shoppe). I know German continued to be written in Fraktur
until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English

stweil commented

Fraktur was used for an important German newspaper (Reichsanzeiger) until 1945. I'd like to try some pages from that newspaper with Tesseract LSTM. Surprisingly even with the English data Tesseract was able to recognize at least some words written in Fraktur.

There is an Old High German (similar to Old English), but the German translation of the New Testament by Martin Luther (1521) was one of the first major printed books in German, and basically it started the modern German language (High German) which is used until today.

@jbaiter commented

I have a decent amount of corpus data for Fraktur from scanned books at hand, about 500k lines in hOCR files (~50GB with TIF images). I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to.

theraysmith commented

The md file documents the training process in tutorial detail, but line boxes and transcriptions sounds perfect!

300k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first.
For now it might be best to hang on for the instructions.

jbaiter commented

The text is CC0 and the images are CC-BY-NC, so that shouldn't be an issue :-) They're going to be public anyway once I've prepped the dataset for publication.

Related:
tesseract-ocr/tessdata#49

Bihari training text not representative

Please see pull request #11

Language pack request: Accented Russian

https://code.google.com/p/tesseract-ocr/issues/detail?id=1352

Rephrased, this is a request for a special purpose language pack, for scanning dictionaries and pedagogical material.

Add support for Armenian

copied from: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/zn4Xd-8wKe8/B6VpQkuZAwAJ

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about how we can add a new language support to the package? for example Armenian language.

Thank you in advance.

Regards,
Vahe

Correct handling of TM sign

Copied from 59

[reply to @Shreeshrii]

@theraysmith commented

TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.

Vietnamese

Forwarding below some feedback re Vietnamese traineddata for 4.00.00

Vietnamese lang data for tess 4.00 seems to have better accuracy, but still sometimes mixes up between acute and hook above marks when they appear on top of circumflex mark (stack diacritics).

frk langdata needs to be changed to Frankish language

Please see tesseract-ocr/tessdata#49
and other discussion regarding German Frakatur.

frk langdata needs to be changed for Frankish language

deu_frak needs to setup under langdata for German texts using Fraktur/Blackletter fonts.

Æ missing from the Norwegian language training data

The Norwegian character Æ is missing from the language files for Norwegian. Performing OCR where the Æ is present results in either /E or AE.

Some samples:
Ærfuglveien 44 er adressen jeg bor på.
Min adresse er Ærfuglgaten 73.
Ærlighet varer lengst.
Ærfuglen er den største andearten i vårt land.
Ærekrenkelse er en handling som består i å krenke en annens æresfølelse, eller opptre på en måte som er egnet til å skade en annens gode navn og rykte eller til å utsette ham for hat, ringeakt eller tap av den for hans stilling eller næring fornødne tillit.
Æsene lå i kamp med en annen gudeslekt, vanene.
Ærgjerrighet har vært viktig for mange av oss og da vi var småjenter, skjønte vi at det er viktig å arbeide hardt og bli til noe.
Det var Æsene som var snille.