tesseract-ocr / langdata_lstm Goto Github PK

View Code? Open in Web Editor NEW

114.0 114.0 153.0 1.19 GB

Data used for LSTM model training

License: Apache License 2.0

langdata_lstm's People

Contributors

Stargazers

Watchers

Forkers

stweil amitdo celerometis sanyaade-machine-learning nikangbz ertguler qiucl1001 jo-gyu-seong mariamhijazi hamadalnamazi xuanyuanjia bighanksmallhank saitama0811 slamek dimancheite aslamy doughmanjad zelade mkl72 kamalkarki timilehin fastz0om chenyenliang nurialfred siavash7 imanedu ltphy bhanumurthy1 mshakirdr samer301 braimourad maui222 mauropaladini byzeng daanharmsen ryokash cnzhujg boms2 maxin19940317 vishakhav2 lxygoodjob ajinkya933 nawone asdbaihu kyotoyaho rishi-arch gongzhengyang demirmehmet0 noct-xx zg163zp2006 amal2nes docu9 malikranasingha roli-68 rojitarik szm007 poizan42 jmokoistinen hearldsshexample leejmdevelop yangyi-asu diff-stone huramba justnawaf zawlinnnaing bolajiy neph123 songgravel chuangluo0629 zhuyc88 oreaw truelter dyl2000 heshenggithub hjkgithub akzhy manojkmohan misszhuping sbboss yym439 huanghuanwen alwagdani ucodai krispokkuluri bisaromer lyckabc kinva chladams shreeshrii dvrogozh gerhobbelt comnamu18 goodmuyis shunshuiyuanxin bgonzalezbustamante saadbinmanjur celestialized troceleng global-localhost global19

langdata_lstm's Issues

Missed letter in the hye.traineddata

In the hye.traineddata the letter և is not included. This letter is replaced by the letter ն . Indeed the two letters aspect are very similar, but they have not the same signification. I have found that in the old arm.traineddata there is no such a problem.

error related to script data during training

02cc8f0
moved all script related data to script subfolder.

This leads to error/warnings during training, eg.

Wrote unicharset file /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset
[Wed Jun 19 18:46:20 UTC 2019] /usr/local/bin/set_unicharset_properties -U /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset -O /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset -X /tmp/chi_sim-2019-06-19.fKD/chi_sim.xheights --script_dir=/home/ubuntu/langdata_lstm
Loaded unicharset of size 5090 from file /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Latin.unicharset
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Han.unicharset
Warning: properties incomplete for index 3 = “
Warning: properties incomplete for index 4 = 《
Warning: properties incomplete for index 5 = 副

I do not know how important these properties are for LSTM and Legacy tesseract training.

@stweil What do you suggest to do in this case?

Missing support for Coptic script

Training of Tesseract with tesstrain and a text containing ϯ creates a unicharset file which includes this line:

ϯ 3 0,255,0,255,0,0,0,0,0,0 Coptic 273 0 273 ϯ	# ϯ [3ef ]a

lstmtrain complains about a missing file:

Failed to load script unicharset from:data/Coptic.unicharset

Wordlists and training texts contain lots of errors

A short test with codespell (which only finds the most common typos for English) found more than 1000 errors in eng.wordlist.

The German wordlist deu.wordlist contains the well known B / ß confusion and also other errors.

The training texts also contain similar errors. In addition, I noticed many foreign (Turkish?) words in the German text.

Are such errors critical for the trained model which is based on that data?

θ in Greek book font rendered as swash form

OCR result: ϑεοὶ γὰρ οὔποτ᾽,

This is an ordinary book font used by editions of classical texts. Because the design of its theta, however, this letter is frequently OCR’ed as a swash form and requires manual correction as it stands out from the rest of the text when rendered in other (esp. sans) Greek fonts.

tesseract 4.00 Trainging_text iteration failed to respond tp page 3402

Training data is created using tesstrain.sh ：
./src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only
--noextract_font_properties --langdata_dir ../langdata
--tessdata_dir ./tessdata
--fontlist "WenQuanYi Zen Hei Medium"
--output_dir ~/tesstutorial/trainplusminus
Trainging_text iteration failed to respond tp page 3402.
log：Rendered page 3402 to file /tmp/tmp.UtQvClRYtq/chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.tif
version:4.0.0

Armenian letter և missing in hye language - confirmation

I checked the armenian hye.training_text and hye.wordlist and I confirm that the և sign is not included in those files. The և sign is replaced by the եւ sign, they have both the same meaning but in armenian the printed document will use the և sign and not not the եւ sign .

Adding additional language Denjongke (sikkimese bhutia) to tesseract language dataset

We are currently trying to add Sikkimese bhutia language for ocr language engine. The letters and words are similar to Dzongkha (dzo) language which is already present in the current dataset . However there are additional letters and words which are not included in Dzongkha dataset. How can we contribute ?

Is it possible to add few pre-1918 Russian characters to RUS language files?

In 1917--1918, the Russian language was reformed in many ways including but not limited to the banning of four letters: I-decimal (now known as "Byelorussian-Ukrainian I"), Yat, Fita, and Izhitsa. The necessity to OCR the texts published in Russia from 1708 through 1918 (and somewhat later) is widely recognised among scholars but they are largely unfamiliar with the ways tesseract can be trained to recognise these missing characters (and, I have to confess, the vast majority of ordinary people will be absolutely unable to train tesseract even if they read the instructions [ https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ]). See also: https://en.wikipedia.org/wiki/Russian_alphabet#Letters_eliminated_in_1918

Is there a possibility to include in the desired characters list for Russian ( langdata_lstm/rus/desired_characters ) the following glyphs:

§ : Section sign ; Unicode number: U+00A7

І : Cyrillic Capital Letter Byelorussian-Ukrainian I ; Unicode number: U+0406
і : Cyrillic Small Letter Byelorussian-Ukrainian I ; Unicode number: U+0456
Ѣ : Cyrillic Capital Letter Yat ; Unicode number: U+0462
ѣ : Cyrillic Small Letter Yat ; Unicode number: U+0463
Ѳ : Cyrillic Capital Letter Fita ; Unicode number: U+0472
ѳ : Cyrillic Small Letter Fita ; Unicode number: U+0473
Ѵ : Cyrillic Capital Letter Izhitsa ; Unicode number: U+0474
ѵ : Cyrillic Small Letter Izhitsa ; Unicode number: U+0475

What else should be provided to add these few characters? A list of words containing these letters? How long should that list be? I am working currently on a project which processes lots of geographic names in pre-1918 Russian (and some other texts), so I can provide at least a list of words of considerable length. For now, I have to resort to OCR the pre-1918 text as a post-1918 and insert the missing four characters manually (mostly, two of them, as Fita and, especially, Izhitsa were rather less frequent).

Or this would rather require a much larger effort like creating a special rus-old model?

Training data should include bullet-like characters

Modern texts especially business documents contain bullet-like symbols e. g. for lists. Also middle dot is used with some frequency. While the recognition results for eng and deu are nearly perfect, the results for these symbols are "random".

For a next release of trained models the training data should be improved in this direction and maybe other symbols as well.

Test image:

Tesseract result with -l eng:

List of vehicles:
* Trucks
* vans
* bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrrader

Result with -l deu:

List of vehicles:
« Trucks
« vans
+ bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrräder

Bontot janda

Removed spam.

how to train this files to get .traineddata

hi.
i want to train new font and character image to fin lang. i want to train character with noise and angle.
how can i use this files :
desired_characters
fin.numbers
fin.punc
fin.singles_text
fin.training_text
fin.unicharambigs
fin.unicharset
fin.wordlist
okfonts.txt

to get .traineddata files like tessdata_best.
should i use tesstrain ( https://github.com/tesseract-ocr/tesstrain ) or use text2image and create box then train ( https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html )

Alternative way to download langdata_lstm master file instead from github

I'm trying to download the langdata_lstm from a work laptop. However, I couldnt download this file in github due to a firewall block which i have no control on. Is there another site that i can download this file from?

thank you

Danish traineddata file doesn't include the "@" character

Environment

Tesseract Version: 5.00 Downloaded from: https://github.com/UB-Mannheim/tesseract/wiki
Platform: Windows 10 64bit

Current Behavior: Danish traineddata file doesn't include the "@" character

Expected Behavior: Danish traineddata file should include the "@" character

Suggested Fix: Danish traineddata file should include the "@" character

File to run OCR on:

In the case of reproducing I have zip file I can send so you may run a VERY basic test which will display both results comparing eng and dan traineddata results. Please whoever looks into the issue to contact me to receive this.

This is a quite a pressing issue so any response is appreciated.

Support for New Reiwa Era Character ㋿ in Japanese

With the new Japanese Reiwa Era, there's a new character introduced ㋿ (U+32FF). Support for this character is required.

Current Behavior: Other Characters are being identified 砒後徘朔御菓
Expected Behavior: ㋿ should be identified for the given input image
Suggested Fix: Train and Update the current jpn.traineddata file with the new jpn character.

Reference:
Wiki Page

Attached:
The input file I used.
The character in 6 different fonts for training.
Reiwa.docx

Duplicate fonts names in okfonts

for example in eng:

Arial Regular
Arial Regular

Please use more fonts for training Uyghur

Please consider tesseract-ocr/langdata#149 (comment)

Missing some Thai numbers in Thai language (tha)

I found that some Thai numbers are missing.
The missing numbers are ๔, ๖, ๗, ๘ and ๙.
The missing numbers don't exist in tha.training_text and tha.unicharset files.

I am not sure how to add the missing numbers to the model without training it from scratch because there is a problem when I try to combine the finetune model with the old model that unicharset number is unequal to the new model (also try --old_traineddata parameter but it did not work).

Thank you.

Add support for Shan language (shn)

Could someone help me to add the Shan language in tesseract?

Shan language = https://en.wikipedia.org/wiki/Shan_language
Language code = shn
Shan Wiki = https://shn.wikipedia.org
All Shan words (including IPA) = jsonfile
Websites that are using Shan scripts = https://shannews.org/ , http://shanunicode.com/
Font = https://saosu-mp.github.io/font/PangLong/PangLong.ttf
Shan syllable break = https://github.com/kwarm/syllable-break

Some Shan characters such as င သ တ ထ ပ မ ယ ရ လ ဝ ႉ း ွ ု ူ ိ ီ ် ၊ ။ are similar to Myanmar (Burmese).

Thanks in advance

English traineddata file does not contain the '±' character?

Environment
Tesseract Version: 5.00 Downloaded from: https://github.com/UB-Mannheim/tesseract/wiki
Platform: Windows 10 64bit

I am trying to OCR using the English dictionary file found:
https://tesseract-ocr.github.io/tessdoc/Data-Files
I notice the character is not included here either:
https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset

Are there any plans to add it? Are there any language files that contain successfully OCR this character?

Many thanks to whoever can assist here. I am attaching the file I used to test this behavior for this character here: (https://github.com/tesseract-ocr/langdata_lstm/files/9870674/Special.Symbols.pdf)

Tesseract fails to detect letters Å and å in Finnish language.

Testing Tesseract to detect Finnish texts containing "swedish o" -> å. Seems it cannot detect them- Å and å correctly. I have also tried fin+swe model but more usually the fin model version of the text is selected.

Is the previous training files available somewhere? Probably the training data does not have enough Åå cases or it is not included even it is official letter.

Armenian.traineddata contains the missing character, so I suggest to try that model.

          Armenian.traineddata contains the missing character, so I suggest to try that model.

Originally posted by @stweil in #49 (comment)

Should we update swe.training_text if new characters are added to desired_characters ?

Recently I made a pull request to update the swedish desired_characters file with new characters.
Now I see swe.training_text does not contains all new added desired_characters.
Do we have to update swe.training_text and add thes new desired_characters, in order to to tesseract recognize them?

grc letters with dot below

This is relevant specifically to grc. Because modern books of Ancient Greek often has to mark out uncertain letters in ancient sources, letters with dot below are a common occurrence but are at present not recognised by tesseract.

A fairly complete list of letters with dot below (except for the lunate sigma ϲ̣) can be found here: https://titus.uni-frankfurt.de/unicode/unicsel/grkkadd.htm

I wonder if recognising dot below shouldn’t be a feature behind a flag to be manually turned on because it might also pick up stains in older books (which however tend not to have such dots & so don’t require this feature). But this could make it difficult to deploy the feature in downstream projects like Internet Archive.

NO fas.unicharset and fas.xheights file for Persian Language

There are no fas.xheights and fas.unicharset file for Persian language.Without these data how can we train tesseract with LSTM on persian language.Coulde you please add them or guide how can we make them ?

Trailing spaces on line 27 of eng.punc

I've not yet worked out whether eng.punc is used by the LSTM mode of tesseract, but I discovered that there are two trailing spaces on line 27 of this file, which might cause the occasional problem.

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke

All the traning data present in bodhi and dzongkha very much applies to Drenjongke with a exception of two mentioned below
1)Since
the size of our corpus is not large, we could have
typed all the data, but we opted for using the OCR
method instead. Testing the OCR method was beneficial because we found that the OCR-ed texts
contained errors due to the “tsha-lag” ◌༹ marker,
which is used to mark the pronunciation of [bj]
in Drenjongke. The use of this marker is unique
to Drenjongke because Tibetan (bodhi) does not have the
sound [bj]..
2)For tokenization, space was set as a delimiter.
Drenjongke script is marked by a syllable marker
called “tsheg” ་, and has a space between potential
morpheme or word boundaries. The use of space in
the orthography is specific to Drenjongke as other
Tibetan languages do not utilize spacing in a sentence.

Since these two are minor issues so we decided not to train entire Drenjongke from scratch instead add the required character for solving problem 1 initially.
Also on the previous issue (https://github.com/amitdo) mentioned that training should be done from our side ... Since our expertise lies in lanuage and not in programming,what we understood is to use the entire tesseract-ocr repo by cloning it locally make the changes and then train it or is it done some other way ... Any help would be highly appreciated.

Please add description for repo - Suggested Text:

Source LSTM training data for Tesseract (4.0.0alpha) for lots of languages

Missing GREEK LUNATE SIGMA SYMBOL in grc and script/Greek models

Current Behavior

A lunate sigma (ϲ, U+03F2) is recognised under language ‘grc’ but is being output as a normal sigma (σς).

Expected Behavior

Outputting it as U+03F2.

Suggested Fix

No response

tesseract -v

5.3.0-6-g76ae

Operating System

No response

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

Arabic training text is only 80 lines

The training text in langdata_lstm/ara is only 80 lines or so.

Apparently Lao\Lao.unicharset Has Uncommitted Changes

I've just installed Git Desktop on Win10 and started cloning Tesseract-ocr. When the process finished the desktop showed that the above-referenced file had been changed, sh...owed the changed lines

-163
NULL 0 NULL 0
Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ]
|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken
85
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
...
and many, many more

Further, the desktop offered me the opportunity to Commit the changes.

It appears that lines 1-4 have been changed, and many lines have been added; but since it's not my code, far be it from me to actually commit the changes.

May I ask for any recommendations?
TIA

wrong default mapping of some Romanian diacritics

Environment

Debian Linux

Tesseract Version: tesseract 4.00.00alpha
Platform: Linux 4.15.0 SMP PREEMPT 2018 x86_64 GNU/Linux

Current Behavior:

using the ron option (Romanian):

romanian diacritics șȘțȚ are mapped into the wrong Unicode codes, namely:
Ș -> Ş=U+015E
ș -> ş=U+015F
Ț -> Ţ=U+0162
ț -> ţ=U+0163

Expected Behavior:

Ș -> Ș=U+0218
ș -> ș=U+0219
Ț -> Ț=U+021A
ț -> ț=U+021B

Suggested Fix:

edit the map accordingly;

Missing many special characters in desired_characters file (Swedish)

The file desired_characters does not contains many of the important special characters like "@".
All special characters in english is also important for swedish language.
Law documents contains section sign § character. Please add this as well.

tesseract-ocr / langdata_lstm Goto Github PK

langdata_lstm's People

Contributors

Stargazers

Watchers

Forkers

langdata_lstm's Issues

Environment

Current Behavior: Danish traineddata file doesn't include the "@" character

Expected Behavior: Danish traineddata file should include the "@" character

Suggested Fix: Danish traineddata file should include the "@" character

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Recommend Projects

Recommend Topics

Recommend Org