Git Product home page Git Product logo

langdata_lstm's People

Contributors

aslamy avatar furtifk avatar jbreiden2 avatar poizan42 avatar shreeshrii avatar stweil avatar timilehin avatar wincentbalin avatar zdenop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

langdata_lstm's Issues

Missed letter in the hye.traineddata

In the hye.traineddata the letter և is not included. This letter is replaced by the letter ն . Indeed the two letters aspect are very similar, but they have not the same signification. I have found that in the old arm.traineddata there is no such a problem.

error related to script data during training

02cc8f0
moved all script related data to script subfolder.

This leads to error/warnings during training, eg.

Wrote unicharset file /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset
[Wed Jun 19 18:46:20 UTC 2019] /usr/local/bin/set_unicharset_properties -U /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset -O /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset -X /tmp/chi_sim-2019-06-19.fKD/chi_sim.xheights --script_dir=/home/ubuntu/langdata_lstm
Loaded unicharset of size 5090 from file /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Latin.unicharset
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Han.unicharset
Warning: properties incomplete for index 3 = “
Warning: properties incomplete for index 4 = 《
Warning: properties incomplete for index 5 = 副

I do not know how important these properties are for LSTM and Legacy tesseract training.

@stweil What do you suggest to do in this case?

Missing support for Coptic script

Training of Tesseract with tesstrain and a text containing ϯ creates a unicharset file which includes this line:

ϯ 3 0,255,0,255,0,0,0,0,0,0 Coptic 273 0 273 ϯ	# ϯ [3ef ]a

lstmtrain complains about a missing file:

Failed to load script unicharset from:data/Coptic.unicharset

Wordlists and training texts contain lots of errors

A short test with codespell (which only finds the most common typos for English) found more than 1000 errors in eng.wordlist.

The German wordlist deu.wordlist contains the well known B / ß confusion and also other errors.

The training texts also contain similar errors. In addition, I noticed many foreign (Turkish?) words in the German text.

Are such errors critical for the trained model which is based on that data?

θ in Greek book font rendered as swash form

IMG_1991

OCR result: ϑεοὶ γὰρ οὔποτ᾽,

This is an ordinary book font used by editions of classical texts. Because the design of its theta, however, this letter is frequently OCR’ed as a swash form and requires manual correction as it stands out from the rest of the text when rendered in other (esp. sans) Greek fonts.

tesseract 4.00 Trainging_text iteration failed to respond tp page 3402

Training data is created using tesstrain.sh :
./src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only
--noextract_font_properties --langdata_dir ../langdata
--tessdata_dir ./tessdata
--fontlist "WenQuanYi Zen Hei Medium"
--output_dir ~/tesstutorial/trainplusminus
Trainging_text iteration failed to respond tp page 3402.
log:Rendered page 3402 to file /tmp/tmp.UtQvClRYtq/chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.tif
version:4.0.0

Armenian letter և missing in hye language - confirmation

I checked the armenian hye.training_text and hye.wordlist and I confirm that the և sign is not included in those files. The և sign is replaced by the եւ sign, they have both the same meaning but in armenian the printed document will use the և sign and not not the եւ sign .

Is it possible to add few pre-1918 Russian characters to RUS language files?

In 1917--1918, the Russian language was reformed in many ways including but not limited to the banning of four letters: I-decimal (now known as "Byelorussian-Ukrainian I"), Yat, Fita, and Izhitsa. The necessity to OCR the texts published in Russia from 1708 through 1918 (and somewhat later) is widely recognised among scholars but they are largely unfamiliar with the ways tesseract can be trained to recognise these missing characters (and, I have to confess, the vast majority of ordinary people will be absolutely unable to train tesseract even if they read the instructions [ https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ]). See also: https://en.wikipedia.org/wiki/Russian_alphabet#Letters_eliminated_in_1918

Is there a possibility to include in the desired characters list for Russian ( langdata_lstm/rus/desired_characters ) the following glyphs:

§ : Section sign ; Unicode number: U+00A7

І : Cyrillic Capital Letter Byelorussian-Ukrainian I ; Unicode number: U+0406
і : Cyrillic Small Letter Byelorussian-Ukrainian I ; Unicode number: U+0456
Ѣ : Cyrillic Capital Letter Yat ; Unicode number: U+0462
ѣ : Cyrillic Small Letter Yat ; Unicode number: U+0463
Ѳ : Cyrillic Capital Letter Fita ; Unicode number: U+0472
ѳ : Cyrillic Small Letter Fita ; Unicode number: U+0473
Ѵ : Cyrillic Capital Letter Izhitsa ; Unicode number: U+0474
ѵ : Cyrillic Small Letter Izhitsa ; Unicode number: U+0475

What else should be provided to add these few characters? A list of words containing these letters? How long should that list be? I am working currently on a project which processes lots of geographic names in pre-1918 Russian (and some other texts), so I can provide at least a list of words of considerable length. For now, I have to resort to OCR the pre-1918 text as a post-1918 and insert the missing four characters manually (mostly, two of them, as Fita and, especially, Izhitsa were rather less frequent).

Or this would rather require a much larger effort like creating a special rus-old model?

Training data should include bullet-like characters

Modern texts especially business documents contain bullet-like symbols e. g. for lists. Also middle dot is used with some frequency. While the recognition results for eng and deu are nearly perfect, the results for these symbols are "random".

For a next release of trained models the training data should be improved in this direction and maybe other symbols as well.

Test image:

bullets

Tesseract result with -l eng:

List of vehicles:
* Trucks
* vans
* bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrrader

Result with -l deu:

List of vehicles:
« Trucks
« vans
+ bicycles
Liste von Fahrzeugen:
e Lastwagen
e Transporter
e Fahrräder

how to train this files to get .traineddata

hi.
i want to train new font and character image to fin lang. i want to train character with noise and angle.
how can i use this files :
desired_characters
fin.numbers
fin.punc
fin.singles_text
fin.training_text
fin.unicharambigs
fin.unicharset
fin.wordlist
okfonts.txt

to get .traineddata files like tessdata_best.
should i use tesstrain ( https://github.com/tesseract-ocr/tesstrain ) or use text2image and create box then train ( https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html )

Danish traineddata file doesn't include the "@" character

Environment

Current Behavior: Danish traineddata file doesn't include the "@" character

Expected Behavior: Danish traineddata file should include the "@" character

Suggested Fix: Danish traineddata file should include the "@" character

File to run OCR on:
Screenshot_572

In the case of reproducing I have zip file I can send so you may run a VERY basic test which will display both results comparing eng and dan traineddata results. Please whoever looks into the issue to contact me to receive this.

This is a quite a pressing issue so any response is appreciated.

Support for New Reiwa Era Character ㋿ in Japanese

With the new Japanese Reiwa Era, there's a new character introduced ㋿ (U+32FF). Support for this character is required.

Current Behavior: Other Characters are being identified 砒後徘朔御菓
Expected Behavior: ㋿ should be identified for the given input image
Suggested Fix: Train and Update the current jpn.traineddata file with the new jpn character.

Reference:
Wiki Page

Attached:
The input file I used.
The character in 6 different fonts for training.
Reiwa.docx
Reiwa

Missing some Thai numbers in Thai language (tha)

I found that some Thai numbers are missing.
The missing numbers are ๔, ๖, ๗, ๘ and ๙.
The missing numbers don't exist in tha.training_text and tha.unicharset files.

I am not sure how to add the missing numbers to the model without training it from scratch because there is a problem when I try to combine the finetune model with the old model that unicharset number is unequal to the new model (also try --old_traineddata parameter but it did not work).

Thank you.

Add support for Shan language (shn)

Could someone help me to add the Shan language in tesseract?

Shan language = https://en.wikipedia.org/wiki/Shan_language
Language code = shn
Shan Wiki = https://shn.wikipedia.org
All Shan words (including IPA) = jsonfile
Websites that are using Shan scripts = https://shannews.org/ , http://shanunicode.com/
Font = https://saosu-mp.github.io/font/PangLong/PangLong.ttf
Shan syllable break = https://github.com/kwarm/syllable-break

Some Shan characters such as င သ တ ထ ပ မ ယ ရ လ ဝ ႉ း ွ ု ူ ိ ီ ် ၊ ။ are similar to Myanmar (Burmese).

Thanks in advance

English traineddata file does not contain the '±' character?

English traineddata file does not contain the '±' character?

Environment
Tesseract Version: 5.00 Downloaded from: https://github.com/UB-Mannheim/tesseract/wiki
Platform: Windows 10 64bit

I am trying to OCR using the English dictionary file found:
https://tesseract-ocr.github.io/tessdoc/Data-Files
I notice the character is not included here either:
https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset

Are there any plans to add it? Are there any language files that contain successfully OCR this character?

Many thanks to whoever can assist here. I am attaching the file I used to test this behavior for this character here: (https://github.com/tesseract-ocr/langdata_lstm/files/9870674/Special.Symbols.pdf)

Tesseract fails to detect letters Å and å in Finnish language.

Testing Tesseract to detect Finnish texts containing "swedish o" -> å. Seems it cannot detect them- Å and å correctly. I have also tried fin+swe model but more usually the fin model version of the text is selected.

Is the previous training files available somewhere? Probably the training data does not have enough Åå cases or it is not included even it is official letter.

grc letters with dot below

This is relevant specifically to grc. Because modern books of Ancient Greek often has to mark out uncertain letters in ancient sources, letters with dot below are a common occurrence but are at present not recognised by tesseract.

A fairly complete list of letters with dot below (except for the lunate sigma ϲ̣) can be found here: https://titus.uni-frankfurt.de/unicode/unicsel/grkkadd.htm

I wonder if recognising dot below shouldn’t be a feature behind a flag to be manually turned on because it might also pick up stains in older books (which however tend not to have such dots & so don’t require this feature). But this could make it difficult to deploy the feature in downstream projects like Internet Archive.

Trailing spaces on line 27 of eng.punc

I've not yet worked out whether eng.punc is used by the LSTM mode of tesseract, but I discovered that there are two trailing spaces on line 27 of this file, which might cause the occasional problem.

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke

All the traning data present in bodhi and dzongkha very much applies to Drenjongke with a exception of two mentioned below
1)Since
the size of our corpus is not large, we could have
typed all the data, but we opted for using the OCR
method instead. Testing the OCR method was beneficial because we found that the OCR-ed texts
contained errors due to the “tsha-lag” ◌༹ marker,
which is used to mark the pronunciation of [bj]
in Drenjongke. The use of this marker is unique
to Drenjongke because Tibetan (bodhi) does not have the
sound [bj]..
2)For tokenization, space was set as a delimiter.
Drenjongke script is marked by a syllable marker
called “tsheg” ་, and has a space between potential
morpheme or word boundaries. The use of space in
the orthography is specific to Drenjongke as other
Tibetan languages do not utilize spacing in a sentence.

Since these two are minor issues so we decided not to train entire Drenjongke from scratch instead add the required character for solving problem 1 initially.
Also on the previous issue (https://github.com/amitdo) mentioned that training should be done from our side ... Since our expertise lies in lanuage and not in programming,what we understood is to use the entire tesseract-ocr repo by cloning it locally make the changes and then train it or is it done some other way ... Any help would be highly appreciated.

Missing GREEK LUNATE SIGMA SYMBOL in grc and script/Greek models

Current Behavior

A lunate sigma (ϲ, U+03F2) is recognised under language ‘grc’ but is being output as a normal sigma (σς).

Expected Behavior

Outputting it as U+03F2.

Suggested Fix

No response

tesseract -v

5.3.0-6-g76ae

Operating System

No response

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

Apparently Lao\Lao.unicharset Has Uncommitted Changes

I've just installed Git Desktop on Win10 and started cloning Tesseract-ocr. When the process finished the desktop showed that the above-referenced file had been changed, sh...owed the changed lines

-163
NULL 0 NULL 0
Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ]
|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken
85
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
...
and many, many more

Further, the desktop offered me the opportunity to Commit the changes.

It appears that lines 1-4 have been changed, and many lines have been added; but since it's not my code, far be it from me to actually commit the changes.

May I ask for any recommendations?
TIA

wrong default mapping of some Romanian diacritics

Environment

Debian Linux

  • Tesseract Version: tesseract 4.00.00alpha

  • Platform: Linux 4.15.0 SMP PREEMPT 2018 x86_64 GNU/Linux

Current Behavior:

using the ron option (Romanian):

romanian diacritics șȘțȚ are mapped into the wrong Unicode codes, namely:
Ș -> Ş=U+015E
ș -> ş=U+015F
Ț -> Ţ=U+0162
ț -> ţ=U+0163

Expected Behavior:

Ș -> Ș=U+0218
ș -> ș=U+0219
Ț -> Ț=U+021A
ț -> ț=U+021B

Suggested Fix:

edit the map accordingly;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.