tesseract-ocr / tessdata Goto Github PK

View Code? Open in Web Editor NEW

5.9K 5.9K 2.1K 3.18 GB

Trained models with fast variant of the "best" LSTM models + legacy models

License: Apache License 2.0

ocr tesseract

tessdata's People

Contributors

Stargazers

Watchers

Forkers

linghushaoxia ujsterry carlcarl seraph526 sharpstringer dhruv-mohan chiahungtai pierfio sipun yiakwy langdead shuaizki yjkim lenbin yyaadet pkdevbox gemchen anongithum shilpi230 wfff20054923 huangzongwu uni8inu younthu ilaoniu ramazanaktolu 19900623 brucezhang80 baituhuangyu eric013 pipi1226 transformersprimeabcxyz tigerhao727 tyronewo amani-lei rahulmod ericlau2018 modulexcite odasoken nperiwal wzpsgit jpxiong zettacristiano mks786 codeguesser alexpandy chingsos firemaples vikramkumariiit devashishd12 zayedmohamed strofe superxiaoqiang shahsankets isbase mahmoudyehia neonaldo arthurtalkgoal cd37ycs ianblenke chenyi2013 baks zshaxing ankitagrawal01 bitrecula blackworx xuxhtest itinysun oshaar magic-coder zx012345 2php magar501 itjiangzhuang rasata hulab spbwun abhiyaduwanshi koshal28 aya-wang zhongshuiyuan jvannoord atmcconn cryank shamanskyh xiaoraoli jackice hans0228 phungvandung ashishnigam butasa yousee777 weiwei22844 hasatameer smiles317 oge77 duanyousen dylanmay nowucme lina1 didstopia

tessdata's Issues

4.0 alpha: Add config to Sanskrit traineddata for improved accuracy

Adding a config file to san.traineddata (and also replacing the tesseract model for sanskrit with the one for English) seems to improve accuracy as well as reduces size of traineddata.

san - 4.0 traineddata
sa4 - modified 4.0 traineddata - https://github.com/Shreeshrii/tessdata4alpha/blob/master/sa4.traineddata

For the sample used:

san
CER	5.05
WER	4.70
WER (order independent)	4.24

sa4
CER	1.86
WER	1.44
WER (order independent)	1.12

Accuracy reports are at

https://github.com/Shreeshrii/tessdata4alpha/blob/master/san_report.html
https://github.com/Shreeshrii/tessdata4alpha/blob/master/sa4_report.html

Groundtruth and images are at:
https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages

Multiple files for arabic and hindi

There are multiple files that are there for ara and hin . Just downloading the traineddata for these languages is not enough.

Downloading multiple files on Windows is difficult.

I would suggest if all required files for these two languages can also be provided as a single zip file - one for each language.

Thanks!

Persian support in Tesseract

Would you be willing to add support for Persian language?

There is an on-going project here https://github.com/reza1615/PersianOcr I tested with "per.traineddata" and it works well on my end.

Thank you

ara.traineddata init failed

I can't init TessBaseAPI with ara.traineddata. Other traineddata is ok. Is there anyone have the same issue?

Missing hin.cube.size

According to the documentation (https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube), lang.cube.size is required for the cube recognition engine.

Either Hindi (hin) should provide all files needed for the cube recognition engine or none of them (or the documentation is wrong and cube works without lang.cube.size - is there more up to date documentation?).

unpack file raineddata

i use windows 7 and install 7-zip
and i cannot unpack file ara.traineddata

Add version info to traineddata files

Since the format of traineddata files is different between 3.xx and 4.xx, it would be helpful to include version info in traineddata files and to also check for compatibility when running tesseract.

Please tag a release

While it is good to have this data up here on github, it would be helpful for Linux distributions if you could tag a release, so that there is a definite set of files that distributions can distribute. Our mirror infrastructure should not mirror "the latest version of whatever is there", and for security reasons we would also like to check by hashing that what we distribute to users tomorrow is the same that we distributed yesterday.

Portuguese Language Module?

Hi, I just downloaded FreeOCR (version 5.41//Tesseract v3), but when I tried the Portuguese language module for the Tesseract OCR available on this site it seems to cause a problem with the OCR: i.e. "tesseract.exe has stopped working." I tried versions 4 and 3.04 of the language module (the program recommends above 2), but both cause the same problem? So, I tried the built-in English language OCR and that works not problem (except of course that it makes mistakes because it doesn't recognise the language)! I noticed when I opened the language folder of the software that por.traineddata is much larger (20,956kb) than any of the pre-installed languages (2-3,000kb)? Any suggestions?

Thanks!

tesseract 4.0 Compiling don't work for me

I followed the steps in https://github.com/tesseract-ocr/tesseract/wiki/Compiling#linux
I found error written so I installed autoconf-archive and in order to resolve error with "Leptonica 1.74" I have done:

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
export LIBLEPT_HEADERSDIR=/usr/local/include
export TESSDATA_PREFIX='/usr/local/share/'

./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make

but I receive this error:

In file included from rect.h:28:0,
from coutln.h:26,
from stepblob.h:23,
from werd.h:28,
from blobbox.h:25,
from blobbox.cpp:25:
blobbox.h: In member function 'void TO_BLOCK::print_rows()':
blobbox.h:737:61: error: expected ')' before 'PRId32'
tprintf("Row range (%g,%g), para_c=%g, blobcount=%" PRId32 "\n",
^
../ccutil/tprintf.h:31:39: note: in definition of macro 'tprintf'
#define tprintf(...) tprintf_internal(VA_ARGS)
^
make[2]: *** [blobbox.lo] Error 1
make[2]: Leaving directory /home/webuser/src/tesseract-4.0/ccstruct' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory /home/webuser/src/tesseract-4.0'
make: *** [all] Error 2

Can you help me?
thanks

Please do split-tarball releases for each language

We used to have per-language tarballs like

tesseract-ocr-3.02.ind.tar.gz
tesseract-ocr-3.02.hrv.tar.gz

etc.

This is very crucial to package these files on Linux distributions, so please release per-language traineddate tarballs again.

Avoid Tesseract download data

Hello,

I am using the javascript version of Tesseract to create a simple web page that allows users to extract text from Image. As I am working on a really slow connection, is it possible for me to somehow avoid tesseract download the traineddata?

Thanks in advance.

ita.special-words missing

https://github.com/tesseract-ocr/langdata/tree/master/ita

https://github.com/tesseract-ocr/langdata/blob/master/ita/ita.config

user_words_suffix special-words

https://github.com/tesseract-ocr/langdata/blob/master/ita/ita.special-words
We probably need to put this file in this (tessdata) repo and tell the users of ita.traineddata that they also need this file.

https://groups.google.com/forum/#!topic/tesseract-ocr/dkwdLLMheFs

Files Error

These files giving error with Tessract 2.0 jni library.

san default psm mode skipping text

$ tesseract --tessdata-dir /usr/share -l san san-019.png san-019
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Detected 18 diacritics

ज्यंग्लत्रुखीसह्स्रनामलोत्रम्- नामावळिट्.
दुर्गासहस्रनामस्तीत्रम्- १ नामाक्ळि
दुर्गासहस्रनत्मस्तीत्र्दुं'म्- २ नामावळिऽ
द्बुगसिद्द्स्रनत्मरत्तोत्रम्दकारादि (३) नामावळि

पार्वतीं ह्यो) सहम्रनम्परतोत्रम्- नामावळिऽ’

फुलकुर्व्यसहस्रनत्मक्तोत्रम्-क्ताचम्-नत्माचळिऽ

गम्यत्रीसह्स्रनत्मक्तोत्रम्-नग्मग्वळिऽ(१)

191
,213

238

300
329
355
360

365

394

397-

432

457

488

491

493

517

531

Low Accuracy for san.traineddata, alternate traineddata files

The sanskrit traineddata file provided here has low accuracy as per the OCR evaluation reports. Please see 'https://github.com/Shreeshrii/imagessan/tree/master/ocr-san' for the OCRed text and accuracy results.

Training for sanskrit using the tesstrain.sh process has yielded sanskrit traineddata files which provide better accuracy. Please see the following for the OCRed text and accuracy report summaries.

The traineddata files for the above are at https://github.com/Shreeshrii/imagessan/tree/master/tessdata

The images and ground truth files used for the evaluation are at 'https://github.com/Shreeshrii/imagessan'

With the current sample for testing, s30.traineddata seems to give the best result.

About Japanese Language File

There seems to be only one file for Japanese language while there are about 7-8 files for English and French etc. When I add the only Japanese file into the tessdata folder and run, the app crash saying there are no file for Japanese. The same program works fine for English. My app is made for iOS, armv7/arm64.

There is no LSTM-based traineddata for 'kur'

BTW, according to Wikipedia there is also a Latin-based Kurdish script.

Which data file should use to detect credit card number?

Brew gets stuck cloning.

I'm not sure what the cause is.
brew install tesseract --HEAD
Everything goes well until when it tries to clone https://github.com/tesseract-ocr/tessdata.git.
remote: Counting objects: 111, done.
remote: Compressing objects: 100% (111/111), done.
Receiving objects: 10% (12/111), 79.01 MiB | 11.95 MiB/s
Then it gets stuck.
Thank you for the help in advance!

4.0 alpha: traineddata size

Refer discussion on tesseract-ocr/tesseract#40

"For Hindi cube+tesseract has half the error rate of either on their own. ... if the new LSTM engine is better, then yes, cube is likely to get the chop for 4.00,"

Hindi"Tests complete. Decision made. Cube is going away in 4.00. ... Note in the above table that LSTM is faster than Tess 3.04 (without adding cube) in both wall time and CPU time! For wall time by a factor of 2."

Also see: tesseract-ocr/tesseract#521

"Tesseract will currently refuse to initialize with just the lstm model in it, even if you use OEM_LSTM_ONLY as the OCR engine mode. For now, you can make it run using any existing traineddata file and adding your new lstm model and (optionally) the lstm dawgs."

The traineddata uploaded for 4.0 alpha for Devanagari script languages san, hin and probably also for mar and nep has files for both tess model and LSTM model. Since tests show that LSTM is better from both speed and accuracy points of view for Devanagari (complex scripts), I suggest that

The various dawg files related to tess model be deleted from the traineddata file to reduce size.
Since inttemp, pffmtable, shapetable and normproto files are currently required, I suggest replacing the inttemp, pffmtable, shapetable files with dummy files - I tested by replacing them with files from eng.traineddata which is smaller and it worked fine. Normproto is tested against the unicharset and gives error messages if changed and so can be left, till this requirement is removed.
Add a config file choosing the LSTM engine to be used for the traineddata.

By doing the above, I was able to reduce the traineddata size for san from 40+ MB to 20+ MB.

san - same word recognized differently in same page

Notice the word नामावलिः at end of each line - it is recognized incorrecly in a different way in each line. (psm 6)

विर्व्य 16
ज्यग्लत्रुखीसहूस्रनामरतोव्रम्- नामाव'ळिऽळू 191
दुर्गासहस्रनामस्तीत्रम्- १ नामांक्ळिन्नू ॰213
द्रुर्गासहस्रनत्मस्तीन्रम्- २ नामावळिऽ 238
द्दुगसिद्द्स्रनत्मक्तोत्रम्दकाराद्दि(३) नामाव'ळिऽ 263
ट्टुगसिहस्रनामक्तोत्रम्- ४ नामावळिइं 300
पार्वतीं ह्यो) सहस्रनामातोत्रम्- नामावळिऽ’ 329
द्दुर्गानवाक्षरीन्निशतींनत्माव'क्ति 355
द्बुर्गाष्टोत्तरङ्प्तनत्मरतोव्रम्- नामावक्ति 360
र्व्यत्मामस्वोत्रम्- नामाक्ळिऽ 363
अन्नपूण्स्सिहस्रनत्मस्तीत्रम्- नामावक्ति 365
अन्नघूर्गाष्टोत्तस्यातनामस्तीन्रम्- नामावक्ति 394
क्रुलकुर्व्यसहस्रनत्मक्तोत्रम्- कवचम्… नामावळिथ् 397-
कुमारींसहृस्रनामक्तोन्नम्- नामावळिय् 432
गङ्ग’म्यासद्वृस्रनप्मक्तोव्रम्- नाम।वक्ति` 457
गङ्ग’म्याष्टोत्तराप्तनामप्तोत्रम्- नामावळिऽ 488
गङ्गादातनप्तास्तोत्रम्- नामावक्ति 491
यमुनासहस्रनामरतोव्रम्- नम्पावळिय् 493
'शिवगङ्गासद्दृस्रनत्माव'ळि 517
गम्पत्रीसह्स्रनत्मक्तोत्रम्- नाम।व'ळिऽ (१) 531

font detection broken -> probably bug in traineddata

Trying to recognize font attributes I found that the font is always set to "Georgia" using eng.traineddata. I attached a synthetic image with different font attributes and simple code to demonstrate the problem. Using config var "tessedit_debug_fonts" I found the score of Georgia is off by something that looks suspiciously like a factor 10. So this is probably about a missed period somewhere.
broken_font_detection_sample.zip
Update: creating my own traineddata using tesstrain.sh and just Liberation Sans/Serif (Bold/Regular/Italic) i found that again just a single font is matched. I could not identify any connection between the training process and the single matching font. If i exclude the matching font from training there is another font that always matches.

fra file generates large number of errors

After downloading fra.tessdata and putting it in /usr/local/share/tessdata/, I get the following errors:

ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line ConvNL<84>^P
ParamsModel::Incomplete line ܵ<9B><B6><C4><C4>?<CF>^XH<B8><%<D1>?S{E`q<E4><D6>?
<D8>m<AF><BF>ف<D3>?7^Y^Nt<C3><EE><B4>?<8E><D9><C5>u^]<E9>俯^C<A4><B3>0<U+07FB>
<BF><D4>٧ҙ<94><U+063F><D1><C6>S^Sk<9E>տy<EB>^V^?<FA>o<B7><BF>^Y
ParamsModel::Incomplete line '<B6>ĿZ(

Version:

% tesseract  -v 
tesseract 3.04.01
 leptonica-1.73
  libjpeg 8d : libpng 1.6.26 : libtiff 4.0.7 : zlib 1.2.8

Running on macos Sierra Darwin 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64

Installed with brew install tesseract.

Arabic Language output is reversed

Hello,

I'm new with tesseract-ocr x3, I'm using the command line to ocr arabic tif. but all words positions are reversed (opposite).

Any help you can provide to reverse ara lang in tesseract x3.
I'm using cygwin to run tesstrain.sh but till now ocr output is reversed.

Many Thanks in Advance.

Please add licensing information

Hi,

Since tessdata is being maintained as a separate repository from the code. Would it be possible to add licensing information here as well?

This would clarify that the language files can also be distributed along with the code/binaries.

Thanks

The result of sin.traineddata quite not accurated. How could I improve?

13 Checked.pdf

Which data file should use to detect watch like digit ?

I am using data file for English. It is able to detect normal digit but it does not detect the digit like Digital watch. I have an image of digital watch and I want to read the digit of that digital watch. Is there any way or .traineddata file to detect these digit. How can I create .traineddata file for this requirement ?

languages not loaded

Hello,
At
https://github.com/tesseract-ocr/tesseract/wiki

I saw there is a windows installer for Tesseract 3.05-dev from Tesseract at UB Mannheim at
https://github.com/UB-Mannheim/tesseract/wiki

I installed it. It works with English. On the last page there is a link to "download the appropriate training data" for another language at
https://github.com/tesseract-ocr/tessdata

On this page it says "These language data files only work with Tesseract 4. Get language data files for Tesseract 3.04 or 3.05 from the 3.04 tree.". The new link is:
https://github.com/tesseract-ocr/tessdata/tree/3.04.00

I downloaded from there bul.traineddata and deu.traineddata and copied to
. . . \Tesseract-OCR\tessdata
. Now tesseract --list-langs shows
bul
deu
eng
osd

But it works really with English. For Bulgarian and for German the message is:
Failed loading language 'deu'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Any advice - what could be wrong?

ces-traineddata causes crash

When using ces-traineddata (downloaded recently from github) in tesseract-ocr trough FreeOCR, tesseract-ocr crashes. I don't know version of tesseract-ocr, but it is commeing with recent FreeOCR 5.41. Other languages (english and swedish tested works). Tested on windows 8.

Dictionary update problem

When i downloaded the dictionary from https://github.com/tesseract-ocr/tessdata/spa.traineddata and replaced the 2MB file dictionary with the 24MB it crashes when i try the OCR... perhaps Tesseract need be updated...

Duplicate and incomplete data for German fraktur

Both deu_frak.traineddata and frk.traineddata try to support German fraktur.

deu_frak is not part of the official tesseract-ocr/langdata, but comes from paalberti/tesseract-dan-fraktur. It does not support the new LSTM recognizer introduced by Tesseract 4, but currently gives better results for fraktur texts than frk (which supports LSTM).

frk can be improved a lot by adding missing characters (primarily the long s, but also paragraph and dollar sign and maybe more) and based on latest corrections for langdata. With an improved frk, deu_frak would no longer be needed.

It is unclear who invented the name frk for Frankish. Maybe it should be renamed.

How to use tesseract for Persian/Farsi language?

Hi everyone!
I am newbie and like to use tesseract for Persian language, How should I do that? Is there any simple and strightforward tutorial for new users?
I have installed tesseract as they wrote in this tutorial: https://github.com/tesseract-ocr/tesseract/wiki/Compiling, but don't know about usage?

Where could i find previous version say 3.03

I tried running an app on my device and it did not work.

Can we make a podspec for this?

Can we make a podspec for this please?

With podspec, we can get latest training data always.

When use chi_sim.trainedddata 3.04, I get an error.

File "3test.py", line 6, in
print image_to_string(image,lang="chi_sim").decode("utf8")
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 164, in image_to_string
raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'read_params_file: parameter not found: allow_blob_division')
zyw@arbipher:~/ocr$

micr tessdata

There is mcr.traineddata available already. Is it worth to include the same from the below link in the listed traineddata which will be useful

I have downloaded 'https://groups.google.com/forum/#!msg/tesseract-ocr/obWI4cz8rXg/9glPuTdRX5EJ'

whitch tesseractdata is German data？How can I recognise whitch tesseractdata can be used in specific language？

i am getting unmatched text.how to setvariables of different font size....

Remove cube from within trained data files

Hindi and a few other languages have cube related files within traineddata. These files should be replaced with new versions without cube, reducing their file size.

/sbin/ldconfig.real: /usr/local/lib is not a known library type

Hi
I got this error messages after trying installing tesseract from two ways (source and debian command line):

/sbin/ldconfig.real: /usr/local/lib is not a known library type
/sbin/ldconfig.real: /usr/local/lib/pkgconfig is not a known library type

What's the problem and how can I fix it?

tessedit_load_sublangs

In language file spr_latn.tessdata (Serbian lating) there is a line:
tessedit_load_sublangs srp
which means that tesseract loads srp (Serbian Cyrillic) language file.

As a result some of the text is recognized as cyrillic, even if the original text contains no cyrillic script at all!

Can this option be disabled in any way, or new language files provided without the "load sublangs" part?

Test image is provided

(Additional notes:
-Older version of this language file did not have this line.
-"tessedit_load_sublangs" is present in another file, srp_latn.
-in the future, if more than one language is needed, one can select it by typing: -l srp+srp_latn)

Thank you.

German dataset seems to have issues

Calling tesseract with the option -l deu results in the following error messages on the console.

ParamsModel::Incomplete line
ParamsModel::Incomplete line ConvNL„�
ParamsModel::Incomplete line ?êBN¸H�Ñ?s�d¯ü�Å?|5LÆ¾�•?LNHçþîå¿ˆ!žÉË-±¿wPß¸Îœ?”H�ôW;Æ?é�‰áé«Â?'Ð=›�V«?æ�Üêqå½?;2ë,£sÓ?©á›¦•˜Ð?S×
ParamsModel::Incomplete line ó§¿$j�¬0�Ä¿`èy�l�¿¿Å�çv³2»?³\ý:c²È?uÓ%Qgâ¿¨Ì!ÉF¯„?êÉ²�é“?�q'ÙÕ9b?�

With the default eng language option or other languages (e.g. dan) this does not seem to happen.

The program still finishes and produces an output, but the error prevents using tesseract writh wrappers because of the error messages.

Full console output:
output.txt

Spa , Ara cube data is not working. Throws Accessviolation error

Hi,
I am using Spanish, arabic trained data in tesseract, but in the statement TesseractEngine it is throwing AccessViolation Error.

One more need, I have an urgent requirement for having Accurate text extraction, So we are going to use TesseractAndCube Engine mode. For this Engine mode we need Cube.* trained data. Please provide the Cube trained data for other languages like portuguese etc., Or kindly give me the steps to train the Cube data file for other languages.

Thanks and Regards,
Merlin

Can't read digit number with arabic traindata

i'm using tesseract 3.05 for android, i have one problem with the traindata files, it's that the arabic traindata can't read digit number (0123456789) i think the problem in LTR and RTL, any suggestion ?

About Persian(fas) Language

Hello,
I am new to this nice project and i need to use it for my language but the only thing that i found for Persian(fas) language was just one file and after checking result was really disappointing i guess it need to provide more data or train.
Would you mind please help me to get to know how it is possible to make it more accurate ?
Thank you.

Please review osd and eng traineddata PKGBUILD for msys2

I have updated the pkgbuild for OSD and ENG traineddata for msys2 - please see https://github.com/Alexpux/MINGW-packages/pull/673

and make any required changes

Indonesian special char

Hi! Thank you so much for this great app!

For indonesian data, I think the notation/special character (@,!,., etc) is not included.

If i don't specify language, my email is transcripted perfectly,
but after I specify indonesian language @ is seen as Q, and . is not red at all.

How to fix/recreate traineddata file? How can I help?