Git Product home page Git Product logo

tessdata's People

Contributors

howardjones avatar jbreiden2 avatar jharia avatar shreeshrii avatar stweil avatar theraysmith avatar zdenop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tessdata's Issues

4.0 alpha: Add config to Sanskrit traineddata for improved accuracy

Adding a config file to san.traineddata (and also replacing the tesseract model for sanskrit with the one for English) seems to improve accuracy as well as reduces size of traineddata.

san - 4.0 traineddata
sa4 - modified 4.0 traineddata - https://github.com/Shreeshrii/tessdata4alpha/blob/master/sa4.traineddata

For the sample used:

san
CER	5.05
WER	4.70
WER (order independent)	4.24

sa4
CER	1.86
WER	1.44
WER (order independent)	1.12

Accuracy reports are at

https://github.com/Shreeshrii/tessdata4alpha/blob/master/san_report.html
https://github.com/Shreeshrii/tessdata4alpha/blob/master/sa4_report.html

Groundtruth and images are at:
https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages

Multiple files for arabic and hindi

There are multiple files that are there for ara and hin . Just downloading the traineddata for these languages is not enough.

Downloading multiple files on Windows is difficult.

I would suggest if all required files for these two languages can also be provided as a single zip file - one for each language.

Thanks!

ara.traineddata init failed

I can't init TessBaseAPI with ara.traineddata. Other traineddata is ok. Is there anyone have the same issue?

Add version info to traineddata files

Since the format of traineddata files is different between 3.xx and 4.xx, it would be helpful to include version info in traineddata files and to also check for compatibility when running tesseract.

Please tag a release

While it is good to have this data up here on github, it would be helpful for Linux distributions if you could tag a release, so that there is a definite set of files that distributions can distribute. Our mirror infrastructure should not mirror "the latest version of whatever is there", and for security reasons we would also like to check by hashing that what we distribute to users tomorrow is the same that we distributed yesterday.

Portuguese Language Module?

Hi, I just downloaded FreeOCR (version 5.41//Tesseract v3), but when I tried the Portuguese language module for the Tesseract OCR available on this site it seems to cause a problem with the OCR: i.e. "tesseract.exe has stopped working." I tried versions 4 and 3.04 of the language module (the program recommends above 2), but both cause the same problem? So, I tried the built-in English language OCR and that works not problem (except of course that it makes mistakes because it doesn't recognise the language)! I noticed when I opened the language folder of the software that por.traineddata is much larger (20,956kb) than any of the pre-installed languages (2-3,000kb)? Any suggestions?

Thanks!

tesseract 4.0 Compiling don't work for me

I followed the steps in https://github.com/tesseract-ocr/tesseract/wiki/Compiling#linux
I found error written so I installed autoconf-archive and in order to resolve error with "Leptonica 1.74" I have done:

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
export LIBLEPT_HEADERSDIR=/usr/local/include
export TESSDATA_PREFIX='/usr/local/share/'

./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make

but I receive this error:

In file included from rect.h:28:0,
from coutln.h:26,
from stepblob.h:23,
from werd.h:28,
from blobbox.h:25,
from blobbox.cpp:25:
blobbox.h: In member function 'void TO_BLOCK::print_rows()':
blobbox.h:737:61: error: expected ')' before 'PRId32'
tprintf("Row range (%g,%g), para_c=%g, blobcount=%" PRId32 "\n",
^
../ccutil/tprintf.h:31:39: note: in definition of macro 'tprintf'
#define tprintf(...) tprintf_internal(VA_ARGS)
^
make[2]: *** [blobbox.lo] Error 1
make[2]: Leaving directory /home/webuser/src/tesseract-4.0/ccstruct' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory /home/webuser/src/tesseract-4.0'
make: *** [all] Error 2

Can you help me?
thanks

Please do split-tarball releases for each language

We used to have per-language tarballs like

tesseract-ocr-3.02.ind.tar.gz
tesseract-ocr-3.02.hrv.tar.gz

etc.

This is very crucial to package these files on Linux distributions, so please release per-language traineddate tarballs again.

Avoid Tesseract download data

Hello,

I am using the javascript version of Tesseract to create a simple web page that allows users to extract text from Image. As I am working on a really slow connection, is it possible for me to somehow avoid tesseract download the traineddata?

Thanks in advance.

Files Error

These files giving error with Tessract 2.0 jni library.

san default psm mode skipping text

san-019


$ tesseract --tessdata-dir /usr/share -l san san-019.png san-019
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Detected 18 diacritics


ज्यंग्लत्रुखीसह्स्रनामलोत्रम्- नामावळिट्.
दुर्गासहस्रनामस्तीत्रम्- १ नामाक्ळि
दुर्गासहस्रनत्मस्तीत्र्दुं'म्- २ नामावळिऽ
द्बुगसिद्द्स्रनत्मरत्तोत्रम्दकारादि (३) नामावळि

पार्वतीं ह्यो) सहम्रनम्परतोत्रम्- नामावळिऽ’

फुलकुर्व्यसहस्रनत्मक्तोत्रम्-क्ताचम्-नत्माचळिऽ

गम्यत्रीसह्स्रनत्मक्तोत्रम्-नग्मग्वळिऽ(१)

191
,213

238

300
329
355
360

365

394

397-

432

457

488

491

493

517

531

Low Accuracy for san.traineddata, alternate traineddata files

The sanskrit traineddata file provided here has low accuracy as per the OCR evaluation reports. Please see 'https://github.com/Shreeshrii/imagessan/tree/master/ocr-san' for the OCRed text and accuracy results.

Training for sanskrit using the tesstrain.sh process has yielded sanskrit traineddata files which provide better accuracy. Please see the following for the OCRed text and accuracy report summaries.

The traineddata files for the above are at https://github.com/Shreeshrii/imagessan/tree/master/tessdata

The images and ground truth files used for the evaluation are at 'https://github.com/Shreeshrii/imagessan'

With the current sample for testing, s30.traineddata seems to give the best result.

About Japanese Language File

There seems to be only one file for Japanese language while there are about 7-8 files for English and French etc. When I add the only Japanese file into the tessdata folder and run, the app crash saying there are no file for Japanese. The same program works fine for English. My app is made for iOS, armv7/arm64.

Brew gets stuck cloning.

I'm not sure what the cause is.
brew install tesseract --HEAD
Everything goes well until when it tries to clone https://github.com/tesseract-ocr/tessdata.git.
remote: Counting objects: 111, done.
remote: Compressing objects: 100% (111/111), done.
Receiving objects: 10% (12/111), 79.01 MiB | 11.95 MiB/s
Then it gets stuck.
Thank you for the help in advance!

4.0 alpha: traineddata size

Refer discussion on tesseract-ocr/tesseract#40

"For Hindi cube+tesseract has half the error rate of either on their own. ... if the new LSTM engine is better, then yes, cube is likely to get the chop for 4.00,"

Hindi"Tests complete. Decision made. Cube is going away in 4.00. ... Note in the above table that LSTM is faster than Tess 3.04 (without adding cube) in both wall time and CPU time! For wall time by a factor of 2."

Also see: tesseract-ocr/tesseract#521

"Tesseract will currently refuse to initialize with just the lstm model in it, even if you use OEM_LSTM_ONLY as the OCR engine mode. For now, you can make it run using any existing traineddata file and adding your new lstm model and (optionally) the lstm dawgs."

The traineddata uploaded for 4.0 alpha for Devanagari script languages san, hin and probably also for mar and nep has files for both tess model and LSTM model. Since tests show that LSTM is better from both speed and accuracy points of view for Devanagari (complex scripts), I suggest that

  1. The various dawg files related to tess model be deleted from the traineddata file to reduce size.

  2. Since inttemp, pffmtable, shapetable and normproto files are currently required, I suggest replacing the inttemp, pffmtable, shapetable files with dummy files - I tested by replacing them with files from eng.traineddata which is smaller and it worked fine. Normproto is tested against the unicharset and gives error messages if changed and so can be left, till this requirement is removed.

  3. Add a config file choosing the LSTM engine to be used for the traineddata.

By doing the above, I was able to reduce the traineddata size for san from 40+ MB to 20+ MB.

san - same word recognized differently in same page

Notice the word नामावलिः at end of each line - it is recognized incorrecly in a different way in each line. (psm 6)


विर्व्य 16
ज्यग्लत्रुखीसहूस्रनामरतोव्रम्- नामाव'ळिऽळू 191
दुर्गासहस्रनामस्तीत्रम्- १ नामांक्ळिन्नू ॰213
द्रुर्गासहस्रनत्मस्तीन्रम्- २ नामावळिऽ 238
द्दुगसिद्द्स्रनत्मक्तोत्रम्दकाराद्दि(३) नामाव'ळिऽ 263
ट्टुगसिहस्रनामक्तोत्रम्- ४ नामावळिइं 300
पार्वतीं ह्यो) सहस्रनामातोत्रम्- नामावळिऽ’ 329
द्दुर्गानवाक्षरीन्निशतींनत्माव'क्ति 355
द्बुर्गाष्टोत्तरङ्प्तनत्मरतोव्रम्- नामावक्ति 360
र्व्यत्मामस्वोत्रम्- नामाक्ळिऽ 363
अन्नपूण्स्सिहस्रनत्मस्तीत्रम्- नामावक्ति 365
अन्नघूर्गाष्टोत्तस्यातनामस्तीन्रम्- नामावक्ति 394
क्रुलकुर्व्यसहस्रनत्मक्तोत्रम्- कवचम्… नामावळिथ् 397-
कुमारींसहृस्रनामक्तोन्नम्- नामावळिय् 432
गङ्ग’म्यासद्वृस्रनप्मक्तोव्रम्- नाम।वक्ति` 457
गङ्ग’म्याष्टोत्तराप्तनामप्तोत्रम्- नामावळिऽ 488
गङ्गादातनप्तास्तोत्रम्- नामावक्ति 491
यमुनासहस्रनामरतोव्रम्- नम्पावळिय् 493
'शिवगङ्गासद्दृस्रनत्माव'ळि 517
गम्पत्रीसह्स्रनत्मक्तोत्रम्- नाम।व'ळिऽ (१) 531

san-019

font detection broken -> probably bug in traineddata

Trying to recognize font attributes I found that the font is always set to "Georgia" using eng.traineddata. I attached a synthetic image with different font attributes and simple code to demonstrate the problem. Using config var "tessedit_debug_fonts" I found the score of Georgia is off by something that looks suspiciously like a factor 10. So this is probably about a missed period somewhere.
broken_font_detection_sample.zip
Update: creating my own traineddata using tesstrain.sh and just Liberation Sans/Serif (Bold/Regular/Italic) i found that again just a single font is matched. I could not identify any connection between the training process and the single matching font. If i exclude the matching font from training there is another font that always matches.

fra file generates large number of errors

After downloading fra.tessdata and putting it in /usr/local/share/tessdata/, I get the following errors:

ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line ConvNL<84>^P
ParamsModel::Incomplete line ܵ<9B><B6><C4><C4>?<CF>^XH<B8><%<D1>?S{E`q<E4><D6>?
<D8>m<AF><BF>ف<D3>?7^Y^Nt<C3><EE><B4>?<8E><D9><C5>u^]<E9>俯^C<A4><B3>0<U+07FB>
<BF><D4>٧ҙ<94><U+063F><D1><C6>S^Sk<9E>տy<EB>^V^?<FA>o<B7><BF>^Y
ParamsModel::Incomplete line '<B6>ĿZ(

Version:

% tesseract  -v 
tesseract 3.04.01
 leptonica-1.73
  libjpeg 8d : libpng 1.6.26 : libtiff 4.0.7 : zlib 1.2.8

Running on macos Sierra Darwin 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64

Installed with brew install tesseract.

Arabic Language output is reversed

Hello,

I'm new with tesseract-ocr x3, I'm using the command line to ocr arabic tif. but all words positions are reversed (opposite).

Any help you can provide to reverse ara lang in tesseract x3.
I'm using cygwin to run tesstrain.sh but till now ocr output is reversed.

Many Thanks in Advance.

Please add licensing information

Hi,

Since tessdata is being maintained as a separate repository from the code. Would it be possible to add licensing information here as well?

This would clarify that the language files can also be distributed along with the code/binaries.

Thanks

Which data file should use to detect watch like digit ?

I am using data file for English. It is able to detect normal digit but it does not detect the digit like Digital watch. I have an image of digital watch and I want to read the digit of that digital watch. Is there any way or .traineddata file to detect these digit. How can I create .traineddata file for this requirement ?

languages not loaded

Hello,
At
https://github.com/tesseract-ocr/tesseract/wiki

I saw there is a windows installer for Tesseract 3.05-dev from Tesseract at UB Mannheim at
https://github.com/UB-Mannheim/tesseract/wiki

I installed it. It works with English. On the last page there is a link to "download the appropriate training data" for another language at
https://github.com/tesseract-ocr/tessdata

On this page it says "These language data files only work with Tesseract 4. Get language data files for Tesseract 3.04 or 3.05 from the 3.04 tree.". The new link is:
https://github.com/tesseract-ocr/tessdata/tree/3.04.00

I downloaded from there bul.traineddata and deu.traineddata and copied to
. . . \Tesseract-OCR\tessdata
. Now tesseract --list-langs shows
bul
deu
eng
osd

But it works really with English. For Bulgarian and for German the message is:
Failed loading language 'deu'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Any advice - what could be wrong?

ces-traineddata causes crash

When using ces-traineddata (downloaded recently from github) in tesseract-ocr trough FreeOCR, tesseract-ocr crashes. I don't know version of tesseract-ocr, but it is commeing with recent FreeOCR 5.41. Other languages (english and swedish tested works). Tested on windows 8.

Duplicate and incomplete data for German fraktur

Both deu_frak.traineddata and frk.traineddata try to support German fraktur.

deu_frak is not part of the official tesseract-ocr/langdata, but comes from paalberti/tesseract-dan-fraktur. It does not support the new LSTM recognizer introduced by Tesseract 4, but currently gives better results for fraktur texts than frk (which supports LSTM).

frk can be improved a lot by adding missing characters (primarily the long s, but also paragraph and dollar sign and maybe more) and based on latest corrections for langdata. With an improved frk, deu_frak would no longer be needed.

It is unclear who invented the name frk for Frankish. Maybe it should be renamed.

When use chi_sim.trainedddata 3.04, I get an error.

File "3test.py", line 6, in
print image_to_string(image,lang="chi_sim").decode("utf8")
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 164, in image_to_string
raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'read_params_file: parameter not found: allow_blob_division')
zyw@arbipher:~/ocr$

/sbin/ldconfig.real: /usr/local/lib is not a known library type

Hi
I got this error messages after trying installing tesseract from two ways (source and debian command line):

/sbin/ldconfig.real: /usr/local/lib is not a known library type
/sbin/ldconfig.real: /usr/local/lib/pkgconfig is not a known library type

What's the problem and how can I fix it?

tessedit_load_sublangs

In language file spr_latn.tessdata (Serbian lating) there is a line:
tessedit_load_sublangs srp
which means that tesseract loads srp (Serbian Cyrillic) language file.

As a result some of the text is recognized as cyrillic, even if the original text contains no cyrillic script at all!

Can this option be disabled in any way, or new language files provided without the "load sublangs" part?

Test image is provided
test

(Additional notes:
-Older version of this language file did not have this line.
-"tessedit_load_sublangs" is present in another file, srp_latn.
-in the future, if more than one language is needed, one can select it by typing: -l srp+srp_latn)

Thank you.

German dataset seems to have issues

Calling tesseract with the option -l deu results in the following error messages on the console.

ParamsModel::Incomplete line
ParamsModel::Incomplete line ConvNL„�
ParamsModel::Incomplete line ?êBN¸H�Ñ?s�d¯ü�Å?|5Lƾ�•?LNHçþî忈!žÉË-±¿wP߸Μ?”H�ôW;Æ?é�‰áé«Â?'Ð=›�V«?æ�Üêqå½?;2ë,£sÓ?©á›¦•˜Ð?S×
ParamsModel::Incomplete line ó§¿$j�¬0�Ä¿`èy�l�¿¿Å�çv³2»?³\ý:c²È?uÓ%Qg⿨Ì!ÉF¯„?êɲ�
é“?�q'ÙÕ9b?�

With the default eng language option or other languages (e.g. dan) this does not seem to happen.

The program still finishes and produces an output, but the error prevents using tesseract writh wrappers because of the error messages.

Full console output:
output.txt

Spa , Ara cube data is not working. Throws Accessviolation error

Hi,
I am using Spanish, arabic trained data in tesseract, but in the statement TesseractEngine it is throwing AccessViolation Error.
accessviolationerror

One more need, I have an urgent requirement for having Accurate text extraction, So we are going to use TesseractAndCube Engine mode. For this Engine mode we need Cube.* trained data. Please provide the Cube trained data for other languages like portuguese etc., Or kindly give me the steps to train the Cube data file for other languages.

Thanks and Regards,
Merlin

Can't read digit number with arabic traindata

i'm using tesseract 3.05 for android, i have one problem with the traindata files, it's that the arabic traindata can't read digit number (0123456789) i think the problem in LTR and RTL, any suggestion ?

About Persian(fas) Language

Hello,
I am new to this nice project and i need to use it for my language but the only thing that i found for Persian(fas) language was just one file and after checking result was really disappointing i guess it need to provide more data or train.
Would you mind please help me to get to know how it is possible to make it more accurate ?
Thank you.

Indonesian special char

Hi! Thank you so much for this great app!

For indonesian data, I think the notation/special character (@,!,., etc) is not included.

If i don't specify language, my email is transcripted perfectly,
but after I specify indonesian language @ is seen as Q, and . is not red at all.

How to fix/recreate traineddata file? How can I help?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.