tesseract-ocr / tessdoc Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 348.0 4.52 MB

Tesseract documentation

Home Page: https://tesseract-ocr.github.io/tessdoc/

HTML 100.00%

tessdoc's People

Contributors

Stargazers

Watchers

Forkers

zuphilip rakeshkumaraka stweil stanislasguinel sejas vinkowei ptichica bartes1973 kasparvonbeelen kmiyazaki-dice saipklvs 1016312364 ammumadhu liuyang1 rajiv2806 tabamotch svenbit mysuithost hyphinwang guoshunhao jasperosy tactlabs mfyproject ricardojra vanyog thomaswang525 monperrus zealzheng mrjj sushil-ds aridaamaliarosa ayakushev1 chaozhang323 ironkage jackcplusplus sahandgithub tasunn nilanjanadas labadimate ompsingh amitdo chenkandy alekyasaladi lukaspal97 gaarahan wangpl504 jay8ee clarkding sshwy thadguidry romcy redbeard3 varun5001 chernloon brothketball adminspx21 admariner ucodai kartset bruceleon0805 cuteofdragon wangrang zyuanfeng1 yibaiershikexin tlff dodn-tint sts0mrg0 rotuna shreeshrii 66i88 red-canoe feitianno3 chldbwls8974 zhangshiguang jroedel evertonoliveira031 syntaxwork boyizhea overgroove mrlisun gerhobbelt gnanaravindhan alexanderp mixintu iexviz scott0123 aftab685 noticeable hunsra tfmorris jravur1308 leotrisciuzzi robyer ximli kensun117 lgztx jpqw celestialized kjarrio tuanha1305

tessdoc's Issues

TrainingTesseract-5.md ScrollView & hallucination 404 link

Reading through the TrainingTesseract-5.md page and I have come across a few dead links:

#building-the-training-tools
ScrollView.jar link goes to 404
Maybe supposed to link to https://tesseract-ocr.github.io/tessdoc/ViewerDebugging ?

#debug-interval-and-visual-debugging
Another ScrollView.jar link goes to 404

#the-hallucination-effect
Then read the hallucination topic. link goes to 404
Maybe supposed to link to https://tesseract-ocr.github.io/tessdoc/tess4/The-Hallucination-Effect.html ?

Is it possible to extract only the text area of an image in Tesseract? It's a non-text location.

I need the position of the text.

Most recent PR broke links

Hi - the most recent PR to Tessdocs broke a number of links in the documentation.

When we navigate to this page and click "User manual", we no longer get directed to the right page and returns an error.

https://tesseract-ocr.github.io/

Think this has something to do with Github Pages.

(Not sure if this is the right place to report this issue but seems right).

Failed to find library "tesseract41.dll" for platform x86

I've tried everything I can think of,but nothing.
I also installed "c++2015 edistributable(x86) - 14.0.24215" and "c++2005 edistributable";
I tried to install tesseract41.dll by "regsvr32 " but install fail;

It's running OK in my computer(windows 10),but in the other(Windows XP ), The above error occurred

Can I run the training in AMD CPU

I want to run this on the AMD CPU server. Can it run normally? Thank you

Do CUDA GPUs improve Tesseract 4.x LSTM performance?

Hello,

According to this article, it's possible to significantly improve LSTM performance using CUDA GPUs. Since Tesseract 4.x uses new LSTM-based core, is it true that it should perform better with CUDA-powered GPUs?

If yes, it would be helpful to have a description (or a note at least) in documentation.

Searchable pdf output: Broken PDF file

The Searchable pdf output section of the Command Line Usage page ends with a PDF file which displays broken (just the file name and sometimes a file icon), whether in Google Chrome 91 or in Mozilla Firefox:

IPV6使用问题，以ipv6地址访问时无法ocr (IPV6 usage problem, unable to ocr when accessing with ipv6 address)

修复方法work.js
function(t, e, r) {
var n, i;
void 0 === (i = "function" == typeof(n = function() {
return function() {
var t = arguments.length;
if (0 === t) throw new Error("resolveUrl requires at least one argument; got none.");
var e = document.createElement("base");
if (e.href = arguments[0], 1 === t) return e.href;
var r = document.getElementsByTagName("head")[0];
r.insertBefore(e, r.firstChild);
for (var n, i = document.createElement("a"), o = 1; o < t; o++) i.href = arguments[o],
n = i.href,
e.href = n;
return r.removeChild(e),
n
}
}) ? n.call(e, r, e, t) : n) || (t.exports = i)
}
修改为
function(t, e, r) {
var n, i;
void 0 === (i = "function" == typeof(n = function() {
return function() {
var t = arguments.length;
if (0 === t) throw new Error("resolveUrl requires at least one argument; got none.");
return arguments[0];
}
}) ? n.call(e, r, e, t) : n) || (t.exports = i)
}
即可，原因时url在ipv6地址解析的问题，使用域名访问ipv6没有为你

Install on MacOS with homebrew

Hello, I understand you only have Windows and Linux but since MacOS is linux based and has repos on homebrew, I installed version 5..1.0 with homebrew with the following command brew install tesseract-lang and I got the message This formula contains only the "eng", "osd", and "snum" language data files. If you need any other supported languages, run brew install tesseract-lang . Now eventhough it works fine for English, I cannot install all the other languages, would any know by any chance the command line to install all the languages as clearly the one I used does not install them all whereas it works perfectly for English. Many thanks!

Transferring trained data from Windows to Linux

Is it possible to transfer the learning data from Windows to Linux and if so, is it enough to just copy the files? I tried to find something about this in the documentation, but unfortunately it is not to be found or I am blind.

Can not retrain Tesseract after editing boxes

I have implemented the following things:
Manually adjusted the boxes of characters generated by tesseract. Generated the tesseract training file. Generated the unicharset file. But when I am using the 'shapeclustering' command to generate shape table it is just stuck. The command it is stuck on:
shapeclustering.exe -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

Tried with following Tesseract versions:
tesseract-ocr-w32-setup-v4.1.0.20190314
tesseract-ocr-w64-setup-v4.1.0.20190314
tesseract-ocr-w32-setup-v4.0.0.20181030
tesseract-ocr-w64-setup-v4.1.0-elag2019

The unicharset command does nothing with the 64 bit versions of tesseract. And the other commands generate a file named 'unicharset' while it should generate a file named 'font_name.unicharset' according to these tutorials:
https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6
https://stackoverflow.com/questions/55036633/how-to-create-traineddata-file-for-tesseract-4-1-0

It is stuck doing nothing as shown in the following image:

If you need any more information then please do let me know. I have tried different approaches and searching for the solution of the problem but could not find anything useful.

Kind Regards,
Ajwad

[Documentation] Recommend one or two GUIs for people new to tesseract

Heya. I use mostly linux, but oddly enough I have had great results via tesseract on windows if I remember correctly.

I have some old documents (semi-old, paper print out only, office bills and such) that I have to scan. I scanned
quite a lot already. The next step is to OCR them (they are in german).

I was thinking of using tesseract again (been some months...) but I think I prefer a GUI. The documentation
refers to this link:

https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html

However there are about 20 entries.

Would it be possible to recommend one or two? It does not matter on which preference this happens, just
any recommendation by any tesseract dev may be useful. I refer mostly to a simple GUI that works
and produces the desired results. It does not have to be perfect but it should work. Right now I have
to pick among 20 entries without really knowing which one to prefer or at the least try first. So perhaps
1 or 2 could be mentioned briefly e. g. "for simple GUI, try abc or def" - something like that to the
documentation README.

OCR not working with zoomed in windows

I have a 1440p laptop and in setting i have changed the zoom level to 175%. Unfortunately, this means that when i try to use tesseract ocr that the image capture reference is offset from the selector. I would really appreciate help with this, as i am a ovice to c# and this program captured my interest.

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Programing Languages\\Python\\Tesseract-OCR\\eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'D:\Programing Languages\Python\Tesseract-OCR\tesseract.exe'
from PIL import Image

img = Image.open('textHand2.png')
text = pytesseract.image_to_string(img)

print(text)

error:

return {
File "D:\Programing Languages\Python\lib\site-packages\pytesseract\pytesseract.py", line 357, in
Output.STRING: lambda: run_and_get_output(*args),
File "D:\Programing Languages\Python\lib\site-packages\pytesseract\pytesseract.py", line 266, in run_and_get_output
run_tesseract(**kwargs)
File "D:\Programing Languages\Python\lib\site-packages\pytesseract\pytesseract.py", line 242, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \Programing Languages\Python\Tesseract-OCR\eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')

can not build training tools

You can not build training tools because of missing dependency.
Check configure output for details.

Instructions for building on MacOS

it might be a good idea to add this to the instructions to save someone ten minutes on stack exchange

Unable to tesseract text from cropped image

Here, we have cropped image from main source. I can't able to extract the text and numbers properly. I have checked with tesseract-5 & 4, also increased the image resolution and have done with pre-processing steps but can't able to get accurate ocr result.( inst.no and off rec 0719 page 0530 not extracted)

simple question

Hello Tesseract !

Wow what a pkg you made, it is so awesome.

I looked at the documentation, looked at various links and FAQs but did not find

what are the list of functions of this package ?
what is bbox output actually telling me? (my guess is the locations of where word X is located)
is there examples of combining output with NLP functions/ pkgs ?

Thank you.

Tabular Data - Read Column By Column

Is there any way to leverage Tesseract for table extraction? Below is an example of a table I would need to extract Chinese Characters (Simplified) from.

Error on install

Hello ,

I am trying to install tesseract-ocr on Debian 9 with gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516.

I run with the below order, according to this https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling-%E2%80%93-GitInstallation.md
./autogen.sh
./configure
make

Now when I run make I get the following error:

root@genius1062:/usr/local/tesseract# make
Making all in .
make[1]: Entering directory '/usr/local/tesseract'
CXX src/api/libtesseract_la-baseapi.lo
In file included from ./src/ccutil/elst.h:24:0,
from ./src/ccstruct/points.h:22,
from ./src/ccstruct/rect.h:22,
from ./src/ccstruct/boxword.h:22,
from src/api/baseapi.cpp:26:
./src/ccutil/serialis.h: In member function ‘bool tesseract::TFile::DeSerialize(std::vector&)’:
./src/ccutil/serialis.h:107:17: error: expected primary-expression before ‘constexpr’
} else if ( constexpr (std::is_same_v<T, std::string>)) {
^~~~~~~~~
./src/ccutil/serialis.h:107:17: error: expected ‘)’ before ‘constexpr’
./src/ccutil/serialis.h:116:15: error: expected ‘(’ before ‘constexpr’
} else if constexpr (std::is_class_v) {
^~~~~~~~~
./src/ccutil/serialis.h:151:3: error: expected ‘}’ at end of input
}
^
./src/ccutil/serialis.h: In member function ‘bool tesseract::TFile::Serialize(const std::vector&)’:
./src/ccutil/serialis.h:166:15: error: expected ‘(’ before ‘constexpr’
} else if constexpr (std::is_same_v<T, std::string>) {
^~~~~~~~~
./src/ccutil/serialis.h:198:3: error: expected ‘}’ at end of input
}
^
In file included from ./src/ccutil/genericvector.h:22:0,
from ./src/ccstruct/fontinfo.h:25,
from ./src/ccstruct/ratngs.h:29,
from ./src/dict/dawg.h:33,
from ./src/dict/dawg_cache.h:23,
from src/api/baseapi.cpp:28:
./src/ccutil/helpers.h: In function ‘bool tesseract::Serialize(FILE*, const std::vector&)’:
./src/ccutil/helpers.h:252:13: error: expected ‘(’ before ‘constexpr’
} else if constexpr (std::is_class_v) {
^~~~~~~~~
src/api/baseapi.cpp:2380:1: error: expected ‘}’ at end of input
} // namespace tesseract
^
src/api/baseapi.cpp: At global scope:
src/api/baseapi.cpp:2380:1: error: expected ‘}’ at end of input
Makefile:4696: recipe for target 'src/api/libtesseract_la-baseapi.lo' failed
make[1]: *** [src/api/libtesseract_la-baseapi.lo] Error 1
make[1]: Leaving directory '/usr/local/tesseract'
Makefile:7801: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1

Debian repository notesalexp.org down?

Our container builds are failing to install tesseract - and it looks like notesalexp.org site is down - is anyone aware of an alternative package repository?

Inconsistency in APIExample

Moved from tesseract-ocr/tesseract#2832.

Current Behavior:

The the wiki page APIExample, for the python example, the handle api is is run through the TessBaseAPIDelete function if the api failed to be initialized whereas for the C example, this is not the case.

python:

rc = tesseract.TessBaseAPIInit3(api, TESSDATA_PREFIX, lang)
if (rc):
    tesseract.TessBaseAPIDelete(api)
    print("Could not initialize tesseract.\n")
    exit(3)

if(TessBaseAPIInit3(handle, NULL, "eng") != 0)
    die("Error initializing tesseract\n");

Expected Behavior:

Either the python example doesn't destroy the api handle or the api handle is destroyed in the c example.

Suggested Fix:

Change either of the examples so they both do the same thing. I do not know tesseract enough to be able to tell which is the correct one. However, I feel like the correct one would be to add TessBaseAPIDelete to the C example as it seems to be an omission.

Prerequesites and what are the configuration need to do OCR

we are using tesseract 4.0.0.
While doing OCR through Linux command "tesseract pan.jpg stdout" getting the better result. But when we integrated tesseract logic in java application it is not giving proper results. But in the same project working fine windows machine. We have already set the TESSDATA_PREFIX environment variable. And in both environments, we have the latest eng.traineddata only. Please find the sample code below.

try{
Tesseract instance = new Tesseract();
instance.setDatapath("/usr/share/tesseract/");
File file = new File("/home/projectr/pan.jpg");
instance.setLanguage("eng");

String result = instance.doOCR(file);
System.out.println(result);
} catch (Exception e) {
e.printStackTrace()
}

If possible send a sample java project which will run on the Linux environment with prerequisite in Linux machine and anything needs to change in any config file.

we are using Linux version 3.10.0-693.el7.x86_64

below are the tesseract version details in Linux machine.
tesseract 4.0.0
leptonica-1.77.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7

Missing example images in ImproveQuality

The Dilation and Erosion section seems very self explanatory, I believe I understand the principle and its practice. It also lists a before-and-after image set that is not in the repo.

Can these images be added to the repo and properly integrated into the documentation? I think they'd just be more helpful.

Unable to tesseract text from cropped image

[typo] url is inaccessible

https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md

If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use the `tessedit_char_whitelist` [configuration variable](ControlParams). See the [FAQ for an example](FAQ-Old#how-do-i-recognize-only-digits).

ControlParams cannot be accessed.

cannot compile tesseract 5.0 on mac os 10.14 (mojave)

I was able to go through all of the steps installing tesseract 5.0 with Homebrew as listed here

Packages which are always needed.

brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica

Packages required for training tools.

brew install pango

Optional packages for extra features.

brew install libarchive

Optional package for builds using g++.

brew install gcc

I then completed the following steps:

git clone https://github.com/tesseract-ocr/tesseract/
cd tesseract
./autogen.sh
mkdir build
cd build

I then ran into problems:

(venv3) Admins-MacBook-Pro-4:build kylefoley$ sudo make install
make: *** No rule to make target install'. Stop. (venv3) Admins-MacBook-Pro-4:build kylefoley$ make training make: *** No rule to make target training'. Stop.
(venv3) Admins-MacBook-Pro-4:build kylefoley$ sudo make training-install
make: *** No rule to make target training-install'. Stop. (venv3) Admins-MacBook-Pro-4:build kylefoley$ ../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig checking for g++... g++ checking whether the C++ compiler works... yes checking for C++ compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... configure: error: in /volumes/googledrive/my drive/laptop/documents/pcode/tesseract/build':
configure: error: cannot run C++ compiled programs.
If you meant to cross compile, use --host'. See config.log' for more details
(venv3) Admins-MacBook-Pro-4:build kylefoley$ configure --disable-shared 'CXXFLAGS=-g -O2 -Wall' PKG_CONFIG_PATH=$(brew --prefix)/opt/icu4c/lib/pkgconfig:$(brew --prefix)/opt/libarchive/lib/pkgconfig:$(brew --prefix)/Library/Homebrew/os/mac/pkgconfig/11
-bash: configure: command not found
(venv3) Admins-MacBook-Pro-4:build kylefoley$ ../configure --disable-shared 'CXXFLAGS=-g -O2 -Wall' PKG_CONFIG_PATH=$(brew --prefix)/opt/icu4c/lib/pkgconfig:$(brew --prefix)/opt/libarchive/lib/pkgconfig:$(brew --prefix)/Library/Homebrew/os/mac/pkgconfig/11
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... configure: error: in /volumes/googledrive/my drive/laptop/documents/pcode/tesseract/build': configure: error: cannot run C++ compiled programs. If you meant to cross compile, use --host'.
See `config.log' for more details

In the above I used a different order but that was after I had tried them in the recommended order. I also tried the solution mentioned here to no avail:

#65

TesseractStudio.Net is not available anymore

I suggest to remove TesseractStudio.Net from "User-Projects-–-3rdParty.md" because the project seems abandoned and you can't even get any installers or executable or source code from anywhere.

Unlisted GUI

https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html
I have one that does support tesseract
https://github.com/GM-Script-Writer-62850/PHP-Scanner-Server/issues
In this project tesseract is used to convert scanned documents to text files

android support for 4.0

I have been trying to use tesseract 4.0 LSTM and OEM version in android, but unable to know if the LSTM version is available in android or not. If yes, can i get a suggested repo link to that

Thank you

Provide more information on LANG_TYPE in the documentation

Is there a lookup table of LANG_TYPE for all the languages that tesseract support?

Compilation guide for various platforms - Windows - Build training tools

The Windows instructions of the 'Build Training Tools' does not work:

http://tesseract-ocr.github.io/tessdoc/Compiling.html#windows

I think it is due to the https://cppan.org website/network being discontinued, and being replaced with Software Network / SW.

How can I obtain the Training Tools for windows (32-bit and 64-bit) ? (Better build them myself).

Also can the documentation be updated for this issue ?

-l Script Usage example

Hello,
There doesn't seem to be an example on how to use the -l script option in tesseract.

I have referred this and it's mentioned we can specify different scripts

Actually, I have a image in fulani language, which is in latin script.
I tried this:
teserract fulani.tiff stdout -l latin
But that seems to fail with this error:

G:\GitHub\Things to copyin\qdata\to clean and add>tesseract fulani.tiff stdout -
l latin
Error opening data file C:\Program Files\Tesseract-OCR/tessdata/latin.traineddat
a
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessda
ta" directory.
Failed loading language 'latin'
Tesseract couldn't load any languages!
Could not initialize tesseract.

I am not sure what's the right way to use the -l script option, if you could provide me an example that would be great help.

Thanks

error in make

Makefile:1930: recipe for target 'libtesseract.la' failed
make[2]: *** [libtesseract.la] Error 1
make[2]: Leaving directory '/home/get/tesseract'
Makefile:4121: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/get/tesseract'
Makefile:1356: recipe for target 'all' failed
make: *** [all] Error 2

Broken links in docs

I was looking at the docs page https://tesseract-ocr.github.io/tessdoc/ and there is broken link:

External Projects / AddOns navigates to https://tesseract-ocr.github.io/AddOns.md which doesn't exists -> it should be https://tesseract-ocr.github.io/tessdoc/AddOns.html (change to .html and missing tessdoc part)

Similar situation is on the AddOns page on first link:

User Projects - 3rd Party navigates to https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.md but it should be https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html (change to .html)

I haven't checked other pages, so it's possible there are similar issues with these *.md instead of *.html

How to question: OCR only part of scanned image

Hello

I have only started using tesseract with ocrmypdf.

I issue a command like this ocrmypdf input_pdf_or_image output_pdf

This is not a ocrmypdf question.

Question
Is there any way to mask or draw a bounding box around a scanned images that will be intercepted as a region in the image that tesseract will know to ignore? Some of my scanned images have graphics or tables in them that get OCR'd and although appreciated I would rather exclude these because the resulting PDF will contain selectable text which is unwanted.

Any ideas or potential solutions would be very much appreciated as I have been trying to find workarounds for days now.

➜ ~ tesseract --version
tesseract 4.1.1
leptonica-1.80.0
libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
Found AVX
Found SSE

Kind regards
—Alex

language selection not available and error while processing

There are two issues I am facing.
First: I cannot select language for ocr (check image 1)
Second: After clicking OK button I am getting an error. (check image 2)
I think second issue is occurring due to the first issue. Am not sure.
I need your assistance.
Please help me.
I have attached both the images.

Creating training data using tesstrain.sh

It is not clear when creating training data using tesstain.sh for the LSTM model
should I use --langdata_dir langdata_lstm or to use --langdata_dir langdata?

It defect which eng.training_text file will be used to generate the training data

what should I use?

Can not generate .box file following the docs.

hi,

I was trying to train a new model using tesseract using my own image and labeled txt.
I have been following this part of the tesseract-4.0 doc:
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#making-box-files

It says:

For example, tesseract image.png image lstmbox will generate a box file with name image.box for the image in the current directory.

However, it failed like:


➜  data git:(master) tesseract -v
tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE
➜  data git:(master) tesseract 2020-03-14-02-41-41-captcha.png 2020-03-14-02-41-41-captcha lstmbox
read_params_file: Can't open lstmbox
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
➜  data git:(master) ✗ ls
2020-03-14-02-41-41-captcha.png  2020-03-14-02-43-05-captcha.png  2020-03-14-02-43-23-result.txt
2020-03-14-02-41-41-captcha.txt  2020-03-14-02-43-05-result.txt   2020-03-14-02-43-39-captcha.png
2020-03-14-02-41-41-result.txt   2020-03-14-02-43-23-captcha.png  2020-03-14-02-43-39-result.txt

On the other hand, I found the following script:
https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py

which works well with

python3 tesstrain/generate_line_box.py --image=data/2020-03-14-02-41-41-captcha.png --txt=data/2020-03-14-02-41-41-captcha.txt > data/2020-03-14-02-41-41-captcha.box

Maybe the current doc about how to train tesseract is outdated now?

can not update tesseract-ocr-devel/ubuntu.

I try to intall and update tesseract-ocr-devel/ubuntu, but it is failed.
is anyone aware of an alternative package repository?

sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel

sudo apt update

Ign:4 http://ppa.launchpad.net/alex-p/tesseract-ocr-devel/ubuntu kinetic InRelease
Err:5 http://ppa.launchpad.net/alex-p/tesseract-ocr-devel/ubuntu kinetic Release
  404  Not Found [IP: 185.125.190.52 80]
Reading package lists... Done
E: The repository 'http://ppa.launchpad.net/alex-p/tesseract-ocr-devel/ubuntu kinetic Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

Add normcap to 3rd party GUI list

Add normcap to 3rd party GUI list.

OCR powered screen-capture tool to capture information instead of images.

https://github.com/dynobo/normcap

Documentation/Link from here to Ubuntu PPA only points to development release PPA

Following the documentation and links here for a Ubuntu binary only leads to, https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel?field.series_filter=jammy which appears to be the PPA for development releases and makes no mention of, https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5?field.series_filter=jammy which appears to be the stable release PPA.

I would suggest that both PPAs be mentioned in the documentation.

tessdoc FAQ - table of contents out of date

On this page, some of the sections below don't have an appropriate Table of Contents link. For instance, if you ctrl+f 'improve ocr results', you can note it has a section but the table of contents doesn't have the appropriate anchor hyperlink.

404 for tessdoc

Environment

Tesseract Version: Latest version of tessdoc on 2021-09-13
Commit Number: NA
Platform: Win10, Chrome v91.0.4472.164 (Official Build) (64-bit)

Current Behavior:

404 error on Tesseract documentation site (https://tesseract-ocr.github.io/tessdoc/)

Expected Behavior:

Tesseract documentation site will show detail information of Tesseract

Suggested Fix:

About fonts, do I need to generate training dataset for each font and train together or can train folder by folder for each font? [email protected]

Clarify language support quality status

The README.md says tesseract "supports over 100 languages out of the box". But - which languages? And what quality is the support for different languages known to be, out of the box?

It would be helpful if a separate file (or wiki page) would detail, to the extent possible, this information.

Could macOS compile tesseract 5.1.0?

I hope to use the LSTM file to train my data but it seems there are only tutorial on Linux?
Could macOS compile tesseract 5.1.0?

Looking to extract data by coordinates

Hello,

I was wondering if someone can point me in the right direction. I want to extract data based on coordinates like
$fields = [
'field1'=>[0,0,100,50],
'fieldn' =>[x1,y1,x2,y2]
];

Thank you

Dead Link

Hello there,

Just noticed that the link is dead in this sentence: "Please read the Implementation introduction before delving too deeply into the training process, and the same note as for training Tesseract 3.04 applies:" (Under the ReadMe section, "Before You Start").

This is the current link: https://github.com/tesseract-ocr/tessdoc/blob/main/tess4/NeuralNetsInTesseract4.00

And I believe it should be this: https://github.com/tesseract-ocr/tessdoc/blob/main/tess4/NeuralNetsInTesseract4.00.md

Document the hacks to build training tools on mac

Environment

Tesseract Version: 5.1.0
Platform: Mac

Current Behavior:

The documentation here https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#hardware-software-requirements says that

"At time of writing, training only works on Linux. (macOS almost works; it requires minor hacks to the shell scripts to account for the older version of bash it provides and differences in mktemp.)"

Expected Behavior:

It would be beneficial for macOS users to document these minor hacks to help them build the training tools on mac.

Suggested Fix:

Document the hacks to build training tools on mac

tesseract-ocr / tessdoc Goto Github PK

tessdoc's People

Contributors

Stargazers

Watchers

Forkers

tessdoc's Issues

It's running OK in my computer(windows 10),but in the other(Windows XP ), The above error occurred

Current Behavior:

Expected Behavior:

Suggested Fix:

Packages which are always needed.

Packages required for training tools.

Optional packages for extra features.

Optional package for builds using g++.

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Recommend Projects

Recommend Topics

Recommend Org