gregjurman / tesserwrap Goto Github PK
View Code? Open in Web Editor NEWPython bindings to the Tesseract API
Home Page: https://tesserwrap.readthedocs.org/en/latest/
License: Other
Python bindings to the Tesseract API
Home Page: https://tesserwrap.readthedocs.org/en/latest/
License: Other
Tesserwrap - Basic Tesseract API Wrapper for Python Tesserwrap is a project that allows simple bindings to Tesseract's API rather than executing the application manually each time. Docs: https://tesserwrap.readthedocs.org/en/latest/ IRC: #tesserwrap on Freenode
➜ tesserwrap git:(zxdev) python setup.py install
ld: library not found for -ltesseract_api
running install
Checking .pth file support in /Library/Python/2.7/site-packages/
error: can't create or remove files in install directory
The following error occurred while trying to add or remove files in the
installation directory:
[Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-13493.pth'
The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:
/Library/Python/2.7/site-packages/
Perhaps your account does not have write access to this directory? If the
installation directory is a system-owned directory, you may need to sign in
as the administrator or "root" account. If you do not have administrative
access to this machine, you may wish to choose a different installation
directory, preferably one that is listed in your PYTHONPATH environment
variable.
For information on other options, you may wish to consult the
documentation at:
https://pythonhosted.org/setuptools/easy_install.html
Please make the appropriate changes for your system and try again.
I found printed out the command and have been trying variations of it, but haven't found what tesseract is named in debian. WIll submit a pull request if I can find it/fix it.
Howdy,
Im currently working on a OCR-PDF solution for visual impaired people.
https://github.com/chrys87/ocrpdf
its just a early state but in my work with this i recognized that set_page_seg_mode is not respected correctly.
I have a multicolumn layout here as example:
https://crivatec.de/page/uploads/images/ocrTransformed1.png
I try the following:
tess = tesseract("/usr/share", self._languageCode) #language is "deu" in this case
tess.set_page_seg_mode(tesserwrap.PageSegMode.PSM_AUTO )
print( tess.get_page_seg_mode() ) #prints 3
self._OCRText[Page_p] = tess.ocr_image(self._modifiedImg[Page_p]) # the image above as pillow image
it seems that the mode is "set" correctly, because the print give the correct value but it still just proceed PSM_SINGLE_BLOCK (6)
so the columns are not recognized.
if i run tesseract from the commandline
tesseract ocrTransformed1.png ocrTransformed1 -l deu -psm 3
works awsome. the result is much more better. the correct psm is used and the columns are recognized.
could you take a look into this?
I m running a current Arch linux with latest tesseract, pillow and python
it seems that years ago a similar problem exists in tesseract itself:
https://code.google.com/p/tesseract-ocr/issues/detail?id=394
by the way, i really enjoy your python api. damn cool stuff :).
pip install tesserwrap
Collecting tesserwrap
Downloading tesserwrap-0.1.6.tar.gz
Complete output from command python setup.py egg_info:
ld: library not found for -lcrt1.o
ld: library not found for -lcrt1.o
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/yp/mhyppj_15z15cl2bf1_cn7jm0000gn/T/pip-build-3opbjwz5/tesserwrap/setup.py", line 45, in <module>
extra_lib_paths)
File "/private/var/folders/yp/mhyppj_15z15cl2bf1_cn7jm0000gn/T/pip-build-3opbjwz5/tesserwrap/setup.py", line 30, in find_closest_libname
"Cannot find Tesseract via ldconfig, confirm it is installed.")
Exception: Cannot find Tesseract via ldconfig, confirm it is installed.
I installed tesseract with brew, and it does appear to be installed:
tesseract
Usage:
tesseract --help | --help-psm | --help-oem | --version
tesseract --list-langs [--tessdata-dir PATH]
tesseract --print-parameters [options...] [configfile...]
tesseract imagename|stdin outputbase|stdout [options...] [configfile...]
OCR options:
--tessdata-dir PATH Specify the location of tessdata path.
--user-words PATH Specify the location of user words file.
--user-patterns PATH Specify the location of user patterns file.
-l LANG[+LANG] Specify language(s) used for OCR.
-c VAR=VALUE Set value for config variables.
Multiple -c arguments are allowed.
--psm NUM Specify page segmentation mode.
--oem NUM Specify OCR Engine mode.
NOTE: These options must occur before any configfile.
...
Some light digging on the internet and uninstalling/reinstalling tesseract have turned up nothing. Any idea what's going on here?
This is with Python 2.7.3 on a standard Ubuntu 12.04:
In [1]: from tesserwrap import Tesseract
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/home/baeuml/<ipython-input-1-b91b678fa26c> in <module>()
----> 1 from tesserwrap import Tesseract
/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/__init__.py in <module>()
----> 1 from .core import tr
2 from ctypes import c_ulonglong, byref
3 import sys
4 import warnings
5
/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/core.py in <module>()
68
69
---> 70 tr = load_library('libtesserwrap', os.path.dirname(__file__))
71
72
/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/core.py in load_library(libname, loader_path)
45
46 """
---> 47 so_ext = get_shared_lib_extension()
48 libname_ext = [libname + so_ext]
49 if sys.version[:3] >= '3.2':
/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/core.py in get_shared_lib_extension(is_python_ext)
28
29 """
---> 30 so_ext = distutils.sysconfig.get_config_var('SO') or ''
31 # fix long extension for Python >=3.2, see PEP 3149.
32 if not is_python_ext and 'SOABI' in distutils.sysconfig.get_config_vars():
AttributeError: 'module' object has no attribute 'sysconfig'
Installed tesseract:
C:\Python34\Scripts>pip install tesseract
Collecting tesseract
Downloading https://files.pythonhosted.org/packages/8d/b7/c4fae9af5842f69d9c45bf1195a94aec090628535c102894552a7a7dbe6c/tesseract
-0.1.3.tar.gz (45.6MB)
100% |################################| 45.6MB 79kB/s
Installing collected packages: tesseract
Running setup.py install for tesseract ... done
Successfully installed tesseract-0.1.3
Then, trying to install tesserwrap I get this error:
C:\Python34\Scripts>pip install tesserwrap
Collecting tesserwrap
Downloading https://files.pythonhosted.org/packages/04/92/4c2134fc465d576c05d4426bc2f1ba7871652d78d3d913bec0bffe0afe8b/tesserwra
p-0.1.6.tar.gz
Complete output from command python setup.py egg_info:
'ld' is not recognized as an internal or external command,
operable program or batch file.
'ld' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\XXN\AppData\Local\Temp\pip-install-ep4an8qi\tesserwrap\setup.py", line 45, in
extra_lib_paths)
File "C:\Users\XXN\AppData\Local\Temp\pip-install-ep4an8qi\tesserwrap\setup.py", line 30, in find_closest_libname
"Cannot find Tesseract via ldconfig, confirm it is installed.")
Exception: Cannot find Tesseract via ldconfig, confirm it is installed.
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\XXN\AppData\Local\Temp\pip-install-ep4an8qi\tesserwrap\
A unit test should be devised that tests the bindings as well as confirms Tesseract is passing data properly into python.
Look at the feasibly to move bindings to the native python.h API rather than relying on Boost::Python.
When a new image is passed into the binding the old image is not destroyed. I have tried:
// In class def:
unsigned char* picture;
// In setimage()
if (picture != NULL) {
delete [] picture;
picture = NULL;
}
However the application segfaults when that line of code is in the function. I must be doing something wrong or need to do something non-standard.
Orig Title: not an issue, but can u let me know how to make the lib work in windows.
I kind of skimmed through the code, and confident that it works well, because it doesn't extra errors since it acts like a perfect wrapper for the original tesseract.
Thanks.
I get the following error, when I try to install tesserwrap using the python setup.py install
command? How do I fix this?
gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -O2 -DNDEBUG -g -O3 -I/usr/local/include -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c tesserwrap/cpp/tesseract_ext.cpp -o build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_ext.o gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -O2 -DNDEBUG -g -O3 -I/usr/local/include -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c tesserwrap/cpp/tesseract_wrap.cpp -o build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_wrap.o c++ -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_ext.o build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_wrap.o -L/usr/local/lib -ltesseract-ocr -o build/lib.macosx-10.6-intel-2.7/libtesserwrap.so ld: library not found for -ltesseract-ocr clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command 'c++' failed with exit status 1
Tesserwrap used to work fine for me until that fateful day I updated my Tesseract from 3.02 to 3.03, just because an alpr software recommend .03 and I wanted to also make use of the training materials.
Upon calling:
tr = Tesseract()
tr.ocr_image(img)
or get_text() or get_utf8_text(), tesserwrap calls the del function in init.py, self.handle and core
is true then tr.Tesserwrap_Destroy(self.handle)
simply exits the program, no errors logged (I know this from debugging).
All code below that function call would cease as the main program has been exited.
I commented out tr.Tesserwrap_Destroy(self.handle)
and now it works fine again.
I can use tesserwrap on Tesseract 3.03, no worries here
However, I don't believe that hack is best.
change /usr/local/lib/python2.7/dist-packages/tesserwrap/init.py line 93
--- return self.get_text().decode()
+++ return self.get_text().decode(encoding='UTF-8')
I get this building tesserwrap from pypi and git against tesseract 2.04, boost 1.46.1, and python 2.7.2
$ python setup.py install
running install
running bdist_egg
running egg_info
creating tesserwrap.egg-info
writing tesserwrap.egg-info/PKG-INFO
writing top-level names to tesserwrap.egg-info/top_level.txt
writing dependency_links to tesserwrap.egg-info/dependency_links.txt
writing manifest file 'tesserwrap.egg-info/SOURCES.txt'
reading manifest file 'tesserwrap.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: manifest_maker: MANIFEST.in, line 1: 'recursive-include' expects
writing manifest file 'tesserwrap.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/tesserwrap
copying tesserwrap/tesseract.py -> build/lib.linux-x86_64-2.7/tesserwrap
copying tesserwrap/init.py -> build/lib.linux-x86_64-2.7/tesserwrap
running build_ext
building 'libtesserwrap' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/tesserwrap
creating build/temp.linux-x86_64-2.7/tesserwrap/cpp
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserwrap/cpp/tesseract_wrap.cpp -o build/temp.linux-x86_64-2.7/tesserwrap/cpp/tesseract_wrap.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for Ada/C/ObjC but not for C++ [enabled by default]
In file included from tesserwrap/cpp/tesseract_wrap.cpp:1:0:
tesserwrap/cpp/tesseract_wrap.h:7:17: error: ‘tesseract’ is not a namespace-name
tesserwrap/cpp/tesseract_wrap.h:7:26: error: expected namespace-name before ‘;’ token
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Left()’:
tesserwrap/cpp/tesseract_wrap.h:20:42: error: ‘class TessBaseAPIExt’ has no member named ‘rect_left_’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Top()’:
tesserwrap/cpp/tesseract_wrap.h:21:41: error: ‘class TessBaseAPIExt’ has no member named ‘rect_top_’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Width()’:
tesserwrap/cpp/tesseract_wrap.h:22:43: error: ‘class TessBaseAPIExt’ has no member named ‘rect_width_’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Height()’:
tesserwrap/cpp/tesseract_wrap.h:23:44: error: ‘class TessBaseAPIExt’ has no member named ‘rect_height_’
tesserwrap/cpp/tesseract_wrap.h: At global scope:
tesserwrap/cpp/tesseract_wrap.h:35:25: error: ‘PageSegMode’ has not been declared
tesserwrap/cpp/tesseract_wrap.h:36:5: error: ‘PageSegMode’ does not name a type
tesserwrap/cpp/tesseract_wrap.cpp: In constructor ‘Tesserwrap::Tesserwrap(const char_, const char_)’:
tesserwrap/cpp/tesseract_wrap.cpp:11:26: error: no matching function for call to ‘TessBaseAPIExt::Init(const char_&, const char_&)’
tesserwrap/cpp/tesseract_wrap.cpp:11:26: note: candidate is:
/usr/include/tesseract/baseapi.h:62:14: note: static int TessBaseAPI::Init(const char_, const char_, const char_, bool, int, char_*)
/usr/include/tesseract/baseapi.h:62:14: note: candidate expects 6 arguments, 2 provided
tesserwrap/cpp/tesseract_wrap.cpp: At global scope:
tesserwrap/cpp/tesseract_wrap.cpp:29:1: error: ‘PageSegMode’ does not name a type
tesserwrap/cpp/tesseract_wrap.cpp:34:33: error: variable or field ‘SetPageSegMode’ declared void
tesserwrap/cpp/tesseract_wrap.cpp:34:33: error: ‘PageSegMode’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘void Tesserwrap::SetImage(std::string, int, int)’:
tesserwrap/cpp/tesseract_wrap.cpp:45:8: error: ‘class TessBaseAPIExt’ has no member named ‘SetImage’
tesserwrap/cpp/tesseract_wrap.cpp:46:8: error: ‘class TessBaseAPIExt’ has no member named ‘SetRectangle’
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘void Tesserwrap::Clear()’:
tesserwrap/cpp/tesseract_wrap.cpp:51:8: error: ‘class TessBaseAPIExt’ has no member named ‘Clear’
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘void Tesserwrap::SetRectangle(int, int, int, int)’:
tesserwrap/cpp/tesseract_wrap.cpp:55:8: error: ‘class TessBaseAPIExt’ has no member named ‘SetRectangle’
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘std::string Tesserwrap::GetUTF8Text()’:
tesserwrap/cpp/tesseract_wrap.cpp:60:22: error: ‘class TessBaseAPIExt’ has no member named ‘GetUTF8Text’
tesserwrap/cpp/tesseract_wrap.cpp: In function ‘void init_module_libtesserwrap()’:
tesserwrap/cpp/tesseract_wrap.cpp:72:11: error: ‘PageSegMode’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:72:22: error: template argument 1 is invalid
tesserwrap/cpp/tesseract_wrap.cpp:73:24: error: ‘PSM_AUTO’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:74:33: error: ‘PSM_SINGLE_COLUMN’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:75:32: error: ‘PSM_SINGLE_BLOCK’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:76:31: error: ‘PSM_SINGLE_LINE’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:77:31: error: ‘PSM_SINGLE_WORD’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:78:31: error: ‘PSM_SINGLE_CHAR’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:79:25: error: ‘PSM_COUNT’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:88:36: error: ‘GetPageSegMode’ is not a member of ‘Tesserwrap’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Height()’:
tesserwrap/cpp/tesseract_wrap.h:23:57: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Width()’:
tesserwrap/cpp/tesseract_wrap.h:22:55: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Top()’:
tesserwrap/cpp/tesseract_wrap.h:21:51: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Left()’:
tesserwrap/cpp/tesseract_wrap.h:20:53: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘std::string Tesserwrap::GetUTF8Text()’:
tesserwrap/cpp/tesseract_wrap.cpp:61:1: warning: control reaches end of non-void function [-Wreturn-type]
error: command 'gcc' failed with exit status 1
Maybe some incompatibility in this toolchain?
Hey Greg,
When I try to compile tesserwrap on OSX10.8, I get the error below. /usr/local/lib contains the library files for tesseract libtesseract.a and libtesseract.dylib. I think that it is actually finding those files but then hitting trouble when trying to do the binding. Any suggestions would be greatly appreciated!
Thanks so much,
Jason
python setup.py install
-L/usr/local/lib
ld: library not found for -ltesseract_api
-L/usr/local/lib
ld: warning: -arch not specified
ld: symbol dyld_stub_binder not found (normally in libSystem.dylib). Needed to perform lazy binding to function _main for inferred architecture x86_64
Traceback (most recent call last):
File "setup.py", line 39, in
extra_lib_paths)
File "setup.py", line 24, in find_closest_libname
"Cannot find Tesseract via ldconfig, confirm it is installed.")
Exception: Cannot find Tesseract via ldconfig, confirm it is installed.
Need to write some docs. The API isnt documented with docstrings.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.