Git Product home page Git Product logo

tesserwrap's Introduction

Tesserwrap - Basic Tesseract API Wrapper for Python

Tesserwrap is a project that allows simple bindings to Tesseract's API rather than executing the application manually each time.

Docs: https://tesserwrap.readthedocs.org/en/latest/
IRC: #tesserwrap on Freenode

tesserwrap's People

Contributors

baali avatar beli-sk avatar gregjurman avatar tax avatar tonyseek avatar tydus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tesserwrap's Issues

ld: library not found for -ltesseract_api

➜  tesserwrap git:(zxdev) python setup.py install             
ld: library not found for -ltesseract_api
running install
Checking .pth file support in /Library/Python/2.7/site-packages/
error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

    [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-13493.pth'

The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    /Library/Python/2.7/site-packages/

Perhaps your account does not have write access to this directory?  If the
installation directory is a system-owned directory, you may need to sign in
as the administrator or "root" account.  If you do not have administrative
access to this machine, you may wish to choose a different installation
directory, preferably one that is listed in your PYTHONPATH environment
variable.

For information on other options, you may wish to consult the
documentation at:

  https://pythonhosted.org/setuptools/easy_install.html

Please make the appropriate changes for your system and try again.

find_closest_libnames fails on debian wheezy

I found printed out the command and have been trying variations of it, but haven't found what tesseract is named in debian. WIll submit a pull request if I can find it/fix it.

set_page_seg_mode is not respected correctly

Howdy,

Im currently working on a OCR-PDF solution for visual impaired people.
https://github.com/chrys87/ocrpdf
its just a early state but in my work with this i recognized that set_page_seg_mode is not respected correctly.
I have a multicolumn layout here as example:
https://crivatec.de/page/uploads/images/ocrTransformed1.png
I try the following:

tess = tesseract("/usr/share", self._languageCode) #language is "deu" in this case
tess.set_page_seg_mode(tesserwrap.PageSegMode.PSM_AUTO )

tess.set_variable('tessedit_pageseg_mode','3')# also tryed this

print( tess.get_page_seg_mode() ) #prints 3
self._OCRText[Page_p] = tess.ocr_image(self._modifiedImg[Page_p]) # the image above as pillow image

it seems that the mode is "set" correctly, because the print give the correct value but it still just proceed PSM_SINGLE_BLOCK (6)
so the columns are not recognized.

if i run tesseract from the commandline
tesseract ocrTransformed1.png ocrTransformed1 -l deu -psm 3
works awsome. the result is much more better. the correct psm is used and the columns are recognized.

could you take a look into this?

I m running a current Arch linux with latest tesseract, pillow and python
it seems that years ago a similar problem exists in tesseract itself:
https://code.google.com/p/tesseract-ocr/issues/detail?id=394

by the way, i really enjoy your python api. damn cool stuff :).

Can't install on osx sierra

pip install tesserwrap
Collecting tesserwrap
  Downloading tesserwrap-0.1.6.tar.gz
    Complete output from command python setup.py egg_info:
    ld: library not found for -lcrt1.o
    ld: library not found for -lcrt1.o
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/yp/mhyppj_15z15cl2bf1_cn7jm0000gn/T/pip-build-3opbjwz5/tesserwrap/setup.py", line 45, in <module>
        extra_lib_paths)
      File "/private/var/folders/yp/mhyppj_15z15cl2bf1_cn7jm0000gn/T/pip-build-3opbjwz5/tesserwrap/setup.py", line 30, in find_closest_libname
        "Cannot find Tesseract via ldconfig, confirm it is installed.")
    Exception: Cannot find Tesseract via ldconfig, confirm it is installed.

I installed tesseract with brew, and it does appear to be installed:

tesseract
Usage:
  tesseract --help | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.
...

Some light digging on the internet and uninstalling/reinstalling tesseract have turned up nothing. Any idea what's going on here?

AttributeError: 'module' object has no attribute 'sysconfig'

This is with Python 2.7.3 on a standard Ubuntu 12.04:

In [1]: from tesserwrap import Tesseract
---------------------------------------------------------------------------                                                                              
AttributeError                            Traceback (most recent call last)                                                                              
/home/baeuml/<ipython-input-1-b91b678fa26c> in <module>()                                                                                                
----> 1 from tesserwrap import Tesseract                                                                                                                 

/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/__init__.py in <module>()                                      
----> 1 from .core import tr                                                                                                                             
      2 from ctypes import c_ulonglong, byref                                                                                                            
      3 import sys                                                                                                                                       
      4 import warnings                                                                                                                                  
      5                                                                                                                                                  

/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/core.py in <module>()                                          
     68                                                                                                                                                  
     69                                                                                                                                                  
---> 70 tr = load_library('libtesserwrap', os.path.dirname(__file__))                                                                                    
     71                                                                                                                                                  
     72                                                                                                                                                  

/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/core.py in load_library(libname, loader_path)                  
     45                                                                                                                                                  
     46     """                                                                                                                                          
---> 47     so_ext = get_shared_lib_extension()                                                                                                          
     48     libname_ext = [libname + so_ext]                                                                                                             
     49     if sys.version[:3] >= '3.2':                                                                                                                 

/usr/local/lib/python2.7/dist-packages/tesserwrap-0.1.1-py2.7-linux-x86_64.egg/tesserwrap/core.py in get_shared_lib_extension(is_python_ext)             
     28                                                                                                                                                  
     29     """                                                                                                                                          
---> 30     so_ext = distutils.sysconfig.get_config_var('SO') or ''                                                                                      
     31     # fix long extension for Python >=3.2, see PEP 3149.                                                                                         

     32     if not is_python_ext and 'SOABI' in distutils.sysconfig.get_config_vars():                                                                   

AttributeError: 'module' object has no attribute 'sysconfig'

Cannot find Tesseract via ldconfig, confirm it is installed

Installed tesseract:
C:\Python34\Scripts>pip install tesseract
Collecting tesseract
Downloading https://files.pythonhosted.org/packages/8d/b7/c4fae9af5842f69d9c45bf1195a94aec090628535c102894552a7a7dbe6c/tesseract
-0.1.3.tar.gz (45.6MB)
100% |################################| 45.6MB 79kB/s
Installing collected packages: tesseract
Running setup.py install for tesseract ... done
Successfully installed tesseract-0.1.3

Then, trying to install tesserwrap I get this error:

C:\Python34\Scripts>pip install tesserwrap
Collecting tesserwrap
Downloading https://files.pythonhosted.org/packages/04/92/4c2134fc465d576c05d4426bc2f1ba7871652d78d3d913bec0bffe0afe8b/tesserwra
p-0.1.6.tar.gz
Complete output from command python setup.py egg_info:
'ld' is not recognized as an internal or external command,
operable program or batch file.
'ld' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\XXN\AppData\Local\Temp\pip-install-ep4an8qi\tesserwrap\setup.py", line 45, in
extra_lib_paths)
File "C:\Users\XXN\AppData\Local\Temp\pip-install-ep4an8qi\tesserwrap\setup.py", line 30, in find_closest_libname
"Cannot find Tesseract via ldconfig, confirm it is installed.")
Exception: Cannot find Tesseract via ldconfig, confirm it is installed.

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in C:\Users\XXN\AppData\Local\Temp\pip-install-ep4an8qi\tesserwrap\

Write proper unit-test

A unit test should be devised that tests the bindings as well as confirms Tesseract is passing data properly into python.

Native python.h

Look at the feasibly to move bindings to the native python.h API rather than relying on Boost::Python.

Memory Leak in set_image()

When a new image is passed into the binding the old image is not destroyed. I have tried:

// In class def:
unsigned char* picture;
// In setimage()
if (picture != NULL) { 
     delete [] picture;
     picture = NULL;
}

However the application segfaults when that line of code is in the function. I must be doing something wrong or need to do something non-standard.

Windows compiling

Orig Title: not an issue, but can u let me know how to make the lib work in windows.
I kind of skimmed through the code, and confident that it works well, because it doesn't extra errors since it acts like a perfect wrapper for the original tesseract.

Thanks.

Does not install

I get the following error, when I try to install tesserwrap using the python setup.py install command? How do I fix this?

gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -O2 -DNDEBUG -g -O3 -I/usr/local/include -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c tesserwrap/cpp/tesseract_ext.cpp -o build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_ext.o gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -O2 -DNDEBUG -g -O3 -I/usr/local/include -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c tesserwrap/cpp/tesseract_wrap.cpp -o build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_wrap.o c++ -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_ext.o build/temp.macosx-10.6-intel-2.7/tesserwrap/cpp/tesseract_wrap.o -L/usr/local/lib -ltesseract-ocr -o build/lib.macosx-10.6-intel-2.7/libtesserwrap.so ld: library not found for -ltesseract-ocr clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command 'c++' failed with exit status 1

Tesserwrap exits after calling methods involving get_text()

Tesserwrap used to work fine for me until that fateful day I updated my Tesseract from 3.02 to 3.03, just because an alpr software recommend .03 and I wanted to also make use of the training materials.
Upon calling:

    tr = Tesseract()
    tr.ocr_image(img)

or get_text() or get_utf8_text(), tesserwrap calls the del function in init.py, self.handle and core is true then tr.Tesserwrap_Destroy(self.handle) simply exits the program, no errors logged (I know this from debugging).

All code below that function call would cease as the main program has been exited.

I commented out tr.Tesserwrap_Destroy(self.handle) and now it works fine again.
I can use tesserwrap on Tesseract 3.03, no worries here
However, I don't believe that hack is best.

encoding

change /usr/local/lib/python2.7/dist-packages/tesserwrap/init.py line 93
--- return self.get_text().decode()
+++ return self.get_text().decode(encoding='UTF-8')

Build error on ubuntu

I get this building tesserwrap from pypi and git against tesseract 2.04, boost 1.46.1, and python 2.7.2

$ python setup.py install
running install
running bdist_egg
running egg_info
creating tesserwrap.egg-info
writing tesserwrap.egg-info/PKG-INFO
writing top-level names to tesserwrap.egg-info/top_level.txt
writing dependency_links to tesserwrap.egg-info/dependency_links.txt
writing manifest file 'tesserwrap.egg-info/SOURCES.txt'
reading manifest file 'tesserwrap.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: manifest_maker: MANIFEST.in, line 1: 'recursive-include' expects

...

writing manifest file 'tesserwrap.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/tesserwrap
copying tesserwrap/tesseract.py -> build/lib.linux-x86_64-2.7/tesserwrap
copying tesserwrap/init.py -> build/lib.linux-x86_64-2.7/tesserwrap
running build_ext
building 'libtesserwrap' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/tesserwrap
creating build/temp.linux-x86_64-2.7/tesserwrap/cpp
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserwrap/cpp/tesseract_wrap.cpp -o build/temp.linux-x86_64-2.7/tesserwrap/cpp/tesseract_wrap.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for Ada/C/ObjC but not for C++ [enabled by default]
In file included from tesserwrap/cpp/tesseract_wrap.cpp:1:0:
tesserwrap/cpp/tesseract_wrap.h:7:17: error: ‘tesseract’ is not a namespace-name
tesserwrap/cpp/tesseract_wrap.h:7:26: error: expected namespace-name before ‘;’ token
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Left()’:
tesserwrap/cpp/tesseract_wrap.h:20:42: error: ‘class TessBaseAPIExt’ has no member named ‘rect_left_’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Top()’:
tesserwrap/cpp/tesseract_wrap.h:21:41: error: ‘class TessBaseAPIExt’ has no member named ‘rect_top_’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Width()’:
tesserwrap/cpp/tesseract_wrap.h:22:43: error: ‘class TessBaseAPIExt’ has no member named ‘rect_width_’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Height()’:
tesserwrap/cpp/tesseract_wrap.h:23:44: error: ‘class TessBaseAPIExt’ has no member named ‘rect_height_’
tesserwrap/cpp/tesseract_wrap.h: At global scope:
tesserwrap/cpp/tesseract_wrap.h:35:25: error: ‘PageSegMode’ has not been declared
tesserwrap/cpp/tesseract_wrap.h:36:5: error: ‘PageSegMode’ does not name a type
tesserwrap/cpp/tesseract_wrap.cpp: In constructor ‘Tesserwrap::Tesserwrap(const char_, const char_)’:
tesserwrap/cpp/tesseract_wrap.cpp:11:26: error: no matching function for call to ‘TessBaseAPIExt::Init(const char_&, const char_&)’
tesserwrap/cpp/tesseract_wrap.cpp:11:26: note: candidate is:
/usr/include/tesseract/baseapi.h:62:14: note: static int TessBaseAPI::Init(const char_, const char_, const char_, bool, int, char_*)
/usr/include/tesseract/baseapi.h:62:14: note: candidate expects 6 arguments, 2 provided
tesserwrap/cpp/tesseract_wrap.cpp: At global scope:
tesserwrap/cpp/tesseract_wrap.cpp:29:1: error: ‘PageSegMode’ does not name a type
tesserwrap/cpp/tesseract_wrap.cpp:34:33: error: variable or field ‘SetPageSegMode’ declared void
tesserwrap/cpp/tesseract_wrap.cpp:34:33: error: ‘PageSegMode’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘void Tesserwrap::SetImage(std::string, int, int)’:
tesserwrap/cpp/tesseract_wrap.cpp:45:8: error: ‘class TessBaseAPIExt’ has no member named ‘SetImage’
tesserwrap/cpp/tesseract_wrap.cpp:46:8: error: ‘class TessBaseAPIExt’ has no member named ‘SetRectangle’
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘void Tesserwrap::Clear()’:
tesserwrap/cpp/tesseract_wrap.cpp:51:8: error: ‘class TessBaseAPIExt’ has no member named ‘Clear’
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘void Tesserwrap::SetRectangle(int, int, int, int)’:
tesserwrap/cpp/tesseract_wrap.cpp:55:8: error: ‘class TessBaseAPIExt’ has no member named ‘SetRectangle’
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘std::string Tesserwrap::GetUTF8Text()’:
tesserwrap/cpp/tesseract_wrap.cpp:60:22: error: ‘class TessBaseAPIExt’ has no member named ‘GetUTF8Text’
tesserwrap/cpp/tesseract_wrap.cpp: In function ‘void init_module_libtesserwrap()’:
tesserwrap/cpp/tesseract_wrap.cpp:72:11: error: ‘PageSegMode’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:72:22: error: template argument 1 is invalid
tesserwrap/cpp/tesseract_wrap.cpp:73:24: error: ‘PSM_AUTO’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:74:33: error: ‘PSM_SINGLE_COLUMN’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:75:32: error: ‘PSM_SINGLE_BLOCK’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:76:31: error: ‘PSM_SINGLE_LINE’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:77:31: error: ‘PSM_SINGLE_WORD’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:78:31: error: ‘PSM_SINGLE_CHAR’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:79:25: error: ‘PSM_COUNT’ was not declared in this scope
tesserwrap/cpp/tesseract_wrap.cpp:88:36: error: ‘GetPageSegMode’ is not a member of ‘Tesserwrap’
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Height()’:
tesserwrap/cpp/tesseract_wrap.h:23:57: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Width()’:
tesserwrap/cpp/tesseract_wrap.h:22:55: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Top()’:
tesserwrap/cpp/tesseract_wrap.h:21:51: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.h: In member function ‘int TessBaseAPIExt::Get_Rect_Left()’:
tesserwrap/cpp/tesseract_wrap.h:20:53: warning: control reaches end of non-void function [-Wreturn-type]
tesserwrap/cpp/tesseract_wrap.cpp: In member function ‘std::string Tesserwrap::GetUTF8Text()’:
tesserwrap/cpp/tesseract_wrap.cpp:61:1: warning: control reaches end of non-void function [-Wreturn-type]
error: command 'gcc' failed with exit status 1

Maybe some incompatibility in this toolchain?

Compile Error on OSX 10.8

Hey Greg,

When I try to compile tesserwrap on OSX10.8, I get the error below. /usr/local/lib contains the library files for tesseract libtesseract.a and libtesseract.dylib. I think that it is actually finding those files but then hitting trouble when trying to do the binding. Any suggestions would be greatly appreciated!

Thanks so much,
Jason

python setup.py install

-L/usr/local/lib
ld: library not found for -ltesseract_api
-L/usr/local/lib
ld: warning: -arch not specified
ld: symbol dyld_stub_binder not found (normally in libSystem.dylib). Needed to perform lazy binding to function _main for inferred architecture x86_64
Traceback (most recent call last):
File "setup.py", line 39, in
extra_lib_paths)
File "setup.py", line 24, in find_closest_libname
"Cannot find Tesseract via ldconfig, confirm it is installed.")
Exception: Cannot find Tesseract via ldconfig, confirm it is installed.

Docs

Need to write some docs. The API isnt documented with docstrings.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.