Git Product home page Git Product logo

alfresco-simple-ocr's Introduction

Alfresco Simple OCR Action

This addon provides an action to extract OCR text from images or plain PDFs in Alfresco.

License The plugin is licensed under the LGPL v3.0.

State Current addon release is 2.3.1

Compatibility The current version has been developed using Alfresco 5.2 and Alfresco SDK 3.0.2, although it should also run in Alfresco 5.1, 5.0 & 4.2 (as it is developed by using Alfresco SDK 3.0)

Browser compatibility: 100% supported

Supported OCR software:

Languages Currently Share action interface is provided in English and the behaviour internface in English, Spanish, Brazilian Portuguese, German and Italian. OCR supported languages catalog depends directly on selected OCR software (Tesseract OCR or Windows.Media.OCR)

No original Alfresco resources have been overwritten

BeeCon 2016

This addon was presented a BeeCon 2016. You can find additionals details at Integrating a simple OCR in Alfresco

Downloading the ready-to-deploy-plugin

The binary distribution is made of two jar files to be deployed in Alfresco as modules:

You can install them by putting the jar files in module folder:

  • Copy repo JAR to /opt/alfresco/modules/platform (create the directory if it does not exist)
  • Copy share JAR to /opt/alfresco/modules/share

Re-start Alfresco after copying the files.

Building the artifacts

If you are new to Alfresco and the Alfresco Maven SDK, you should start by reading Jeff Potts' tutorial on the subject.

You can build the artifacts from source code using maven $ mvn clean package

Installation

OCR software for Linux depends on programs like gs or ImageMagick, which are also dependencies for Alfresco. In order to avoid problems, it's recommended to install Alfresco from scratch, letting the OS the installation of the packages.

You can find detailed instructions to perform Alfresco installation from scratch at Alfresco Documentation.

If you are using Linux and your Alfresco is installed by using default wizards, you must pay attention to environment execution for programs launched inside your JVM and you must adjust versions and path precedence.

You can find more options to solve this problem at the FAQ page.

Configuration

After installation, following properties must be included in alfresco-global.properties

  • If you are using pdfsandwich
ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang spa+eng+fra
ocr.server.os=linux

  • If you are using OCRmyPDF
ocr.command=/usr/local/bin/ocrmypdf
ocr.output.verbose=true
ocr.output.file.prefix.command=

ocr.extra.commands=--verbose 1 --force-ocr -l spa+eng+fra
ocr.server.os=linux

  • If you are using Windows.OCR
ocr.url=http://localhost:60064/api/OCR/
ocr.output.verbose=true

ocr.extra.commands=Spanish
ocr.server.os=windows

Usage of rule

  • Including a rule on a folder by selecting Extract OCR action
  • Every dropped image on this folder will be sent to OCR software in order to produce a searchable PDF file.
  • To perform this operation asynchronously, just use the check provided by Alfresco to configure the rule.
  • To allow Alfresco operating in case of OCR error, set the rule check Continue on error

Usage of action

  • Press the action OCR in document browser or document details
  • The action will be executed in asynchronous mode, so the result will be available after a time

Known issues

  • When using WebDAV to upload documents, only asynchronous rule execution is allowed

alfresco-simple-ocr's People

Contributors

angelborroy-ks avatar bpechcorp avatar douglascrp avatar flashboss avatar tfdsimoes-ks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alfresco-simple-ocr's Issues

Problem during ocr

When I put some pdf files in the alfresco folder configured with the rule (ocr-extraction) Alfresco creates a new version of the file without perform ocr correctly.

When this happens It writes this in the alfresco.log:
`Version: ImageMagick 6.9.1-10 Q16 x86_64 2015-08-12 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules
Delegates (built-in): freetype jng jpeg ltdl png tiff wmf

Checking for unpaper:
unpaper -version
*** error: Unknown parameter '-version'.
Try 'unpaper --help' for options.
Checking for tesseract:
tesseract -v
Checking for gs:
gs -v
GPL Ghostscript 8.64 (2009-02-03)
Copyright (C) 2009 Artifex Software, Inc. All rights reserved.
Input file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf"
Output file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf"
Number of pages in inputfile: 1
Processing page 1.
identify -format "%w\n%h\n" "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf[0]"
convert -type Bilevel -density 300x300 "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf[0]" /tmp/pdfsandwichf66a6b.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwichf66a6b.pbm /tmp/pdfsandwich5838df_unpaper.pbm
Processing sheet: /tmp/pdfsandwichf66a6b.pbm -> /tmp/pdfsandwich5838df_unpaper.pbm
tesseract /tmp/pdfsandwich5838df_unpaper.pbm /tmp/pdfsandwich0ca5f3 -l spa+eng+fra pdf
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=595 -dDEVICEHEIGHTPOINTS=842 -dPDFFitPage -o /tmp/pdfsandwich5264db.pdf /tmp/pdfsandwich0ca5f3.pdf
OCR done. Writing "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf"
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf" /tmp/pdfsandwich5264db.pdf

Done.

2017-01-30 15:01:17,837 INFO [es.keensoft.alfresco.ocr.OCRTransformWorker] [http-apr-8080-exec-9] STDERR: tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /usr/local/lib/liblept.so.4)
tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /lib64/libtiff.so.5)
tesseract 3.04.01
leptonica-1.72
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.3

tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /usr/local/lib/liblept.so.4)
tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /lib64/libtiff.so.5)
Tesseract Open Source OCR Engine v3.04.01 with Leptonica`

I have noticed this error:

STDERR: tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /usr/local/lib/liblept.so.4)

Can it generate this problem? How can I fix this?

Thanks in advance

ocrmypdf doesn't work using alfresco-simple-ocr

Hi,

i have an issue using alfresco-simple-ocr and facing this error when i tried to OCR a pdf :

Exception in thread "defaultAsyncAction1" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 10220020 Failed to perform OCR transformation: Execution result: os: Linux command: /usr/bin/ocrmypdf --verbose 1 --force-ocr -l eng /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_6969309335739725478.pdf /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_6969309335739725478_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/lib/python3.5/site-packages/ocrmypdf/__main__.py", line 53, in <module> _unicodefun._verify_python3_env at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183) at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38) at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164) at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161) at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464) at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169) at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38) at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 10220020 Failed to perform OCR transformation: Execution result: os: Linux command: /usr/bin/ocrmypdf --verbose 1 --force-ocr -l eng /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_6969309335739725478.pdf /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_6969309335739725478_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/lib/python3.5/site-packages/ocrmypdf/__main__.py", line 53, in <module> _unicodefun._verify_python3_env at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86) at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181) ... 10 more Caused by: org.alfresco.service.cmr.repository.ContentIOException: 10220020 Failed to perform OCR transformation: Execution result: os: Linux command: /usr/bin/ocrmypdf --verbose 1 --force-ocr -l eng /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_6969309335739725478.pdf /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_6969309335739725478_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/lib/python3.5/site-packages/ocrmypdf/__main__.py", line 53, in <module> _unicodefun._verify_python3_env at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79) ... 11 more

My config file:
### OCR config ### ocr.command=/usr/bin/ocrmypdf ocr.output.verbose=true ocr.output.file.prefix.command= ocr.extra.commands=--verbose 1 --force-ocr -l eng ocr.server.os=linux

If i use this command directly it's working the document is created : ocrmypdf --verbose 1 --force-ocr -l eng /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5341287848260715795.pdf /alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5341287848260715795_ocr.pdf

I'm using centOS and alfresco with docker

thanks :)

Spaces between characters in ocr'ed pdfs

Hi,

Every PDF file I ocr in Alfresco contains spaces between each character.

Example : the word "client" becomes "c l i e n t".

Maybe it's a pdfsandwich issue but as it is called from alfresco-simple-ocr I though i would ask here. Is there anything I can do to solve this ?

System is Ubuntu 16.06 x64, running Alfresco 5.2.0 (re21f2be5-b22)

Regards,

David

Zero byte pdf created when the ocr extract process fails

I've noticed a not ok behaviour with the ocr extract process.

When it can't extract the text (at least when using OCRmyPDF), failing for any reason, the action writes a broken pdf file with 0 byte length into the new version.

Tested using Ubuntu 14.04 with OCRmyPDF installed using pip3

Can you manually resubmit a document for OCR?

Rather than deleting it and re-adding it, is there a way to request a document to be OCRed again?

What happens if Alfresco crashes or there is an error while OCRing the document, does it get resubmitted for OCR or is it added to the library without OCR?

Files was not modified

Hello,

I've been encountering an issue recently.
When I attempt to use the OCR on a PDF file, a message appears saiying "the document will be availabe in minutes" but i can't find any file converted. And the original file was not modified.

I want to kown if someone has ever seen this issue or help me to get the pdf converted.

Thank you in advance.

How to restrict OCR (PDFSandwich) for Searchable Documents (PDF)?

BUG: OCR (PDFSandwich) is getting executed for Searchable Documents (PDF) as well.

Expected behavior: OCR should not process documents already containing text or searchable file.

Actual behavior: OCR is getting executed for Searchable Documents as well.

Steps to reproduce the behavior: Uploaded text containing PDF files which is also being processed for OCR.

Please help me on this.

Tell us about your environment: Linux

TIFF images ingestion

When accepting TIFF images to OCR, include link in resulting PDF for original TIFF file

libtiff.so.5 & liblept.so.5 problems on latest Alfresco 5.2 and Ubuntu 16.04

Hi,
I have an issue after installing latest simple-ocr 2.3.1 with pdfsandwich ocr engine, Alfresco 5.2.0 and Ubuntu 16.04 LTS.
All supporting apps installed with apt-get / dpkg method of installation as follows:

  1. TESSERACT:
# apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-ind
# tesseract -v
tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
  1. PDFSANDWICH:
# dpkg -i /media/sf_Downloads/Alfresco/Addons/simple-ocr/apps-supported/pdfsandwich_0.1.6_amd64.deb
# apt-get -fy install
# pdfsandwich -version
pdfsandwich version 0.1.6

Error when try to ocr document by clicking OCR button on document page:

Exception in thread "defaultAsyncAction5" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 02290026 Failed to perform OCR transformation:
Execution result:
  os:         Linux
  command:    /opt/alfresco-community/bin/bw-pdfsandwich.sh -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166_ocr.pdf
  succeeded:  false
  exit code:  2
  out:        pdfsandwich version 0.1.6
Checking for convert:
convert -version
Version: ImageMagick 7.0.5-2 Q16 x86_64 2017-04-04 http://www.imagemagick.org
Copyright: © 1999-2017 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Featur
  err:        tesseract: /opt/alfresco-community/common/lib/libtiff.so.5: no version information available (required by /usr/lib/liblept.so.5)
tesseract 3.04.01
leptonica-1.73
 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.56 : libtiff 4.0.7 : zli
       at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
       at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
       at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
       at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
       at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464)
       at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
       at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
       at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 02290026 Failed to perform OCR transformation:
Execution result:
  os:         Linux
  command:    /opt/alfresco-community/bin/bw-pdfsandwich.sh -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166_ocr.pdf
  succeeded:  false
  exit code:  2
  out:        pdfsandwich version 0.1.6
Checking for convert:
convert -version
Version: ImageMagick 7.0.5-2 Q16 x86_64 2017-04-04 http://www.imagemagick.org
Copyright: © 1999-2017 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Featur
  err:        tesseract: /opt/alfresco-community/common/lib/libtiff.so.5: no version information available (required by /usr/lib/liblept.so.5)
tesseract 3.04.01
leptonica-1.73
 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.56 : libtiff 4.0.7 : zli
       at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
       at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
       ... 10 more
Caused by: org.alfresco.service.cmr.repository.ContentIOException: 02290026 Failed to perform OCR transformation:
Execution result:
  os:         Linux
  command:    /opt/alfresco-community/bin/bw-pdfsandwich.sh -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166_ocr.pdf
  succeeded:  false
  exit code:  2
  out:        pdfsandwich version 0.1.6
Checking for convert:
convert -version
Version: ImageMagick 7.0.5-2 Q16 x86_64 2017-04-04 http://www.imagemagick.org
Copyright: © 1999-2017 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Featur
  err:        tesseract: /opt/alfresco-community/common/lib/libtiff.so.5: no version information available (required by /usr/lib/liblept.so.5)
tesseract 3.04.01
leptonica-1.73
 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.56 : libtiff 4.0.7 : zli
       at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)
       ... 11 more

Following libtiff on server I found:

root@alflab:/usr# ls -l /usr/lib/x86_64-linux-gnu/libtiff.*
lrwxrwxrwx 1 root root     16 Mar 20 23:42 /usr/lib/x86_64-linux-gnu/libtiff.so.5 -> libtiff.so.5.2.4
-rw-r--r-- 1 root root 475496 Mar 20 23:42 /usr/lib/x86_64-linux-gnu/libtiff.so.5.2.4
root@alflab:/usr# ls -l /opt/alfresco-community/common/lib/libtiff.*
-rw-r--r-- 1 root root 781854 Jun 16  2017 /opt/alfresco-community/common/lib/libtiff.a
-rwxr-xr-x 1 root root   1099 Jun 16  2017 /opt/alfresco-community/common/lib/libtiff.la
lrwxrwxrwx 1 root root     16 Mar 24 22:54 /opt/alfresco-community/common/lib/libtiff.so -> libtiff.so.5.2.5
lrwxrwxrwx 1 root root     16 Mar 24 22:54 /opt/alfresco-community/common/lib/libtiff.so.5 -> libtiff.so.5.2.5
-rwxr-xr-x 1 root root 525016 Jun 16  2017 /opt/alfresco-community/common/lib/libtiff.so.5.2.5

It seems Ubuntu's leptonica is not match with Alfresco libtiff version. CMIIW.

How to fix this error?

Thank you,
[bayu]

Problem writing OCRed file

Hello,
very nice tool to have but I cannot seem to make it work.
It produces the ocr file (through pdfsandwich in my case) but cannot write final result and place it in Alfresco repository.
Below extract from alfresco.log file, maybe you can help?
Thanks in advance
Mat

2016-04-07 17:08:56,234 WARN [es.keensoft.alfresco.ocr.OCRExtractAction] [defaultAsyncAction3] org.alfresco.service.cmr.repository.ContentIOException: 03070084 Failed to copy reader content to writer:
writer: ContentAccessor[ contentUrl=store://2016/4/7/17/8/da25d86b-dbc1-4b95-ba10-4d2daaef7b17.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=it]
source reader: ContentAccessor[ contentUrl=store://2016/4/7/17/6/4a37ad34-b135-49cb-b316-9c7386eaf5bd.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=en_US]
org.alfresco.service.cmr.repository.ContentIOException: 03070084 Failed to copy reader content to writer:
writer: ContentAccessor[ contentUrl=store://2016/4/7/17/8/da25d86b-dbc1-4b95-ba10-4d2daaef7b17.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=it]
source reader: ContentAccessor[ contentUrl=store://2016/4/7/17/6/4a37ad34-b135-49cb-b316-9c7386eaf5bd.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=en_US]

OCRmyPDF creates error when run from alfresco-simple-ocr

Alfresco Repository: 5.2.0 (r135134-b14) Community-Edition
Simple-OCR: 1.1.1
OS: Ubuntu Linux 16.04 (Kernel 4.4.0-71-generic)
OCRmyPDF: 4.5.3
Python 3: 3.5.2

I installed simple-ocr-repo.amp on my Alfresco Server to work with OCRmyPDF. The alfresco-global.properties contains the following lines:

### Simple OCR Action Properties
ocr.command=/usr/local/bin/ocrmypdf
ocr.output.verbose=true
ocr.output.file.prefix.command=
ocr.extra.commands=-l deu -c -d
ocr.server.os=linux

The folder - rule is working and I can see running tasks from ocrmypdf, unpaper, tesseract.

A file .../tomcat/temp/Alfresco/OCRTransformWorker_source_1267824845391978282.pdf is created, but no .../tomcat/temp/Alfresco/OCRTransformWorker_source_1267824845391978282_ocr.pdf.

alfresco.log contains:

Execution result: 
   os:         Linux
   command:    /usr/local/bin/ocrmypdf -l deu -c -d /home/andreas/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1267824845391978282.pdf /home/andreas/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1267824845391978282_ocr.pdf
   succeeded:  true
   exit code:  15
   out:        
   err:        /usr/bin/python3: /home/andreas/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/bin/python3)
WARNING -    1: [tesseract] lots of diacritics - possibly poor OCR
  ERROR - Unrecoverable error: rangecheck in .
	at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:72)

Running the command directly on the shell creates the target file without any problems. Are there problems with the environment? Changing the libz.so.1 to the OS Version did not help and caused another similar Error in python3.

Unable to convert .tif (multipages .tif file) to pdf using pdfsandwich on Ubuntu

Hi,

I have tried with OCR Action as well as the same command I tried on the Ubuntu terminal but with no luck.

I'm getting below error when I use OCR Action & when I execute below command directly from Terminal on Ubuntu for .tif (Multipages tif file) to .pdf file.

Can you please help on this?

$ /usr/bin/pdfsandwich -verbose -lang spa+eng+fra Sample_3_Multi_page.tif -o Sample_3_Multi_page.pdf
pdfsandwich version 0.1.4
Checking for convert:
convert -version
Version: ImageMagick 6.8.9-9 Q16 x86_64 2018-07-10 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC
Features: DPC Modules OpenMP
Delegates: bzlib cairo djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png rsvg tiff wmf x xml zlib

Checking for unpaper:
unpaper -version
6.1
Checking for tesseract:
tesseract -v
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Checking for gs:
gs -v
GPL Ghostscript 9.18 (2015-10-05)
Copyright (C) 2015 Artifex Software, Inc. All rights reserved.
Input file: "Sample_3_Multi_page.tif"
Output file: "Sample_3_Multi_page.pdf"
Fatal error: exception Failure("Error: Could not determine number of pages of file Sample_3_Multi_page.tif")

Thanks.

Failed to load on Alfresco 5.2

Hi,
I've a problem when installing version 1.1.1 on Alfresco Community 5.2.
Following are the logs:

 2017-05-10 19:32:50,891  INFO  [management.subsystems.ChildApplicationContextFactory] [localhost-startStop-1] Startup of 'Authentication' subsystem, ID: [Authentication, managed, alfrescoNtlm1] complete
 2017-05-10 19:32:52,217  WARN  [context.support.XmlWebApplicationContext] [localhost-startStop-1] Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ocr-extract' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Cannot resolve reference to bean 'transformer.worker.OCR' while setting bean property 'ocrTransformWorker'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'transformer.worker.OCR' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Initialization of bean failed; nested exception is org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'boolean' for property 'verbose'; nested exception is java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
 2017-05-10 19:32:52,752  INFO  [management.subsystems.ChildApplicationContextFactory] [localhost-startStop-1] Stopping 'Authentication' subsystem, ID: [Authentication, managed, alfrescoNtlm1]
 2017-05-10 19:32:52,752  INFO  [management.subsystems.ChildApplicationContextFactory] [localhost-startStop-1] Stopped 'Authentication' subsystem, ID: [Authentication, managed, alfrescoNtlm1]
 2017-05-10 19:32:52,757  ERROR [web.context.ContextLoader] [localhost-startStop-1] Context initialization failed
 org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ocr-extract' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Cannot resolve reference to bean 'transformer.worker.OCR' while setting bean property 'ocrTransformWorker'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'transformer.worker.OCR' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Initialization of bean failed; nested exception is org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'boolean' for property 'verbose'; nested exception is java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:334)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:108)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1419)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1160)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:458)
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:293)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:223)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:290)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:191)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:636)
        at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:938)
        at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:479)
        at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:410)
        at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:306)
        at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:112)
        at org.alfresco.web.app.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:70)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5016)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5524)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:649)
        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:672)
        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1859)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'transformer.worker.OCR' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Initialization of bean failed; nested exception is org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'boolean' for property 'verbose'; nested exception is java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:529)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:458)
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:293)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:223)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:290)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:191)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:328)
        ... 29 more
Caused by: org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'boolean' for property 'verbose'; nested exception is java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.BeanWrapperImpl.convertIfNecessary(BeanWrapperImpl.java:469)
        at org.springframework.beans.BeanWrapperImpl.convertForProperty(BeanWrapperImpl.java:495)
        at org.springframework.beans.BeanWrapperImpl.convertForProperty(BeanWrapperImpl.java:489)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.convertForProperty(AbstractAutowireCapableBeanFactory.java:1465)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1424)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1160)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519)
        ... 35 more
Caused by: java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.propertyeditors.CustomBooleanEditor.setAsText(CustomBooleanEditor.java:125)
        at org.springframework.beans.TypeConverterDelegate.doConvertTextValue(TypeConverterDelegate.java:455)
        at org.springframework.beans.TypeConverterDelegate.doConvertValue(TypeConverterDelegate.java:427)
        at org.springframework.beans.TypeConverterDelegate.convertIfNecessary(TypeConverterDelegate.java:181)
        at org.springframework.beans.BeanWrapperImpl.convertIfNecessary(BeanWrapperImpl.java:449)
        ... 41 more
May 10, 2017 7:32:52 PM org.apache.catalina.core.StandardContext listenerStart
SEVERE: Exception sending context initialized event to listener instance of class org.alfresco.web.app.ContextLoaderListener
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ocr-extract' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Cannot resolve reference to bean 'transformer.worker.OCR' while setting bean property 'ocrTransformWorker'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'transformer.worker.OCR' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Initialization of bean failed; nested exception is org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'boolean' for property 'verbose'; nested exception is java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:334)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:108)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1419)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1160)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:458)
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:293)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:223)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:290)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:191)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:636)
        at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:938)
        at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:479)
        at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:410)
        at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:306)
        at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:112)
        at org.alfresco.web.app.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:70)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5016)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5524)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:649)
        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:672)
        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1859)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'transformer.worker.OCR' defined in class path resource [alfresco/module/simple-ocr-repo/context/service-context.xml]: Initialization of bean failed; nested exception is org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'boolean' for property 'verbose'; nested exception is java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:529)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:458)
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:293)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:223)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:290)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:191)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:328)
        ... 29 more
Caused by: org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'boolean' for property 'verbose'; nested exception is java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.BeanWrapperImpl.convertIfNecessary(BeanWrapperImpl.java:469)
        at org.springframework.beans.BeanWrapperImpl.convertForProperty(BeanWrapperImpl.java:495)
        at org.springframework.beans.BeanWrapperImpl.convertForProperty(BeanWrapperImpl.java:489)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.convertForProperty(AbstractAutowireCapableBeanFactory.java:1465)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1424)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1160)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:519)
        ... 35 more
Caused by: java.lang.IllegalArgumentException: Invalid boolean value [${ocr.output.verbose}]
        at org.springframework.beans.propertyeditors.CustomBooleanEditor.setAsText(CustomBooleanEditor.java:125)
        at org.springframework.beans.TypeConverterDelegate.doConvertTextValue(TypeConverterDelegate.java:455)
        at org.springframework.beans.TypeConverterDelegate.doConvertValue(TypeConverterDelegate.java:427)
        at org.springframework.beans.TypeConverterDelegate.convertIfNecessary(TypeConverterDelegate.java:181)
        at org.springframework.beans.BeanWrapperImpl.convertIfNecessary(BeanWrapperImpl.java:449)
        ... 41 more

OS Linux Ubuntu 16.04.
How can solve this issue?

Thanks,
[bayu]

OCR not working properly on tiff and jpeg files

I am trying to perform OCR on tiff and jpeg files but showing "Couldn't find trailer dictionary","Couldn't read xref table"," exception Failure("Error: pdfinfo could not determine number of pages. Check the pdf input file.\n")" although the transformation from jpeg or tiff files to PDF files is working properly and the PDF file is visible on the alfresco share page

getContent from OCR'd file

Hi,
i successfully implemented simple-ocr with tesseract into Alfresco on Linux/Ubuntu and everything works fine. I can OCR a document and search via the Live-Search and the advanced search. Now i want to get the content of the PDF document, read some information from it and add it to the properties. When i do "document.getContent()" on a file (w/ and w/o OCR applied) i only get the encoded PDF content. I thought, since the file got OCR'd, there is a second stream inside the PDF which is just plain text but thats not the case.
Is there a way to extract or simply get the OCR Plain Text layer out of the PDF? If yes how can one do that?

I know i can transform the file into an .TXT format, extract the content i need, add it to the original PDF file and delete/store the .TXT file but thats a lot of effort just to add a value to the properties.

Async error

Hello,
my installation work fine with your plugin when i put pdf files with drag & drop.
But when i click on "OCR" button in share, alfresco send a popup say "some error happened..." and catalina.out say :
Caused by: org.alfresco.error.AlfrescoRuntimeException: 08070071 Invalid parameter asynchronous for action/condition ocr-extract at org.alfresco.repo.web.scripts.rule.AbstractRuleWebScript.parseJsonParameterValues(AbstractRuleWebScript.java:363)
at org.alfresco.repo.web.scripts.rule.AbstractRuleWebScript.parseJsonAction(AbstractRuleWebScript.java:255)
at org.alfresco.repo.web.scripts.rule.ActionQueuePost.executeImpl(ActionQueuePost.java:75)
at org.springframework.extensions.webscripts.DeclarativeWebScript.execute(DeclarativeWebScript.java:64)

Can you help me ?

Olivier

Ok for ALF CE 5.1 with -force !

Hi all,
It's ok for me after many test, i has do make this :

java/bin/java -jar bin/alfresco-mmt.jar install amps/simple-ocr-repo-1.1.0.amp tomcat/webapps/alfresco.war -preview -force

Cordialy

AttributeError when calling ocrmypdf from script

When dropping a PDF-file into a folder that has a rule to perform ocr the following errors show up in alfresco.log and catalina.log:

2016-09-14 10:55:02,076 INFO  [es.keensoft.alfresco.ocr.OCRTransformWorker] [defaultAsyncAction6] EXIT VALUE: 1
2016-09-14 10:55:02,076 INFO  [es.keensoft.alfresco.ocr.OCRTransformWorker] [defaultAsyncAction6] STDOUT:
2016-09-14 10:55:02,076 INFO  [es.keensoft.alfresco.ocr.OCRTransformWorker] [defaultAsyncAction6] STDERR: Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 7, in <module>
    from ocrmypdf.__main__ import run_pipeline
  File "/usr/lib/python3.4/site-packages/ocrmypdf/__main__.py", line 60, in <module>
    if tesseract.version() < MINIMUM_TESS_VERSION:
  File "/usr/lib64/python3.4/functools.py", line 472, in wrapper
    result = user_function(*args, **kwds)
  File "/usr/lib/python3.4/site-packages/ocrmypdf/tesseract.py", line 57, in version
    tesseract_version = re.match(r'tesseract\s(.+)', versions).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

2016-09-14 10:55:02,076 WARN  [es.keensoft.alfresco.ocr.OCRExtractAction] [defaultAsyncAction6] workspace://SpacesStore/36917a25-024c-4f40-a1d1-fd1d48b23ac3: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 08140008 Failed to perform OCR transformation:
Execution result:
   os:         Linux
   command:    /opt/alfresco-community/scripts/ocr.sh /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6082099407754684458.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6082099407754684458_ocr.pdf
   succeeded:  false
   exit code:  1
   out:
   err:        Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 7, in <module>
    from ocrmypdf.__main__ import run_pipeline
  File "/usr/lib/python3.4/site-packages/ocrmypdf/__main__.py", line 60, in <module>
    if tesseract.version() < MINIMU

The exit code 1 indicates that there is a problem with attributes

We use a script that executes ocrmypdf

#!/bin/bash
export PATH=/usr/bin/:$PATH
ocrmypdf --verbose 1 -l deu+eng "$@"

If the script or ocrmypdf itself is executed manually, everything works fine and a PDF with ocr is generated.
btw. if the command in the script is executed with "sudo" it runs into anoter error like:
err: sudo: sorry, you must have a tty to run sudo

ocr-section in alfresco-global.properties looks like this:

# OCR Config
ocr.command=/opt/alfresco-community/scripts/ocr.sh
ocr.output.verbose=true
ocr.output.file.prefix.command=
ocr.extra.commands=
ocr.server.os=linux

List of installed amps:

[root@alfresco alfresco-community]# java -jar ./bin/alfresco-mmt.jar list tomcat/webapps/alfresco
Module 'simple-ocr-repo' installed in 'tomcat/webapps/alfresco'
   -    Title:        simple-ocr-repo Repository AMP project
   -    Version:      1.1.0.1607271657
   -    Install Date: Wed Sep 14 10:38:07 CEST 2016
   -    Description:   Manages the lifecycle of the simple-ocr-repo Repository AMP (Alfresco Module Package)
Module 'org.alfresco.integrations.google.docs' installed in 'tomcat/webapps/alfresco'
   -    Title:        Alfresco / Google Docs Integration
   -    Version:      3.0.3
   -    Install Date: Mon Sep 12 11:05:32 CEST 2016
   -    Description:   The Repository side artifacts of the Alfresco / Google Docs Integration.
Module 'alfresco-aos-module' installed in 'tomcat/webapps/alfresco'
   -    Title:        Alfresco Office Services Module
   -    Version:      1.1
   -    Install Date: Mon Sep 12 11:05:29 CEST 2016
   -    Description:   Allows applications that can talk to a SharePoint server to talk to your Alfresco installation
Module 'alfresco-share-services' installed in 'tomcat/webapps/alfresco'
   -    Title:        Alfresco Share Services AMP
   -    Version:      5.2.0
   -    Install Date: Mon Sep 12 11:05:27 CEST 2016
   -    Description:   Module to be applied to alfresco.war, containing APIs for Alfresco Share

System:
centos7.2
Alfresco Community 5.1
ocrmypdf 4.2.4
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0

Any idea how to solve this?

Custom indicator on Share for OCRd PDFs

How about creating a custom indicator for OCRd PDFs?
I think that would make easier to users to identify when the process was successfully executed.

I can work on it if you agree with the idea.

alfresco and pdf sandwich on 2 differents servers

Hi, I try to use this add-on on a windows installation of alfresco with pdf Sandwich on a linux server.
So I took the linux script provided here https://angelborroy.wordpress.com/2017/01/19/alfresco-installing-ocr-as-an-external-service/ and change it to be using on windows, the script works fine alone. But I can't make it work with alfresco and the add-on. Here's my global.properties file:
`ocr.command=E:\test.bat
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang eng+fra
ocr.server.os=linux`

At first I had an error that was saying that the host does not exists. I was using a putty saved session so I tried without the saved session and now get a read error (408) .

If I replace ocr.server.os=linux with ocr.server.os=windows I got java errors like invalid uri as it tries to connect to a web service that doesn't exists because I use pdf sandwich on the linux server, it doesn't try to use the bat script.

Is someone has already try something like this? Is it even possible?

Error setting versionService after the "Action is not asynchronous by design anymore" changes

I've updated my source code, compiled it and tried in the same server I was testing before, and now, when the server starts, I see the error message bellow and Alfresco doesn't work after that:


Caused by: org.springframework.beans.NotWritablePropertyException: Invalid property 'versionService' of bean class [es.keensoft.alfresco.ocr.OCRExtractAction]: Bean property 'versionService' is not writable or has an invalid setter method. Does the parameter type of the setter match the return type of the getter?
    at org.springframework.beans.BeanWrapperImpl.setPropertyValue(BeanWrapperImpl.java:1044)
    at org.springframework.beans.BeanWrapperImpl.setPropertyValue(BeanWrapperImpl.java:904)
    at org.springframework.beans.AbstractPropertyAccessor.setPropertyValues(AbstractPropertyAccessor.java:75)
    at org.springframework.beans.AbstractPropertyAccessor.setPropertyValues(AbstractPropertyAccessor.java:57)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1452)

Wizard install workaround script in FAQ is buggy

Thanks for writing the plugin, it works well and will let me digitize all my personal papers! A ton of space saved in the drawers!

I just wanted to point out that the script for the workaround if you've installed with the wizard is buggy.

Here is a script that does the same thing but more sanely:

#!/usr/bin/env bash
# set -o xtrace # Uncomment for debugging/troubleshooting

sudo /usr/bin/pdfsandwich "$@"

By quoting your variable, you avoid word splitting. For example, if you passed several options as in -verbose -lang eng+fra via ocr.extra.commands, pdfsandwich would only run with -verbose and without -lang or anything after if $@ wasn't quoted.

Also, sudo is more sane than su because it sanitizes arguments.

Setting the xtrace option gives a more verbose output in the logs when the command fail. It's quite helpful when troubleshooting.

Plugin is not working with multi-tenant environment

BUG

OCR error on Multi-tenant Environment.

Expected behavior

When i upload a pdffile galon.pdf in tenant ([email protected])-->click OCR on menu --> OCR file result is created.

Actual behavior

When i upload a pdffile galon.pdf in tenant ([email protected])-->click OCR on menu --> OCR file result is not created.

Steps to reproduce the behavior

  • Install Alfresco, Tesseract, PDFsandwitch, alfresco-simple-ocr
  • Create tenant
  • Upload PDF test file then OCR

Additional details (analysis so far, log statements, references, etc.)

Error log in attach file

Tell us about your environment

Centos 7 64bit + Tesseract 3.05 or 4.0 + PDF sandwitch 0.1.6, alfresco-simple-ocr orig
ocr_error.txt
jar file

action not visible

Hi,

I cannot see the action in the folder actions / rules. I did the following to install:

  • setup OCRmyPDF
  • `# java -jar alfresco-mmt.jar install /home/andre/simple-ocr-repo.amp /opt/alfresco/addons/war/share.war -verbose
    Installing AMP '/home/andre/simple-ocr-repo.amp' into WAR '/opt/alfresco/addons/war/share.war'
    Backing up WAR file...
    WAR has been backed up to '/opt/alfresco/addons/war/share.war-1460411326904.bak'
    Adding files relating to version '1.0.0.1604041237' of module 'simple-ocr-repo'
    • File '/WEB-INF/lib/simple-ocr-repo.jar' added to war from amp
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/context/bootstrap-context.xml' added to war from amp
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/context/model-context.xml' added to war from amp
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/context/service-context.xml' added to war from amp
    • Directory '/WEB-INF/classes/alfresco/module/simple-ocr-repo/context' added to war
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/messages/simple-ocr-repo.properties' added to war from amp
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/messages/simple-ocr-repo_es.properties' added to war from amp
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/messages/simple-ocr-repo_pt.properties' added to war from amp
    • Directory '/WEB-INF/classes/alfresco/module/simple-ocr-repo/messages' added to war
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/model/ocr-model.xml' added to war from amp
    • Directory '/WEB-INF/classes/alfresco/module/simple-ocr-repo/model' added to war
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/alfresco-global.properties' added to war from amp
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/log4j.properties' added to war from amp
    • File '/WEB-INF/classes/alfresco/module/simple-ocr-repo/module-context.xml' added to war from amp
    • Directory '/WEB-INF/classes/alfresco/module/simple-ocr-repo' added to war
      root@alfresco:/opt/alfresco/addons# service alfresco restart
      `
  • added the code to alfresco-global.properties

Did I miss something? Version is CE 5.1

thank you!

[SOLVED - CPU Usage Improvement] Unable to pass the -tesseract command line option to pdfsandwich

Attempting to control the processor priority of tesseract with nice. I don't want to limit tesseract, I just don't want it to bring the rest of Alfresco to a halt.

Method attempted:
alfresco-global.properties:
ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-tesseract "nice -n 10 tesseract" -verbose -lang eng
ocr.server.os=linux

Log after attempt:
Rule applied to folder to trigger OCR of the document added, upon uploading PDF to the folder the logs are as follows:
os: Linux
command: /usr/bin/pdfsandwich -tesseract "nice -n 10 tesseract" -verbose -lang eng /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_7526164203075376866.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_7526164203075376866_ocr.pdf
succeeded: false
exit code: 2
out:
err: /usr/bin/pdfsandwich: unknown option `-n'.
USAGE: pdfsandwich [options] inputfile.pdf

If I copy and use the line after "command:" at prompt, it successfully executes the command and "top" reflects the nice priority. How can I pass the "nice -n 10 tesseract" to limit the priority?

What I've tried so far:
I've moved the (-tesseract "nice -n 10 tesseract") option to the "ocr.command" line and also attempted using a "\-n" to allow it.

What else can I do to make this work?

PDF scanned with simple OCR, but it is not searchable. NO error message.

`2017-07-16 12:05:54,696 WARN [es.keensoft.alfresco.ocr.OCRExtractAction] [http-apr-8080-exec-9] workspace://SpacesStore/8baf93bd-102e-4404-b687-2707d0e91a55: Node does not exist: workspace://SpacesStore/8baf93bd-102e-4404-b687-2707d0e91a55 (status:null)
2017-07-16 12:06:11,341 INFO [es.keensoft.alfresco.ocr.OCRTransformWorker] [http-apr-8080-exec-9] EXIT VALUE: 0
2017-07-16 12:06:11,344 INFO [es.keensoft.alfresco.ocr.OCRTransformWorker] [http-apr-8080-exec-9] STDOUT: pdfsandwich version 0.1.4
Checking for convert:
convert -version
Version: ImageMagick 6.8.9-9 Q16 x86_64 2016-04-18 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC
Features: DPC Modules OpenMP
Delegates: bzlib djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png tiff wmf x xml zlib

Checking for unpaper:
unpaper -version
6.1
Checking for tesseract:
tesseract -v
Checking for gs:
gs -v
GPL Ghostscript 9.18 (2015-10-05)
Copyright (C) 2015 Artifex Software, Inc. All rights reserved.
Input file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656.pdf"
Output file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656_ocr.pdf"
Number of pages in inputfile: 2
More threads than pages. Using 2 threads instead.

Parallel processing with 2 threads started.
Processing page order may differ from original page order.

Processing page 1.
identify -format "%w\n%h\n" "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656.pdf[0]"
Processing page 2.
identify -format "%w\n%h\n" "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656.pdf[1]"
convert -type Bilevel -density 300x300 "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656.pdf[0]" /tmp/pdfsandwich5ec85c.pbm
convert -type Bilevel -density 300x300 "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656.pdf[1]" /tmp/pdfsandwich257aef.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich257aef.pbm /tmp/pdfsandwicha53d7a_unpaper.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich5ec85c.pbm /tmp/pdfsandwich0ec278_unpaper.pbm
Processing sheet #1: /tmp/pdfsandwich257aef.pbm -> /tmp/pdfsandwicha53d7a_unpaper.pbm
tesseract /tmp/pdfsandwicha53d7a_unpaper.pbm /tmp/pdfsandwichc79f63 -l eng pdf
Processing sheet #1: /tmp/pdfsandwich5ec85c.pbm -> /tmp/pdfsandwich0ec278_unpaper.pbm
tesseract /tmp/pdfsandwich0ec278_unpaper.pbm /tmp/pdfsandwich36d95f -l eng pdf
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=613 -dDEVICEHEIGHTPOINTS=790 -dPDFFitPage -o /tmp/pdfsandwich1b3ea1.pdf /tmp/pdfsandwich36d95f.pdf
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=613 -dDEVICEHEIGHTPOINTS=790 -dPDFFitPage -o /tmp/pdfsandwich449b3d.pdf /tmp/pdfsandwichc79f63.pdf
OCR done. Writing "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656_ocr.pdf"
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1103660786151459656_ocr.pdf" /tmp/pdfsandwich1b3ea1.pdf /tmp/pdfsandwich449b3d.pdf

Done.

2017-07-16 12:06:11,344 INFO [es.keensoft.alfresco.ocr.OCRTransformWorker] [http-apr-8080-exec-9] STDERR: tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.51 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

[image2 @ 0x12b8600] Encoder did not produce proper pts, making some up.
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
[image2 @ 0x2363600] Encoder did not produce proper pts, making some up.
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
OSD: Weak margin (2.69) for 395 blob text block, but using orientation anyway: 0
OSD: Weak margin (6.59) for 413 blob text block, but using orientation anyway: 0
Detected 8 diacritics

`

No bean named 'ocr-extract' is defined

BUG

Can't use OCR

Expected behavior

Use OCR

Actual behavior

Message "Some error happened when processing your request, OCR has not been applied to the document" when clicking the OCR button.

Steps to reproduce the behavior

Open a scanned PDF and click on the OCR button

Additional details (analysis so far, log statements, references, etc.)

2017-11-30 13:34:43,548 ERROR [org.springframework.extensions.webscripts.AbstractRuntime] [http-apr-8080-exec-2] Exception from executeScript: 10300006 Wrapped Exception (with status template): No bean named 'ocr-extract' is defined
org.springframework.extensions.webscripts.WebScriptException: 10300006 Wrapped Exception (with status template): No bean named 'ocr-extract' is defined
        at org.springframework.extensions.webscripts.AbstractWebScript.createStatusException(AbstractWebScript.java:1138)
        at org.springframework.extensions.webscripts.DeclarativeWebScript.execute(DeclarativeWebScript.java:171)
        at org.alfresco.repo.web.scripts.RepositoryContainer$3.execute(RepositoryContainer.java:519)
        at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464)
        at org.alfresco.repo.web.scripts.RepositoryContainer.transactionedExecute(RepositoryContainer.java:587)
        at org.alfresco.repo.web.scripts.RepositoryContainer.transactionedExecuteAs(RepositoryContainer.java:656)
        at org.alfresco.repo.web.scripts.RepositoryContainer.executeScriptInternal(RepositoryContainer.java:428)
        at org.alfresco.repo.web.scripts.RepositoryContainer.executeScript(RepositoryContainer.java:308)
        at org.springframework.extensions.webscripts.AbstractRuntime.executeScript(AbstractRuntime.java:399)
        at org.springframework.extensions.webscripts.AbstractRuntime.executeScript(AbstractRuntime.java:210)
        at org.springframework.extensions.webscripts.servlet.WebScriptServlet.service(WebScriptServlet.java:132)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.alfresco.module.aosmodule.service.ContextRootFilter.doFilter(ContextRootFilter.java:93)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.alfresco.web.app.servlet.GlobalLocalizationFilter.doFilter(GlobalLocalizationFilter.java:68)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:218)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:506)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:962)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1115)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
        at org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run(AprEndpoint.java:2486)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.springframework.beans.factory.NoSuchBeanDefinitionException: No bean named 'ocr-extract' is defined
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBeanDefinition(DefaultListableBeanFactory.java:575)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getMergedLocalBeanDefinition(AbstractBeanFactory.java:1111)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:276)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:191)
        at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1123)
        at org.alfresco.repo.action.ActionServiceImpl.getActionDefinition(ActionServiceImpl.java:286)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
        at org.alfresco.repo.security.permissions.impl.AlwaysProceedMethodInterceptor.invoke(AlwaysProceedMethodInterceptor.java:41)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.alfresco.repo.security.permissions.impl.ExceptionTranslatorMethodInterceptor.invoke(ExceptionTranslatorMethodInterceptor.java:53)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.alfresco.repo.audit.AuditMethodInterceptor.invoke(AuditMethodInterceptor.java:166)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:96)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:260)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:94)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
        at com.sun.proxy.$Proxy47.getActionDefinition(Unknown Source)
        at org.alfresco.repo.web.scripts.rule.AbstractRuleWebScript.parseJsonParameterValues(AbstractRuleWebScript.java:343)
        at org.alfresco.repo.web.scripts.rule.AbstractRuleWebScript.parseJsonAction(AbstractRuleWebScript.java:255)
        at org.alfresco.repo.web.scripts.rule.ActionQueuePost.executeImpl(ActionQueuePost.java:75)
        at org.springframework.extensions.webscripts.DeclarativeWebScript.execute(DeclarativeWebScript.java:64)
        ... 36 more

Tell us about your environment

Ubuntu 16.04 x64

Hello,

I have installed alfresco-simple-ocr by placing simple-ocr-repo-2.3.1.jar and simple-ocr-share-2.3.1.jar in /opt/alfresco-community/modules/share.

Also, I have edited alfresco-global.properties and added the following lines :

ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang spa+eng+fra
ocr.server.os=linux

Could you please help ?

Regards,

David

PDFs contain spaces between each character in words so search function won't work

Hi,
I'm quite new to the alfresco world. What I wanted to achieve:

  • Scan a Document (as .tiff)
  • upload it to Alfresco
  • let the magic happen and the document converted to a .pdf
  • search for the document by text
    What works:
  • All except the search

In the document, the text is selectable but contains a space character between every letter in the words so searching is not working. Any idea how to fix this? I'm using ocrmypdf with --verbose 1 --force-ocr -l deu+eng

Thank you,
Michael

Upload new version of document doesn't execute OCR process

Hi,
I have a rule which will trigger Extract OCR when document/content uploaded and updated (2 requirements in single rule).

When a new document uploaded (plain PDF), the version will update to 1.1 (OCR runs normally).
But secondly, when I upload new version the OCR process didn't work.

What's and where's the problem?

Thank you,

Run in background is a RAM + CPU hog

I am playing around with Simple OCR while uploading a large number of documents.

I noticed that if I don't enable the rule to run in background, some of the document uploads time out (408). If I run the task in the background then Alfresco will spawn as many concurrent jobs as there are documents.

This can be a problem when uploading 10+ documents: it consumes huge amounts of RAM by running every job in parallel (regardless of how many CPUs/cores are available). This slows the server down to a grind, and eventually the Linux OOM reaper starts killing processes.

Is there any way to set the plugin so that it only runs a specific number of jobs in parallel? Preferably a number that matches the number of CPUs on the system?

Using windows OCR

Hello Angel.

Can you give some more info on using the windows OCR.

Reading MS doc I don't see how http://localhost:60064/api/OCR/ is installed and used.

Is there a spécific keensoft install for this. I think a saw something on keensoft blog but my spanish is very limited.

Regards

Bertrand

alfresco crash on action with multiple files

Hi,
thank you for alfresco-simple-ocr! Works very fine with alfresco 5.2.
If i run the rule on folders with few files all works fine, but if i run the rule on folder with lot pdfs alfresco crash... alfresco run a lot of pdfsandwich process, those fill RAM, processors go to 100%, then became to swap and alfresco crash without complete the rule...
Any solution to specify how many process execute ???
Thank you very much and excuse for my bad english :-)

Document is changed by pdfsandwich but not searchable

Hi,

I just need to get OCR to run before I can put Alfresco to production on a private/non-profit use. However, even after a day of try and error I can't get out of one issue.

The documents are OCR'ed when I put them into the folder with the rule, I see the convert.bin and tesseract threads running in top, afterwards the document got a new version number by "OCRd". But I can't search it. When I download the new PDF and open it in Acrobat, I still can't search. If I let Acrobat run its OCR on it and reupload it as a new version, also Alfresco will find the content.

Does anyone have an idea what's going on here? And help is highly appreciated :-)

`
Checking for convert:
convert -version
Version: ImageMagick 6.9.1-10 Q16 x86_64 2015-08-12 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules
Delegates (built-in): freetype jng jpeg ltdl png tiff wmf

Checking for unpaper:
unpaper -version
6.1
Checking for tesseract:
tesseract -v
Checking for gs:
gs -v
GPL Ghostscript 8.64 (2009-02-03)
Copyright (C) 2009 Artifex Software, Inc. All rights reserved.
Input file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391.pdf"
Output file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391_ocr.pdf"
Number of pages in inputfile: 1
More threads than pages. Using 1 threads instead.
Processing page 1.
identify -format "%w\n%h\n" "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391.pdf[0]"
convert -type Bilevel -density 300x300 "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391.pdf[0]" /tmp/pdfsandwich4e4c4b.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich4e4c4b.pbm /tmp/pdfsandwich28b7bf_unpaper.pbm
Processing sheet #1: /tmp/pdfsandwich4e4c4b.pbm -> /tmp/pdfsandwich28b7bf_unpaper.pbm
tesseract /tmp/pdfsandwich28b7bf_unpaper.pbm /tmp/pdfsandwich81f2bb -l deu pdf
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=595 -dDEVICEHEIGHTPOINTS=816 -dPDFFitPage -o /tmp/pdfsandwichb52248.pdf /tmp/pdfsandwich81f2bb.pdf
OCR done. Writing "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391_ocr.pdf"
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391_ocr.pdf" /tmp/pdfsandwichb52248.pdf

Done.

2017-05-26 19:28:55,904 INFO [es.keensoft.alfresco.ocr.OCRTransformWorker] [defaultAsyncAction1] STDERR: unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavformat-ffmpeg.so.56)
unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavcodec-ffmpeg.so.56)
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.3 : libwebp 0.4.4 : libopenjp2 2.1.0

unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavformat-ffmpeg.so.56)
unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavcodec-ffmpeg.so.56)
[image2 @ 0x1018900] Encoder did not produce proper pts, making some up.
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
`

Unable to include a rule on a folder by selecting ocr-extract action

Alfresco community has been installed on Ubuntu 17.10 in /home/administrator/alfresco-community, pdfsandwich has been installed (version command works), changes have been made in alfresco-global.properties as indicated in the readme and jar files have been copied to /home/administrator/alfresco-community/modules/share and /repo. I created a Test OCR folder in my site and attempted to add a rule as directed. The ocr-extract action is not available for me to select.

/usr/bin/pdfsandwich -version
pdfsandwich version 0.1.6

PDFSandwich OCR Settings

ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang eng+spa+fra
ocr.server.os=linux

simple OCR Integration With Alfresco on windows

Hi,
I am trying the integration of Simple OCR Add on with alfresco on windows desktop. I didn't find the way to install/configure the Windows.Media.OCR as local service. Please tell me how can I do that. Also my requirement is to configure the Add on with either windows 7 Pro or windows 2012 R2, may know whether it is supported with them.

Inquiry about Alfresco V5.2

Hello Angel.
I need to know if there is an installer / support for Alfresco V5.2 or if it's near to be released.
Thank you.
Regards.

Pablo

Could not find program tesseract

Hi,

I just got unstable conditions when this addon failed to process. I ran the Extract OCR on folder's rule, when content enter that folder.

See the parts of log below:

err:        Fatal error: exception Failure("Could not find program tesseract. Make sure this program exists and can be found in your search path.
Use command line options to specify a custom binary.")

        at es.keensoft.alfresco.ocr.OCRExtractAction.executeImpl(OCRExtractAction.java:143)
        at org.alfresco.repo.action.executer.ActionExecuterAbstractBase.execute(ActionExecuterAbstractBase.java:267)
        at org.alfresco.repo.action.ActionServiceImpl.directActionExecution(ActionServiceImpl.java:839)
        at org.alfresco.repo.action.executer.CompositeActionExecuter.executeImpl(CompositeActionExecuter.java:66)
        at org.alfresco.repo.action.executer.ActionExecuterAbstractBase.execute(ActionExecuterAbstractBase.java:267)
        at org.alfresco.repo.action.ActionServiceImpl.directActionExecution(ActionServiceImpl.java:839)
        at org.alfresco.repo.action.ActionServiceImpl.executeActionImpl(ActionServiceImpl.java:740)
        at org.alfresco.repo.action.AsynchronousActionExecutionQueueImpl$ActionExecutionWrapper$1$1.execute(AsynchronousActionExecutionQueueImpl.java:423)
        at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:457)
        at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:326)
        at org.alfresco.repo.action.AsynchronousActionExecutionQueueImpl$ActionExecutionWrapper$1.doWork(AsynchronousActionExecutionQueueImpl.java:432)
        at org.alfresco.repo.tenant.TenantUtil.runAsWork(TenantUtil.java:119)
        at org.alfresco.repo.tenant.TenantUtil.runAsTenant(TenantUtil.java:88)
        at org.alfresco.repo.tenant.TenantUtil$1.doWork(TenantUtil.java:62)
        at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:548)
        at org.alfresco.repo.tenant.TenantUtil.runAsUserTenant(TenantUtil.java:58)
        at org.alfresco.repo.action.AsynchronousActionExecutionQueueImpl$ActionExecutionWrapper.run(AsynchronousActionExecutionQueueImpl.java:435)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 09100114 Failed to perform OCR transformation:
Execution result:
   os:         Linux
   command:    /usr/local/bin/pdfsandwich -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6093412488594768960.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6093412488594768960_ocr.pdf
   succeeded:  false
   exit code:  2
   out:
   err:        Fatal error: exception Failure("Could not find program tesseract. Make sure this program exists and can be found in your search path.
Use command line options to specify a custom binary.")

I can run tesseract normally from command line:

root@dmsdev:/opt/alfresco-community# tesseract -v
tesseract 3.05.00dev
 leptonica-1.73
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

When I tried to run it manually from the shell, it works:

root@dmsdev:/opt/alfresco-community/tomcat/shared/classes# /usr/local/bin/pdfsandwich -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1877691809857980597.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1877691809857980597_ocr.pdf
...<skip>
OCR done. Writing "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1877691809857980597_ocr.pdf"
pdfunite /tmp/pdfsandwich3c120a.pdf /tmp/pdfsandwich3601ee.pdf /tmp/pdfsandwich_outputafffcf.pdf

/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1877691809857980597_ocr.pdf generated.

Done.

The problem is gone when I restart the Alfresco. What's the problem?

Thanks,
[bayu]

no bean named 'ocr-extract' is defined

I used Alfresco CE 201702 with simple-ocr 1.1.1 and everything worked fine.

Now I want to switch to Alfresco CE 201707 with simple-ocr 2.3.1 and get an error using the Action in the Document Browser. alfresco.log says:

ERROR [org.springframework.extensions.webscripts.AbstractRuntime] [http-apr-8080-exec-7] Exception from executeScript: 08160000 Wrapped Exception (with status template): No bean named 'ocr-extract' is defined

share.log says:

simple-ocr-repo Platform Jar Module - SDK 3, 2.3.1, Platform JAR Module (to be included in the alfresco.war) - SDK 3
simple-ocr-share Share Jar Module - SDK 3, 2.3.1, Share JAR Module (to be included in the share.war) - SDK 3

My alfresco-global.properties includes:

###Simple OCR Action Properties
#local	ocr program
#ocr.command uses a Shell-Script, to get the right Python environment.
ocr.command=/usr/local/bin/alfocr
ocr.output.verbose=true
ocr.output.file.prefix.command=
#rotating, cleaning, languages…
ocr.extra.commands=--verbose 1 --force-ocr -l deu -c -d
ocr.server.os=linux

/usr/local/bin/alfocr:

#!/usr/bin/env bash
#set -o xtrace # Uncomment for debugging/troubleshooting

sudo -u andreas /usr/bin/ocrmypdf "$@"

Are there any other dependencies or what is going wrong?

How to install alfresco-simple-ocr in Alfresco 5.1 on Windows 10

I'm new in the use of Alfresco i need the OCR, I found how to install AMP file and I did and now I can not log in alfresco, I get error

Your authentication details have not been recognized Alfresco or may not be available at this time.

I use Windows 10 and Alfresco 5.1

Do you have somewhere a complete user's manual how to do OCR implementation in Alfresco. Since I saw that different settings depending on whether it is used pdfsandwich, OCRmyPDF or Windows.OCR, have not been able to install any of it?

I apologize if I ask a lot of questions, I am new in everything and exploring opportunities

Can't login anymore after installing the AMP

Hi,

I'm new with Alfresco and this is the first amp I install.

It's running on an Ubuntu 16.04 server and I'm connecting from a remote windows 7 machine. I installed the amp using the following command:

java -jar alfresco-mmt.jar install /var/dms/alfresco/simple-ocr-repo-deu.amp /var/alfresco-community/tomcat/webapps/alfresco.war -verbose

(it's called simple-ocr-repo-deu.amp since I renamed it. It's the most recent German version you uploaded. i tried with regular simple-ocr-repo-1.1.0.amp too, to no avail)

I get a whole lot of "File '/WEB-INF/classes/....' added to war from amp" messages and no error. But when I restart the alfresco.sh, I'm not seeing the module as active.

However when I use the /bin/apply_amps.sh script, it installs 7 modules ("simple-ocr-repo", "org_alfresco_module_wcmquickstart", "org.alfresco.integrations.google.docs", "alfresco-share-services", "alfresco-aos-module", "org.alfresco.integrations.share.google.doc" and "org_alfresco_module_wcmquickstartshare") to alfresco.war

And if I do that, the login on /share no longer works..

Control exceptions for OCR Share action

When an exception is thrown on OCR command execution, action in Share remains being executed with the status message forever.

Probably is required a review on exception catching at repo action.

Cannot find ocr file

I copied two jars in bitnami vistual machine folder, I configure the alfresco-global.properties files and configured to use pdfsandwich.
I launch ocr action from action menù and i see that in Alfresco /tmp folder the OCRTransformWorker pdf file is created but I cannot see them in web interface, is there I missing? where can I read the log to view the problem?

ocrpdf with docker

Hello Angel,
I'm running alfresco under Centos7 and I have installed ocrmypdf with docker.
In alfresco-global.properties I have this configuration:

ocr.command=/opt/alfresco-community/scriptsme/ocrpdf
ocr.output.verbose=true
ocr.output.file.prefix.command=
ocr.extra.commands=
ocr.server.os=linux

and ocrpdf I have this:

!/bin/sh

docker run -v /opt/alfresco-community/tomcat/temp/Alfresco:/home/docker ocrmypdf --verbose 1 --force-ocr -l spa+eng+fra $1 $2

Passing parameters to ocrpdf fails because it happens absolute and routes would need only pass the name of the file source and destination.
Is there any way of modifcar this?

Greetings and thanks for everything.

Custom language parameter

Hi,
If we checked the log, the tesseract language parameter (-l) was set to eng+spa+fra trained data.
Could we change to another language library? If so, how?
Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.