Git Product home page Git Product logo

bjjocr's Introduction

bjjocr

Script to automatically perform zonal OCR on a PDF and rename the PDF according to the results. It uses tesseract, imagemagick and pdftk.

This script scratches a specific itch our HR department has, namely to process thousands of uniform PDFs twice a month. As such it is geared towards many same-looking PDFs that come in bunches of several hundreds pages per PDF. We need each page as a single sheet identifiable by an ID on the page. So we first burst the PDF into single page PDFs and then run OCR over a zone on each PDF and put the result into a temporary file. We read the temporary file and rename the single page PDF accordingly.

In principle we can do this with as many zones as we like but we must initiate a new tesseract run for every zone.

Quickstart:

1) Install required packages

We run a Debian 7 installation:

$apt-get install tesseract-ocr tesseract-ocr-[yourlanguage] imagemagick pdftk

2) Convert your source file to a tiff-image.

The tiff must fullfill the following requirement

  • 300dpi
  • 8 bit colordepth

I recommend stripping and trimming, too. Use ImageMagick's convert to create the tiff file:

$convert -depth 8 -density 300 -strip -trim input.pdf output.tiff

3) Use zonal OCR with tesseract.

Tesseract can perform zonal OCR if one or more appropriate zone files are provided. Zonefiles must use the extension .uzn and the same name as the input file, e.g input file = input.tiff then the uzn file must be named input.uzn

The .uzn file format is:

x-coordinate y-coordinate width height identifier

  • The x- and y-coordinates define the top left corner of a rectangle.
  • Width and height define the dimensions of the rectangle.
  • The identifier is unused and currently only helps the user in remembering the defined zone.
  • All parameters must be one space apart.
  • You can define several zones in one file, but only the first one is used.

$tesseract input.tiff - -l -psm 4

Instead of stdout you can use a freely chosen filename. The file will have the extension .txt.

#Future plans

If we can't get the commercial solution working to our demands, we will implement

  1. parallel processing of many PDFs.

  2. a folder watchdog ([id]notify) to run the script whenever a PDF is dropped into the watched folder.

  3. Learn github's markup.

bjjocr's People

Contributors

bjjanssen avatar bjoern-janssen-studitemps avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.