Git Product home page Git Product logo

mocrin's Introduction

mocrin

license

Overview

Mocrin coordinates multiple ocr-engine to create a uniform workflow and folder structure. It is part of the Aktienführer-Datenarchiv work process, but can also be used independently.

mocrin_process

Mocrin is a command line driven processing tool for multiple ocr-engine.
The main purpose is to handle multiple ocr-engine with one interface for a cleaner and uniform workflow. Another purpose is to serve as part of an self-configuration process to extract the best settings for different ocr-engines. Just now you can store multiple configuration files for the ocr-engines. It can also be used to cut out areas from image with user-set characteristics, which can be further used as training datasets for NN-models.

Ocromore further parse the different ocr-outputfiles to a sqlite-database. The purpose of this database is to serve as an exchange and store platform using pandas as handler. Combining pandas and the dataframe-objectifier offers a wide-range of performant use-cases like msa. To evaluate the results you can either use the common standard isri tool to generate a accuracy report or do visual comparision with diff-tools (default "meld").

Note that the automatic processing will sometimes need some manual adjustments.

Current State

✓ Talk to tesseract
✓ Talk to ocropus
✓ Talk to abbyy
✓ Configuration files (for the whole process and every single ocr-engine)
✓ Implement cut method
✓ Create uniform output structure
✓ Create hocr-output
✓ Create logs with settings information

Output fileformats

✓ hocr (with confidences)
✓ abbyy-xml (with confidences "ASCII")

Installation

Requirements

Install:
Alternative docker (for windows recommended):

Build:

docker build -t mocrin .

Run:

docker run -it -v 'PWD':/home/developer/coding/mocrin mocrin 

Then run cli commands (see example)

To run the scripts for visual results in your OS:
(not available in the docker-image)

  • install Python and Requirements
Info

The project was written in PyCharm 2017.3 (CE)>,
so if you are a developer it's recommended to use it.

Python 3.6.3 (default, Oct 6 2017, 08:44:35)
GCC 5.4.0 20160609 on linux
Tested on: Ubuntu17.10

Building instructions

Dependencies can be installed into a Python Virtual Environment:

    $ virtualenv mocrin_venv/
    $ source mocrin_venv/bin/activate
    $ pip install -r requirements.txt

Process steps

Overview

First of all you have to adjust the config-files. There are two main config-files in "./profiles/":

  • cli_args
    • path to ocr ocr-files (e.g. hocr)
    • parameter for parsing hocr to db
      • naming etc.
  • ocropy
    • path to db
    • parameter for combining the information from the ocr-files
  • tess
    • path to db
    • parameter for combining the information from the ocr-files

The parameter to perform the examples are set as default.
So you can just run the following commands.

At the current stage it is recommended to use PyCharm to perform the next steps.

Running

Example

Parse OCR with multiple Engines and store the result in a unified folder structure:

# All parameters can set in the configs
# Config files are stored in the profiles folder.

$ python3 ./mocrin.py

The result are stored in ./Testfiles/tableparser_output/

Copyright and License

Copyright (c) 2017 Universitätsbibliothek Mannheim

Author: Jan Kamlah

mocrin is Free Software. You may use it under the terms of the Apache 2.0 License. See LICENSE for details.

Acknowledgements

The tools are depending on some third party libraries:

  • tesserocr - Wrapper for Tesseract-API (MIT-License)

mocrin's People

Contributors

jkamlah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.