Git Product home page Git Product logo

acdc_train's Introduction

Automatic Collation for Diversifying Corpora (ACDC)

This package provides code for producing training data for optical character recognition and handwritten text recogntion (OCR and HTR) by aligning the output of an initial model on a collections of images with a collection of digital editions of similar texts.

For background and a walkthrough of using these tools, see the video tutorial.

First, install passim. Then install kraken. If you want to start with PDF files of books rather than page images, use the pdf option:

pip install --user kraken[pdf]

After this is complete, the programs seriatim, kraken, and ketos should be in your PATH and available on the command line.

Install the scripts in this package with:

pip install --user .

We use make to manage OCR of a potentially large number of input pages. Create a directory for your work, go into that directory, and link to the Makefile in this package:

ln -s <path to src>/acdc_train/etc/Makefile

If you're starting with PDF files, put them in a subdirectory named pdf. If you're starting with individual page image files instead, create a directory named images with subdirectories each containing the page image files for a book.

If you put plain text files in a directory named electronic_texts, they will be interpreted with OpenITI markdown. If you prefer, you could put JSONL-formatted input in electronic_texts.json. This uses the passim conventions of an id field for a unique document identifier and text field, potentially with escaped newlines, for the text.

In the paper, we bootstrapped training starting from page segmentation and transcription models trained on printed Arabic-script books. You can change the segment and ocr variables in the Makefile to train from a different starting model.

You should then be able to run experiments with three rounds of OCR'ing the pages in pdf or images and retraining by running this make command:

make all

If you have a GPU that works with kraken, uncomment the line near the top of the Makefile to use that device with kraken:

KRAKEN_DEVICE=-d cuda:0

If any of the steps in the pipeline complain about running out of memory, edit the line near to the top of the Makefile to give spark more than 4GB of memory:

export SPARK_SUBMIT_ARGS=--executor-memory 4G --driver-memory 4G

acdc_train's People

Contributors

dasmiq avatar

Stargazers

Thibault Clérice avatar BALACHANDAR S avatar Stefan Weil avatar Wouter Haverals avatar Colin Brisson avatar ytuerker avatar  avatar Rohan Chauhan avatar  avatar

Watchers

Raff Viglianti avatar  avatar Masoumeh Seydi avatar Maxim Romanov avatar S Merchant avatar  avatar

Forkers

rohanchn freymat

acdc_train's Issues

pdf_images.py: command not found

When trying to running acdc with a pdf instead of images, I immediately get the following error:

(mkdir -p images/0049Gotha.OrA1521; cd images/0049Gotha.OrA1521; bash -c "pdf_images.py ../../pdf/0049Gotha.OrA1521.pdf") && touch images/0049Gotha.OrA1521/_SUCCESS
bash: pdf_images.py: command not found
make: *** [Makefile:64: images/0049Gotha.OrA1521/_SUCCESS] Error 127

The script creates a folder for the images extracted from the pdf but then fails to recognize that the pdf_images.py is in the bin folder of the acdc_train folder.

Modification I made to the Makefile to get this working

Thank you for making this available!

While setting it up for experiments, I had to make a few changes.

I had to replace the input to ocr --model in line 92 to $(ocr) so that I could load my base model for the task.

$(call slurm_run,cat $< | batkraken.sh gen2-print --alto $(KRAKEN_DEVICE) segment --model $(segment) -d horizontal-rl --baseline ocr --model gen1-print-n7m5.out/alto-nall/ft_best.mlmodel --base-dir R >& $@.err)

Simlarly, I had to replace print-n7m5.out/alto-nall/ft_best.mlmodel in line 79 with $(ocr)

$(MAKE) ocr=print-n7m5.out/alto-nall/ft_best.mlmodel $(patsubst splits/%,gen1-print/jobs/_%,$(wildcard splits/x*))

As I was working with an LTR script for my tests, I also had to change -d to horizontal-lr and --base-dir to L instead of horizontal-rl and R.

Will it be useful to change -d and --base-dir to auto?

Changing the values of arguments N and M in Makefile leads to an error when 'make all'

Hi,
when I change the values of arguments N and M in the Makefile, I get the following error when executing 'make all':

'make: *** No rule to make target 'gen2-print-n7m5.out/alto-union/ft_best.mlmodel', needed by 'all'. Stop.'

I think this is due to the fact that some N and M values are hardcoded in the last line of the Makefile:

all: gen2-print-n7m5.out/alto-union/ft_best.mlmodel

Shouldn't we have the following line instead ?

'all: gen2-print-n$(N)m$(M).out/alto-union/ft_best.mlmodel'

dependencies for PDF extraction: pyvips, click, Pillow

If you use PDFs instead of images, the script uses pyvips to extract the images from pdf.

To install pyvips on ubuntu (see https://github.com/libvips/pyvips):
first install libvips (https://github.com/libvips/libvips/wiki/Build-for-Ubuntu):

sudo apt install libvips
sudo apt install libvips-tools
sudo apt install libvips-dev

then install pyvips: pip install pyvips

libvips also depends on click and Pillow:

pip install click
pip install Pillow

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.