Git Product home page Git Product logo

tesstools's Introduction

TessTools: Tools for the use of Tesseract OCR in R

Lifecycle R-CMD-check

Interface to the Tesseract OCR command line tool (version 4) and parsing functions for the analysis of historical newspaper archives. This is under development.

Installation

Make sure you have the tesseract command line program installed and available in PATH. You can either Install Tesseract via pre-built binary package or build it from source.

$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

You can install the development version of TessTools from GitHub with:

# install.packages("devtools")
devtools::install_github("OlivierBinette/TessTools")

Example

Download the first issue (1905) of the Duke Chronicle newspaper.

library(TessTools)

issueID = chronicle_meta[1, "local_id"]
zipfile = download_chronicle(issueID, outputdir="data-raw")

Run Tesseract OCR on the newspaper scans and extract text paragraphs together with their bounding boxes.

hocrfiles = hocr_from_zip(zipfile, outputdir="data-raw/hocr", exdir="data-raw/img")
#> Running tesseract-OCR on 4 image files.

# Extract paragraph text
text = paragraphs(hocrfiles)
text[[1]][9:11, ] # Some paragraphs on the first page
bbox1 bbox2 bbox3 bbox4 text
481 1251 1099 1314 HESPERIAN VS. COLUMBIAN.
361 1394 1225 1554 Sixteenth Annual Inter-Society Debate —Won By the Hesperian.
424 1592 822 1653 A great debate!

Visualize the result using hocrjs:

webpages = visualize_html(hocrfiles, outputdir="data-raw/html") # webpage is at data-raw/html/dchnp71001-html
browseURL(webpages[[1]]) # Note: bring up the hocrjs menu and select "show background image"

Ground truth

Paragraphs of the first issue have been annotated according to the article to which they belong.

# Ground truth for first page
vol1_paragraphs_truth[[1]][9:11, ]
bbox1 bbox2 bbox3 bbox4 text articleID category note
9 481 1251 1099 1314 HESPERIAN VS. COLUMBIAN. 1 title
10 361 1394 1225 1554 Sixteenth Annual Inter-Society Debate —Won By the Hesperian. 1 title
11 424 1592 822 1653 A great debate! 1 text

tesstools's People

Contributors

olivierbinette avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

tesstools's Issues

Missing CITATION.cff file for repository

It would be useful for visitors & researchers if this repository had a CITATION.cff file, so that we can know how to properly cite it. Would you mind adding one, please?

In case it's useful, there is a handy online CITATION.cff file creation tool here: https://citation-file-format.github.io/cff-initializer-javascript/#/

For further motivation: the Zotero browser plugin knows how to read CITATION.cff files in GitHub repositories, making it easy for Zotero users to add an entry for the repository to their Zotero bibliography database and thus more likely that they will cite it in their work.

Formalize API

Formalize the TessTool API.

Currently:

  • hocr and html generating functions return a list of path to these files
  • functions which manipulate these files operate on list of paths and return data loaded to memory.

Write "hocr_*" class of functions

Write a class of functions of the form hocr_* to operate on hocr files.

We have a few functions of the form hocr_from_* to generate hocr files already.

Example additional functionality from hocr-tools:

  • hocr_parse or hocr_words: return the complete hocr file as a tidy table
  • hocr_paragraphs: return hocr as a tidy table of paragraphs, ignoring line and word-specific information
  • hocr_lines: same as above, but for paragraphs
  • hocr_combine to combine multiple hocr files into one
  • hocr_split to split hocr files into individual pages

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.