Git Product home page Git Product logo

tidy-toc's Introduction

Command line tool to clean OCRed ToC text

About

A command line tool to clean text data derived by OCRing a table of contents (typically of a book). It removes meaningless strings that are output by OCR engine such as

# example of meaningless strings
Introduction ......... 00... eeeeeeee 1
....
Matrices and Maps............. 060 cece eee cece eee te nee eee 328

and detects strange rows and interactively removes them.

Banach Spaces
§1. Elementary Properties and Examples 63
§2. Linear Operators on Normed Spaces 67
xiv Contents <- strange !
§3. Finite Dimensional Normed Spaces 69
§4. Quotients and Products of Normed Spaces 70

It also finds and interactively removes unwanted new lines such as

7.5.2 Approximating the Power Function and the Expected Sample
Size 334

Moreover, it points out and interactively fix rows having problematic page number such as

Exercises 37
Chapter 2 <-
Utility and Loss 40
2.1 Introduction 39 <-

Demo

--clean --select

The first demo shows how the process of interactive cleaning of unwanted characters and rows goes. The input and output text file are at sample (clean & select) and at output sample (clean & select).

--merge --page

The second demo shows the process of merging rows and correcting page numbers. The input and output text file are at sample (merge & correct) and at output sample (merge & correct).

The sample ToCs used here are OCRed (by tesseract) table of contents of books Monte Carlo Statistical Methods and Statistical Decision Theory, respectively.

Install

  • Tested Environment
    • Windows 10 + WSL2 + Ubuntu 20.04
    • python 3.10.5 (pyenv 2.3.2) + poetry (1.1.11)

Follow a typical routine of setting up a virtual environment by pyenv + poetry.

git clone https://github.com/Shena4746/tidy-toc.git
cd ./tidy-toc
pyenv local 3.10.5

The last line fails if you have not downloaded python 3.10.5. Run the following to download it, and try the previous pyenv command again.

pyenv install 3.10.5

Locate the python interpreter at {project-top}/.venv. Then let poetry perform a local installation of dependency.

python3 -m venv .venv
poetry install

Make sure that poetry uses the intended python interpreter.

poetry run which python
poetry run python --version

Usage

The following command shows help.

/path/to/{project-top}/.venv/bin/python3 /path/to/tidy-toc.py --help

# Usage: tidy-toc.py [OPTIONS] TEXT_FILE

#   clean OCRed ToC text data.

# Options:
#   -c, --clean             remove useless characters such as the long series of
#                           periods between text and page number, e.g.,
#                           'Introduction [.........,..,,.,..] 1'
#   -s, --select            suggest unnecessary rows, such as 'xii', 'iv
#                           Contents', and ask user to select from them
#                           interactively.
#   -m, --merge             suggest neighboring two rows that might be
#                           originally in a same row, such as '1.1 hello'
#                           followed by 'world 5.'
#   -p, --page              point out badly page-numbered rows and correct them
#                           interactively.
#   -l, --maxline INTEGER   the number of suggested rows displayed at once in
#                           the --select process. the default uses 10. will be
#                           ignored unless --select option is enabled.
#   -d, --dirout DIRECTORY  directory where output text file is saved. the
#                           default uses the same place as the input text file.
#   --pre TEXT              prefix for the stem-name of the output text file.
#                           the default is ''. see --join option.
#   --suf TEXT              suffix for the stem-name of the output text file.
#                           the default is '_cleaned'. see --join option.
#   -j, --join TEXT         the character with which prefix + (text file name) +
#                           suffix are combined. e.g., prefix='pre', text file
#                           name='sample.txt', suffix='suf', join='_' -> output
#                           text file name='pre_sample_suf.txt'
#   -o, --overwrite         overwrite input file with output. If enabled, all
#                           options for output file name such as --pre and
#                           --join are ignored.
#   --help                  Show this message and exit.

As in Demo, it is often convenient to set an alias for the /path/to/interpreter /path/to/tidy-toc.py.

tidy-toc's People

Contributors

shenapse avatar

Stargazers

Ken Enda avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.