Git Product home page Git Product logo

blackstack's Introduction

Blackstack

A machine learning approach to table and figure extraction. Uses SciKit Learn's Support Vector Machines and a custom annotator to build a model for entity extraction.

Whereas other approaches to table and figure reading depend on content to be well-structured, Blackstack ignores the issue of table and figure data extraction (see Fonduer) and instead uses a format-agnostic approach to extracting entities as images that can then be used for future analysis or querying.

Installation

Blackstack relies on a few libraries that you probably need to install, including ghostscript and tesseract. If you are using MacOS and Homebrew you can install them as so:

brew install ghostscript parallel tesseract

Postgres is also required. If you are using MacOS Postgres.app is suggested.

Once Postgres is installed, set up the database:

createdb blackstack
psql blackstack < setup/schema.sql

You might need to substitute in your Postgres credentials on the above commands (e.g. createdb -U me blackstack).

Additionally you will also need to update config.py with your Postgres credentials. First, make a copy of config.py.example and name it config.py:

cp config.py.example config.py

and then update config.py with your Postgres credentials.

Finally, install the required Python packages:

pip install -r requirements.txt

Getting started

Preprocessing

Before a model can be trained, documents to use as training data must be selected. If you are attempting to extract entities from a specific journal or publisher it is recommended that your training data also come from that journal or publisher. If you are trying to create a general classifier you should have a good sample of documents from across disciplines, publishers, etc.

The number of documents you choose is up to you and will influence the accuracy of the model.

To process a document for use in the annotator run the following:

./preprocess.sh training /path/to/your/document.pdf

This will create a folder ./docs/training/<document> that contains a PNG for each page of the PDF, an HTML file that contains the Tesseract output for each page, moves the original PDF to that folder, runs statistics on the Tesseract output and stores those within Postgres.

Creating a model

Once your training documents have been processed you can annotate them using the Flask application in /annotator. To start it, simply cd annotator and run python server.py to start the application. You can then navigate to http://localhost:5555 in your web browser to begin tagging.

You will be presented with a page from one of the training documents and a box around a section of the page. Click the button on the left that best describes that area. If you don't know or it is ambiguous select "Other".

The table on the left side of the window will display the model's current probability for each category for the given area. Stopping and restarting the tagging application will update the model with the training data you have produced.

If you'd like to play around with Blackstack without creating your own training data you can load around 5k example labels using the provided example data that were produced using heterogenous geoscience articles.

psql blackstack < setup/example_data.sql

Extracting entities from a PDF

Once you have created training data for the machine learning model you can run it against a PDF. To do so you must first prepare the PDF for processing similarly to the way it is prepared for use a a training document.

./preprocess.sh classified /path/to/a/<document>.pdf

This will convert the PDF to PNGs and run Tesseract on them. Once that is done you can run the extractor on it as so:

python extract.py ./docs/classified/document

The entities found will then be found in ./docs/classified/<document>/tables

FAQ

Why is the figure/table/map/etc cut off or otherwise incomplete?
Blackstack depends on Tesseract's concept of "areas" and its ability to accurate identify them. While it is generally good at identifying blocks of text it is less consistent when there are combinations of text and graphics. Blackstack attempts to resolve some of these issues by merging adjacent "areas" that have a high probability of being related, yet issues still remain.

Why is something that is clearly not a graphic extracted as one?
Tesseract can occasionally be confused by certain page layouts, especially cover pages. Since Blackstack depends on Tesseract to accurately identify "areas" it can only be as accurate as Tesseract.

Why is the accuracy so bad on a document that isn't an academic paper? Blackstack was developed for table, figure, graphic, and map extraction from published academic literature. If you would like to extend it to other types of documents you will need to update the database schema and heuristics.py to reflect the nature of that type of document.

Funding

Development supported by NSF ICER 1343760

License

MIT

Other Info

preprocess.sh

Usage: ./preprocess.sh ~/Downloads/document.pdf
Purpose: Does the following:

  • Takes a given PDF and runs it first through ghostscript to create PNGs of each page
  • Runs each PNG through tesseract to create HTML files
  • Creates a folder within the folder docs with the name of the pdf, and a folder for the tesseract and png output.
  • Moves the original document to the new folder and renames it orig.pdf
  • Runs the output of tesseract through the script summarize.py

process.sh

Purpose: Runs tesseract on a given page. Not called directed - used by preprocess.sh

summarize.py

Takes a given document, generates document-wide statistics with helpers.summarize_document, and runs each area in the document through all the labeling functions in heuristics.py. The result is stored in the postgres table areas.

heuristics.py

Labeling functions for areas. Most return a boolean, and some an integer.

classifier.py

Generates a model using SciKit Learn that can b used to classify areas. Contains two methods:

  • create() - queries all areas and associated labels created by annotator and returns a model
  • classify(<pages>, <doc_stats>) - Takes as an input a list of pages and document-level stats generated by helpers.summarize_document and returns those pages with all areas tagged with their classification.

annotator

A flask application that runs on port 5555 that can be used for creating training data.

blackstack's People

Contributors

jczaplew avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.