A Benchmark & Evaluation for Text Extraction from PDF

This project is about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles. It provides (1) a benchmark generator, (2) a ready-to-use benchmark and (3) an extensive evaluation, with meaningful evaluation criteria.

The Benchmark Generator

constructs high-quality benchmarks from TeX source files.
identifies the following 16 logical text blocks: title, author(s), affiliation(s), date, abstract, headings, paragraphs of the body text, formulas, figures, tables, captions, listing-items, footnotes, acknowledgements, references, appendices.
serializes desired logical text blocks to plain text, XML or JSON format.

For more details and usage, see benchmark-generator/.

The Benchmark

consists of 12,099 ground truth files and 12,099 PDF files of scientific articles, randomly selected from arXiv.org. Each ground truth file contains the title, the headings and the body text paragraphs of a particular scientific article.
was generated using the benchmark generated above.

For more details, see benchmark/.

The Evaluation

assesses the following 13 PDF extraction tools: pdftotext, pdftohtml, pdf2xml (Xerox), pdf2xml (Tiedemann), PdfBox, ParsCit, LA-PdfText, PdfMiner, pdfXtk, pdf-extract, PDFExtract, Grobid, Icecite.
provides meaningful evaluation criteria in order to assess the semantic abilities of a tool on identifying (1) words, (2) the reading order, (3) paragraph boundaries and (4) the semantic roles of text elements in PDF.

For more details, see evaluation/.

ckorzen / pdf-text-extraction-benchmark Goto Github PK

pdf-text-extraction-benchmark's Introduction

A Benchmark & Evaluation for Text Extraction from PDF

The Benchmark Generator

The Benchmark

The Evaluation

pdf-text-extraction-benchmark's People

Contributors

Stargazers

Watchers

Forkers

pdf-text-extraction-benchmark's Issues

Doc diff as a library

dataset 404 not found

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent