wzbsocialsciencecenter / pdftabextract Goto Github PK

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Home Page: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

License: Apache License 2.0

Python 99.70% Makefile 0.30%

pdf data-mining python image-processing tables ocr

pdftabextract's Introduction

pdftabextract - A set of tools for data mining (OCR-processed) PDFs

July 2016 / Feb. 2017, Markus Konrad [email protected] [email protected] / Berlin Social Science Center

This project is currently not maintained.

IMPORTANT INITIAL NOTES

From time to time I receive emails from people trying to extract tabular data from PDFs. I'm fine with that and I'm glad to help. However, some people think that pdftabextract is some kind of magic wand that automatically extracts the data they want by simply running one of the provided examples on their documents. This, in the very most cases, won't work. I want to clear up a few things that you should consider before using this software and before writing an email to me:

pdftabextract is not an OCR (optical character recognition) software. It requires scanned pages with OCR information, i.e. a "sandwich PDF" that contains both the scanned images and the recognized text. You need software like tesseract or ABBYY Finereader for OCR. In order to check if you have a "sandwich PDF", open your PDF and press "select all". This usually reveals the OCR-processed text information.
pdftabextract is some kind of last resort when all other things fail for extracting tabular data from PDFs. Before trying this out, you should ask yourself the following questions:

Is there really no other way / no other format for which the data is available?
Can a special OCR software like ABBYY Finereader detect and extract the tables (you need to try this with a large sample of pages -- I found the table recognition in Finereader often unreliable)?
Is it possible to extract the recognized text as-is from the PDFs and parse it? Try using the pdftotext tool from poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts: pdftotext -layout yourdocument.pdf. This will create a file yourdocument.txt containing the recognized text (from the OCR) with a layout that hopefully resembles your tables. Often, this can be parsed directly (e.g. with a Python script using regular expressions). If it can't be parsed (e.g. if the columns are not well separated in the text, the tables on each page are too different to each other in order to come up with a common structure for parsing, the pages are too skewed or rotated) then pdftabextract is the right software for you.

pdftabextract is a set of tools. As such, it contains functions that are suitable for certain documents but not for others and many functions require you to set parameters that depend on the layout, scan quality, etc. of your documents. You can't just use the example scripts blindly with your data. You will need to adjust parameters in order that it works well with your documents. Below are some hints and explanations regarding those tools and their parameters.

Introduction

This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.

Module overview

After that you can view the extracted text boxes with the pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were successfully detected, they can be converted to a page grid with the extract module and their contents can be extracted using fit_texts_into_grid in the same module. extract also allows you to export the data as pandas DataFrame.

If your scanned pages are double pages, you will need to pre-process them with splitpages.

Examples and tutorials

An extensive tutorial was posted here and is derived from the Jupyter Notebook contained in the examples. There are more use-cases and demonstrations in the examples directory.

Features

load and parse files in pdf2xml format (common module)
split scanned double pages (splitpages module)
detect lines in scanned pages via image processing (imgproc module)
detect page rotation or skew and fix it (imgproc and textboxes module)
detect clusters in detected lines or text box positions in order to find column and row positions (clustering module)
extract tabular data and convert it to pandas DataFrame (which allows export to CSV, Excel, etc.) (extract module)

Installation

This package is available on PyPI and can be installed via pip: pip install pdftabextract

Requirements

The requirements are listed in requirements.txt and are installed automatically if you use pip.

Only Python 3 -- No Python 2 support.

Converting PDF files to XML files with pdf2xml format

You need to convert your PDFs using the poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts. From this package we need the command pdftohtml and can create an XML file in pdf2xml format in the following way using the Terminal:

pdftohtml -c -hidden -xml input.pdf output.xml

The arguments input.pdf and output.xml are your input PDF file and the created XML file in pdf2xml format respectively. It is important that you specify the -hidden parameter when you're dealing with OCR-processed ("sandwich") PDFs. You can furthermore add the parameters -f n and -l n to set only a range of pages to be converted.

Usage and examples

For usage and background information, please read my series of blog posts about data mining PDFs.

See the following images of the example input/output:

Original page

Generated (and skewed) pdf2xml file viewed with pdf2xml-viewer

Detected lines

Detected clusters of vertical lines (columns)

Generated page grid viewed in pdf2xml-viewer

Excerpt of the extracted data

License

Apache License 2.0. See LICENSE file.

pdftabextract's People

Contributors

Stargazers

Watchers

Forkers

mikeatm moandcompany kaiserdan neuroradiology priestd09 twocngdagz parksebastien yanlinaung jangkyung roryosiochain seanzhanjx indy9000 chagge ecoblockchain hsavran gondo jburke007 number0 wombatpm shilezi neo4reo sayiho benjamesbabala ddc-gc cash2one li363849131 robingong xwang0415 kamalkarki kangkot bmschmidt riccheng shenggaozhu lxj0276 jkamlah stweil hadokencode chapzq77 dh-heima puremath86 guptam lihka1 jashurc fashtimedotcom newusername321 hitflame mojgang mpvyard kmccabe1990 vibi007 ardentran gpsbird vidascontadas alon-zhong huynguyen7994 cosecant-csc luffydream cenxim liyanhang1989 theamazingjarvis zofuthan krzynio yuyuelovediandian hanxueming126 shanmugapriya2k15 iamlostcoast tybiot jeffli678 generalzh vault-the alwaysproblem cn3c3p shubhampachori12110095 laal65 eudaimonia kjeanclaude weeang763162 jenniferzhu tangcheng2014 dta0502 gts-gi asitang thesoulbender www3838438 xuezhizeng chenkaigithub ownermz breakend2010 justinxiong veilinghigh guoqing1988 fosity xtuyaowu uscdatascience ginking majian7654 romainberti ufukhurriyetoglu yc-maltazard terry

pdftabextract's Issues

Not able to create vertical lines and recognize clusters

I have run catalog_30s.py, on one of my pdfs which has some text on the top and bottom and a table with 2 columns at the center like below Screen.

I changed these parameters in the script
N_COL_BORDERS = 3
MIN_COL_WIDTH = 687

The output was

page 1: detecting lines in image file 'data/sample.pdf-1_1.png'...

found 38 lines
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines-orig.png'
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines.png'
WARNING:root:no vertical lines found
no page rotation / skew found
found 0 clusters
Traceback (most recent call last):
File "sample.py", line 140, in
img_w_clusters = iproc_obj.draw_line_clusters(imgproc.DIRECTION_VERTICAL, vertical_clusters)
File "build/bdist.macosx-10.12-intel/egg/pdftabextract/imgproc.py", line 395, in draw_line_clusters
ZeroDivisionError: integer division or modulo by zero

Why is the script not able to recognise the vertical lines ? What could be the issue.

Data Sources

Hello, my graduation thesis is also related to document image recognition. Can you give me your data source?

Question

Can this be used to extract tables from something like this through Juypter Notebook?

http://www.tradeweb.com/uploadedFiles/Tradweb/Content/About_Us/02.21.2018%20-%20Tradeweb%20Markets%20Quarterly%20Activity%20Report%20-%204Q17.pdf

Thanks

pdftohtml not generating image tag in XML file

When generating a XML file via pdftohtml like
pdftohtml -c -hidden -xml input.pdf output.xml
there is no image tag in the XML file (also this command is only generating a XML file, PNGs will not be generated).
I have followed all the steps mentioned on your blog post but the code does not execute properly because at
imgflebasename = p['image'][:p['image'].rindex('.')]
no images are found and therefore there is no attribute rindex.

Tagged releases and changelog

Could you tag your releases and make the changelog directly visible in your GitHub releases? This would increase transparency for potential users and distribution maintainers. Something like https://skywinder.github.io/github-changelog-generator/ could be used to generate the changelog based on merged PRs and closed issues since the last release tag.

No text boxes in the output

Hi,
when I run pdftohtml -c -hidden -xml a.pdf a.pdf.xml in this file I have no text boxes in the output, but only the below infos.

Is it normal? What's wrong in my command?

Thank you

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.41.0">
<page number="1" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-1_1.jpg"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-2_1.jpg"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-3_1.jpg"/>
</page>
<page number="4" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-4_1.jpg"/>
</page>
<page number="5" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-5_1.jpg"/>
</page>
</pdf2xml>

Output is not coming

I'm using that schoollist_1.py file on my document but it is not showing anything in the output format. It is running the code but it is not showing anything on the output files. The .csv and .xlsx format files are empty when I'm opening them. So please help me out. And I'm using my code on the company invoice. They are basically a scanned documents and I want to using them as you have mentioned them. First by creating the .xml file and then by inserting them into the data folder and then using them. At the output folder all the files are coming. Everything is there, those xml, json, and png format documents. But the output is not being showed to me at my file in any of the .csv or on the .xlsx format
Please help

Please help me.

I want this table to be recognized using pdftabextract.

I converted that to searchable pdf using tesseract and followed every step in this tutorial
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
And I got this result
output.xlsx
I want to improve result as much similar as image.
Please help me.

`Poppler` installation on windows

I've been trying to install Poppler to execute the first line of code pdf2html, using "pip install python-poppler-qt5" also tried conda installation but failed. Tried to add the source files to my anaconda/lib/site-packages directory also failed. Please could someone tell me how to get poppler up and running on windows?

pdftohtml is not generating image files for the given pdf file.

Code is Running. However not detecting any data.

Code is not running getting below error.

getting below:

Problem in running script

I am not getting how to use this scripts.
please tell me how to run this scipt

Logger file missing

It seems logger file is missing

from pdftabextract import logger

in clustering.py

jpeg8.dll does not exist

Hi,
when I run pdftohtml I have an error because I do not have on my system (Win 10 64bit) jpeg8.dll.

How to solve this problem?

Thank you

pdftabextract does not label an text boxes

pdftabextract does not label an text boxes in the scanned PDF I have. What could be possible reason?

request feature for compatible with google vision api (document_text_detection)

need feature can manipulate with protobuff document scan object
https://cloud.google.com/vision/docs/detecting-fulltext#vision-document-text-detection-python

poppler pdftohtml my pdf, the pictures of outputs are not right, my system is win7 and the pdf is Chinese

pdftohtml does not create any scanned page with formats png and jpg

I tried to extract table data from PDF files but the first step in the process is to generate xml and page images.

Unfortunately, using pdftohtml library for PDF pages I cannot create any image files in the data/ directory when following the tutorial at: DataMining- WZBSocialScienceCenter

The command that fails to create the images:

pdftohtml -c -xml -hidden TradingIEX.pdf TradingIEX.pdf.xml

How is it possible to create such page images for the non-scanned PDF files?

Thanks.

Use pdftabextract convert pdf which is converted by a picture

Hi, I try to convert a pdf to excel, but it failed. It is a table in a picture. I convert the picture into pdf , then I use the code to convert. It failed. So could you tell me what kind of pdf can be converted?

pdftohtml -c -hidden -xml input.pdf output.xml

This gives us an error about the parameters:

pdftohtml version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftohtml [options]
-f : first page to convert
-l : last page to convert
-z : initial zoom level (1.0 means 72dpi)
-r : resolution, in DPI (default is 150)
-skipinvisible : do not draw invisible text
-allinvisible : treat all text as invisible
-opw : owner password (for encrypted files)
-upw : user password (for encrypted files)
-q : don't print any messages or errors
-cfg : configuration file to use in place of .xpdfrc
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information

Is there an updating on the package? I installed it using:

pip install pdftabextract

A question

Can this project run on windows？ And this project can recognize numbers from image？