Git Product home page Git Product logo

tabula-py's Introduction

tabula-py

Build Status

tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF. You can read tables from PDF and convert into pandas's DataFrame.

Requirements

  • Java
    • Confirmed working with Java 7, 8
  • pandas

Usage

Install

pip install tabula-py

Example

tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON.

import tabula

# Read pdf into DataFrame
df = tabula.read_pdf("test.pdf", options)

# Read remote pdf into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")

See example notebook

Options

  • pages (str, int, list of int, optional)
    • An optional values specifying pages to extract from. It allows str, int, list of int.
    • Example: 1, '1-2,3', 'all' or [1,2]. Default is 1
  • guess (bool, optional):
    • Guess the portion of the page to analyze per page.
  • area (list of float, optional):
    • Portion of the page to analyze(top,left,bottom,right).
    • Example: [269.875, 12.75, 790.5, 561]. Default is entire page
  • spreadsheet (bool, optional):
    • Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • nospreadsheet (bool, optional):
    • Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • password (bool, optional):
    • Password to decrypt document. Default is empty
  • silent (bool, optional):
    • Suppress all stderr output.
  • columns (list, optional):
    • X coordinates of column boundaries.
    • Example: [10.1, 20.2, 30.3]
  • format (str, optional):
    • Format for output file or extracted object. (CSV, TSV, JSON)
  • output_path (str, optional):
    • Output file path. File format of it is depends on format.
    • Same as --outfile option of tabula-java.

FAQ

Can I use option xxx?

Yes. You can use options argument as following. The format is same as cli of tabula-java.

read_pdf_table(file_path, options="--columns 10.1,20.2,30.3")

How can I ignore useless area?

In short, you can extract with area and spreadsheet option.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use area option

According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

For example, using macOS's preview, I got area information of this PDF:

image

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

tabula-py's People

Contributors

chezou avatar kirkobyte avatar

Watchers

Nicolás Pérez Iguarán avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.