chezou / tabula-py Goto Github PK

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

License: MIT License

Python 100.00%

tabula-py's Introduction

tabula-py

tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.

You can see the example notebook and try it on Google Colab, or we highly recommend reading our documentation, especially the FAQ section.

Requirements

Java 8+
Python 3.8+

OS

I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the documentation for the detailed installation for Windows 10.

Usage

Documentation
- FAQ would be helpful if you have an issue
Example notebook on Google Colaboratory

Install

Ensure you have a Java runtime and set the PATH for it.

pip install tabula-py

If you want to leverage faster execution with jpype, install with jpype extra.

pip install tabula-py[jpype]

Example

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

See an example notebook for more details. I also recommend reading the tutorial article written by @aegis4048, and another tutorial written by @tdpetrou.

Contributing

Interested in helping out? I'd love to have your help!

You can help by:

Reporting a bug.
Adding or editing documentation.
Contributing code via a Pull Request. See also for the contribution
Write a blog post or spread the word about tabula-py to people who might be able to benefit from using it.

Contributors

Another support

You can also support our continued work on tabula-py with a donation on GitHub Sponsors or Patreon.

tabula-py's People

Contributors

Stargazers

Watchers

Forkers

diwahars semio shanwun rhm9336 fantajeon jian-mo bglar vidascontadas ejmurray mpw kirkobyte kyledef celyn1 jburke007 enjoymore nicolaspereziguaran harish-garg linll louiekang nathania siegfriedzhen franklyn1987 sakthivel230590 cbgaindia zion302 yxpku kpolimis aswells2 3stack-software ankitsd glaze-s qiaoqiaozhu bf109f macintoshxz faizan1041 janetyeah zequequiel tawanda thinkmorerick milaner innocentjr kschoelzel nafisur-jspl marek1914 camel2000 numsoul fuckqqcom lageek amrahstija zby0902 nikhilgk dawsongzhao tanaka2008 pokern datahack-ru userxxx1553 iris-qq jaquedeveloper ninina79 dtelless derekwhat curtlh kanymanyman lcd1232 zhancr aritrozen yannvon pulkit nicole116 ddedalus gmaretto xiaoxialei fwanyc krassowski scotthb guptaaish skols geethikab 62ramya xzxldl55 henfee gaybro8777 athirara gehongpeng jessejones archgroove delaney-shaman rrozas evamarla algogrit rmneveslh gallaecio cascadeone varpurantala zzzz123321 flashus mikekiwa qren25 dcohen21 717ct

tabula-py's Issues

how can I extract all tables from a 30 pages pdf file

Summary of your issue

Environment

Write and check your environment.

python --version: ?
java -version: ?
OS and it's version: ?
Your PDF URL:

What did you do when you faced the problem?

//write here

Example code:

paste your core code

Output:

paste your output

What did you intend to be?

issue while using tabula-py

i need urgent help on tabula automation. i read your article at the below link but and installed pip as well as tabula-py image below -

https://github.com/chezou/tabula-py

But how to proceed after that, when i try to execute below lines through a python script its giving an error, kindly help-

#!/usr/bin/python
#!/usr/bin/perl
#!/usr/bin/perl -d:ptkdb

import fileinput, sys, os ,subprocess, io

from tabula import read_pdf_table
df=read_pdf_table("TAJ.pdf")

Unable to read merged column headers

This is a follow up of earlier issue#43 which was closed. All the details provided in #43 are the same, here is the updated information:

I upgraded tabula.py to use the latest jar (tabula-1.0.1-jar-with-dependencies.jar) and while it reduced these warnings, I still get some.

Aug 31, 2017 11:42:02 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
WARNING: Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
Aug 31, 2017 11:42:03 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
WARNING: Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
Aug 31, 2017 11:42:03 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
WARNING: Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'

The main issue is that common cell headers donot get read in and not sure if the warnings are related. Please find the PDF file here: ufile.io/5xuti
You will see that common cell header in table of page 1 for instance ('Three Months Ended March 31') gets dropped.

Given a document how ignore the header and set the columns of a table?

I am working with a PDF very similar to this document:

As you can see the above document has a header, when I try to use tabula-py to extract it, I am getting everything merged in a single column:

In:

df = read_pdf_table('file.pdf')

Out:

Thus, my question is how can I ignore the header and get the content of the table?. I also tried with the options:

In:

df = read_pdf_table('file.pdf', columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

Out:


---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-4-33ed930c5d2a> in <module>()
      6 
      7 df = read_pdf_table('file.pdf',
----> 8                    columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5']
          9 
         10 #df = read_pdf_table('/Users/user/Downloads/table.pdf')

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, **kwargs)
     45     args = ["java", "-jar", jar_path] + options + [input_path]
     46 
---> 47     output = subprocess.check_output(args)
     48 
     49     if len(output) == 0:

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    624 
    625     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 626                **kwargs).stdout
    627 
    628 

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    706         if check and retcode:
    707             raise CalledProcessError(retcode, process.args,
--> 708                                      output=stdout, stderr=stderr)
    709     return CompletedProcess(process.args, retcode, stdout, stderr)
    710 

CalledProcessError: Command '['java', '-jar', '/usr/local/lib/python3.5/site-packages/tabula/tabula-0.9.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '--columns',

Nevertheless, it did not worked.

updated version?

Summary of your issue

It looks like the current version of tabula-py is not compatible with the latest tabula-java. What needs to be done to upgrade tabula-py which has the updates from https://github.com/tabulapdf/tabula/releases/tag/v1.1.1 ?

Environment

Write and check your environment.

python --version: 3.6
java -version: 1.8.0_45-b14
OS and it's version: CentOS Linux release 7.2.1511
Your PDF URL: any

What did you do when you faced the problem?

fails to parse tables fully throwing these warnings:
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Bold

It appears that tabula-java v1.1.1 has the fix for this issue.

Example code:

read_pdf_table(in_file, pages=i)

Output:

Jul 19, 2017 11:05:35 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman,Bold
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Bold
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman,Italic
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Italic
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman
Jul 19, 2017 11:05:37 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:37 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC

What did you intend to be?

tabula-py giving different result than tabula gui(Java)

Summary of your issue

I'm trying to extract a portion of a pdf page using tabula-py api. But the end character of the last cell is missed when I run the code in python. I tried parsing it using the jar file provided on the tabula-java page, and it extracted all the cells character correctly

Environment

Write and check your environment.

python --version: Python 2.7.12
java -version: openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
OS and it's version: Ubuntu 16.04
Your PDF URL: https://services.skyteam.com/Timetable/Skyteam_Timetable_NA_EU.pdf
Page 15

What did you do when you faced the problem?

I tried parsing the page in tabula-java GUI. When it used the lattice extraction method, the character was misisng. But when I clicked the stream method of extraction the page was parsed correctly.

Example code:

df = tabula.read_pdf("Skyteam_Timetable_NA_EU.pdf",pages="15",area= (55.781,306.797,75.119,580.497),spreadsheet=True)
list(df.head().keys())

Output:

[u'FROM:', u'Detroit, USA', u'DT']

What did you intend to be?

[u'FROM:', u'Detroit, USA', u'DTW']

Handle multiple table

With tables has different column size in same page, tabula-py crashes with pd.read_csv.

While using area and column attributes, guess has to be set to False.

While using area and column attributes, guess has to be set to False.
Otherwise, setting area and column will not work

cannot import name 'read_pdf'

Summary of your issue

Hello! I tried using tabula package to import some data in pdf but I got an error. I tried the command - from tabula import read_pdf. This gave me an error - cannot import name 'read_pdf'

I installed tabula-py package.

Environment

python 3.6.1
macOS Sierra 10.12.5

What did you do when you faced the problem?

Example code:

from tabula import read_pdf

Output:

ImportError                               Traceback (most recent call last)
<ipython-input-30-59770b1fd371> in <module>()
----> 1 from tabula import read_pdf

ImportError: cannot import name 'read_pdf'

What did you intend to be?

Can't read the embedded font

I've got a problem when try read one of the pdf. Can you take a look - where am i wrong?

python --version: 2.7
java:
Your PDF URL: https://drive.google.com/file/d/0B0MZAdjMKP0Sbjcyc3Y3RDVMNlk/view?usp=sharing
OS and it's version: ? windows

from tabula import convert_into
convert_into("data\test1.pdf", "data\test1.csv", output_format="csv")

Output:

May 18, 2017 2:56:53 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Can't read the embedded font Arial-BoldMT
May 18, 2017 2:56:53 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Using font Arial Bold instead
May 18, 2017 2:56:53 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Can't read the embedded font ArialMT
May 18, 2017 2:56:53 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Using font Arial instead

Cannot Install tabula-py

My java version is 1.8.0_101 and pandas installed in anaconda environment. I tried install it on both python version is Python 2.7.12 and python 3.5 :: Anaconda 4.1.1 (64-bit).

I executed "pip install tabula-py" on anaconda as well, the running message is :
Collecting tabula-py
Could not find a version that satisfies the requirement tabula-py (from versions: )
No matching distribution found for tabula-py

Is there any specific requirements other than Java and pandas? Thank you

Register pypi

pandas.io.common.CParserError: Error tokenizing data.

Hi,

We get the following error parsing a certain pdf file from a URL.
This is using latest tabula-py from git.

url is https://resource.holdan.co.uk/Holdan/gbp/BMD.pdf

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    df = tabula.read_pdf(url, pages="all")
  File "/usr/local/lib/python2.7/dist-packages/tabula/wrapper.py", line 69, in read_pdf_table
    return pd.read_csv(io.BytesIO(output), encoding = encoding)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in _read
    data = parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 939, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1508, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:9977)
  File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10235)
  File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:10963)
  File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10834)
  File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:25978)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 7 fields in line 270, saw 10

tabula-py: how to handle columns consisting entirely of NULL values?

tabula-py has done a mostly fantastic job doing what I need it to do so far. However, I have encountered issues caused by a table column that consists entirely of NULLs. For example, I have a table split across 5 different pages. For one page, the partitioned table in question has an empty column; this shifts the entire dataframe to the left, so that when it is merged with the partitioned tables from the other pages, the columns don't line up correctly. Is there a way around this?

Error if a row has an "extra" cell

I'm getting a pandas parse error using the file:

btcc-example.pdf

presumably because one row has more cells than the the ones preceding it.

----> 1 df = tabula.read_pdf("data/2017/btcc-round1.pdf", pages=62, stream=True,lattice=False)

/usr/local/lib/python3.6/site-packages/tabula/wrapper.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)
     95         pandas_options['encoding'] = pandas_options.get('encoding', encoding)
     96 
---> 97         return pd.read_csv(io.BytesIO(output), **pandas_options)
     98 
     99

...
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)()

ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 7

As a stop gap, should the the pandas dataframe be sized to the maximum number of items in a row (the max length of the row lists in the JSON representation), padding short rows with nulls, and then let the user fudge the result? Alternatively, set the columns from the column width and try to populate any long row(s) correctly?

The yellow highlight area is also giving some issues - e.g. with the extraction only extracting the highlighted areas under some circumstances - but I assume that's a tabula-java issue?

Outputting Format w/ Multi Page PDF

Summary of your issue

Hi there,

I was curious if I'm missing something with regards to convert_into. I pulled down the repo and did a convert into with the example.pdf you provide for my own sanity check, and its only converting the first page of the PDF to a CSV. Is this normal behavior? Am I missing something important here?

Also, with the options does format actually work? I've been trying to use this for an outputted parse object to no avail.

Write and check your environment.

python --version: 2.7
java -version: 8
OS and it's version: OS X El Capitan 10.11.6
Your PDF URL: The example PDF

Relevant Code

Here's the issue with just the first page being output

Here's the format options with `csv`

Handle Java option

To set Java options such as -Xmx, read_pdf() should have an option for it.
ref: #27

Some tables' numbers are extracted wrong from pdf

Summary of your issue

Some tables' numbers are extracted wrong from pdf

Environment

problem.pdf

Steps to reproduce:

Read "problem.pdf" file.
Some numbers are wrong.
e.g: Line 6 : 3,797 -> 3,777, 5,317 -> 5,337 and there are some more...

Example code:

df = read_pdf('problem.pdf')

Other info:

Also reproduce on:
tabula-java v1.01 with commad:
java -jar tabula-1.0.1-jar-with-dependencies.jar problem.pdf -o problem.csv
tabula v1.1.1

read_pdf reads only the first page

read_pdf reads only one page, can not set pages="all"
if i set pages="all" it throws error

windows 10

Write and check your environment.

python --version: 3.6
java -version: 1.8.0_144
OS and it's version: windows 10
Your PDF URL:test.pdf

What did you do when you faced the problem?

i tried to print all table data, with pages=all
it throws error

Example code:

from tabula import read_pdf
dfs=read_pdf('test.pdf', encoding='cp1254', output_format='csv', pages='all')
print(dfs)

Output:

Oct 03, 2017 7:17:20 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2
INFO: OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
Oct 03, 2017 7:17:21 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2
INFO: OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
Oct 03, 2017 7:17:22 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2
INFO: OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
Oct 03, 2017 7:17:24 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2
INFO: OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
Oct 03, 2017 7:17:24 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2
INFO: OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
Traceback (most recent call last):
File "D:/pycharm/cdsco/tabula_csv.py", line 2, in
dfs=read_pdf('test.pdf', encoding='cp1254', output_format='csv', pages='all')
File "C:\Users\amal\AppData\Local\Programs\Python\Python36-32\lib\site-packages\tabula\wrapper.py", line 97, in read_pdf
return pd.read_csv(io.BytesIO(output), **pandas_options)
File "C:\Users\amal\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\amal\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 411, in _read
data = parser.read(nrows)
File "C:\Users\amal\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "C:\Users\amal\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas_libs\parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas_libs\parsers.c:10862)
File "pandas_libs\parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas_libs\parsers.c:11138)
File "pandas_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas_libs\parsers.c:11884)
File "pandas_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas_libs\parsers.c:11755)
File "pandas_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 49, saw 5

Process finished with exit code 1

if I remove pages='all'
it just prints first page table

What did you intend to be?

i want to read all pages data

Unable to read Hyphen ( -) properly

Summary of your issue

X410-SATA-S28 text value in a pdf is getting converted as X410?SATA?S28 into csv. this issue applicable for python and java versions on tabula.

Environment

windows/linux
Write and check your environment.

python --version: ? 2.7
java -version: ? 1.7
OS and it's version: ? windows/linux
Your PDF URL: just pdf with one cell and the value X410-SATA-S28

What did you do when you faced the problem?

I will replace ? with a hyphen in code temporarily
//write here

Example code:

paste your core code

java -Xmx4080m -jar C:\Python27\lib\site-packages\tabula\tabula-0.9.2-jar-with-dependencies.jar --pages all --guess --format CSV --outfile C:\Meher\pricelistoutput.csv --spreadsheet C:\Meher\pricelist.pdf

Output:

paste your output
X410?SATA?S28 

## What did you intend to be?

X410-SATA-S28

wrapper.py and tabula jar are missing

Summary of your issue

I can import the library tabula, but the functions are still inaccessible. I checked the directory \site-packages\tabula. The wrapper.py and tabula jar file are missing.

Environment

Write and check your environment.

python --version: Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
java -version: java version "1.8.0_131"
OS and it's version: Windows 10 64 bit
Your PDF URL:

What did you do when you faced the problem?

I tried to manually place them in the directory and run again. But it still doesn't work.

Example code:

df = tb.read_pdf("D:\\pdf table extract\\clarkfilterdcccrossref.pdf")

Output:

AttributeError                            Traceback (most recent call last)
<ipython-input-2-df8599025f3b> in <module>()
      1 #df = pd.DataFrame()
----> 2 df = tb.read_pdf("D:\\pdf table extract\\clarkfilterdcccrossref.pdf")
      3 tb.convert_into("D:\\pdf table extract\\clarkfilterdcccrossref.pdf","output.csv",output_format="csv")

AttributeError: module 'tabula' has no attribute 'read_pdf'

What did you intend to be?

Read the pdf file.

Cannot execute read_pdf

Summary of your issue

Cannot read pdf's from directory. Was working before.

Environment

Python 2.7

Example code:

print read_pdf(r"C:\Users\riley\Desktop\Bank Statements\53591.pdf")

Output:

Traceback (most recent call last):
  File "C:/Users/riley/PycharmProjects/Payroll/PayrollParsePDF.py", line 126, in <module>
    print read_pdf(r"C:\Users\riley\Desktop\Bank Statements\53591.pdf")
  File "C:\Python27\lib\site-packages\tabula\wrapper.py", line 54, in read_pdf_table
    output = subprocess.check_output(args)
  File "C:\Python27\lib\subprocess.py", line 212, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "C:\Python27\lib\subprocess.py", line 390, in __init__
    errread, errwrite)
  File "C:\Python27\lib\subprocess.py", line 640, in _execute_child
    startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

module 'tabula' has no attribute 'read_pdf'

Dear tabula Developers,

I just installed tabula on Windows 10 x64, Anaconda Python 3.6 with the following command:
>c:\Programs\Anaconda\Scripts\pip.exe install tabula-py

After that I restarted my python kernel and imported tabula:

import tabula
df = tabula.read_pdf('my_pdf')

But I get the following error message:

AttributeError: module 'tabula' has no attribute 'read_pdf'

I checked the module folder in Anaconda (c:/Programs/Anaconda/Lib/site-packages/tabula/) and I found the 2 jar files and the 2 py files, and the wrapper.py contains the read_pdf function.

Can you help me why I'm not able to load this function? I tried to check the installed version but

AttributeError: module 'tabula' has no attribute 'version'

By the way pip returned with a success message after install:

Installing collected packages: tabula-py
Successfully installed tabula-py-0.8.0

I use several external modules in Anaconda and never get any issue like this...

Thank you!

Maybe a simple answer..

Summary of your issue

It is just not working

Environment

Write and check your environment.

python --version: 3.6
java -version: 8
OS and it's version: MacSierra
Your PDF URL: http://estaticog1.globo.com/2016/02/02/fuv2016_chamada_1.PDF

What did you do when you faced the problem?

I tried to google it. I did not find anything similar

Example code:

paste your core code

from tabula import read_pdf
df = read_pdf('Fuvest/1.pdf')

Output:

Exception in thread "main" java.lang.UnsupportedClassVersionError: technology/tabula/CommandLineApp : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Traceback (most recent call last):
File "/Users/brunoandradeono/Documents/LiClipse Workspace/Process_imaging/Uni4aDay.py", line 36, in
df = read_pdf('/users/Brunoandradeono/desktop/Fuvest/1.pdf')
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tabula/wrapper.py", line 75, in read_pdf
output = subprocess.check_output(args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['java', '-jar', '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tabula/tabula-1.0.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '/users/Brunoandradeono/desktop/Fuvest/1.pdf']' returned non-zero exit status 1.

What did you intend to be?

Get all columns without the header and pass to data frame

Columns getting merged

Summary of your issue

I have a PDF with a table extending to multiple pages. For some rows, the value in last two (or second last two) columns is getting merged into a single one. Please note that I have removed some of the sensitive information.

Environment

python --version: Python 2.7.6
java -version: java version "1.8.0_144"
OS and it's version: Ubuntu 14.04
Your PDF URL: I have attached a screenshot of the table

What did you do when you faced the problem?

Tried lattice=True option but it is not even reading the table

Example code:

all_values = tabula.read_pdf("sample.pdf", pages='all', pandas_options={'header': None, 'error_bad_lines': False, 'warn_bad_lines': False})
all_values = all_values.values.tolist()

for val in all_values:
	print val

Output:

[u'22/06/17', u'IMPS-7-RAHUL-HDFC-XXXXXXXX', u'00007', u'22/06/17', nan, u'1,000.00 14,904.08', nan]
[nan, u'8-', nan, nan, nan, nan, nan]
[u'23/06/17', u'NEFT CHGS INCL ST & CESS 170617', u'000000000000000', u'23/06/17', u'2.88', u'14,901.20', nan]
[u'23/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'23/06/17', nan, u'579.00 15,480.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'9', nan, nan, nan, nan, nan]
[u'23/06/17', u'ACH D- AUFINANCIERS-9', u'0000002', u'23/06/17', u'14,400.00', u'1,080.20', nan]
[u'24/06/17', u'IMPS-7-ANI TECHNOLOGIES PRI-H', u'00007', u'24/06/17', nan, u'2,378.00 3,458.20', nan]
[nan, u'DFC-XXXXXXXXXXX-2017-06', nan, nan, nan, nan, nan]
[nan, u'-24-PAYMENT', nan, nan, nan, nan, nan]
[u'27/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'27/06/17', nan, u'445.00 3,903.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'8', nan, nan, nan, nan, nan]
[u'28/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'28/06/17', nan, u'2,279.00 6,182.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'0', nan, nan, nan, nan, nan]
[u'29/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'29/06/17', nan, u'1,165.00 7,347.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'5', nan, nan, nan, nan, nan]

What did you intend to be?

[u'22/06/17', u'IMPS-7-RAHUL-HDFC-XXXXXXXX', u'00007', u'22/06/17', nan, u'1,000.00', u'14,904.08']
[nan, u'8-', nan, nan, nan, nan, nan]
[u'23/06/17', u'NEFT CHGS INCL ST & CESS 170617', u'000000000000000', u'23/06/17', u'2.88', nan, u'14,901.20']
[u'23/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'23/06/17', nan, u'579.00', u'15,480.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'9', nan, nan, nan, nan, nan]
[u'23/06/17', u'ACH D- AUFINANCIERS-9', u'0000002', u'23/06/17', u'14,400.00', nan, u'1,080.20']
[u'24/06/17', u'IMPS-7-ANI TECHNOLOGIES PRI-H', u'00007', u'24/06/17', nan, u'2,378.00', u'3,458.20']
[nan, u'DFC-XXXXXXXXXXX-2017-06', nan, nan, nan, nan, nan]
[nan, u'-24-PAYMENT', nan, nan, nan, nan, nan]
[u'27/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'27/06/17', nan, u'445.00', u'3,903.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'8', nan, nan, nan, nan, nan]
[u'28/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'28/06/17', nan, u'2,279.00', u'6,182.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'0', nan, nan, nan, nan, nan]
[u'29/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'29/06/17', nan, u'1,165.00', u'7,347.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'5', nan, nan, nan, nan, nan]

Repeating headers

Summary of your issue

I have the same table continuing on multiple pages and on each page the header is written again.
This makes sense in a pdf format, but not in a csv.

Is there a way so that it ignores the headers and only converts the table?
For example, convert_without_header=True

Error tokenizing data. C error

Summary of your issue

The following error appears Error 'tokenizing data. C error: Expected 4 fields in line 44, saw 5'

Environment

Write and check your environment.

python --version: 2.7
java -version: 8
OS and it's version: ?
Your PDF URL:

I cannot post a url to the PDF because it is a bank statement, and thus contains confidential information

What did you do when you faced the problem?

I am using the read_pdf function to extract tables from PDFs. Try and Except was used to catch the errors for the pdf that tabula cannot read

Example code:

 df = read_pdf(files[i], pages='all', encoding='ISO-8859-1', error_bad_lines=False)

Output:

C:/Users/riley/Desktop/BankStatements/50139.pdf Error tokenizing data. C error: Expected 4 fields in line 44, saw 5

Error tokenizing data

Summary of your issue

Error tokenizing data. C error: Expected 10 fields in line 18, saw 11

Environment

Jupyter Notebook- Anaconda
Write and check your environment.

python --version: >3
java -version: Version 8 update 111: 1.8.0_111
OS and it's version: Mac OS Sierra 10.12.4
Your PDF URL: http://www.wrldc.in/9_reportNew/dailydata_01082017.pdf

What did you do when you faced the problem?

I used read_pdf on above url and I received this error:
"Error tokenizing data. C error: Expected 10 fields in line 18, saw 11"

//write here

Example code:

paste your core code

Output:

paste your output

What did you intend to be?

Can 'convert_into()' pdf file to json but executing 'read_pdf()' as json gives UTF-8 encoding error.

Summary of your issue

Can 'convert_into()' pdf file to json, but executing 'read_pdf()' as json gives UTF-8 encoding error.

Environment

Write and check your environment.

python --version: ? 3.6.1.final.0, jupyer notebook 5.0.0
java -version: ?
OS and it's version: ? win64 anaconda 4.3.22
Your PDF URL: https://www.dropbox.com/s/rg11o0iitia4zua/QA-17H104161-2017-09-22-DO.pdf?dl=0

What did you do when you faced the problem?

I don't understand why the convert_into function works fine with this pdf, but passing the same pdf into read_pdf() yields an encoding error. Shouldn't the default options for both functions be identical?

Example code:

from tabula import read_pdf
from tabula import convert_into
import pandas
file = 'T:/baysestuaries/Data/WDFT-Coastal/db_archive/QA/QA-17H104161-2017-09-22-DO.pdf'
convert_into(file,"test.json", output_format='json')
df = read_pdf(file, output_format='json')

Output:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-208-fc7babef8e03> in <module>()
----> 1 df = read_pdf(file, output_format='json')

C:\Users\ETurner\AppData\Local\Continuum\Anaconda3\lib\site-packages\tabula\wrapper.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)
     90 
     91         else:
---> 92             return json.loads(output.decode(encoding))
     93 
     94     else:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5134: invalid start byte

What did you intend to be?

Ideally, the behavior of both functions should be identical. I am actually trying to read this pdf as a pandas dataframe, but it is very messy. Just reading it as a json works for me so I can parse out the items I need. However, don't want to have to convert files first to waste disk space.

area has no effect when page >1

hi,

Summary of your issue

area has no effect when page > 1

Environment

Write and check your environment.

python --version: 3.4
java -version: 8
OS and it's version: windows 10

df = tabula.read_pdf(self.filename,
                                    pages=1,
                                    area=[470.58, 29.388, 745.116, 580.692],
                                    columns=[30.876, 56.916, 267.468, 361.212, 454.956, 481.74],
                                    encoding="latin-1",
                                    pandas_options={"encoding": "latin-1"})

KO :no effect

df = tabula.read_pdf(self.filename,
                                    pages=2,
                                    area=[470.58, 29.388, 745.116, 580.692],
                                    columns=[30.876, 56.916, 267.468, 361.212, 454.956, 481.74],
                                    encoding="latin-1",
                                    pandas_options={"encoding": "latin-1"})

tested with 3 pdf files

thanks a lot

tabula java requirements v6 or v7.

Please add to the documentation of tabluapy that tabula requires java 6 or java 7 only!

I spent many hours trying to diagnose why I could not get tabula-py to run. I had java but it was not the correct version.

From the tabula readme

Using Tabula

First, make sure you have a recent copy of Java installed. You can
download Java at https://www.java.com/download/ . Tabula requires
a Java Runtime Environment compatible with Java 6 or Java 7.

To hack around other java versions installed I prepend the correct path to java 7 I installed

import sys
b = sys.path
sys.path = ['/opt/java/jre1.7.0_79/bin'] + b

And can now extract the tables. This would have saved a lot of time if requirements said
java 6 or java 7

Rename spreadsheet/no-spreadsheet options

After tabula-java 0.9.2, spreadsheet/no-spreadsheet options are renamed to lattice/stream.

tabulapdf/tabula-java@6e1b540

non-zero exit status

Hi, I tried to run this code on Mac OSX with Anaconda and Java 8:

import tabula
df = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf") print(df)

I got this error message:
Exception in thread "main" java.lang.NoSuchMethodError: java.lang.Integer.compare(II)I at technology.tabula.TextChunk.isLtrDominant(TextChunk.java:179) at technology.tabula.TextElement.mergeWords(TextElement.java:266) at technology.tabula.TextElement.mergeWords(TextElement.java:105) at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:178) at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:161) at technology.tabula.CommandLineApp.main(CommandLineApp.java:60) Traceback (most recent call last): File "test.py", line 3, in <module> df = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf") File "/Users/username/anaconda/lib/python3.5/site-packages/tabula/wrapper.py", line 54, in read_pdf_table output = subprocess.check_output(args) File "/Users/username/anaconda/lib/python3.5/subprocess.py", line 626, in check_output **kwargs).stdout File "/Users/username/anaconda/lib/python3.5/subprocess.py", line 708, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['java', '-jar', '/Users/username/anaconda/lib/python3.5/site-packages/tabula/tabula-0.9.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '4931.pdf']' returned non-zero exit status 1

Any idea what might cause this? thanks!

Column header

Summary of your issue

I am extracting data from a table using the "guess=True" option. Unfortunately, the first row is imported as column header. (The guess is not really wrong, since the typeface is bold and there is a line below it, see Example.) I didn't find I way to tell read_pdf_table not to treat the particular first line as column header.

Would it be possible to add a "header=None" option?

Add test

ImportError: cannot import name 'read_pdf'

Summary of your issue

Hi, when I tried the example notebook note, anaconda IDE (spyder) gave me some error message

from tabula import read_pdf
Traceback (most recent call last):

  File "<ipython-input-25-59770b1fd371>", line 1, in <module>
    from tabula import read_pdf

ImportError: cannot import name 'read_pdf'

However, the following command works
import tabula

Environment

anaconda (python 3.5) + windows10
Write and check your environment.

python --version: ?
java -version: ?
OS and it's version: ?
Your PDF URL:

What did you do when you faced the problem?

//write here

Example code:

paste your core code

Output:

paste your output

What did you intend to be?

Bumb up tabula-java 1.0.1

Trouble converting or extracting multiple pages

Summary of your issue

In my case converting a pdf file with multiple page failed. I tried both, converting it into dataframe and converting to csv, but both failed. It has worked when converting without giving any page argument. However, in that case only the first page is getting converted.

Environment

Write and check your environment.

python --version: Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
java -version: 1.8.0_131
OS and it's version: Windows 10
Your PDF URL: http://www.cea.nic.in/reports/monthly/generation/2017/May/tentative/opm_16.pdf

Example code:

from tabula import read_pdf, convert_into
convert_into("opm_16.pdf", "test.csv", output_format="csv", page=1)

Output:

CalledProcessError                        Traceback (most recent call last)
<ipython-input-29-bde26bfa1bde> in <module>()
----> 1 convert_into("opm_16.pdf", "test.csv", output_format="csv", page=1)

C:\Users\{user}\Anaconda3\lib\site-packages\tabula\wrapper.py in convert_into(input_path, output_path, output_format, java_options, **kwargs)
    138 
    139     try:
--> 140         subprocess.check_output(args)
    141     finally:
    142         if is_url:

C:\Users\{user}\Anaconda3\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
    334 
    335     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 336                **kwargs).stdout
    337 
    338 

C:\Users\{user}\Anaconda3\lib\subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    416         if check and retcode:
    417             raise CalledProcessError(retcode, process.args,
--> 418                                      output=stdout, stderr=stderr)
    419     return CompletedProcess(process.args, retcode, stdout, stderr)
    420 

CalledProcessError: Command '['java', '-jar', 'C:\\Users\\{user}\\Anaconda3\\lib\\site-packages\\tabula\\tabula-0.9.2-jar-with-dependencies.jar', '--pages', '1', '--guess', '--format', 'CSV', '--outfile', 'test.csv', 'opm_16.pdf']' returned non-zero exit status 1.

Java issue with linux 64-bit using any java version

I am trying to run a simple test to scrape a table from a pdf using
df = read_pdf_table("./test.pdf", guess=False)
But, i get
CalledProcessError: Command '['java', '-jar', '/home/duduvz14/anaconda3/lib/python3.6/site-packages/tabula/tabula-0.9.2-jar-with-dependencies.jar', '--pages', '11', './test.pdf']' returned non-zero exit status 1.
I am using Ubuntu linux 64-bit and already attempted using Java 6, 7 and 8 but the error persists
Thanks in advance for the support

Unable to execute my script on client system

Summary of your issue

I have developed a script using tabula-py and created an executable using pyinstaller which works fine on my PC but when I run it on my client's PC it gives me the error

'The system cannot find the file specified tabula/wrapper.py although the the file is already present in the executable folder in the correct path. Should the client needs to have java installed to run the script?

Environment

python --version: 3.5
java -version: 8
OS and it's version: win8
Your PDF URL:

What did you do when you faced the problem?

I have created an executable file using pyinstaller.

//write here

Example code:

paste your core code

Output:

paste your output

What did you intend to be?

I want to run my executable file on a windows PC without java and python installed. I guess I need to install java on PC to run my executable.

Unable to extract Japanese characters

Reported in this issue tabulapdf/tabula-java#114 (comment)

It depends on tabulapdf/tabula-java#52

AttributeError: 'module' object has no attribute 'read_pdf'

Summary of your issue

When importing the read_pdf method from tabula-py using
from tabula import read_pdf
as the example demonstrated
It shows the following error message
AttributeError: 'module' object has no attribute 'read_pdf'

Environment

anaconda python 2.1.12 + tabula 0.9.0
Write and check your environment.

python --version: ? anaconda python 2.1.12
java -version: ? java 1.8.0_111
OS and it's version: ? windows 10
Your PDF URL:

What did you do when you faced the problem?

//write here

Example code:

paste your core code

Output:

paste your output

What did you intend to be?

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3

I am trying to extract the tables from a number of pdf documents:

In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="all")

Out:


---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-31-c86da9ee0350> in <module>()
      1 from tabula import read_pdf_table
----> 2 pdf_table = read_pdf_table("../file.pdf", pages="all")
      3 type(pdf_table)

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, options, pages, guess, area, spreadsheet, password, nospreadsheet, silent)
    100         return
    101 
--> 102     return pd.read_csv(io.BytesIO(output))

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    398         return parser
    399 
--> 400     data = parser.read()
    401     parser.close()
    402     return data

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
    936                 raise ValueError('skipfooter not supported for iteration')
    937 
--> 938         ret = self._engine.read(nrows)
    939 
    940         if self.options.get('as_recarray'):

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1503     def read(self, nrows=None):
   1504         try:
-> 1505             data = self._reader.read(nrows)
   1506         except StopIteration:
   1507             if self._first_chunk:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:9884)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:25878)()

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3

I tried to use sep parameter as \t. Nevertheless, it did not worked. What can I do?

CONVERTING SELECTED PAGES

Hello all, is there any option of converting selected pages of PDF into excel / csv? i mean to say if i want to convert only tables present on page 12 to 20 of the PDF then what option do i have? regards

python Tabula : FileNotFoundError: [WinError 2] The system cannot find the file specified

Summary of your issue

I'm getting an error while reading a pdf file via tabula

Environment

Write and check your environment.

python --version:3 ?
java -version: 8?
OS and it's version: Win7 32bit ?
Your PDF URL:

What did you do when you faced the problem?

//write here
below is the code used

Example code:

import tabula
df = tabula.read_pdf("D:/Users/rag/Documents/GE_Confidential/Projects/GE_Health_Care/pdf/test.pdf")

Output:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-11-1c72e9de1c11> in <module>()
----> 1 df = tabula.read_pdf("D:/Users/rag/Documents/GE_Confidential/Projects/GE_Health_Care/pdf/test.pdf")

D:\Users\rag\AppData\Local\Continuum\Anaconda3\lib\site-packages\tabula\wrapper.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)
     73 
     74     try:
---> 75         output = subprocess.check_output(args)
     76     finally:
     77         if is_url:

D:\Users\rag\AppData\Local\Continuum\Anaconda3\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
    334 
    335     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 336                **kwargs).stdout
    337 
    338 

D:\Users\rag\AppData\Local\Continuum\Anaconda3\lib\subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    401         kwargs['stdin'] = PIPE
    402 
--> 403     with Popen(*popenargs, **kwargs) as process:
    404         try:
    405             stdout, stderr = process.communicate(input, timeout=timeout)

D:\Users\rag\AppData\Local\Continuum\Anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
    705                                 c2pread, c2pwrite,
    706                                 errread, errwrite,
--> 707                                 restore_signals, start_new_session)
    708         except:
    709             # Cleanup if the child failed starting.

D:\Users\rag\AppData\Local\Continuum\Anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
    988                                          env,
    989                                          cwd,
--> 990                                          startupinfo)
    991             finally:
    992                 # Child is launched. Close the parent's copy of those pipe

FileNotFoundError: [WinError 2] The system cannot find the file specified

What did you intend to be?

i want to read a pdf table and convert it to data-frame for further analysis...
if there is any other alternative please let me know how to do it..

Many thanks in advance...

Why use both -g and -a flags?

Why use both -g and -a flags in options for extracting specific area? Isn't -g overriding the -a flag? You mention that it does not work without the -r option. I just tried without -r and -g and it works good.

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Multiple pages in convert_into()

Summary of your issue

convert_into() converts only the first page.

Environment

python --version: Python 2.7.6
java -version: java version "1.8.0_144"
OS and it's version: Ubuntu 14.04
Your PDF URL: Not required

Example code:

tabula.convert_into("test.pdf", "output.csv", output_format="csv")

Output:

CSV with table from the first page only.

What did you intend to be?

CSV with tables from all the pages.

java.lang.OutOfMemoryError: GC overhead limit exceeded

Summary of your issue

My input PDF file is too large ..around 9000 pages (working fine if i select few pages)

Environment

Trying both in windows and linux
Write and check your environment.

python --version: 2.7.13
java -version: ? 1.7 (tried only python)
OS and it's version: ? windows10 and linux (tried both)
Your PDF URL: solution is working fine few pages but not working for 9000 pages.

What did you do when you faced the problem?

//write here

Example code:

paste your core code

tabula.convert_into("C:\Meher\pricelist.pdf", "C:\Meher\pricelistoutput.csv", spreadsheet=True,output_format="csv", pages="all")

Output:

tabula.convert_into("C:\Meher\pricelist.pdf", "C:\Meher\pricelistoutput.csv"
, spreadsheet=True,output_format="csv", pages="all")
May 16, 2017 6:12:14 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceede
d
at technology.tabula.ObjectExtractor.processTextPosition(ObjectExtractor
.java:329)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
gine.java:504)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:56)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
e.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:216)
at technology.tabula.ObjectExtractor.drawPage(ObjectExtractor.java:153)
at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:10

 at technology.tabula.PageIterator.next(PageIterator.java:29)
 at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:160)

 at technology.tabula.CommandLineApp.extractFileInto(CommandLineApp.java:

at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.jav
a:128)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:10

 at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)

Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\tabula\wrapper.py", line 114, in convert_i
nto
subprocess.check_output(args)
File "C:\Python27\lib\subprocess.py", line 219, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['java', '-jar', 'C:\Python27\lib\sit
e-packages\tabula\tabula-0.9.2-jar-with-dependencies.jar', '--pages', 'all', '
--guess', '--format', 'CSV', '--outfile', 'C:\Meher\pricelistoutput.csv', '--s
preadsheet', 'C:\Meher\pricelist.pdf']' returned non-zero exit status 1

What did you intend to be?

Columns keyword?

In the tabula-java command line utility there is a --columns option where the x-coordinates of the columns can be given. This is really useful for poorly formatted tables. Would it be possible to add it here?

When running the read_pdf_table has an attribute error

File "/Users/nestordeharo/Documents/Python/scripts_automatizar/venv/lib/python2.7/site-packages/tabula/wrapper.py", line 74, in read_pdf_table
result = subprocess.run(args, stdout=subprocess.PIPE)
AttributeError: 'module' object has no attribute 'run'

chezou / tabula-py Goto Github PK

tabula-py's Introduction

tabula-py

Requirements

OS

Usage

Install

Example

Contributing

Contributors

Another support

tabula-py's People

Contributors

Stargazers

Watchers

Forkers

tabula-py's Issues

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

This is a follow up of earlier issue#43 which was closed. All the details provided in #43 are the same, here is the updated information:

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

Output:

Summary of your issue

Relevant Code

Here's the issue with just the first page being output

Here's the format options with csv

Summary of your issue

Environment

Steps to reproduce:

Example code:

Other info:

windows 10

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

Summary of your issue

Environment

Example code:

Output:

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

Summary of your issue

Environment

Here's the format options with `csv`