ashima / pdf-table-extract Goto Github PK

View Code? Open in Web Editor NEW

275.0 275.0 93.0 276 KB

Extract tables from PDF pages.

License: MIT License

Python 100.00%

pdf-table-extract's People

Contributors

Stargazers

Watchers

pdf-table-extract's Issues

Error reporting needs an option to dump the full traceback

The current exception catching tends to hide the source of the error, making debugging hard.

window7 problem

(copied by chris from 895500a#commitcomment-6903404)
@amccarren commented on 895500a 9 hours ago

When I run this example I get 
CalledProcessError: Command '['pdftoppm', '-h']' returned non-zero exit status 99

I am working on windows 7 and have set the path to the directory where the pdftoppm.exe is stored. 
Any suggestions?

Andrew

Consider merging with Pandas

Consider adding pdf-table-extract as source in Pandas lib. Currently pandas supports using a number of external libs for data I/O.

For example see: pandas-dev/pandas#4556

Made Appropriate Changes

For Python 3.x
Need to do some changes in following files:

In /src/pdftableextract/core.py
a. in try: .... except:..... block in the lines where Exception, e is written should be Exception as e.
b. In the same file, in getCell() function double parentheses should not be there.
In src/pdftableextract/init.py
Instead of from core ....
It should be from .core import .......
In src/pdftableextract/extracttab.py
point no. b of 1 should be applied
In src/pdftableextract/pnm.py
line #31 should be print() for python 3.x

Thanks!

Does it support Chinese charaters in the pdf documents?

I'm try to extract a table with Chinese charaters, but it does not work(no characters are extracted) and the output are as follows.

                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    

[30 rows x 26 columns]

How can I make it to support Chinese?

Examples

Hi,
May it be possible to provide an example that uses a ST Micro’s datasheet? Cheers! Rob

Consider using pdfminer as library / alternative to pdftoppm

Consider using pdfminer as library / alternative to pdftoppm.

pdfminer is a pure python implementation.

See: https://github.com/euske/pdfminer/

Hi:
I'm trying to use pdf-table-extract with a slightly different pdf table, but its first column is entirely treated as a header cell. Where can I find more documentation about process_page()? Some parameter description, examples... anything would be welcome.
Thanks in advance

Command failed: pdftoppm -h

Delete not defined

The function delete in core.py is not defined.

I get the following error whenever this code is executed.
File "test_to_pandas.py", line 6, in
cells = [pdf.process_page("pg_0001.pdf",p) for p in pages]
File "/usr/local/lib/python2.7/dist-packages/pdf_table_extract-0.1-py2.7.egg
!/pdftableextract/core.py", line 179, in process_page
vd = delete(vd,i)
NameError: global name 'delete' is not defined

This code section comes from core.py

  j = 0
  while j < len(hd):
  if hd[j+1]-hd[j] > maxdiv :
      hd = delete(hd,j)
      hd = delete(hd,j)
    else:
      j=j+2

I have attached an image of the PDF I was trying to parse.
screenshot from 2013-10-09 17 55 27

Example not working, no docs available

I'm trying to set things up so I can convert pdfs to tables.

I'm using fedora 20 and i have run:

python setup.py install
pip install pandas

Both packages pdftoppm and pdftotext are installed.

Running this :

cd example
python test_to_pandas.py

fails with

Traceback (most recent call last):
  File "test_to_pandas.py", line 5, in <module>
    cells = [pdf.process_page("a.pdf",p) for p in pages]
  File "/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/core.py", line 89, in process_page
    (maxval, width, height, data) = readPNM(p.stdout)
  File "/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/pnm.py", line 24, in readPNM
    raise IOError("Expected 2 elements from parsing PNM file, got {0}: {1}".format(ls, name))
IOError: Expected 2 elements from parsing PNM file, got 0: <pipe>

PS: I've renamed the pdf from example.podf to a.pdf

Unclear failure mode if pdftoppm not installed

When pdftoppm is not installed, pdf-table-extract fails:

----> 1 pdf.process_page("foo.pdf", "1")

/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/core.pyc in process_page(infile, pgs, outfilename, greyscale_threshold, page, crop, line_length, bitmap_resolution, name, pad, white, black, bitmap, checkcrop, checklines, checkdivs, checkcell)
     54 # image load secion.
     55
---> 56   (maxval, width, height, data) = readPNM(p.stdout)
     57
     58   pad = int(pad)

/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/pnm.pyc in readPNM(fd)
     20   data = fd.read()
     21
---> 22   xs, ys = s.split()
     23   width = int(xs)
     24   height = int(ys)

ValueError: need more than 0 values to unpack

Output to html is giving self not defined.

$ pdf-table-extract -i A1.pdf -o a1.html -p 1 -t table_html
global name 'self' is not defined

Error on crop: slice indices must be integers or None or have an index method

Am getting "slice indices must be integers or None or have an index method" if I try and use the -c option:

$ pdf-table-extract -i test.pdf -o test.txt -p 1
[works okay]
$ pdf-table-extract -i test.pdf -o test.txt -p 1 -c 0:0:3:3
slice indices must be integers or None or have an __index__ method
$

NameError: global name 'hd' is not defined

Running the following command:

python extracttab.py  -i ~/Downloads/HIST-PRICES.pdf -o ~/Downloads/HIST-PRICES.csv -p 1 -t table_csv

using the file found here: http://www.ico.org/historical/2010-19/PDF/HIST-PRICES.pdf leads to:

Traceback (most recent call last):
  File "extracttab.py", line 487, in <module>
    } [ args.t ](cells,args.page)
  File "extracttab.py", line 431, in o_table_csv
    ] for x in range(len(pgs))
NameError: global name 'hd' is not defined

IndexError when last page is blank

If the last page is completely blank an index error occurs.

Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "scrape_pdf2.py", line 159, in test
    pdflist = get_table_pages(pages)
  File "scrape_pdf2.py", line 95, in get_table_pages
    cells = [pdf.process_page("example.pdf",p) for p in pages]
  File "/Users/jonathan/.virtualenvs/elance/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/core.py", line 211, in process_page
    if vd[i+1]-vd[i] > maxdiv :
IndexError: index out of bounds

ashima / pdf-table-extract Goto Github PK

pdf-table-extract's People

Contributors

Stargazers

Watchers

Forkers

pdf-table-extract's Issues

Recommend Projects

Recommend Topics

Recommend Org