Git Product home page Git Product logo

pdf-table-extract's People

Contributors

boblannon avatar jbzdak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf-table-extract's Issues

window7 problem

(copied by chris from 895500a#commitcomment-6903404)
@amccarren commented on 895500a 9 hours ago

When I run this example I get 
CalledProcessError: Command '['pdftoppm', '-h']' returned non-zero exit status 99

I am working on windows 7 and have set the path to the directory where the pdftoppm.exe is stored. 
Any suggestions?

Andrew

Made Appropriate Changes

For Python 3.x
Need to do some changes in following files:

  1. In /src/pdftableextract/core.py
    a. in try: .... except:..... block in the lines where Exception, e is written should be Exception as e.
    b. In the same file, in getCell() function double parentheses should not be there.

  2. In src/pdftableextract/init.py
    Instead of from core ....
    It should be from .core import .......

  3. In src/pdftableextract/extracttab.py
    point no. b of 1 should be applied

  4. In src/pdftableextract/pnm.py
    line #31 should be print() for python 3.x

Thanks!

Does it support Chinese charaters in the pdf documents?

I'm try to extract a table with Chinese charaters, but it does not work(no characters are extracted) and the output are as follows.

                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    
                     ...                    

[30 rows x 26 columns]

How can I make it to support Chinese?

Examples

Hi,
May it be possible to provide an example that uses a ST Micro’s datasheet? Cheers! Rob

numpy.ndarray has no pop

I tried out master, and it seems I have a numpy install issue.

python extracttab.py -i pdf.pdf  -p 1
Traceback (most recent call last):
  File "extracttab.py", line 482, in <module>
    cells.extend(process_page(pgs))
  File "extracttab.py", line 270, in process_page
    vd.pop(i)
AttributeError: 'numpy.ndarray' object has no attribute 'pop'

Any ideas how to fix this?

Documentation

Hi:
I'm trying to use pdf-table-extract with a slightly different pdf table, but its first column is entirely treated as a header cell. Where can I find more documentation about process_page()? Some parameter description, examples... anything would be welcome.
Thanks in advance

Delete not defined

The function delete in core.py is not defined.

I get the following error whenever this code is executed.
File "test_to_pandas.py", line 6, in
cells = [pdf.process_page("pg_0001.pdf",p) for p in pages]
File "/usr/local/lib/python2.7/dist-packages/pdf_table_extract-0.1-py2.7.egg
!/pdftableextract/core.py", line 179, in process_page
vd = delete(vd,i)
NameError: global name 'delete' is not defined

This code section comes from core.py

  j = 0
  while j < len(hd):
  if hd[j+1]-hd[j] > maxdiv :
      hd = delete(hd,j)
      hd = delete(hd,j)
    else:
      j=j+2

I have attached an image of the PDF I was trying to parse.
screenshot from 2013-10-09 17 55 27

Example not working, no docs available

I'm trying to set things up so I can convert pdfs to tables.

I'm using fedora 20 and i have run:

python setup.py install
pip install pandas

Both packages pdftoppm and pdftotext are installed.

Running this :

cd example
python test_to_pandas.py

fails with

Traceback (most recent call last):
  File "test_to_pandas.py", line 5, in <module>
    cells = [pdf.process_page("a.pdf",p) for p in pages]
  File "/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/core.py", line 89, in process_page
    (maxval, width, height, data) = readPNM(p.stdout)
  File "/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/pnm.py", line 24, in readPNM
    raise IOError("Expected 2 elements from parsing PNM file, got {0}: {1}".format(ls, name))
IOError: Expected 2 elements from parsing PNM file, got 0: <pipe>

PS: I've renamed the pdf from example.podf to a.pdf

Unclear failure mode if pdftoppm not installed

When pdftoppm is not installed, pdf-table-extract fails:

----> 1 pdf.process_page("foo.pdf", "1")

/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/core.pyc in process_page(infile, pgs, outfilename, greyscale_threshold, page, crop, line_length, bitmap_resolution, name, pad, white, black, bitmap, checkcrop, checklines, checkdivs, checkcell)
     54 # image load secion.
     55
---> 56   (maxval, width, height, data) = readPNM(p.stdout)
     57
     58   pad = int(pad)

/usr/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/pnm.pyc in readPNM(fd)
     20   data = fd.read()
     21
---> 22   xs, ys = s.split()
     23   width = int(xs)
     24   height = int(ys)

ValueError: need more than 0 values to unpack

NameError: global name 'hd' is not defined

Running the following command:

python extracttab.py  -i ~/Downloads/HIST-PRICES.pdf -o ~/Downloads/HIST-PRICES.csv -p 1 -t table_csv

using the file found here: http://www.ico.org/historical/2010-19/PDF/HIST-PRICES.pdf leads to:

Traceback (most recent call last):
  File "extracttab.py", line 487, in <module>
    } [ args.t ](cells,args.page)
  File "extracttab.py", line 431, in o_table_csv
    ] for x in range(len(pgs))
NameError: global name 'hd' is not defined

IndexError when last page is blank

If the last page is completely blank an index error occurs.

Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "scrape_pdf2.py", line 159, in test
    pdflist = get_table_pages(pages)
  File "scrape_pdf2.py", line 95, in get_table_pages
    cells = [pdf.process_page("example.pdf",p) for p in pages]
  File "/Users/jonathan/.virtualenvs/elance/lib/python2.7/site-packages/pdf_table_extract-0.1-py2.7.egg/pdftableextract/core.py", line 211, in process_page
    if vd[i+1]-vd[i] > maxdiv :
IndexError: index out of bounds

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.