camelot-dev / camelot Goto Github PK

View Code? Open in Web Editor NEW

2.8K 2.8K 444.0 17.02 MB

A Python library to extract tabular data from PDFs

Home Page: https://camelot-py.readthedocs.io

License: MIT License

Python 100.00%

camelot's People

Contributors

Stargazers

Watchers

Forkers

davidkong0987 sjm20066 narongdejsrn sverma25 sanjayjan26 wheeled jnothman eladkehat sinar pythonthings trifacta raghuramdr akki2825 xiaotongsong noke8868 williamdeve futurebody zhnagchulan haotianliu hongshunyang tangbanxianing psyche11 linsys hell-to-heaven stars-and-focus techsu1992 shiweihappy kisspassion d29xia mephistogit liuenguang leo-xxx w55699 whtugithub kinghows a77civit yeayee wrightway12 collector-m wongxk dayorday tchigher tavernier ruo2012 nathonnot emaiqi hitpisces nike47 aswindinesh pravarag kishvanchee nightwarriorftw hardipinders rossverploegh vbvjain xyxie vigneshhari philipg99 crtejaswi fighting41love andersendanmark pmsco aliashkar hainan89 vasantvohra koustavcode qushitu starkhuu xumax123 miltonarango zhangkailazy mickey0891 javacodemood govkartha minsifansi sharmer156 opencollective theophine bikashgupta11 rockstardevs kanishkan91 flaketill lnk2past monkeyfx heidarmamosian simonxing archerzj dragie jimrobo ravedata pacinolucifer einsky josepvb aaronfedyan johnwalz97 arifulmondal lnicalo halovely 6897889 lize1803

camelot's Issues

ValueError: max() arg is an empty sequence

When running on this document (https://www.qao.qld.gov.au/sites/qao/files/annual-reports/annual_report_2016-17.pdf), when it reaches page 4, it throws the following ValueError:

import camelot
camelot.read_pdf(path, pages='3', flavor='stream')

Traceback (most recent call last):
File "", line 2, in
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\io.py", line 117, in read_pdf
**kwargs
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\handlers.py", line 172, in parse
p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 458, in extract_tables
cols, rows = self._generate_columns_and_rows(table_idx, tk)
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 349, in _generate_columns_and_rows
ncols = max(set(elements), key=elements.count)
ValueError: max() arg is an empty sequence

Easy enough to capture with a try/except but thought I would pop it up here to let you know
Thanks for writing this package, excellent work!

Getting word coordinates in each cell

I am able to get bounding boxes for each using table.cells, Any pointers to get bounding boxes for each word in each cell?

Lines not being included in user specified table area

Link to PDF: https://stackoverflow.com/questions/53203779/headers-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from

Steps to reproduce: camelot -p 3 lattice -T 60,770,520,400 -plot grid 007.pdf

Camelot takes 3 minutes on small PDF in lattice mode

With Camelot 0.7.3 (installed from source), I ran the following command:

camelot --output entries.csv --format csv lattice original.pdf

on the PDF https://rawpowerlifting.com/wp-content/uploads/2019/01/2019-Tioga-Downs-Results.pdf

It executed for about 3 minutes, producing the output:

2019-10-13T11:37:46 - INFO - Processing page-1
Found 2 tables

why module 'camelot' has no attribute 'read_pdf'

"""""""""""""""""""
import camelot

tables=camelot.read_pdf("foo.pdf")
tables[0].df
tables.export("foo.csv",f="csv",comress=True)
tables[0].to_csv("foo.csv")

"""""""""""""""""
why ?

File "C:/Users/jiuyang.wei/Desktop/ocr/another/camelot.py", line 8, in
import camelot

File "C:\Users\jiuyang.wei\Desktop\ocr\another\camelot.py", line 10, in
tables=camelot.read_pdf("foo.pdf")

AttributeError: module 'camelot' has no attribute 'read_pdf'

why module 'camelot' has no attribute 'read_pdf'

Reduce file reads in camelot.handlers._save_page

@niazangels

camelot.handler._save_page is called as many times as there are pages passed to camelot.read_pdf. Each time this function is invoked, the source PDF is read from disk, parsed using PdfFileReader and is decrypted. This is something which can be reduced that contributes significantly to performance.

A great way to avoid this is accept a list of pages instead of page and run _save_pages function only once. The PdfFileReader object can be created once and we can loop over pages to save the pages separately.

I have this already working on a private fork with one hiccup that the PdfFileReader object gets modified for certain files after successfully looping and extracting ~80 pages in some of my sample PDFs. I create a copy of the original object to work around this but its a whole lot faster than the current approach as it completely avoids the 80+ file reads.

Let me know if this is something you'd like to incorporate, and I'd be happy to raise a pull request.

Cheers, and thanks for all the great work! smile

There's an associated PR atlanhq/camelot#311

access violation writing 0x076ED670

rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x076ED670

32-bit python
latest ghostscript

Add FAQ section in docs

How is accuracy calculated? (needs an image for better understanding)

Remove dependency on ghostscript and opencv

Something to think about for the future:

~~OpenCV: maybe implement morph transform within the library itself/vendorize the code (not sure about dependency on C extensions)?~~

tk: Required for matplotlib.

ghostscript: maybe use some Python library to convert PDF to image (same quality as ghostscript).

Some questions:
[1] Can pdftoppm be an alternative to ghostscript?
[2] Are poppler-utils more widely available (pre-installed) than ghostscript?

@tkelman wrote:

Could the matplotlib dependency be made optional? The plotting features here look like not a lot of code, and it's a pretty complicated dependency to pull in.

Similarly might pillow be a viable smaller alternative to the use of opencv here?

Hello @tkelman! I think making matplotlib optional makes sense. Let me look into it as I go on to adding more tests for the plotting code atlanhq/camelot#127.

Camelot uses adaptive threshold and morphological transformations from opencv. I haven't worked with pillow in the past but a quick google search got me this morph transform equivalent in pillow. I think removing opencv as a dependency would mean replacing the current image processing code with a combination of pillow + adaptive threshold / morph transform implementations. Let me explore this a bit further. Meanwhile if you have any other alternatives or suggestions on how we could do this, would love if you could share them on this thread!

matplotlib is now an optional requirement!

@sweco-sekrsv wrote:

I'm not exaclty sure what you are using Ghostscript for but I switched to pdftoppm for rasterizing pdf to images. I'm using the CLI tool and calling it from python.
For my scenarios, it's stable and generate images quicker than Ghostscript. I have had better success with fonts using pdftoppm as well.

I'm on windows and are using the latest binaries from here:
http://blog.alivate.com.au/poppler-windows

On a side note it can also fix "broken" PDF' files. As the ones in this ticket:
atlanhq/camelot#306
Resaving them with pdftocairo in the poppler tools makes the file load ok with pdf-miner

On another side note I tried making Ghostscript run using multiprocessing (to speed things up) but that did not seem to work very good. Not sure Ghostscript is designed to run using several threads.

Print table to stdout

@crotoc

Hi there,
Thanks very much for your great project and save me a lot of time! Now I want to build a pipeline using camelot and need to know how to print the output to the stdout. Please let me know if there is a way!

Thanks,
Rui

Great library, but dependencies ??!!

Note: This is not an issue, yet no better place to discuss on this.

Stats below are pulled from PyPI downloads.
Despite being a better process than the others, what do you think supports the less usage.

Useful stats in parsing_report

Use-case:

Help the user drop tables in an ETL workflow based on parsing accuracy, whitespace in table cells.

More stat ideas:

A boolean that tells if there might be encoding errors in the output.
Distribution of font sizes?

Camelot working on 3.7 not specified on installation page

Index page badge specifies 3.7 (also specified on setup.py) but installation page does not specify 3.7. Minor doc fix.

Drop Python 2 support?

Should we set a date and add our project here? https://python3statement.org/

Add opencollective link to documentation and README

Make PDFHandler more efficient

Every time read_pdf is called, a new PDFHandler object is created, and parse (which splits a PDF into multiple single page PDFs). This is inefficient. Instead:

Split and store single page PDFs into a temp directory named after the md5 hash of the master PDF file. And then calculate the actual new single page PDFs that are needed, based on the page numbers provided by the user (which can change).
In Lattice, convert a single page PDF into an image, if and only if the PNG doesn't exist?

All values are getting displayed in single row using Camelot

Refer atlanhq/camelot#339

min_height/width to filter out tables?

Refer atlanhq/camelot#357.

Error: openpyxl.utils.exceptions.IllegalCharacterError

ERROR:root:
Traceback (most recent call last):
  File "/home/myusername/.local/lib/python3.7/site-packages/excalibur/tasks.py", line 123, in extract
    tables.export(f_datapath, f=f, compress=True)
  File "/home/myusername/.local/lib/python3.7/site-packages/camelot/core.py", line 745, in export
    table.df.to_excel(writer, sheet_name=sheet_name, encoding="utf-8")
  File "/home/myusername/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 2257, in to_excel
    engine=engine,
  File "/home/myusername/.local/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 739, in write
    freeze_panes=freeze_panes,
  File "/home/myusername/.local/lib/python3.7/site-packages/pandas/io/excel/_openpyxl.py", line 416, in write_cells
    xcell.value, fmt = self._value_with_fmt(cell.val)
  File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 252, in value
    self._bind_value(value)
  File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 205, in _bind_value
    value = self.check_string(value)
  File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 169, in check_string
    raise IllegalCharacterError
openpyxl.utils.exceptions.IllegalCharacterError

By the way, which is the official issue tracker? This one or https://github.com/atlanhq/camelot/issues

Conversion to csv for Hindi pdf is not correct

I am using Camelot to convert this document to csv. The csv file is created however, the issue is it is not correct.

As can be seen from the shared original file and the converted csv file, the Hindi characters are not converted properly.

import camelot
tables = camelot.read_pdf('Demand_ Estimate.pdf', flavor='stream')
tables[0].to_csv('demand_estimate.csv')

This is my code.

Optimize memory usage for long PDFs

Using Camelot for some very long PDFs (>500 pages), I noticed that memory usage can grow significantly (in my experience, it can reach 30 GB and more).

I don't know if I'm doing something wrong.

Anyway, I found this solution: to divide the extraction into some chunks (for example, chunks of 50 pages); at the end of every chunk extraction, data are saved to disk.
Doing so, I succeed in limiting memory usage to a maximum of 4 GB, even for PDF of about 3000 pages.

@vinayak-mehta : what do you think about this approach? It could be useful? Are there better ways to limit memory usage?

(obviously, if the data saved on disk, later are all loaded into memory, the problem persists)

Automatically choose flavor based on type of table in PDF

Continuing the conversation from #102.

@imri:

When you say that lattice should work perfectly - I sort of wish to create a generic way to detect and extract tables without having to know which detection method (lattice / stream) is best for a given document - I want to decouple them as much as possible.

@vinayak-mehta

I get your use-case and it is not possible currently through the library itself. But I see two possibilities which can be implemented (both heuristics):

As far as I can tell from NurminenDetectionAlgorithm.java, Tabula first filters out all Lattice-type tables from the document and then looks for Stream-type tables, till it cannot find any more tables. Similarly, we can "couple" both flavors into a single one inside Camelot.

We can create a flavor called guess which automatically chooses between Lattice and Stream.

Replace pdf-table-extract in the intro page with pdfplumber

Preserve table captions and other relevant info

Right now, it only parses text that is present within a table's boundary. Important information like table captions and footnotes should also be parsed.

Use Appveyor for Windows support

A lot of people face issues with ghostscript on Windows. Should we use Appveyor till we can remove ghostscript altogether?

Update pypi page for camelot

@vinayak-mehta Now that the main repo has changed to this one, the https://pypi.org/project/camelot-py/ page should be also updated.

Assuming whole page as one table in stream flavour

Camelot is assuming whole page as one table even there is sufficient space before and after table.
Only setting I could find is column_tol which is default at Zero. It doesn't make any difference.
Is there any other setting for this?
And please answer one more question.
How are your coordinates different from pdfplumber?

pdf

Unify table_area input with output from Adobe PDF Viewer

Based on the @CartierPierre's issue here atlanhq/camelot#367.

Add OCR support

The experimental version exists before this commit 9753889. It uses Tesseract (using pyocr). ocropy looked promising the last time I checked, opening this issue for discussion and experiments around OCR.

Link to python-tk package shows error.

The link to the tk package in the README shows error. We need to replace it with this link : https://packages.ubuntu.com/bionic/python/python-tk

/cc: @vinayak-mehta

winreg fails to find dll for ghostscript in 64bit Windows 10

I got the following error using camelot.read_pdf('some.pdf') Tracing back the RuntimeError, it looks like winreg is looking at the wrong view of the registry:

I don't know if my setup is odd (my system path may be a bit of a mess), but it may be good to detect which version of Windows is in use and then add the flags access=winreg.KEY_READ | winreg.KEY_WOW64_64KEY.

Error below:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-42c019f21b87> in <module>
----> 1 tables = camelot.read_pdf('PROCES-RM003D (Diagnostic Objects).pdf', pages='47-49', flavor='lattice')

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
    115             suppress_stdout=suppress_stdout,
    116             layout_kwargs=layout_kwargs,
--> 117             **kwargs
    118         )
    119         return tables

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
    170             for p in pages:
    171                 t = parser.extract_tables(
--> 172                     p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
    173                 )
    174                 tables.extend(t)

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\parsers\lattice.py in extract_tables(self, filename, suppress_stdout, layout_kwargs)
    401             return []
    402 
--> 403         self._generate_image()
    404         self._generate_table_bbox()
    405 

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\parsers\lattice.py in _generate_image(self)
    210 
    211     def _generate_image(self):
--> 212         from ..ext.ghostscript import Ghostscript
    213 
    214         self.imagename = "".join([self.rootname, ".png"])

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript\__init__.py in <module>
     22 #
     23 
---> 24 from . import _gsprint as gs
     25 
     26 

c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript\_gsprint.py in <module>
    245     libgs = __win32_finddll()
    246     if not libgs:
--> 247         raise RuntimeError("Please make sure that Ghostscript is installed")
    248     libgs = windll.LoadLibrary(libgs)
    249 else:

RuntimeError: Please make sure that Ghostscript is installed

Add more pdf-to-image engines?

Ghostscript does the job of doing this currently but is a pain to install and debug and does not have a friendly license. Before we can do #13, does it make sense to use python-pdfbox. Then again, it downloads the pdfbox jar file and would need java to be installed on user systems.

atlanhq/camelot#346

Use multiprocessing to parallely process PDF pages

>>> camelot.read_pdf('filename.pdf', pages='all', parallel=True)

We could try and use all cores present on the machine using multiprocessing. More ideas are welcome.

Add option to modify the ToUnicode map

https://stackoverflow.com/questions/31876415/parsing-a-pdfdevanagari-script-using-pdfminer-gives-incorrect-output

Concat multi page tables

Would be nice to have a way to merge tables which span multiple pages.

Add link to installation of external deps in the contributor's guide

Specifically in the "Setting up a development environment" section.

AttributeError from PDFMiner

@igormp

Although I can't upload the bad PDF due to NDA reasons, this issue is well documented here, along with some solutions to it, and there's even a PR in place to fix that, but there seems to be no maintainer available to merge it.

I'm not sure how this should be handled, since it's a PDFMiner problem, which seems to be unmaintained, and that reflects directly on camelot.

TableList error in Camelot and in Excalibur

Occurs when running the camelot example (or when uploading a pdf in excalibur):

Traceback (most recent call last):
File "camelottest.py", line 1, in
import camelot
File "C:\Program Files\Python37\lib\site-packages\camelot_init_.py", line 6, in
from .io import read_pdf
File "C:\Program Files\Python37\lib\site-packages\camelot\io.py", line 5, in
from .handlers import PDFHandler
File "C:\Program Files\Python37\lib\site-packages\camelot\handlers.py", line 8, in
from .core import TableList
ImportError: cannot import name 'TableList' from 'camelot.core' (C:\Program Files\Python37\lib\site-packages\camelot\core_init_.py)

ZeroDivisionError: float division by zero - table_regions

Raised by @shivohmgupta:

When I'm using table_regions parameter, above error is coming. Here is my code
tables = camelot.read_pdf("a.pdf", flavor='stream', table_regions=['49,217,568,403'])
Without the table_regions parameter, it is working fine

Hybrid flavor combining lattice and stream

Shift text up based on the presence of horizontal lines and some metric based on blank rows. If the vertical lines are not present then, Stream generated columns/user given separators should be used.

For example:

Duplicate strings assigned to the same cell

Check out this birdisland.pdf output here.

Negative value as accuracy of table.

While testing I have faced a case where table.accuracy is negative number.

PDF:page-3.pdf
Code:

tables=camelot.read_pdf('/Users/skatipomu/Table_Extraction_Camelot/page3.pdf',pages="all)
[table.accuracy for table in tables]

Output:
[99.99999999999997, -20.852716930856104]

I think the reason is because in compute_accuracy method in utils.py while calculating accuracy we are subtracting error percentage from 1. It is supposed to be in the range [0.0,1.0] but the errors passed on to this method contains error percentages in the range[0 to 100] which inturn is from get_table_index method. So dividing this error by 100 solved the issue for me.

def compute_accuracy(error_weights):
    """Calculates a score based on weights assigned to various
    parameters and their error percentages.

    Parameters
    ----------
    error_weights : list
        Two-dimensional list of the form [[p1, e1], [p2, e2], ...]
        where pn is the weight assigned to list of errors en.
        Sum of pn should be equal to 100.

    Returns
    -------
    score : float

    """
    SCORE_VAL = 100
    try:
        score = 0
        if sum([ew[0] for ew in error_weights]) != SCORE_VAL:
            raise ValueError("Sum of weights should be equal to 100.")
        for ew in error_weights:
            weight = ew[0] / len(ew[1])
            for error_percentage in ew[1]:
                **score += weight * (1 - error_percentage)**
    except ZeroDivisionError:
        score = 0
    return score

from score += weight * (1 - error_percentage) to score += weight * (1 - error_percentage/100.0)

import camelot

table = camelot.read_pdf('data/esign.pdf', suppress_stdout=True)
print(table)

Page splitting is very slow for some PDFs

The function that checks for page rotation is the culprit. pdfminer's layout analysis takes a long time for such pdfs. Examples: the RNTB pdfs from un-sdg.

Adding a kwarg which lets user specify rotation can is a minor optimization that can fix this.

edge_tol skipped in read_pdf

@CartierPierre

If I refer to
https://github.com/socialcopsdev/camelot/blob/7cf409aa08f937edd24d6ac14d8daa56e614bb6d/camelot/parsers/stream.py#L48
It's possible to change the edge_tol, but in the read_pdf kwargs, it's not possible to change it, it's filtered out here :
https://github.com/socialcopsdev/camelot/blob/7cf409aa08f937edd24d6ac14d8daa56e614bb6d/camelot/io.py#L104