camelot-dev / camelot Goto Github PK
View Code? Open in Web Editor NEWA Python library to extract tabular data from PDFs
Home Page: https://camelot-py.readthedocs.io
License: MIT License
A Python library to extract tabular data from PDFs
Home Page: https://camelot-py.readthedocs.io
License: MIT License
When running on this document (https://www.qao.qld.gov.au/sites/qao/files/annual-reports/annual_report_2016-17.pdf), when it reaches page 4, it throws the following ValueError:
import camelot
camelot.read_pdf(path, pages='3', flavor='stream')
Traceback (most recent call last):
File "", line 2, in
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\io.py", line 117, in read_pdf
**kwargs
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\handlers.py", line 172, in parse
p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 458, in extract_tables
cols, rows = self._generate_columns_and_rows(table_idx, tk)
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 349, in _generate_columns_and_rows
ncols = max(set(elements), key=elements.count)
ValueError: max() arg is an empty sequence
Easy enough to capture with a try/except but thought I would pop it up here to let you know
Thanks for writing this package, excellent work!
I am able to get bounding boxes for each using table.cells
, Any pointers to get bounding boxes for each word in each cell?
Steps to reproduce: camelot -p 3 lattice -T 60,770,520,400 -plot grid 007.pdf
With Camelot 0.7.3 (installed from source), I ran the following command:
camelot --output entries.csv --format csv lattice original.pdf
on the PDF https://rawpowerlifting.com/wp-content/uploads/2019/01/2019-Tioga-Downs-Results.pdf
It executed for about 3 minutes, producing the output:
2019-10-13T11:37:46 - INFO - Processing page-1
Found 2 tables
"""""""""""""""""""
import camelot
tables=camelot.read_pdf("foo.pdf")
tables[0].df
tables.export("foo.csv",f="csv",comress=True)
tables[0].to_csv("foo.csv")
"""""""""""""""""
why ?
File "C:/Users/jiuyang.wei/Desktop/ocr/another/camelot.py", line 8, in
import camelot
File "C:\Users\jiuyang.wei\Desktop\ocr\another\camelot.py", line 10, in
tables=camelot.read_pdf("foo.pdf")
AttributeError: module 'camelot' has no attribute 'read_pdf'
why module 'camelot' has no attribute 'read_pdf'
camelot.handler._save_page is called as many times as there are pages passed to camelot.read_pdf. Each time this function is invoked, the source PDF is read from disk, parsed using PdfFileReader and is decrypted. This is something which can be reduced that contributes significantly to performance.
A great way to avoid this is accept a list of pages instead of page and run _save_pages function only once. The PdfFileReader object can be created once and we can loop over pages to save the pages separately.
I have this already working on a private fork with one hiccup that the PdfFileReader object gets modified for certain files after successfully looping and extracting ~80 pages in some of my sample PDFs. I create a copy of the original object to work around this but its a whole lot faster than the current approach as it completely avoids the 80+ file reads.
Let me know if this is something you'd like to incorporate, and I'd be happy to raise a pull request.
Cheers, and thanks for all the great work! smile
There's an associated PR atlanhq/camelot#311
rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x076ED670
32-bit python
latest ghostscript
Something to think about for the future:
OpenCV: maybe implement morph transform within the library itself/vendorize the code (not sure about dependency on C extensions)?- tk: Required for matplotlib.
- ghostscript: maybe use some Python library to convert PDF to image (same quality as ghostscript).
Some questions:
[1] Can pdftoppm be an alternative to ghostscript?
[2] Are poppler-utils more widely available (pre-installed) than ghostscript?
@tkelman wrote:
Could the matplotlib dependency be made optional? The plotting features here look like not a lot of code, and it's a pretty complicated dependency to pull in.
Similarly might pillow be a viable smaller alternative to the use of opencv here?
Hello @tkelman! I think making matplotlib optional makes sense. Let me look into it as I go on to adding more tests for the plotting code atlanhq/camelot#127.
Camelot uses adaptive threshold and morphological transformations from opencv. I haven't worked with pillow in the past but a quick google search got me this morph transform equivalent in pillow. I think removing opencv as a dependency would mean replacing the current image processing code with a combination of pillow + adaptive threshold / morph transform implementations. Let me explore this a bit further. Meanwhile if you have any other alternatives or suggestions on how we could do this, would love if you could share them on this thread!
matplotlib is now an optional requirement!
@sweco-sekrsv wrote:
I'm not exaclty sure what you are using Ghostscript for but I switched to pdftoppm for rasterizing pdf to images. I'm using the CLI tool and calling it from python.
For my scenarios, it's stable and generate images quicker than Ghostscript. I have had better success with fonts using pdftoppm as well.I'm on windows and are using the latest binaries from here:
http://blog.alivate.com.au/poppler-windowsOn a side note it can also fix "broken" PDF' files. As the ones in this ticket:
atlanhq/camelot#306
Resaving them with pdftocairo in the poppler tools makes the file load ok with pdf-minerOn another side note I tried making Ghostscript run using multiprocessing (to speed things up) but that did not seem to work very good. Not sure Ghostscript is designed to run using several threads.
Hi there,
Thanks very much for your great project and save me a lot of time! Now I want to build a pipeline using camelot and need to know how to print the output to the stdout. Please let me know if there is a way!Thanks,
Rui
Use-case:
Help the user drop tables in an ETL workflow based on parsing accuracy, whitespace in table cells.
More stat ideas:
Index page badge specifies 3.7 (also specified on setup.py) but installation page does not specify 3.7. Minor doc fix.
Should we set a date and add our project here? https://python3statement.org/
Every time read_pdf is called, a new PDFHandler object is created, and parse (which splits a PDF into multiple single page PDFs). This is inefficient. Instead:
Refer atlanhq/camelot#339
Refer atlanhq/camelot#357.
ERROR:root:
Traceback (most recent call last):
File "/home/myusername/.local/lib/python3.7/site-packages/excalibur/tasks.py", line 123, in extract
tables.export(f_datapath, f=f, compress=True)
File "/home/myusername/.local/lib/python3.7/site-packages/camelot/core.py", line 745, in export
table.df.to_excel(writer, sheet_name=sheet_name, encoding="utf-8")
File "/home/myusername/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 2257, in to_excel
engine=engine,
File "/home/myusername/.local/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 739, in write
freeze_panes=freeze_panes,
File "/home/myusername/.local/lib/python3.7/site-packages/pandas/io/excel/_openpyxl.py", line 416, in write_cells
xcell.value, fmt = self._value_with_fmt(cell.val)
File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 252, in value
self._bind_value(value)
File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 205, in _bind_value
value = self.check_string(value)
File "/home/myusername/.local/lib/python3.7/site-packages/openpyxl/cell/cell.py", line 169, in check_string
raise IllegalCharacterError
openpyxl.utils.exceptions.IllegalCharacterError
By the way, which is the official issue tracker? This one or https://github.com/atlanhq/camelot/issues
I am using Camelot to convert this document to csv. The csv file is created however, the issue is it is not correct.
As can be seen from the shared original file and the converted csv file, the Hindi characters are not converted properly.
import camelot
tables = camelot.read_pdf('Demand_ Estimate.pdf', flavor='stream')
tables[0].to_csv('demand_estimate.csv')
This is my code.
Using Camelot for some very long PDFs (>500 pages), I noticed that memory usage can grow significantly (in my experience, it can reach 30 GB and more).
I don't know if I'm doing something wrong.
Anyway, I found this solution: to divide the extraction into some chunks (for example, chunks of 50 pages); at the end of every chunk extraction, data are saved to disk.
Doing so, I succeed in limiting memory usage to a maximum of 4 GB, even for PDF of about 3000 pages.
@vinayak-mehta : what do you think about this approach? It could be useful? Are there better ways to limit memory usage?
(obviously, if the data saved on disk, later are all loaded into memory, the problem persists)
Continuing the conversation from #102.
When you say that lattice should work perfectly - I sort of wish to create a generic way to detect and extract tables without having to know which detection method (lattice / stream) is best for a given document - I want to decouple them as much as possible.
I get your use-case and it is not possible currently through the library itself. But I see two possibilities which can be implemented (both heuristics):
- As far as I can tell from NurminenDetectionAlgorithm.java, Tabula first filters out all Lattice-type tables from the document and then looks for Stream-type tables, till it cannot find any more tables. Similarly, we can "couple" both flavors into a single one inside Camelot.
- We can create a flavor called guess which automatically chooses between Lattice and Stream.
A lot of people face issues with ghostscript on Windows. Should we use Appveyor till we can remove ghostscript altogether?
@vinayak-mehta Now that the main repo has changed to this one, the https://pypi.org/project/camelot-py/ page should be also updated.
Camelot is assuming whole page as one table even there is sufficient space before and after table.
Only setting I could find is column_tol which is default at Zero. It doesn't make any difference.
Is there any other setting for this?
And please answer one more question.
How are your coordinates different from pdfplumber?
Based on the @CartierPierre's issue here atlanhq/camelot#367.
The experimental version exists before this commit 9753889. It uses Tesseract (using pyocr). ocropy looked promising the last time I checked, opening this issue for discussion and experiments around OCR.
The link to the tk
package in the README shows error. We need to replace it with this link : https://packages.ubuntu.com/bionic/python/python-tk
/cc: @vinayak-mehta
I got the following error using camelot.read_pdf('some.pdf')
Tracing back the RuntimeError
, it looks like winreg
is looking at the wrong view of the registry:
I don't know if my setup is odd (my system path may be a bit of a mess), but it may be good to detect which version of Windows is in use and then add the flags access=winreg.KEY_READ | winreg.KEY_WOW64_64KEY
.
Error below:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-42c019f21b87> in <module>
----> 1 tables = camelot.read_pdf('PROCES-RM003D (Diagnostic Objects).pdf', pages='47-49', flavor='lattice')
c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
115 suppress_stdout=suppress_stdout,
116 layout_kwargs=layout_kwargs,
--> 117 **kwargs
118 )
119 return tables
c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
170 for p in pages:
171 t = parser.extract_tables(
--> 172 p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
173 )
174 tables.extend(t)
c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\parsers\lattice.py in extract_tables(self, filename, suppress_stdout, layout_kwargs)
401 return []
402
--> 403 self._generate_image()
404 self._generate_table_bbox()
405
c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\parsers\lattice.py in _generate_image(self)
210
211 def _generate_image(self):
--> 212 from ..ext.ghostscript import Ghostscript
213
214 self.imagename = "".join([self.rootname, ".png"])
c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript\__init__.py in <module>
22 #
23
---> 24 from . import _gsprint as gs
25
26
c:\users\andre\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript\_gsprint.py in <module>
245 libgs = __win32_finddll()
246 if not libgs:
--> 247 raise RuntimeError("Please make sure that Ghostscript is installed")
248 libgs = windll.LoadLibrary(libgs)
249 else:
RuntimeError: Please make sure that Ghostscript is installed
Ghostscript does the job of doing this currently but is a pain to install and debug and does not have a friendly license. Before we can do #13, does it make sense to use python-pdfbox
. Then again, it downloads the pdfbox jar file and would need java to be installed on user systems.
>>> camelot.read_pdf('filename.pdf', pages='all', parallel=True)
We could try and use all cores present on the machine using multiprocessing. More ideas are welcome.
Would be nice to have a way to merge tables which span multiple pages.
Specifically in the "Setting up a development environment" section.
Although I can't upload the bad PDF due to NDA reasons, this issue is well documented here, along with some solutions to it, and there's even a PR in place to fix that, but there seems to be no maintainer available to merge it.
I'm not sure how this should be handled, since it's a PDFMiner problem, which seems to be unmaintained, and that reflects directly on camelot.
Occurs when running the camelot example (or when uploading a pdf in excalibur):
Traceback (most recent call last):
File "camelottest.py", line 1, in
import camelot
File "C:\Program Files\Python37\lib\site-packages\camelot_init_.py", line 6, in
from .io import read_pdf
File "C:\Program Files\Python37\lib\site-packages\camelot\io.py", line 5, in
from .handlers import PDFHandler
File "C:\Program Files\Python37\lib\site-packages\camelot\handlers.py", line 8, in
from .core import TableList
ImportError: cannot import name 'TableList' from 'camelot.core' (C:\Program Files\Python37\lib\site-packages\camelot\core_init_.py)
Raised by @shivohmgupta:
When I'm using table_regions parameter, above error is coming. Here is my code
tables = camelot.read_pdf("a.pdf", flavor='stream', table_regions=['49,217,568,403'])
Without the table_regions parameter, it is working fine
Check out this birdisland.pdf output here.
While testing I have faced a case where table.accuracy
is negative number.
PDF:page-3.pdf
Code:
tables=camelot.read_pdf('/Users/skatipomu/Table_Extraction_Camelot/page3.pdf',pages="all)
[table.accuracy for table in tables]
Output:
[99.99999999999997, -20.852716930856104]
I think the reason is because in compute_accuracy
method in utils.py while calculating accuracy we are subtracting error percentage from 1. It is supposed to be in the range [0.0,1.0] but the errors passed on to this method contains error percentages in the range[0 to 100] which inturn is from get_table_index
method. So dividing this error by 100 solved the issue for me.
def compute_accuracy(error_weights):
"""Calculates a score based on weights assigned to various
parameters and their error percentages.
Parameters
----------
error_weights : list
Two-dimensional list of the form [[p1, e1], [p2, e2], ...]
where pn is the weight assigned to list of errors en.
Sum of pn should be equal to 100.
Returns
-------
score : float
"""
SCORE_VAL = 100
try:
score = 0
if sum([ew[0] for ew in error_weights]) != SCORE_VAL:
raise ValueError("Sum of weights should be equal to 100.")
for ew in error_weights:
weight = ew[0] / len(ew[1])
for error_percentage in ew[1]:
**score += weight * (1 - error_percentage)**
except ZeroDivisionError:
score = 0
return score
from score += weight * (1 - error_percentage)
to score += weight * (1 - error_percentage/100.0)
Currently lots of tests call read_pdf
. This wastes time converting the PDF to raster images. This could be avoided.
Steps to reproduce: camelot -p 3 lattice -plot contour 007.pdf
We can improve our TravisCI configuration so it runs on multiple operating systems, specifically having a Windows setup on CI could catch issues people are experiencing.
Here's the documentation and sample config how to achieve this:
https://docs.travis-ci.com/user/languages/python/#running-python-tests-on-multiple-operating-systems
I got this error when I ran the piece of code below "from .pdftypes import PDFObjectNotFound"
I have installed both dependencies ghostscript and tkinter but no idea why it's throwing import error?
import camelot
table = camelot.read_pdf('data/esign.pdf', suppress_stdout=True)
print(table)
The function that checks for page rotation is the culprit. pdfminer's layout analysis takes a long time for such pdfs. Examples: the RNTB pdfs from un-sdg.
Adding a kwarg which lets user specify rotation can is a minor optimization that can fix this.
If I refer to
https://github.com/socialcopsdev/camelot/blob/7cf409aa08f937edd24d6ac14d8daa56e614bb6d/camelot/parsers/stream.py#L48
It's possible to change the edge_tol, but in the read_pdf kwargs, it's not possible to change it, it's filtered out here :
https://github.com/socialcopsdev/camelot/blob/7cf409aa08f937edd24d6ac14d8daa56e614bb6d/camelot/io.py#L104
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.