Comments (15)
Columns containing only NaN get removed automatically in the read_pdf module, while I want them to be there. Is there any way to prevent this?
I want 3 columns to be in order as they are, while I am getting this
from tabula-py.
Ok. I'll raise an issue at tabula-java.
Received same output from stream=True
from tabula-py.
I think it's a kind of tabula-java option problem, so it would be nice to ask at tabula-java issue.
How about using stream option?
tabula.read_pdf('target.pdf', pages='all', stream=True, guess=False)
from tabula-py.
The same problem occur in tabular-py
Even if a column is empty in any page then tabula-py writes data of another column
from tabula-py.
I am so having issue same @konfi
Please help us!
from tabula-py.
@samkit-jain Were you able to raise this issue again. Please let me know steps if its resolved.
Issue - Even if a column is empty in any page then tabula-py writes data of another column
Tried reading the pdf file using tabula read_pdf in python.
Code
df=read_pdf(pdfFile, pages='1', stream='True', guess='False')
df = df.dropna(axis='rows')
print(tabulate(df))
As you can see in output screenshot the columns Withdrawal & Deposit got merged into a single column
If i use Tabula app (https://tabula.technology/) on local it works fine.
from tabula-py.
@kvopencode Haven't been able to resolve this
from tabula-py.
I tried with this PDF created by Google Spreadsheet, but couldn't reproduce the issue. Could someone provide a minimum reproducible PDF for me?
test.pdf
In [1]: import tabula
In [2]: path = "test.pdf"
In [23]: df= tabula.read_pdf(path)
In [24]: df
Out[24]:
ColA ColB ColC
0 1.0 NaN 6.0
1 NaN 2.0 7.0
2 NaN NaN NaN
3 3.0 NaN 8.0
4 NaN 4.0 NaN
5 NaN 5.0 9.0
In [25]: tabula.read_pdf(path, guess=False)
Out[25]:
ColA ColB ColC
0 1.0 NaN 6.0
1 NaN 2.0 7.0
2 NaN NaN NaN
3 3.0 NaN 8.0
4 NaN 4.0 NaN
5 NaN 5.0 9.0
In [26]: tabula.read_pdf(path, guess=False, stream=True)
Out[26]:
ColA Unnamed: 1 ColB Unnamed: 3 ColC Unnamed: 5
0 NaN 1.0 NaN NaN NaN 6.0
1 NaN NaN NaN 2.0 NaN 7.0
2 NaN 3.0 NaN NaN NaN 8.0
3 NaN NaN NaN 4.0 NaN NaN
4 NaN NaN NaN 5.0 NaN 9.0
In [36]: tabula.read_pdf(path, guess=False, lattice=True)
Out[36]:
ColA ColB ColC
0 1.0 NaN 6.0
1 NaN 2.0 7.0
2 NaN NaN NaN
3 3.0 NaN 8.0
4 NaN 4.0 NaN
5 NaN 5.0 9.0
from tabula-py.
@chezou Can you please share your email id ? I shall send across
from tabula-py.
@kvopencode Unfortunately, I don't want to support via E-mail privately, because:
- I doubt this is a tabula-java related issue. If so, the PDF should be shared with the tabula-java team.
- tabula-py is a private project, which means I develop and maintain it in my spare time. Not so enough resources to support only by me.
- Personally, I had really awful experiences through e-mail basis requests. Some were impolite, some tended to overuse my limited resource. I don't think you're the one, but it'd be nice if we could have "minimum reproducible data" for tackling this issue since tabula-py is an open-source project.
For people facing this issue:
Please provide the PDF for future comments on this issue. Otherwise, I might lock this issue since I can't reproduce it and can't specify whether a tabula-py issue or tabula-java one.
from tabula-py.
I have the same issue. The debit and the credit column should be on a separate column but for some reason they are not
MIB ADJ02262021.pdf
.
from tabula-py.
was anyone able to resolve the issue of merged columns?
from tabula-py.
A possible workaround for me is to convert the PDF file to HTML.
Then read the table from the HTML content:
https://datascientyst.com/extract-table-from-pdf-with-python-pandas/
from tabula-py.
@samkit-jain - Found any fix for this?
from tabula-py.
@Aman0509 Not with tabula. I switched over to pdfplumber.
from tabula-py.
Related Issues (20)
- How to estimate area? (without webapp) HOT 1
- Superscript numbers coerce to be a normal number HOT 1
- Superscript numbers in PDF coerce to be a normal number HOT 1
- Superscript numbers in PDF coerce to be a normal number HOT 2
- Allow columns to accept a string for relative area HOT 5
- Allow columns parameter to use relative area HOT 5
- Cutting off first character of last column HOT 5
- tabula.io.read_pdf 'columns' argument change typing to Iterable[float] HOT 1
- tabula.io.read_pdf 'columns' argument change type annotation to Iterable[float] HOT 3
- tabula.io.read_pdf argument "pandas_options" is being changed inside the function HOT 1
- tabula.io.read_pdf argument "pandas_options" is being changed inside the function HOT 3
- Extracting non tabular data from pdfs, is it possible? HOT 1
- Extracting non-tabular (1-tabula output) data from pdf, is it possible? HOT 3
- Unable to remove error : Got stderr: Picked up _JAVA_OPTIONS: -Djava.awt.headless=true HOT 1
- Unable to remove note in log : Got stderr: Picked up _JAVA_OPTIONS: -Djava.awt.headless=true HOT 1
- Tabula py Ignores an entire column if it's blank and if it does not contain headerd? HOT 1
- tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', HOT 3
- dont ignore empty columns in tables spanning multiple pages HOT 1
- Try to install tabula-py HOT 1
- Use JPype instead of subprocess HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabula-py.