Git Product home page Git Product logo

Comments (15)

peppapoppy avatar peppapoppy commented on June 19, 2024 2

Columns containing only NaN get removed automatically in the read_pdf module, while I want them to be there. Is there any way to prevent this?
sample
I want 3 columns to be in order as they are, while I am getting this
sample2

from tabula-py.

samkit-jain avatar samkit-jain commented on June 19, 2024 1

Ok. I'll raise an issue at tabula-java.

Received same output from stream=True

from tabula-py.

chezou avatar chezou commented on June 19, 2024

I think it's a kind of tabula-java option problem, so it would be nice to ask at tabula-java issue.

How about using stream option?

tabula.read_pdf('target.pdf', pages='all', stream=True, guess=False)

from tabula-py.

ayubansal1998 avatar ayubansal1998 commented on June 19, 2024

The same problem occur in tabular-py
Even if a column is empty in any page then tabula-py writes data of another column

from tabula-py.

thiencuong14 avatar thiencuong14 commented on June 19, 2024

I am so having issue same @konfi

[input]
image

[output]
image

Please help us!

from tabula-py.

kvopencode avatar kvopencode commented on June 19, 2024

@samkit-jain Were you able to raise this issue again. Please let me know steps if its resolved.

Issue - Even if a column is empty in any page then tabula-py writes data of another column

Tried reading the pdf file using tabula read_pdf in python.
Code

df=read_pdf(pdfFile, pages='1', stream='True', guess='False')
df = df.dropna(axis='rows')
print(tabulate(df))

As you can see in output screenshot the columns Withdrawal & Deposit got merged into a single column

If i use Tabula app (https://tabula.technology/) on local it works fine.

Sample source pdf
Screenshot 2019-10-30 at 14 32 22

Python dataframe output
pythonDataframeOutput

from tabula-py.

samkit-jain avatar samkit-jain commented on June 19, 2024

@kvopencode Haven't been able to resolve this

from tabula-py.

chezou avatar chezou commented on June 19, 2024

I tried with this PDF created by Google Spreadsheet, but couldn't reproduce the issue. Could someone provide a minimum reproducible PDF for me?
test.pdf

In [1]: import tabula
In [2]: path = "test.pdf"

In [23]: df=  tabula.read_pdf(path)

In [24]: df
Out[24]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

In [25]: tabula.read_pdf(path, guess=False)
Out[25]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

In [26]: tabula.read_pdf(path, guess=False, stream=True)
Out[26]:
   ColA  Unnamed: 1  ColB  Unnamed: 3  ColC  Unnamed: 5
0   NaN         1.0   NaN         NaN   NaN         6.0
1   NaN         NaN   NaN         2.0   NaN         7.0
2   NaN         3.0   NaN         NaN   NaN         8.0
3   NaN         NaN   NaN         4.0   NaN         NaN
4   NaN         NaN   NaN         5.0   NaN         9.0

In [36]: tabula.read_pdf(path, guess=False, lattice=True)
Out[36]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

from tabula-py.

kvopencode avatar kvopencode commented on June 19, 2024

@chezou Can you please share your email id ? I shall send across

from tabula-py.

chezou avatar chezou commented on June 19, 2024

@kvopencode Unfortunately, I don't want to support via E-mail privately, because:

  • I doubt this is a tabula-java related issue. If so, the PDF should be shared with the tabula-java team.
  • tabula-py is a private project, which means I develop and maintain it in my spare time. Not so enough resources to support only by me.
  • Personally, I had really awful experiences through e-mail basis requests. Some were impolite, some tended to overuse my limited resource. I don't think you're the one, but it'd be nice if we could have "minimum reproducible data" for tackling this issue since tabula-py is an open-source project.

For people facing this issue:
Please provide the PDF for future comments on this issue. Otherwise, I might lock this issue since I can't reproduce it and can't specify whether a tabula-py issue or tabula-java one.

from tabula-py.

egodalle avatar egodalle commented on June 19, 2024

I have the same issue. The debit and the credit column should be on a separate column but for some reason they are not
MIB ADJ02262021.pdf
.

from tabula-py.

Ayushi-Garg-1 avatar Ayushi-Garg-1 commented on June 19, 2024

was anyone able to resolve the issue of merged columns?

from tabula-py.

softhints avatar softhints commented on June 19, 2024

A possible workaround for me is to convert the PDF file to HTML.
Then read the table from the HTML content:

https://datascientyst.com/extract-table-from-pdf-with-python-pandas/

from tabula-py.

Aman0509 avatar Aman0509 commented on June 19, 2024

@samkit-jain - Found any fix for this?

from tabula-py.

samkit-jain avatar samkit-jain commented on June 19, 2024

@Aman0509 Not with tabula. I switched over to pdfplumber.

from tabula-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.