Hey, I was browsing through the simpleGui examples and I was wonderi

Feature request: an multi-coll to single coll converter about pymupdf-utilities HOT 6 CLOSED

pymupdf commented on August 11, 2024

Feature request: an multi-coll to single coll converter

from pymupdf-utilities.

Comments (6)

JorjMcKie commented on August 11, 2024

Yeah, I have been experimenting with this kind of thing. E.g. a table extractor using wxPython.

The problem with all of this is the complex logic you have mentioned, too: how to intelligently recognize what a human will instantly understand when looking: 2 columns? Or 3? Headlines or images breaking an established rule ...

The cropbox won't help here. It is a thing which is equal to page.rect (if present in the PDF definition). Can be changed, but that leads to the previous questions again.
This is stuff for an AI actually. I remember having once seen a project exclusively devoted to recognize tables in a PDF.

from pymupdf-utilities.

angea commented on August 11, 2024

Well I don't want it to be automated.
I just want that it takes a few seconds per page - most page would use the same layout.

Why can't cropbox be used? the page contents would be left unchanged - we just need new pages with cropbox over the previous contents.

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

The /Cropbox, if present, is equal to the page rectangle presented by (Py-) MuPDF. If absent, the /Mediabox is used.
There are more boxes available as per PDF definition, but let's forget about them.

If you want to "use" the cropbox for your purpose, you must first know how to change it, right? And that leads to the issues you and I have sketched above.

In order to extract text from a PDF organized in 2 columns, you just have to know this fact beforehand. Extract text from 2 rectangles for display - or, if you want to use some existing GUI-based doc viewer, redefine the /Cropbox 2 times per page before you display the current GUI window.

The doc viewer in that repo alreadydoes something superior, I think: it is able to do a (simple) zoom into each page, which is logically divided in 9 (3 * 3) sub-rectangles for display.
In addition, it is usable for all document types - not just PDF (the only type which knows what a /Cropbox is).
If you modify this a bit to show a different logical page sub-division, you can achieve what you want, e.g. 2 * 1 rectangles, 2 * 2, etc.

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

The basis for all those viewer is an image of the page, created via page.getPixnap. This method supports a parameter clip to specify a subarea of the full page. It is the basis for doing the zooming logic sketched in previous post.

from pymupdf-utilities.

nandhakumarvs56 commented on August 11, 2024

I am trying your code to extract text from pdf documents. But I am getting the attritubute error fitz module has no attribute open. How to resolve this solution

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

fitz.open is a synonym for fitz.Document defined in __init__.py. So your error means, that the init is not executed.
The reason for this in turn may be manifold: installation went wrong, trying to import from within the directory where the init.py lives, etc.

from pymupdf-utilities.

Feature request: an multi-coll to single coll converter about pymupdf-utilities HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent