Git Product home page Git Product logo

Comments (6)

JorjMcKie avatar JorjMcKie commented on August 11, 2024

Yeah, I have been experimenting with this kind of thing. E.g. a table extractor using wxPython.

The problem with all of this is the complex logic you have mentioned, too: how to intelligently recognize what a human will instantly understand when looking: 2 columns? Or 3? Headlines or images breaking an established rule ...

The cropbox won't help here. It is a thing which is equal to page.rect (if present in the PDF definition). Can be changed, but that leads to the previous questions again.
This is stuff for an AI actually. I remember having once seen a project exclusively devoted to recognize tables in a PDF.

from pymupdf-utilities.

angea avatar angea commented on August 11, 2024

Well I don't want it to be automated.
I just want that it takes a few seconds per page - most page would use the same layout.

Why can't cropbox be used? the page contents would be left unchanged - we just need new pages with cropbox over the previous contents.

from pymupdf-utilities.

JorjMcKie avatar JorjMcKie commented on August 11, 2024

The /Cropbox, if present, is equal to the page rectangle presented by (Py-) MuPDF. If absent, the /Mediabox is used.
There are more boxes available as per PDF definition, but let's forget about them.

If you want to "use" the cropbox for your purpose, you must first know how to change it, right? And that leads to the issues you and I have sketched above.

In order to extract text from a PDF organized in 2 columns, you just have to know this fact beforehand. Extract text from 2 rectangles for display - or, if you want to use some existing GUI-based doc viewer, redefine the /Cropbox 2 times per page before you display the current GUI window.

The doc viewer in that repo alreadydoes something superior, I think: it is able to do a (simple) zoom into each page, which is logically divided in 9 (3 * 3) sub-rectangles for display.
In addition, it is usable for all document types - not just PDF (the only type which knows what a /Cropbox is).
If you modify this a bit to show a different logical page sub-division, you can achieve what you want, e.g. 2 * 1 rectangles, 2 * 2, etc.

from pymupdf-utilities.

JorjMcKie avatar JorjMcKie commented on August 11, 2024

The basis for all those viewer is an image of the page, created via page.getPixnap. This method supports a parameter clip to specify a subarea of the full page. It is the basis for doing the zooming logic sketched in previous post.

from pymupdf-utilities.

nandhakumarvs56 avatar nandhakumarvs56 commented on August 11, 2024

I am trying your code to extract text from pdf documents. But I am getting the attritubute error fitz module has no attribute open. How to resolve this solution

from pymupdf-utilities.

JorjMcKie avatar JorjMcKie commented on August 11, 2024

fitz.open is a synonym for fitz.Document defined in __init__.py. So your error means, that the init is not executed.
The reason for this in turn may be manifold: installation went wrong, trying to import from within the directory where the init.py lives, etc.

from pymupdf-utilities.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.