Comments (6)
Yeah, I have been experimenting with this kind of thing. E.g. a table extractor using wxPython.
The problem with all of this is the complex logic you have mentioned, too: how to intelligently recognize what a human will instantly understand when looking: 2 columns? Or 3? Headlines or images breaking an established rule ...
The cropbox won't help here. It is a thing which is equal to page.rect
(if present in the PDF definition). Can be changed, but that leads to the previous questions again.
This is stuff for an AI actually. I remember having once seen a project exclusively devoted to recognize tables in a PDF.
from pymupdf-utilities.
Well I don't want it to be automated.
I just want that it takes a few seconds per page - most page would use the same layout.
Why can't cropbox be used? the page contents would be left unchanged - we just need new pages with cropbox over the previous contents.
from pymupdf-utilities.
The /Cropbox, if present, is equal to the page rectangle presented by (Py-) MuPDF. If absent, the /Mediabox is used.
There are more boxes available as per PDF definition, but let's forget about them.
If you want to "use" the cropbox for your purpose, you must first know how to change it, right? And that leads to the issues you and I have sketched above.
In order to extract text from a PDF organized in 2 columns, you just have to know this fact beforehand. Extract text from 2 rectangles for display - or, if you want to use some existing GUI-based doc viewer, redefine the /Cropbox 2 times per page before you display the current GUI window.
The doc viewer in that repo alreadydoes something superior, I think: it is able to do a (simple) zoom into each page, which is logically divided in 9 (3 * 3) sub-rectangles for display.
In addition, it is usable for all document types - not just PDF (the only type which knows what a /Cropbox is).
If you modify this a bit to show a different logical page sub-division, you can achieve what you want, e.g. 2 * 1 rectangles, 2 * 2, etc.
from pymupdf-utilities.
The basis for all those viewer is an image of the page, created via page.getPixnap
. This method supports a parameter clip
to specify a subarea of the full page. It is the basis for doing the zooming logic sketched in previous post.
from pymupdf-utilities.
I am trying your code to extract text from pdf documents. But I am getting the attritubute error fitz module has no attribute open. How to resolve this solution
from pymupdf-utilities.
fitz.open
is a synonym for fitz.Document
defined in __init__.py
. So your error means, that the init is not executed.
The reason for this in turn may be manifold: installation went wrong, trying to import from within the directory where the init.py lives, etc.
from pymupdf-utilities.
Related Issues (20)
- Suggestion for Jupyter notebooks HOT 3
- ModuleNotFoundError: No module named 'ParseTab' HOT 2
- export-toc script outputs invalid csv when bookmark entry has newline character
- anonymize.py raises UnicodeDecodeError HOT 2
- libcrypt.so.2: cannot open shared object file: No such file or directory HOT 3
- multi_column.py does not identify multiple columns in some cases HOT 3
- How to use fitz to delete tables in PDF? Need help, Thanks. HOT 1
- document rescale after replacing fonts HOT 1
- fitzcli.py open file as in-memory stream & parameter for printing instead of writing to .txt HOT 1
- multi_column.py errors with latest version of pymupdf HOT 4
- Use a template for documenting the examples
- Make sure the examples are PEP8 compliant HOT 1
- Merge the demo scripts into the examples folder HOT 2
- Update links in Read the Docs HOT 3
- Merge the conversion scripts into the examples folder
- Merge the image-replacement scripts into the examples folder
- Merge the font-replacement scripts into the examples folder
- Merge the text-extraction scripts into the examples folder
- Merge the textbox-extraction scripts into the examples folder
- Define a folder structure for the examples
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf-utilities.