Comments (21)
I'm the pdf2pdfocr main developer. Glad to know the project is being helpful for others! Please let me know if you need improvements on pdf2pdfocr for better integration with paperless.
from paperless.
Yup, see also https://github.com/jbarlow83/OCRmyPDF ... Adding the OCR text inside the PDF is awesome because it then becomes searchable with pdftotext | grep
or something more like a db and indexer.
from paperless.
@danielquinn No problem. I know everyone has an opinion on optional non-foss dependencies, so that's fine. I'll tackle #271 first, which will open the door to custom pre-scripts that can OCR.
from paperless.
This script could be useful: https://github.com/LeoFCardoso/pdf2pdfocr
Uses only OSS and has seemed to work on a few files I've tried it out on.
from paperless.
Integration of OCRFeeder is unlikely to ever happen because there's just too much in there that we don't need, (gui, scanner communication) and much of what it does do is already done in Paperless (parsing the doc and OCRing the text). The one part that sounds (mildly) appealing is recreating the imported documents as PDFs with a plaintext metadata layer. I've heard that this is possible, but the means of doing this isn't apparent from that URL or the code I found in their git repo.
So I'm going to close this one as WONTFIX. I appreciate the interest, but I've no intention of moving Paperless in this direction.
from paperless.
I don't mean to integrate the whole ocrfeeder, but pick the interesting parts. For example the Layout analysis and the Combined PDF creation.
from paperless.
Ah alright, thanks for pointing me to the right place for the combined PDF stuff. It looks potentially doable, but I won't prioritise it right now given the other stuff on my plate (API, proper front-end, image support, mail support, documenting all of it, etc.) So I'll re-open this issue and modify the title to reference combined PDFs.
As for layout analysis, I don't its value at present. I set this thing up to make indexing and finding paper easier, but honestly don't want to go too deep into prioritising some parts of the text over others for the purposes of search, especially when there's no elaborate search engine system on the other side.
from paperless.
https://github.com/virantha/pypdfocr adds OCR to the PDF and also uses tesseract. Maybe this is an option
from paperless.
You can do it too with cairo + pango, but I guess it may add dependencies on some X11 libraries :/
from paperless.
That should be doable with Tesseract itself:
Tesseract FAQ
from paperless.
I think it does not position the text directly "behind" the image. The ocrfeeder-code builds a pdf that is searchable and where text is selectable.
from paperless.
I think the package 'pyocr' (paperless uses it) don't support pdf export..maybe we create image_to_pdf
(like the used image_to_string function) for pyocr. Therefore we have to make a PdfBuilder
class with pdf
as value for the tesseract_configs
. Compare used DigitBuilder which uses the tesseract's digits config file.
The run_tesseract function has an arg for tesseract options. tesseract would use the pdf config file to produce searchable PDF output
For multipage pdf paperless can combine the pdf's with pdftk.
As a workaround (with a post-consumption script) you can use pdfsandwich
i know: ocr runs twice
from paperless.
If someone wants to take this one on, I'm happy to work with them to merge it into Paperless, but I'm afraid my plate is just too full to commit to it.
from paperless.
FYI I'm going to take a stab at this, but by using ABBYY FineReader. I know, I know, it's not FOSS. However, I've found it to be just way better than tesseract for OCR. It also generated the nice "hybrid" pdf's with selectable text. Timeframe is probably months though.
from paperless.
@bkanuka I think FR is reasonably priced with around 100-200$/€/... so that sounds like a reasonable way to go given its impressive OCR performance.
However, which edition did you intend to use? They have this CLI version that is priced by throughput and then the Windows versions of which the corporate edition has some feature comparable to inotify...
from paperless.
I'll target the "ABBYY FineReader Engine 11 CLI for Linux". Frankly, I think if someone was looking for corporate features, they would probably be looking at a different document manager all together.
I will also consider targeting a cloud-based OCR. There's a few that are relatively free (as in beer). ABBYY's Cloud OCR is free for 100 pages/month. There's probably others out there.
from paperless.
Hi @bkanuka it's great that you're considering writing this, but I just want to point out that I won't merge any code that depends on non-FOSS software. You're more than welcome to take advantage of the script hooks to write your own code to talk to Paperless, but on principle I don't want to have any code in this project that requires non-free code -- even optionally.
Just thought I'd give you a heads up before you burned a few months working on it only to find out then :-)
from paperless.
@bkanuka Sounds good and most straight-forward, just looked up the prices again and 1000 pages per month should probably be enough. Said 1k pages is their 200$ offer, see here: http://www.ocr4linux.com/en:pricing:start
from paperless.
Just out of curiosity, what are your use cases for proprietary or cloud OCR ?
from paperless.
@jflesch In my case, tabular layouts are preserved better, the overall OCR precision is better and the embedded of text into PDF is too.
from paperless.
That's a pretty impressive project @jat255. I think the best process to follow for this sort of thing is to pre-process PDF documents before sending them to Paperless. Now that Paperless has support for skipping the OCR step if the file already has embedded text, the file will be consumed quick & easy without having to do a big Tesseracting.
I'm not keen on integrating this into the consumption process though, as doing so will modify the original rather than just indexing & storing it. It's probably best done by making use of the pre-consume-script hook, which is triggered right before consumption, so I've documented that process here.
Now that we have a couple non-free and one Free means of solving this, I'm thinking it's safe to close this one so I'm going to do so.
from paperless.
Related Issues (20)
- Uploade with Webgui or App HOT 3
- docker-compose fails to build with the last version of Pipenv HOT 3
- Correnspondent picked from filename
- Docker Container Unhealthy
- Problem pulling static content with reverse proxy
- ImportError: cannot import name 'FieldDoesNotExist' from 'django.db.models' in __init__.py
- Not detecting new files via ftp only via smb
- Consumer uses 100% CPU when idle HOT 3
- [Feature] - Templates for OCR (Zonal OCR) using KULL
- Problem using docker-compose HOT 2
- gunicorn cannot read files? wrong permissions? HOT 2
- consumer not running in Synology Docker HOT 9
- Provide as Yunohost App
- Dockerfile: Unable to open /etc/sudoers: Permission denied HOT 1
- Paperless-ng is here. Thoughts on merging into master. HOT 18
- Disabling encryption failing after one file HOT 10
- Docker Install : No such file or directory 'python3 HOT 2
- Docker install: ERROR: for consumer Container "a713bc3650c5" is unhealthy.
- ERROR Error while consuming document img_20180606_204601.893.jpg: Invalid rotation (0) HOT 1
- Paperless in Kubernetes with NFS Backing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paperless.