<a href="https://wiki.gnome.org/action/show/Apps/OCRFeeder?action=show&redirect=OC

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

You can do it too with <a href="https://github.com/openpaperwork/paperwork-backend/blo

That should be doable with Tesseract itself: <a href="https://github.com/tesseract

Hybridise PDFs with combined OCR'd text about paperless HOT 21 CLOSED

janLo commented on May 18, 2024

Hybridise PDFs with combined OCR'd text

from paperless.

Comments (21)

LeoFCardoso commented on May 18, 2024 3

I'm the pdf2pdfocr main developer. Glad to know the project is being helpful for others! Please let me know if you need improvements on pdf2pdfocr for better integration with paperless.

from paperless.

mnp commented on May 18, 2024 1

Yup, see also https://github.com/jbarlow83/OCRmyPDF ... Adding the OCR text inside the PDF is awesome because it then becomes searchable with pdftotext | grep or something more like a db and indexer.

from paperless.

bkanuka commented on May 18, 2024 1

@danielquinn No problem. I know everyone has an opinion on optional non-foss dependencies, so that's fine. I'll tackle #271 first, which will open the door to custom pre-scripts that can OCR.

from paperless.

jat255 commented on May 18, 2024 1

This script could be useful: https://github.com/LeoFCardoso/pdf2pdfocr

Uses only OSS and has seemed to work on a few files I've tried it out on.

from paperless.

danielquinn commented on May 18, 2024

Integration of OCRFeeder is unlikely to ever happen because there's just too much in there that we don't need, (gui, scanner communication) and much of what it does do is already done in Paperless (parsing the doc and OCRing the text). The one part that sounds (mildly) appealing is recreating the imported documents as PDFs with a plaintext metadata layer. I've heard that this is possible, but the means of doing this isn't apparent from that URL or the code I found in their git repo.

So I'm going to close this one as WONTFIX. I appreciate the interest, but I've no intention of moving Paperless in this direction.

from paperless.

janLo commented on May 18, 2024

I don't mean to integrate the whole ocrfeeder, but pick the interesting parts. For example the Layout analysis and the Combined PDF creation.

from paperless.

danielquinn commented on May 18, 2024

Ah alright, thanks for pointing me to the right place for the combined PDF stuff. It looks potentially doable, but I won't prioritise it right now given the other stuff on my plate (API, proper front-end, image support, mail support, documenting all of it, etc.) So I'll re-open this issue and modify the title to reference combined PDFs.

As for layout analysis, I don't its value at present. I set this thing up to make indexing and finding paper easier, but honestly don't want to go too deep into prioritising some parts of the text over others for the purposes of search, especially when there's no elaborate search engine system on the other side.

from paperless.

senk commented on May 18, 2024

https://github.com/virantha/pypdfocr adds OCR to the PDF and also uses tesseract. Maybe this is an option

from paperless.

jflesch commented on May 18, 2024

You can do it too with cairo + pango, but I guess it may add dependencies on some X11 libraries :/

from paperless.

17Halbe commented on May 18, 2024

That should be doable with Tesseract itself:
Tesseract FAQ

from paperless.

janLo commented on May 18, 2024

I think it does not position the text directly "behind" the image. The ocrfeeder-code builds a pdf that is searchable and where text is selectable.

from paperless.

maphy-psd commented on May 18, 2024

I think the package 'pyocr' (paperless uses it) don't support pdf export..maybe we create image_to_pdf (like the used image_to_string function) for pyocr. Therefore we have to make a PdfBuilder class with pdf as value for the tesseract_configs. Compare used DigitBuilder which uses the tesseract's digits config file.

The run_tesseract function has an arg for tesseract options. tesseract would use the pdf config file to produce searchable PDF output

For multipage pdf paperless can combine the pdf's with pdftk.

As a workaround (with a post-consumption script) you can use pdfsandwich

i know: ocr runs twice

from paperless.

danielquinn commented on May 18, 2024

If someone wants to take this one on, I'm happy to work with them to merge it into Paperless, but I'm afraid my plate is just too full to commit to it.

from paperless.

bkanuka commented on May 18, 2024

FYI I'm going to take a stab at this, but by using ABBYY FineReader. I know, I know, it's not FOSS. However, I've found it to be just way better than tesseract for OCR. It also generated the nice "hybrid" pdf's with selectable text. Timeframe is probably months though.

from paperless.

lenucksi commented on May 18, 2024

@bkanuka I think FR is reasonably priced with around 100-200$/€/... so that sounds like a reasonable way to go given its impressive OCR performance.
However, which edition did you intend to use? They have this CLI version that is priced by throughput and then the Windows versions of which the corporate edition has some feature comparable to inotify...

from paperless.

bkanuka commented on May 18, 2024

I'll target the "ABBYY FineReader Engine 11 CLI for Linux". Frankly, I think if someone was looking for corporate features, they would probably be looking at a different document manager all together.

I will also consider targeting a cloud-based OCR. There's a few that are relatively free (as in beer). ABBYY's Cloud OCR is free for 100 pages/month. There's probably others out there.

from paperless.

danielquinn commented on May 18, 2024

Hi @bkanuka it's great that you're considering writing this, but I just want to point out that I won't merge any code that depends on non-FOSS software. You're more than welcome to take advantage of the script hooks to write your own code to talk to Paperless, but on principle I don't want to have any code in this project that requires non-free code -- even optionally.

Just thought I'd give you a heads up before you burned a few months working on it only to find out then :-)

from paperless.

lenucksi commented on May 18, 2024

@bkanuka Sounds good and most straight-forward, just looked up the prices again and 1000 pages per month should probably be enough. Said 1k pages is their 200$ offer, see here: http://www.ocr4linux.com/en:pricing:start

from paperless.

jflesch commented on May 18, 2024

Just out of curiosity, what are your use cases for proprietary or cloud OCR ?

from paperless.

lenucksi commented on May 18, 2024

@jflesch In my case, tabular layouts are preserved better, the overall OCR precision is better and the embedded of text into PDF is too.

from paperless.

danielquinn commented on May 18, 2024

That's a pretty impressive project @jat255. I think the best process to follow for this sort of thing is to pre-process PDF documents before sending them to Paperless. Now that Paperless has support for skipping the OCR step if the file already has embedded text, the file will be consumed quick & easy without having to do a big Tesseracting.

I'm not keen on integrating this into the consumption process though, as doing so will modify the original rather than just indexing & storing it. It's probably best done by making use of the pre-consume-script hook, which is triggered right before consumption, so I've documented that process here.

Now that we have a couple non-free and one Free means of solving this, I'm thinking it's safe to close this one so I'm going to do so.

from paperless.

Hybridise PDFs with combined OCR'd text about paperless HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent