internetarchive / archive-pdf-tools Goto Github PK

View Code? Open in Web Editor NEW

82.0 20.0 13.0 26.42 MB

Fast PDF generation and compression. Deals with millions of pages daily.

Home Page: https://archive-pdf-tools.readthedocs.io/en/latest/

License: GNU Affero General Public License v3.0

Python 87.76% Cython 11.74% Shell 0.49%

compression ocr pdf pdf-compression pdf-generation pdf-generator pdf-to-image python pdf-compressor

archive-pdf-tools's People

Contributors

Stargazers

Watchers

Forkers

novikovke cclauss redsandro maxpeal rmast stweil digdug101 pasaopasen sanchaya jrochkind tfmorris cunckster storytracer

archive-pdf-tools's Issues

Support actual recompression of an existing PDF without any input hOCR or input images

We could just extract every image from a PDF, and insert the MRC compressed images in its place. This way we could just compress existing PDFs, much like the foxit pdf compressor does: https://www.foxit.com/compress-pdf/

This would actually be a pretty trivial, but powerful, addition.

PDF/UA improvements

VeraPDF now supports PDF/UA verification:

~/verapdf/verapdf --format xml --flavour ua1 /tmp/test.pdf  > /tmp/out.xml

We should fix the problems that it finds with our PDFs, I suspect that this will also help with the problems that Adobe finds.

This means at least:

Adding the Primary language
Mark Figure's as Artifacts
Adding alt text to Figure's (we might not need to if we mark them as artifact)
Define language for text blocks
Potentially indicate the reading order?

Usefulness of MRC for decent quality compression of scanned book pages with illustrations

Opening a new issue as requested.

Here are some samples: https://mega.nz/folder/BRhChKob#xo-HHaJrD9VYN6YV3ur9WA

128.tif & 188.tif - original cleaned up 600dpi scans
*-scantailor.tif - 600dpi mixed output with bitonal text and color photos, as autodetected
*-scantailor-pdfbeads.pdf - above .tif split into two layers, with the text layer jbig2-encoded and the background layer JP2-encocded downsampled to 150dpi, and everything encoded in a pdf using pdfbeads
*.jp2 - some compressed versions of the original, forgot the settings. Page 128 is almost half as small as the PDF's so I assume PDF sizes can be slightly improved.

The folders have some residual files. ScanTailor itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)

Can MRC output get to be anything comparable to the PDFs at the same or lower size? I'm also curious whether it can be achieved directly from the original cleaned up scan or the ScanTailor mixed output step is still advised.

Look into support JPG instead of JPEG2000 for foreground/background generation

This should make the PDFs load quite a bit faster, but at the expensive of quality and compression, I believe. Still, it might be worth trying.

Use "linear" option from new pymupdf (if it doesn't break metadata writing)

This option could be used:

linear (bool) – Save a linearised version of the document. This option creates a file format for improved performance for Internet access. Excludes “incremental”.

Last time I tried to use it, it heavily broken evince/poppler, so we might need to file bugs with them first.

pdfcomp: new tool, discussion, compression questions

The tool needs command line arguments much like recode_pdf (which we might want to rename) - and probably those flags out to be shared mostly.

Let's also use this to discuss issues of people testing pdfcomp now.

I don't understand this picture

Why would we need so many colors smeared to the bottom right that are not behind the foreground mask?

Those could all be optimized away to facilitate Run Length Encoding.

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF

I now work with a hocr-file coming from pdftree to get out the current searchable text from a PDF as suggested on the bottom of this issue:
ocropus/hocr-tools#117

recode_pdf --from-imagestack './2022-01-08*.tif' --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 2022-01-08a.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

Even if I leave out the hocr-file in the hope the input PDF should be already taken for the searchable text inside there's still an error:
recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 628, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
for idx, hocr_page in enumerate(hocr_iter):
File "/usr/local/lib/python3.8/dist-packages/archive_hocr_tools-1.1.13-py3.8.egg/hocr/parse.py", line 42, in hocr_page_iterator
fp.seek(0)
AttributeError: 'NoneType' object has no attribute 'seek'

I anonymized the hocr by :%s/>.*</span>/>bla</span>
anonymized.zip

Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu

Per this issue #42 (comment) - it might be a better idea to not have users mess around with compression flags too much and provide human readable 'profiles' or 'presets'.

Having shared presets/profiles would also make it easier to integrate the same profiles in other standalone tools that are to come.

Some scans become inverted

I've noticed it two times before, and I thought it was a computer issue because I scanned too large at 600 dpi. But now I encounter this for a third time, this time while scanning a small card at 300 dpi. I'm beginning to think this might be a bug.

Original: Left. recode_pdf: Right.

My normal workflow:

ls -1 *.png > in.txt
tesseract -l nld+eng --dpi 300 in.txt out hocr
recode_pdf -v -m 2 --dpi 300 --from-imagestack "./*.png" --hocr-file out.hocr -o "out-recode.pdf"

Is this a known issue? Is there a known workaround? I did a quick search, didn't turn up anything.
I'm not sure I can share the full resolution card openly because it is copyrighted, but if this issue is never seen before I am willing to email full resolution file for testing purposes.

$ recode_pdf --version
internetarchivepdf 1.4.14

Look into increasing the quality of the foreground image by compressing less

I think the default parameters are over compressing the foreground image, and changing the quality a little bit should hardly infer a compression size hit, but would improve the quality quite a bit.

Add support for 1-bit (black & white) mode, where the end result is just the mask

Lot of fuzz in background picture

Hi Merlijn,

I like this repo as it looks like the first serious open source MRC PDF solution I've found. However, I recently filed an issue with didjvu that has equal bad background fuzz. However, that was diminished by using the better DjVu-algorithm from C44 instead of DjVuMake for removing the surrounding pixels from characters:

jwilk/didjvu#18

I guess when you try the picture overthere you'll find a similar fuzzy background with this MRC PDF-compressor. It might be interesting to study the algorithm in the open source c44 to better separate the foreground from the background.

look at kakadu/grok/openjpeg compression parameters

It would be worth it to look at the "-q" parameter and also at the multilayer options, perhaps we can make better use of the compression codecs.

Use (not yet released) pdf->hocr conversation to improve compression for existing PDFs

If we know where the PDF contains text, we could apply our usual higher-quality hOCR-based compression there.

Support PDF generation without MRC

This used to work (currently called image mode 1), but it for sure doesn't work right now, so it would be nice to make that work again.

Support recompressing existing PDFs without hOCR files and without touching the text input

This would be quite helpful for OCRmyPDF users if they wanted to aggressively compress their PDFs after OCRmyPDF has done its work, see ocrmypdf/OCRmyPDF#541

Add option to disable jbig2

Since it's to enabled by default, we don't have a way to disable it.

Support PDF generation/compression without hOCR files

This should be a no-brainer, but we need to deal with a few things:

We use hOCR files to estimate the page size based on the DPI encoded in the hOCR files (if present), otherwise we estimate it.
The code that generates the initial PDF with text layer obviously relies on hOCR. We could just make a PDF with empty pages of the right size as alternative when we have no hOCR.

Maybe support a glob for hocr files too, rather than requiring them to be combined into a single file

This shouldn't be too hard, and might be easier for some. We could add it as --from-hocrstack

openjpeg is not working properly

Recommended actions discussed in this issue:

Remove -threads (or place flag last) for OpenJpeg (done: 31def81)
Allow threads to be specified (encoder agnostic e.g. -num_threads for Kakadu)
Merge debug messages from ROI build

(Original issue below.)

Using -J openjpeg results in lossless compression. Probably because it is the default:

$ opj_compress -h | grep -A 3 "Default encoding"
Default encoding options:
-------------------------

 * Lossless

My test image compresses 0.29 times (got 3 times bigger).

recode_pdf -v --dpi 300 \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

recode_pdf should probably use some sane defaults with -J openjpeg.

Command line arguments don't work.

Unfortunately, manually setting the compression options doesn't work. According to opj_compress -h, the compression ratio can be adjusted with -q or -r:

-r <compression ratio>,<compression ratio>,...
    Different compression ratios for successive layers.
    The rate specified for each quality level is the desired
    compression factor (use 1 for lossless)
    Decreasing ratios required.
      Example: -r 20,10,1 means 
            quality layer 1: compress 20x, 
            quality layer 2: compress 10x 
            quality layer 3: compress lossless
    Options -r and -q cannot be used together.
-q <psnr value>,<psnr value>,<psnr value>,...
    Different psnr for successive layers (-q 30,40,50).
    Increasing PSNR values required, except 0 which can
    be used for the last layer to indicate it is lossless.
    Options -r and -q cannot be used together.

Yet the resulting files are identical size-wise regardless of compression-flags:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 20' \
  --bg-compression-flags ' -r 20' \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -q 5' \
  --bg-compression-flags ' -q 5' \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

Am I missing something, or is this a bug with either recode_pdf or the documentation?

Testing openjpeg directly

$ opj_compress -r 750 -i in.png -o out.jp2

3.0 MB -> 34,3 kB

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

openjpeg version

$ opj_compress -h | grep openjp2
It has been compiled against openjp2 library v2.3.1.

Test scan to experiment with

test_1.png.zip

Add/implement regression tests for MRC

We can leverage the scripts in tools/ to perform the MRC compression separately, and merge the final result, and create a diff of the output of the original image and the MRC compressed image. This way, if we have a database of images, we could improve the algorithm and see how it performs against known data/images.

Consider turning on mask denoising by default

It's currently turned off by default, but could help with quality and compression, in case the mask contains noise. It would likely slow down the mask generation a bit, though.

Bug in foreground/background separator choosing massive block instead of character outline.

Partly anonymized replay of my previous finding on compressing the bankstatement with downsampling the foreground, revealing a bug in the foreground-binarizer/separator.

Add fg_downsample=12 in compress-pdf-images:

    mrc_gen = create_mrc_hocr_components(pil_image, hocr_word_data,
    #mrc_gen = create_mrc_hocr_components(pil_image, [],
            denoise_mask=DENOISE_FAST,
            bg_downsample=3,
            fg_downsample=12
            )

bankstatementgeknipt8noalphag.zip

ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatementgeknipt8noalphag.tiff outgeknipt8g.pdf
pdfcomp outgeknipt8g.pdf outgeknipt8-12g.pdf
outgeknipt8-12g.pdf
outgeknipt8g.pdf

Support pillow jpeg2000 writing

It's probably not as great as grok or kakadu, but it'd be nice to support it for who folks who don't have the other programs installed.

Missing test suite?

It looks like archive-pdf-tools currently does not have an automated test suite.
I know that not all developers like to work this way, but I think providing a test suite can be very advantageous for quality assurance, to make sure the library works equally well on different platforms. It may also be helpful to verify changes for correctness and avoid regressions.
pytest is a popular choice as a test framework among open-source Python projects, for instance.

Detect if RGB images in pages are greyscale or even 1bit

Would be a neat way to efficiently compress certain input data

Add tests...

We have none, and it would be good to have some.

The choice for inverting, what's the use for perc_larger?

In mrc.py some perc_larger seems useful for choosing for inversion of the word. However that value seems to be never used?

Small difference in compressionratio

See my post after the closed #30

There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?

Error with hocr-files from Tesseract

When Tesseract generates this HOCR-file
img.zip

I get this error:

recode_pdf --from-imagestack ../210923-005.tif --hocr-file ~/img.hocr -o /tmp/outf.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 XOP
	 FMA4
	 FMA3
Creating text only PDF
Starting page generation at 2021-11-28T10:59:56.133494
Traceback (most recent call last):
  File "/usr/local/bin/recode_pdf", line 4, in <module>
    __import__('pkg_resources').run_script('archive-pdf-tools==1.4.9', 'recode_pdf')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 262, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 1070, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 189, in create_tess_textonly_pdf
    imgfile = image_files[idx]
IndexError: list index out of range

For the first 5 pages there was no issue with the same command, it's only this page, so the hocr coming from Tesseract contains something not allowed.|

Windows port

Add pillow fallback for reading images (in case kakadu or openjpeg2000 or grok is not available)
os.remove causes sharing violations since I remove files after I open them, which Windows doesn't allow
jbig2 encoding is not available right now

Upon release of the new mupdf and pymupdf, flip on JBIG2 by default

Currently the default mask encoding is not JBIG2, since mupdf 1.18 has bugs dealing with it:

Need some inspiration?

https://github.com/whitelok/image-text-localization-recognition
https://github.com/qurator-spk/eynollah

First recode_pdf test: 'numpy' has no attribute 'int'.

Just followed the install instructions, but the test recode_pdf --version gets a numpy-related error:

david@DESKTOP5:~/src/jbig2enc$ recode_pdf --version
Traceback (most recent call last):
  File "/home/david/.local/bin/recode_pdf", line 4, in <module>
    from internetarchivepdf.recode import recode
  File "/home/david/.local/lib/python3.10/site-packages/internetarchivepdf/__init__.py", line 2, in <module>
    from . import mrc
  File "/home/david/.local/lib/python3.10/site-packages/internetarchivepdf/mrc.py", line 36, in <module>
    from optimiser import optimise_gray, optimise_rgb, optimise_gray2, optimise_rgb2, fast_mask_denoise
  File "cython/optimiser.pyx", line 11, in init optimiser
  File "/home/david/.local/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Thanks.

Define scope of tooling and work to improve for that scope

Right now the tooling naming is a bit confusing. The main tool is called "recode_pdf", but it really doesn't do PDF recoding, it does PDF creation and also inserts text layers, and performs MRC compression.

Since I am working on adding a tool to actually recode existing PDFs (MRC compressing them, and not doing anything else for starters), it might make sense to think about renaming the tool names, but also define what the tools ought to do.

I think there are a few scenarios:

Given a set of images (and hOCR results), create a (compressed) PDF - like what ocrmypdf does.
Given an input PDF with just one image per page, do what the above step does.
Given an uncompressed PDF, compress (recode) the PDF. Optional features here are to (1) insert a text layer (2) make the PDF PDF/A compatible

Can others think of other scenarios?

I guess there could be a tool that also incorporates calling Tesseract, but I think that should probably be out of scope of this particular project (I am interesting in building public tooling for this, just not in the scope of this repo)

pillow is not working properly

Using -J pillow results in a terrible images. It looks like the image is resampled 4 to 1.

recode_pdf -v --dpi 300 \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow.pdf

Here is the -J pillow foreground layer:

For comparison, here is -J kakadu:

The resulting files are approximately similar in size. Is pillow really absurdly bad, or does it need to get different compression parameters? I wanted to try this out, recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

Test scan to experiment with

test_1.png.zip

Suggested actionables

Use sane defaults for pillow so quality is reasonable.
Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.
Update documentation with Pillow compression flags.

License (in)compatibility

Hi, any progress on the license-incompatibility with OcrMyPDF (MPL-2.0)?

Would GScan2PDF (GPLv3) be a better fit? I'll try to study the differences...

pdfcomp: problems with inverted text that is often better in hocr.

This form https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf

First page saved to jpeg via this site: https://smallpdf.com

Result of the left column is quite readable at the right screen-resolution.

ocrmypdf --pdfa-image-compression lossless -O0  0001.jpg formulierhocrjpg.pdf
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
Successfully converted to PDF, processing...
Scanning contents: 100%|████████████████████████| 1/1 [00:00<00:00, 73.93page/s]
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:09<00:00,  9.92s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 1/1 [00:00<00:00,  2.46page/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

pdfcomp formulierhocrjpg.pdf formulierhocrjpgkleiner.pdf
Compression factor: 9.617848822158944

formulierhocrjpgkleiner.pdf

Contains unreadable text on the left. The hocr contains "Toelichting 1.1", it is completely unreadable.

My patch for the inversion ratio makes it better readable:

formulierhocrjpgkleinerpatch.pdf

However if you lookup the mask-picture it doesn't contain this text in the left column at all.

So my patch isn't the only needed change for that routine.

Wrong resolution of mask image when foreground image is downsampled

I tried to use recode_pdf from imagestack together with the option to downsample the foreground ("--fg-downsample 4").
The resulting pdf was unreadable.

I found out that the foreground (meaning the color layer) was resampled as expected.
When the pdf is written, the resolution of the mask layer (which should stay in the original size) is taken from the foreground and therefor wrong.

As a solution i changed mrc.py to return the size of the mask and used the values from recode.py

This works fine when encoding images to pdf.
I did not test ist with other modes.

Attached you find patches for mrc.py and recode.py.
patches.tar.gz

Add option (and heuristic) to treat the background as 'just plain (white) paper' for further optimisations

Add another font beyond the glyphless font to actually render fonts of the languages that are in use

There is an old branch here that implements the concept:

https://github.com/internetarchive/archive-pdf-tools/tree/show-text-on-selection

It looks a bit messy, and the code was older (wrt font sizes when I wrote it), but something was working back then:

This table is a set of fonts that we could expect to have around I believe (system wide?):

Font Name	Installed Base Font	Comments
china-s	Heiti	simplified Chinese
china-ss	Song	simplified Chinese (serif)
china-t	Fangti	traditional Chinese
china-ts	Ming	traditional Chinese (serif)
japan	Gothic	Japanese
japan-s	Mincho	Japanese (serif)
korea	Dotum	Korean
korea-s	Batang	Korean (serif)

Then the question becomes -- what do we do for Arabic fonts?

We will want to add the language to the word data as returned by archive-hocr-tools, and then on a per page basis insert the right font.


(Old bug: https://git.archive.org/merlijn/archive-pdf-tools/-/issues/4)

Run noise estimation on a part of the image

We probably only need to analyse a part of the image to get a decent sense of camera (or other) noise. Running it on the whole image takes quite some time (it's the most costly operation currently).

Add --best flag?

master file contents.rst not found during build of docs

I had to apply this instruction of Hrvoje to get the docs building running the apt installed Sphinx v1.8.5:

Use JBIG2 compression to determine if we want to blur or denoise before thresholding

We can perform threshold on the original image, optimistically do the JBIG2 conversion, and only when the JBIG2 doesn't compress well, we either apply blur to the image and re-threshold, and/or denoise the threshold result (mask).

JBIG2 compression is fast, and our current noise estimation is not. Since our JBIG2 is lossless, good compression suggests that the image is not noisy.

This will help up speed up the PDF generation, since the Gaussian noise estimation is currently the most CPU intensive part, which is kind of silly.

Support Grok for JPEG2000 encode/decode and support OpenJPEG2000 in a better fashion

Grok is a pretty promising JPEG2000 encoder/decoder: https://github.com/GrokImageCompression/grok/
We already support OpenJPEG2000, but only in a pretty basic fashion. It might make sense to add special compression options for it, like we do for kakadu currently with the slope flags.

--jbig2 deprecated

I built the newest version of this tool, and it states I should use [--mask-compression {jbig2,ccitt}]. So the main readme should be adapted accordingly.

Improve mask and background generation

There a few things to improve in the mask generation:

The Sauvola binarisation currently uses fixed parameters, which is not ideal. We probably want to make some of those parameters dependent on the image DPI, and change the k value to 0.34 as default.
We could look into better binarisation algorithms like multi-scale sauvola as mentioned here: tesseract-ocr/tesseract#3083 (comment)

The same applies to the hocr-specific mask generation.

internetarchive / archive-pdf-tools Goto Github PK

archive-pdf-tools's People

Contributors

Stargazers

Watchers

Forkers

archive-pdf-tools's Issues

Command line arguments don't work.

Testing openjpeg directly

Additional info

Test scan to experiment with

Additional info

Test scan to experiment with

Suggested actionables

Recommend Projects

Recommend Topics

Recommend Org