Git Product home page Git Product logo

internetarchive / archive-pdf-tools Goto Github PK

View Code? Open in Web Editor NEW
82.0 20.0 13.0 26.42 MB

Fast PDF generation and compression. Deals with millions of pages daily.

Home Page: https://archive-pdf-tools.readthedocs.io/en/latest/

License: GNU Affero General Public License v3.0

Python 87.76% Cython 11.74% Shell 0.49%
compression ocr pdf pdf-compression pdf-generation pdf-generator pdf-to-image python pdf-compressor

archive-pdf-tools's People

Contributors

cclauss avatar jrochkind avatar mara004 avatar merlijnwajer avatar redsandro avatar stweil avatar tfmorris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archive-pdf-tools's Issues

PDF/UA improvements

VeraPDF now supports PDF/UA verification:

~/verapdf/verapdf --format xml --flavour ua1 /tmp/test.pdf  > /tmp/out.xml

We should fix the problems that it finds with our PDFs, I suspect that this will also help with the problems that Adobe finds.

This means at least:

  • Adding the Primary language
  • Mark Figure's as Artifacts
  • Adding alt text to Figure's (we might not need to if we mark them as artifact)
  • Define language for text blocks
  • Potentially indicate the reading order?

Usefulness of MRC for decent quality compression of scanned book pages with illustrations

Opening a new issue as requested.

Here are some samples: https://mega.nz/folder/BRhChKob#xo-HHaJrD9VYN6YV3ur9WA

128.tif & 188.tif - original cleaned up 600dpi scans
*-scantailor.tif - 600dpi mixed output with bitonal text and color photos, as autodetected
*-scantailor-pdfbeads.pdf - above .tif split into two layers, with the text layer jbig2-encoded and the background layer JP2-encocded downsampled to 150dpi, and everything encoded in a pdf using pdfbeads
*.jp2 - some compressed versions of the original, forgot the settings. Page 128 is almost half as small as the PDF's so I assume PDF sizes can be slightly improved.

The folders have some residual files. ScanTailor itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)

Can MRC output get to be anything comparable to the PDFs at the same or lower size? I'm also curious whether it can be achieved directly from the original cleaned up scan or the ScanTailor mixed output step is still advised.

Use "linear" option from new pymupdf (if it doesn't break metadata writing)

This option could be used:

linear (bool) – Save a linearised version of the document. This option creates a file format for improved performance for Internet access. Excludes “incremental”.

Last time I tried to use it, it heavily broken evince/poppler, so we might need to file bugs with them first.

pdfcomp: new tool, discussion, compression questions

The tool needs command line arguments much like recode_pdf (which we might want to rename) - and probably those flags out to be shared mostly.

Let's also use this to discuss issues of people testing pdfcomp now.

I don't understand this picture

image

Why would we need so many colors smeared to the bottom right that are not behind the foreground mask?

Those could all be optimized away to facilitate Run Length Encoding.

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF

I now work with a hocr-file coming from pdftree to get out the current searchable text from a PDF as suggested on the bottom of this issue:
ocropus/hocr-tools#117

recode_pdf --from-imagestack './2022-01-08*.tif' --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 2022-01-08a.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

Even if I leave out the hocr-file in the hope the input PDF should be already taken for the searchable text inside there's still an error:
recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 628, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
for idx, hocr_page in enumerate(hocr_iter):
File "/usr/local/lib/python3.8/dist-packages/archive_hocr_tools-1.1.13-py3.8.egg/hocr/parse.py", line 42, in hocr_page_iterator
fp.seek(0)
AttributeError: 'NoneType' object has no attribute 'seek'

I anonymized the hocr by :%s/>.*</span>/>bla</span>
anonymized.zip

Some scans become inverted

I've noticed it two times before, and I thought it was a computer issue because I scanned too large at 600 dpi. But now I encounter this for a third time, this time while scanning a small card at 300 dpi. I'm beginning to think this might be a bug.

Original: Left. recode_pdf: Right.
image

My normal workflow:

ls -1 *.png > in.txt
tesseract -l nld+eng --dpi 300 in.txt out hocr
recode_pdf -v -m 2 --dpi 300 --from-imagestack "./*.png" --hocr-file out.hocr -o "out-recode.pdf"

Is this a known issue? Is there a known workaround? I did a quick search, didn't turn up anything.
I'm not sure I can share the full resolution card openly because it is copyrighted, but if this issue is never seen before I am willing to email full resolution file for testing purposes.

$ recode_pdf --version
internetarchivepdf 1.4.14

Lot of fuzz in background picture

Hi Merlijn,

I like this repo as it looks like the first serious open source MRC PDF solution I've found. However, I recently filed an issue with didjvu that has equal bad background fuzz. However, that was diminished by using the better DjVu-algorithm from C44 instead of DjVuMake for removing the surrounding pixels from characters:

jwilk/didjvu#18

I guess when you try the picture overthere you'll find a similar fuzzy background with this MRC PDF-compressor. It might be interesting to study the algorithm in the open source c44 to better separate the foreground from the background.

Support PDF generation without MRC

This used to work (currently called image mode 1), but it for sure doesn't work right now, so it would be nice to make that work again.

Support PDF generation/compression without hOCR files

This should be a no-brainer, but we need to deal with a few things:

  • We use hOCR files to estimate the page size based on the DPI encoded in the hOCR files (if present), otherwise we estimate it.
  • The code that generates the initial PDF with text layer obviously relies on hOCR. We could just make a PDF with empty pages of the right size as alternative when we have no hOCR.

openjpeg is not working properly

Recommended actions discussed in this issue:

  • Remove -threads (or place flag last) for OpenJpeg (done: 31def81)
  • Allow threads to be specified (encoder agnostic e.g. -num_threads for Kakadu)
  • Merge debug messages from ROI build

(Original issue below.)


Using -J openjpeg results in lossless compression. Probably because it is the default:

$ opj_compress -h | grep -A 3 "Default encoding"
Default encoding options:
-------------------------

 * Lossless

My test image compresses 0.29 times (got 3 times bigger).

recode_pdf -v --dpi 300 \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

recode_pdf should probably use some sane defaults with -J openjpeg.

Command line arguments don't work.

Unfortunately, manually setting the compression options doesn't work. According to opj_compress -h, the compression ratio can be adjusted with -q or -r:

-r <compression ratio>,<compression ratio>,...
    Different compression ratios for successive layers.
    The rate specified for each quality level is the desired
    compression factor (use 1 for lossless)
    Decreasing ratios required.
      Example: -r 20,10,1 means 
            quality layer 1: compress 20x, 
            quality layer 2: compress 10x 
            quality layer 3: compress lossless
    Options -r and -q cannot be used together.
-q <psnr value>,<psnr value>,<psnr value>,...
    Different psnr for successive layers (-q 30,40,50).
    Increasing PSNR values required, except 0 which can
    be used for the last layer to indicate it is lossless.
    Options -r and -q cannot be used together.

Yet the resulting files are identical size-wise regardless of compression-flags:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 20' \
  --bg-compression-flags ' -r 20' \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf
recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -q 5' \
  --bg-compression-flags ' -q 5' \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

Am I missing something, or is this a bug with either recode_pdf or the documentation?

Testing openjpeg directly

$ opj_compress -r 750 -i in.png -o out.jp2

3.0 MB -> 34,3 kB

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

openjpeg version

$ opj_compress -h | grep openjp2
It has been compiled against openjp2 library v2.3.1.

Test scan to experiment with

test_1.png.zip

Add/implement regression tests for MRC

We can leverage the scripts in tools/ to perform the MRC compression separately, and merge the final result, and create a diff of the output of the original image and the MRC compressed image. This way, if we have a database of images, we could improve the algorithm and see how it performs against known data/images.

Bug in foreground/background separator choosing massive block instead of character outline.

Partly anonymized replay of my previous finding on compressing the bankstatement with downsampling the foreground, revealing a bug in the foreground-binarizer/separator.

image

Add fg_downsample=12 in compress-pdf-images:

    mrc_gen = create_mrc_hocr_components(pil_image, hocr_word_data,
    #mrc_gen = create_mrc_hocr_components(pil_image, [],
            denoise_mask=DENOISE_FAST,
            bg_downsample=3,
            fg_downsample=12
            )

bankstatementgeknipt8noalphag.zip

ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatementgeknipt8noalphag.tiff outgeknipt8g.pdf
pdfcomp outgeknipt8g.pdf outgeknipt8-12g.pdf
outgeknipt8-12g.pdf
outgeknipt8g.pdf

Support pillow jpeg2000 writing

It's probably not as great as grok or kakadu, but it'd be nice to support it for who folks who don't have the other programs installed.

Missing test suite?

It looks like archive-pdf-tools currently does not have an automated test suite.
I know that not all developers like to work this way, but I think providing a test suite can be very advantageous for quality assurance, to make sure the library works equally well on different platforms. It may also be helpful to verify changes for correctness and avoid regressions.
pytest is a popular choice as a test framework among open-source Python projects, for instance.

Add tests...

We have none, and it would be good to have some.

Small difference in compressionratio

See my post after the closed #30

There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?

Error with hocr-files from Tesseract

When Tesseract generates this HOCR-file
img.zip

I get this error:

recode_pdf --from-imagestack ../210923-005.tif --hocr-file ~/img.hocr -o /tmp/outf.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 XOP
	 FMA4
	 FMA3
Creating text only PDF
Starting page generation at 2021-11-28T10:59:56.133494
Traceback (most recent call last):
  File "/usr/local/bin/recode_pdf", line 4, in <module>
    __import__('pkg_resources').run_script('archive-pdf-tools==1.4.9', 'recode_pdf')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 262, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 1070, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 189, in create_tess_textonly_pdf
    imgfile = image_files[idx]
IndexError: list index out of range

For the first 5 pages there was no issue with the same command, it's only this page, so the hocr coming from Tesseract contains something not allowed.|

Windows port

  • Add pillow fallback for reading images (in case kakadu or openjpeg2000 or grok is not available)
  • os.remove causes sharing violations since I remove files after I open them, which Windows doesn't allow
  • jbig2 encoding is not available right now

First recode_pdf test: 'numpy' has no attribute 'int'.

Just followed the install instructions, but the test recode_pdf --version gets a numpy-related error:

david@DESKTOP5:~/src/jbig2enc$ recode_pdf --version
Traceback (most recent call last):
  File "/home/david/.local/bin/recode_pdf", line 4, in <module>
    from internetarchivepdf.recode import recode
  File "/home/david/.local/lib/python3.10/site-packages/internetarchivepdf/__init__.py", line 2, in <module>
    from . import mrc
  File "/home/david/.local/lib/python3.10/site-packages/internetarchivepdf/mrc.py", line 36, in <module>
    from optimiser import optimise_gray, optimise_rgb, optimise_gray2, optimise_rgb2, fast_mask_denoise
  File "cython/optimiser.pyx", line 11, in init optimiser
  File "/home/david/.local/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Thanks.

Define scope of tooling and work to improve for that scope

Right now the tooling naming is a bit confusing. The main tool is called "recode_pdf", but it really doesn't do PDF recoding, it does PDF creation and also inserts text layers, and performs MRC compression.

Since I am working on adding a tool to actually recode existing PDFs (MRC compressing them, and not doing anything else for starters), it might make sense to think about renaming the tool names, but also define what the tools ought to do.

I think there are a few scenarios:

  • Given a set of images (and hOCR results), create a (compressed) PDF - like what ocrmypdf does.
  • Given an input PDF with just one image per page, do what the above step does.
  • Given an uncompressed PDF, compress (recode) the PDF. Optional features here are to (1) insert a text layer (2) make the PDF PDF/A compatible

Can others think of other scenarios?

I guess there could be a tool that also incorporates calling Tesseract, but I think that should probably be out of scope of this particular project (I am interesting in building public tooling for this, just not in the scope of this repo)

pillow is not working properly

Using -J pillow results in a terrible images. It looks like the image is resampled 4 to 1.

recode_pdf -v --dpi 300 \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow.pdf

Here is the -J pillow foreground layer:
pillow

For comparison, here is -J kakadu:
kakadu

The resulting files are approximately similar in size. Is pillow really absurdly bad, or does it need to get different compression parameters? I wanted to try this out, recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

Test scan to experiment with

test_1.png.zip

Suggested actionables

  • Use sane defaults for pillow so quality is reasonable.
  • Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.
  • Update documentation with Pillow compression flags.

License (in)compatibility

Hi, any progress on the license-incompatibility with OcrMyPDF (MPL-2.0)?

Would GScan2PDF (GPLv3) be a better fit? I'll try to study the differences...

pdfcomp: problems with inverted text that is often better in hocr.

This form https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf

First page saved to jpeg via this site: https://smallpdf.com

0001

Result of the left column is quite readable at the right screen-resolution.

ocrmypdf --pdfa-image-compression lossless -O0  0001.jpg formulierhocrjpg.pdf
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
Successfully converted to PDF, processing...
Scanning contents: 100%|████████████████████████| 1/1 [00:00<00:00, 73.93page/s]
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:09<00:00,  9.92s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 1/1 [00:00<00:00,  2.46page/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

pdfcomp formulierhocrjpg.pdf formulierhocrjpgkleiner.pdf
Compression factor: 9.617848822158944

formulierhocrjpgkleiner.pdf

Contains unreadable text on the left. The hocr contains "Toelichting 1.1", it is completely unreadable.

My patch for the inversion ratio makes it better readable:

formulierhocrjpgkleinerpatch.pdf

However if you lookup the mask-picture it doesn't contain this text in the left column at all.

So my patch isn't the only needed change for that routine.

Wrong resolution of mask image when foreground image is downsampled

I tried to use recode_pdf from imagestack together with the option to downsample the foreground ("--fg-downsample 4").
The resulting pdf was unreadable.

I found out that the foreground (meaning the color layer) was resampled as expected.
When the pdf is written, the resolution of the mask layer (which should stay in the original size) is taken from the foreground and therefor wrong.

As a solution i changed mrc.py to return the size of the mask and used the values from recode.py

This works fine when encoding images to pdf.
I did not test ist with other modes.

Attached you find patches for mrc.py and recode.py.
patches.tar.gz

Add another font beyond the glyphless font to actually render fonts of the languages that are in use

There is an old branch here that implements the concept:

https://github.com/internetarchive/archive-pdf-tools/tree/show-text-on-selection

It looks a bit messy, and the code was older (wrt font sizes when I wrote it), but something was working back then:

image

image

This table is a set of fonts that we could expect to have around I believe (system wide?):

Font Name Installed Base Font Comments
china-s Heiti simplified Chinese
china-ss Song simplified Chinese (serif)
china-t Fangti traditional Chinese
china-ts Ming traditional Chinese (serif)
japan Gothic Japanese
japan-s Mincho Japanese (serif)
korea Dotum Korean
korea-s Batang Korean (serif)
Then the question becomes -- what do we do for Arabic fonts?

We will want to add the language to the word data as returned by archive-hocr-tools, and then on a per page basis insert the right font.


(Old bug: https://git.archive.org/merlijn/archive-pdf-tools/-/issues/4)

Run noise estimation on a part of the image

We probably only need to analyse a part of the image to get a decent sense of camera (or other) noise. Running it on the whole image takes quite some time (it's the most costly operation currently).

Use JBIG2 compression to determine if we want to blur or denoise before thresholding

We can perform threshold on the original image, optimistically do the JBIG2 conversion, and only when the JBIG2 doesn't compress well, we either apply blur to the image and re-threshold, and/or denoise the threshold result (mask).

JBIG2 compression is fast, and our current noise estimation is not. Since our JBIG2 is lossless, good compression suggests that the image is not noisy.

This will help up speed up the PDF generation, since the Gaussian noise estimation is currently the most CPU intensive part, which is kind of silly.

--jbig2 deprecated

I built the newest version of this tool, and it states I should use [--mask-compression {jbig2,ccitt}]. So the main readme should be adapted accordingly.

Improve mask and background generation

There a few things to improve in the mask generation:

  • The Sauvola binarisation currently uses fixed parameters, which is not ideal. We probably want to make some of those parameters dependent on the image DPI, and change the k value to 0.34 as default.
  • We could look into better binarisation algorithms like multi-scale sauvola as mentioned here: tesseract-ocr/tesseract#3083 (comment)

The same applies to the hocr-specific mask generation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.