Comments (19)
The only thing that we need to fix for the Windows port to be initially functioning is a pattern I unfortunately use a lot: open a file, remove it but keep the fd around. Windows doesn't allow this. See for example os.remove(tiff_in)
in insert_images_mrc
from archive-pdf-tools.
There is a kakadu release for Windows:
- Install with
msiexec /i /tmp/KDU805_Demo_Apps_for_Win64_200602.msi
- There is also jpegoptim.exe - I used https://github.com/XhmikosR/jpegoptim-windows/releases
We also need to solve this as long as we have no jbig2enc built for Windows:
from archive-pdf-tools.
Looking forward to this project working on Windows.
Isn't the binary here (jbig2.exe) sufficient? https://github.com/2m/image-to-jbig2-pdf
from archive-pdf-tools.
This is the project that it relies on: https://github.com/agl/jbig2enc
I haven't tried building it on Windows yet because it mentioned Visual Studio, and I don't have that lying around. Perhaps the automake build could also work. Ideally someone would do something like the jpegoptim-windows guy did - providing some ways to build the binary on Github Actions, or some other CI.
from archive-pdf-tools.
I'm currently working on refactoring the code to make Windows mostly work (even with ccitt
instead of jbig2
the results are pretty good). If you have a way to get a working jbig2enc
binary for Windows, I'm happy to work with you to make sure this works too.
from archive-pdf-tools.
As of commit f072c89b40a061d55d272d6f128c1caec644925f
the latest github artifact should work on Windows. I'll issue a new release today or tomorrow. I tried this:
wine python.exe Scripts/recode_pdf --from-imagestack 'data/sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file data/sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata data/sim_english-illustrated-magazine_1884-12_2_15_scandata.xml --dpi 400 -m 2 -t 10 --mask-compression ccitt --denoise fast -v -o /tmp/out.pdf
(I didn't find a jbig2enc
yet, so I'm using ccitt
. The default JPEG2000 implementation is Pillow, although using kakadu can definitely speed up the process some more, it seems like a sane default)
from archive-pdf-tools.
Apologies for posting in a closed issue thread if that's out of the norm. I agree that a binary that can be automatically built would be ideal, but doing the following on Linux:
sudo apt-get install automake
sudo apt install libtool
sudo apt install libleptonica-dev
sudo apt install zlib1g-dev
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
gets me a jbig2
binary:
abc@host:~/Desktop/jbig2enc$ jbig2
No filename given
Usage: jbig2 [options] <input filenames...>
Options:
-b <basename>: output file root name when using symbol coding
-d --duplicate-line-removal: use TPGD in generic region coder
-p --pdf: produce PDF ready data
-s --symbol-mode: use text region, not generic coder
-t <threshold>: set classification threshold for symbol coder (def: 0.85)
-T <bw threshold>: set 1 bpp threshold (def: 188)
-r --refine: use refinement (requires -s: lossless)
-O <outfile>: dump thresholded image as PNG
-2: upsample 2x before thresholding
-4: upsample 4x before thresholding
-S: remove images from mixed input and save separately
-j --jpeg-output: write images from mixed input as JPEG
-a --auto-thresh: use automatic thresholding in symbol encoder
--no-hash: disables use of hash function for automatic thresholding
-V --version: version info
-v: be verbose
jbig2.exe
from the above-linked github seems to be a slightly older binary, but still working nonetheless.
C:\out>jbig2
No filename given
Usage: jbig2 [options] <input filenames...>
Options:
-b <basename>: output file root name when using symbol coding
-d --duplicate-line-removal: use TPGD in generic region coder
-p --pdf: produce PDF ready data
-s --symbol-mode: use text region, not generic coder
-t <threshold>: set classification threshold for symbol coder (def: 0.85)
-T <bw threshold>: set 1 bpp threshold (def: 188)
-r --refine: use refinement (requires -s: lossless)
-O <outfile>: dump thresholded image as PNG
-2: upsample 2x before thresholding
-4: upsample 4x before thresholding
-S: remove images from mixed input and save separately
-j --jpeg-output: write images from mixed input as JPEG
-v: be verbose
I don't remember the exact provenance (Virustotal clears it), but here is an up-to-date binary (jbig2enc 0.28)
I can also try to compile it myself with VS later if that's needed. I'm still confused about "no jbig2enc built," though.
from archive-pdf-tools.
I think the binary is called jbig2
(and not jbig2enc
), so that makes sense. I should have written "no jbig2enc build" instead of "built" perhaps.
What I mean is a clear and simple way for users to get a jbig2.exe
file that they can trust. The point is mostly that I'm not psyched about having a link in the instructions to some online mega.co.nz or other download host that contains a jbig2.exe
file . I think the file you linked on mega.co.nz might already just work if you set --mask-compression jbig2
and ensure it is in the PATH
, but it's not something I'd like to link to in the README
.
Having a way + some instructions on how to build it on Windows would be great. Then I can either build it myself or link to a known-good source (like your build). I have also built the jbig2
binary without problems on Linux, but I don't have a Windows system, I just test with wine
.
from archive-pdf-tools.
Thanks for the info/instructions btw, we can either use this issue or create a new one to sort out the jbig2.exe situation - either way is fine by me.
from archive-pdf-tools.
I found the source for the above binary, I think: https://github.com/anotatta/jbig2enc/releases/tag/0.29 Perhaps that can be better trusted to an extent.
from archive-pdf-tools.
This executable requires various libraries, though, like liblept-5.dll
(leptonica) and libgcc_s_seh-1.dll
and libstdc++-6.dll
, so those would also have to be packaged, ideally the binary is statically compiled.
from archive-pdf-tools.
Or we'd have to document getting leptonica and MinGW set up.
from archive-pdf-tools.
(Sorry, I'm not really a Windows user, so I am not sure what the usual sensible approach would be).
I'm also looking at utilising pyinstaller
to make a standalone .exe
file with everything contained.
from archive-pdf-tools.
Ugh, you're right. I had a PATH environment variable set to 'C:\Program Files\Tesseract-OCR' so I overlooked the needed DLLs. Yes, ideally it should be statically compiled.
from archive-pdf-tools.
Some further info: anotatta
's binary hosted on Github is 32-bit and requires a 32-bit version of libstdc++-6.dll
(which also appears to be the version included in the Tesseract-OCR installation (source).
Running ldd
gives this result:
W:\jbig2enc-32bit>ldd jbig2.exe
ntdll.dll => /c/Windows/SYSTEM32/ntdll.dll (0x7ffe8b210000)
KERNEL32.DLL => /c/Windows/System32/KERNEL32.DLL (0x7ffe89130000)
KERNELBASE.dll => /c/Windows/System32/KERNELBASE.dll (0x7ffe87340000)
msvcrt.dll => /c/Windows/System32/msvcrt.dll (0x7ffe89090000)
WS2_32.dll => /c/Windows/System32/WS2_32.dll (0x7ffe88880000)
RPCRT4.dll => /c/Windows/System32/RPCRT4.dll (0x7ffe885a0000)
liblept-5.dll => /c/Program Files/Tesseract-OCR/liblept-5.dll (0x71040000)
libgcc_s_seh-1.dll => /c/Program Files/Tesseract-OCR/libgcc_s_seh-1.dll (0x61440000)
GDI32.dll => /c/Windows/System32/GDI32.dll (0x7ffe8b1b0000)
gdi32full.dll => /c/Windows/System32/gdi32full.dll (0x7ffe875e0000)
libstdc++-6.dll => /c/Program Files/Tesseract-OCR/libstdc++-6.dll (0x1050000)
msvcp_win.dll => /c/Windows/System32/msvcp_win.dll (0x7ffe880c0000)
ucrtbase.dll => /c/Windows/System32/ucrtbase.dll (0x7ffe881b0000)
USER32.dll => /c/Windows/System32/USER32.dll (0x7ffe88c20000)
libwinpthread-1.dll => /c/Program Files/Tesseract-OCR/libwinpthread-1.dll (0x64940000)
win32u.dll => /c/Windows/System32/win32u.dll (0x7ffe87320000)
libjpeg-8.dll => /c/Program Files/Tesseract-OCR/libjpeg-8.dll (0x6b800000)
libgif-7.dll => /c/Program Files/Tesseract-OCR/libgif-7.dll (0x65880000)
libopenjp2.dll => /c/Program Files/Tesseract-OCR/libopenjp2.dll (0x70b40000)
libpng16-16.dll => /c/Program Files/Tesseract-OCR/libpng16-16.dll (0x68b40000)
libtiff-5.dll => /c/Program Files/Tesseract-OCR/libtiff-5.dll (0x68ec0000)
zlib1.dll => /c/Program Files/Intel/WiFi/bin/zlib1.dll (0x73480000)
libwebp-7.dll => /c/Program Files/Tesseract-OCR/libwebp-7.dll (0x61940000)
libjbig-2.dll => /c/Program Files/Tesseract-OCR/libjbig-2.dll (0x64900000)
VCRUNTIME140.dll => /c/Windows/SYSTEM32/VCRUNTIME140.dll (0x7ffe74b30000)
liblzma-5.dll => /c/Program Files/Tesseract-OCR/liblzma-5.dll (0x63cc0000)
libstdc++-6.dll => /c/Program Files/Tesseract-OCR/libstdc++-6.dll (0x1050000)
The SYSTEM32 dlls are a non-problem. zlib1.dll
is also in the Tesseract-OCR installation folder.
Redoing the 32-bit is giving me a bit of trouble, but I managed to compile a 64-bit .exe
with MSYS2 (MinGW x64) after applying this patch and it requires a 64-bit version of libstdc++-6.dll
.
W:\jbig2enc-64bit>ldd jbig2.exe
ntdll.dll => /c/Windows/SYSTEM32/ntdll.dll (0x7ffe8b210000)
KERNEL32.DLL => /c/Windows/System32/KERNEL32.DLL (0x7ffe89130000)
KERNELBASE.dll => /c/Windows/System32/KERNELBASE.dll (0x7ffe87340000)
ucrtbase.dll => /c/Windows/System32/ucrtbase.dll (0x7ffe881b0000)
WS2_32.dll => /c/Windows/System32/WS2_32.dll (0x7ffe88880000)
RPCRT4.dll => /c/Windows/System32/RPCRT4.dll (0x7ffe885a0000)
libstdc++-6.dll => /w/jbig2enc-64bit/libstdc++-6.dll (0x7ffe55580000)
libgcc_s_seh-1.dll => /c/Program Files/Tesseract-OCR/libgcc_s_seh-1.dll (0x61440000)
liblept-5.dll => /c/Program Files/Tesseract-OCR/liblept-5.dll (0x71040000)
msvcrt.dll => /c/Windows/System32/msvcrt.dll (0x7ffe89090000)
GDI32.dll => /c/Windows/System32/GDI32.dll (0x7ffe8b1b0000)
gdi32full.dll => /c/Windows/System32/gdi32full.dll (0x7ffe875e0000)
msvcp_win.dll => /c/Windows/System32/msvcp_win.dll (0x7ffe880c0000)
USER32.dll => /c/Windows/System32/USER32.dll (0x7ffe88c20000)
libwinpthread-1.dll => /c/Program Files/Tesseract-OCR/libwinpthread-1.dll (0x64940000)
win32u.dll => /c/Windows/System32/win32u.dll (0x7ffe87320000)
libgif-7.dll => /c/Program Files/Tesseract-OCR/libgif-7.dll (0x65880000)
libjpeg-8.dll => /c/Program Files/Tesseract-OCR/libjpeg-8.dll (0x6b800000)
libpng16-16.dll => /c/Program Files/Tesseract-OCR/libpng16-16.dll (0x68b40000)
libopenjp2.dll => /c/Program Files/Tesseract-OCR/libopenjp2.dll (0x70b40000)
libtiff-5.dll => /c/Program Files/Tesseract-OCR/libtiff-5.dll (0x68ec0000)
zlib1.dll => /c/Program Files/Intel/WiFi/bin/zlib1.dll (0x73480000)
libwebp-7.dll => /c/Program Files/Tesseract-OCR/libwebp-7.dll (0x61940000)
VCRUNTIME140.dll => /c/Windows/SYSTEM32/VCRUNTIME140.dll (0x7ffe74b30000)
libjbig-2.dll => /c/Program Files/Tesseract-OCR/libjbig-2.dll (0x64900000)
liblzma-5.dll => /c/Program Files/Tesseract-OCR/liblzma-5.dll (0x63cc0000)
When I tried to change settings to compile everything into a static binary it errored saying that I don't have Leptonica installed (even though it was). I'm admittedly not very experienced with this.
Anyway,
- Tesseract-OCR could be installed and then the 32-bit
jbig2.exe
simply dropped into the installation folder (may take an UAC prompt to approve moving the file in a folder inProgram Files
), - The Tesseract-OCR installation directory can be added within the PATH environment variable and 64-bit
jbig2.exe
& 64-bitlibstdc++-6.dll
can work elsewhere. jbig2.exe
gets distributed with all the dlls?- Static build?
from archive-pdf-tools.
Suggesting to install Tesseract is a fine solution by me, I was planning to include that later in any case when I get to the OCR part of all of this (that runs before PDF).
from archive-pdf-tools.
Alternatively, I might try my hand at (4) at some point, but probably not in the next few weeks.
from archive-pdf-tools.
An additional note: after running pip install archive-pdf-tools
the resulting scripts lack extensions, such that a command in Windows would have to be something like: python C:\Python38\Scripts\recode_pdf
. Adding a .py extension to the file, if everything else is properly set up, should make recode_pdf
available as a command. It's probably a matter of python packaging, and it's unimportant other than it should be documented for new users.
E.g. tif -> pdf would currently go something like this:
for %f IN (*.tif) do ( tesseract -l deu "%f" - hocr > "_out_%~nf.hocr" && python C:\Python38\Scripts\recode_pdf --from-imagestack "%f" --hocr-file "_out_%~nf.hocr" --dpi 600 -m 2 --hq-pages 1 --mask-compression jbig2 --denoise-mask fast --bg-downsample 3 -v -o "__out_%~nf.pdf" )
No matter how it's tweaked, not great results for scanned book pages with photo illustrations so far. MRC doesn't seem appropriate use case for that, though I recall (ABBYY-generated?) PDFs on archive.org typically looking a bit better.
from archive-pdf-tools.
An additional note: after running
pip install archive-pdf-tools
the resulting scripts lack extensions, such that a command in Windows would have to be something like:python C:\Python38\Scripts\recode_pdf
. Adding a .py extension to the file, if everything else is properly set up, should makerecode_pdf
available as a command. It's probably a matter of python packaging, and it's unimportant other than it should be documented for new users.
Yeah, I also realised that happened, but I wasn't sure why it was like that, I guess Windows wants some extension, and it doesn't honour the shebang (doh). I suppose we could do that rename.
E.g. tif -> pdf would currently go something like this:
for %f IN (*.tif) do ( tesseract -l deu "%f" - hocr > "_out_%~nf.hocr" && python C:\Python38\Scripts\recode_pdf --from-imagestack "%f" --hocr-file "_out_%~nf.hocr" --dpi 600 -m 2 --hq-pages 1 --mask-compression jbig2 --denoise-mask fast --bg-downsample 3 -v -o "__out_%~nf.pdf" )
Just a side note, if you first do the tesseract calls, and then use hocr-combine-stream ( https://archive-hocr-tools.readthedocs.io/en/latest/#hocr-combine-stream ) you can pass a glob to --from-imagestack
(and the combined hocr file).
No matter how it's tweaked, not great results for scanned book pages with photo illustrations so far. MRC doesn't seem appropriate use case for that, though I recall (ABBYY-generated?) PDFs on archive.org typically looking a bit better.
That could be, although I found it was usually quite similar to the Abbyy/LuraTech. If you can share an example we can take a look. There are a few variables to consider:
- Archive.org PDFs use kakadu for jpeg2000 compression, your command probably uses Pillow (OpenJPEG). I haven't done extensive quality testing with the OpenJPEG settings
- You can sacrifice compression for quality. One thing to try is to not downsample the background as much. Alternatively, you could look at the
--hq-pages
argument, or just use the hq compression params for normal compression. - The first 10 pages and the last 5 pages are compressed at high(er) quality, as opposed to the rest of the pages. This is true for ~1 year now or so.
You can probably get decent compression ratios (in your case probably more if your input images are not JPEG2000) that way. The typical compression ratio for archive.org items is ~7-8x and ~2-3x if the entire thing is in high-quality mode. But we start with JPEG2000 images as input, which typically compress better than most other image formats (I know this is somewhat outdated).
If you want to fiddle with the --bg-compression-flags and --fg-compression-flags, look at the arguments that opj_compress
takes (or grk_compress
, or just the Pillow args). In any case, I think for the issue of quality with images/photos I would recommend to open a separate issue. NB: A collegue of mine is working on adding "ocr_photo" elements to the hOCR output of Tesseract, which would allow us to potentially special-case parts of the image that are considered to be a photo.
from archive-pdf-tools.
Related Issues (20)
- pillow is not working properly HOT 27
- Need some inspiration? HOT 7
- Some scans become inverted HOT 7
- Detect if RGB images in pages are greyscale or even 1bit
- Define scope of tooling and work to improve for that scope
- Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu HOT 1
- Missing test suite? HOT 1
- pdfcomp: new tool, discussion, compression questions HOT 19
- Bug in foreground/background separator choosing massive block instead of character outline. HOT 14
- The choice for inverting, what's the use for perc_larger?
- pdfcomp: problems with inverted text that is often better in hocr. HOT 10
- Wrong resolution of mask image when foreground image is downsampled HOT 1
- First recode_pdf test: 'numpy' has no attribute 'int'. HOT 5
- IndexError: list index out of range (single TIFF file) HOT 5
- HOCR rendering compares unfavorably with tesseract PDF text layer HOT 11
- Installing on MacOS? HOT 29
- Q: accessible tagging/hints? HOT 4
- A certain PDF from Archive.org does not display all of its contents on Mac OS HOT 26
- A user-friendly example for a scanned multipage PDF needed HOT 3
- Recode does not merge hocr into pdf HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archive-pdf-tools.