I now work with a hocr-file coming from pdftree to get out the current searchable text from a PDF as suggested on the bottom of this issue:
ocropus/hocr-tools#117
recode_pdf --from-imagestack './2022-01-08*.tif' --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 2022-01-08a.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages
recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages
Even if I leave out the hocr-file in the hope the input PDF should be already taken for the searchable text inside there's still an error:
recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 628, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
for idx, hocr_page in enumerate(hocr_iter):
File "/usr/local/lib/python3.8/dist-packages/archive_hocr_tools-1.1.13-py3.8.egg/hocr/parse.py", line 42, in hocr_page_iterator
fp.seek(0)
AttributeError: 'NoneType' object has no attribute 'seek'
I anonymized the hocr by :%s/>.*</span>/>bla</span>
anonymized.zip