jrmuizel / pdf-extract Goto Github PK

A rust library for extracting content from pdfs

Rust 99.56% Python 0.44%

pdf-extract's Introduction

pdf-extract

A rust library to extract content from PDF files.

let bytes = std::fs::read("tests/docs/simple.pdf").unwrap();
let out = pdf_extract::extract_text_from_mem(&bytes).unwrap();
assert!(out.contains("This is a small demonstration"));

pdf-extract's People

Stargazers

Watchers

pdf-extract's Issues

Making codebase more flexible and modularizing code for having a better software

Hey, I'm came up with this package minutes ago and I've got through the code base. I see that there has lots of room for code clean up and modularizing code for different packages and files. I'm on it, I'll be making a PR in the couple of days. Let me know if you have any comments ;)

how to compile + run examples/extract.rs from cargo?

HI, thanks for building this; looks useful to grab text from PDF's using rust.
I'm new to rust so forgive my ignorance here.
I had to copy examples/extract.rs to lib/main.rs to be able to do cargo run some.pdf. Is there a way to compile examples/extract.rs directly?
I saw this error when trying:

» rustc examples/extract.rs 
error[E0463]: can't find crate for `pdf_extract`
 --> examples/extract.rs:1:1
  |
1 | extern crate pdf_extract;
  | ^^^^^^^^^^^^^^^^^^^^^^^^^ can't find crate

Handle other named CMaps

We need to embed the data some how.

Panic on specific cases of "Separation"-type ColorSpace

I have a document that fails to parse: http://www.kozlonyok.hu/nkonline/MKPDF/hiteles/MK10200.pdf

It fails at this line:

pdf-extract/src/lib.rs

Line 1294 in 4dbdc35

 let alternate_space = pdf_to_utf8(cs[2].as_name().expect("second arg must be a name")); 

According to the specs, the second argument is either a name, or an actual color space object:

A Separation colour space is defined as follows:
[/Separation name alternateSpace tintTransform]
...
The alternateSpace parameter shall be an array or name object that identifies the alternate colour space, which may be any device or CIE-based colour space but may not be another special colour space (Pattern, Indexed, Separation, or DeviceN).

In my fork I just replaced the expect with unwrap_or, since I do not display anything.
To do this properly, I think the make_colorspace function would have to be refactored so that it can do the color space handling "twice".

panic : missing colorspace [67, 83, 112]

For some pdf, I get the following panic: missing colorspace [67, 83, 112]

Is there a way to mitigate this?

unexpected smask type 168 0 R

got this today with a pdf i can't unfortunately share here:

thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:1230:24:
unexpected smask type 168 0 R

pdftotext -layout equivalent

I am wondering if there's something similar to pdftotext -layout where it tries to maintain the original layout. Cheers!

when extract_text will cast an assert_error

panicked at 'assertion failed:xxxx
assert error file in .cargo/git/checkouts/pdf-extract-1e3ad5dc34c14d18/5eca5d5/src/lib.rs:833:17 .
like

        let base_name = get_name_string(doc, font, b"BaseFont");
        let descendants = maybe_get_array(doc, font, b"DescendantFonts").expect("Descendant fonts required");
        let ciddict = maybe_deref(doc, &descendants[0]).as_dict().expect("should be CID dict");
        let encoding = maybe_get_obj(doc, font, b"Encoding").expect("Encoding required in type0 fonts");
        dlog!("base_name {} {:?}", base_name, font);

        match encoding {
            &Object::Name(ref name) => {
                let name = pdf_to_utf8(name);
                dlog!("encoding {:?}", name);
                assert!(name == "Identity-H");
            }
            &Object::Stream(ref stream) => {
                let contents = get_contents(stream);
                dlog!("Stream: {}", String::from_utf8(contents.clone()).unwrap());
            }
            _ => { panic!("unsupported encoding {:?}", encoding)}
        }

i guess font encoding is not utf-8?
stack trace :

stack backtrace:
   0: std::panicking::default_hook::{{closure}}
   1: std::panicking::default_hook
   2: std::panicking::rust_panic_with_hook
   3: std::panicking::begin_panic
   4: pdf_extract::make_font
   5: pdf_extract::Processor::process_stream
   6: pdf_extract::Processor::process_stream
   7: pdf_extract::output_doc
   8: pdf_extract::extract_text
   9: extract_text::text::Text::from_file
  10: extract_text::main
  11: std::rt::lang_start::{{closure}}

12: std::panicking::try::do_call

Handle documents missing colorspace

The crate causes a panic


thread 'main' panicked at 'missing colorspace [67, 83, 112]', /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1248:85
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:61
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1028
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1412
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:65
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:50
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:188
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:205
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:464
  11: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:373
  12: std::panicking::begin_panic_fmt
             at src/libstd/panicking.rs:328
  13: pdf_extract::make_colorspace::{{closure}}
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1248
  14: core::option::Option<T>::unwrap_or_else
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libcore/option.rs:419
  15: pdf_extract::make_colorspace
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1248
  16: pdf_extract::Processor::process_stream
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1369
  17: pdf_extract::output_doc
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1979
  18: dataset::main
             at src/main.rs:36
  19: std::rt::lang_start::{{closure}}
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/rt.rs:61
  20: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:48
  21: std::panicking::try::do_call
             at src/libstd/panicking.rs:287
  22: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:78
  23: std::panicking::try
             at src/libstd/panicking.rs:265
  24: std::panic::catch_unwind
             at src/libstd/panic.rs:396
  25: std::rt::lang_start_internal
             at src/libstd/rt.rs:47
  26: std::rt::lang_start
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/rt.rs:61
  27: main
  28: __libc_start_main
  29: _start
note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.

Word spacing is not applied correctly

In the show_text code, word spacing is not applied correctly.

pdf-extract/src/lib.rs

Line 1158 in 38b1f15

let spacing = if is_space { ts.word_spacing } else { ts.character_spacing };

If you take a look at pdfminer's implementation, word spacing is added on top of character spacing:
https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfdevice.py#L171
https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfdevice.py#L183

The rust implementation isn't exactly in other ways, because char spacing is added in a bit of a finicky way, but patching the code to ts.word_spacing + ts.character_spacing was good enough for me.

Spec violation TrueType without Encoding entry

This codepath in PdfSimpleFont::new() is not standard compliant

            None => {
                if let Some(type1_encoding) = type1_encoding {
                    let mut table = Vec::from(PDFDocEncoding);
                    dlog!("type1encoding");
                    for (code, name) in type1_encoding {
                        let unicode = glyphnames::name_to_unicode(&pdf_to_utf8(&name));
                        if let Some(unicode) = unicode {
                            table[code as usize] = unicode;
                        } else {
                            dlog!("unknown character {}", pdf_to_utf8(&name));
                        }
                    }
                    encoding_table = Some(table)
                } else if subtype == "TrueType" {
                    encoding_table = Some(encodings::WIN_ANSI_ENCODING.iter()
                        .map(|x| if let &Some(x) = x { glyphnames::name_to_unicode(x).unwrap() } else { 0 })
                        .collect());
                }
            }

p.267 PDF standard
"When the font has no Encoding entry, or the font descriptor’s Symbolic flag is set (in which case the Encoding
entry is ignored), this shall occur:
• If the font contains a (3, 0) subtable, the range of character codes shall be one of these: 0x0000 - 0x00FF,
0xF000 - 0xF0FF, 0xF100 - 0xF1FF, or 0xF200 - 0xF2FF. Depending on the range of codes, each byte
from the string shall be prepended with the high byte of the range, to form a two-byte character, which shall
be used to select the associated glyph description from the subtable.
• Otherwise, if the font contains a (1, 0) subtable, single bytes from the string shall be used to look up the
associated glyph descriptions from the subtable.
If a character cannot be mapped in any of the ways described previously, a conforming reader may supply a
mapping of its choosing."

On all documents I've tested, the encoding_table is never used when the font is TrueType without an encoding because the unicode_map is present, so supplying WIN_ANSI_ENCODING as a fallback makes no difference.

unexpected smask type <</Type /Mask/S /Luminosity/G 13 0 R>>

Hi there, I'm trying to use this library to read a pdf, but for some reason it just doesn't work because of this mask.

Here ir the following error:

thread 'main' panicked at 'unexpected smask type <</Type /Mask/S /Luminosity/G 13 0 R>>', .cargo\registry\src\github.com-1ecc6299db9ec823\pdf-extract-0.6.2\src\lib.rs:1190:24
stack backtrace:
0: backtrace::backtrace::trace_unsynchronized
at .cargo\registry\src\github.com-1ecc6299db9ec823\backtrace-0.3.46\src\backtrace\mod.rs:66
1: std::sys_common::backtrace::print_fmt
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\sys_common\backtrace.rs:78
2: std::sys_common::backtrace::print::{{impl}}::fmt
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\sys_common\backtrace.rs:59
3: core::fmt::write
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libcore\fmt\mod.rs:1076
4: std::io::Write::write_fmtstd::sys::windows::stdio::Stderr
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\io\mod.rs:1537
5: std::sys_common::backtrace::print
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\sys_common\backtrace.rs:62
6: std::sys_common::backtrace::print
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\sys_common\backtrace.rs:49
7: std::panicking::default_hook::{{closure}}
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panicking.rs:198
8: std::panicking::default_hook
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panicking.rs:218
9: std::panicking::rust_panic_with_hook
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panicking.rs:486
10: std::panicking::begin_panic_handler
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panicking.rs:388
11: std::panicking::begin_panic_fmt
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panicking.rs:342
12: pdf_extract::apply_state
at C:\Users\rotc.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-extract-0.6.2\src\lib.rs:1190
13: pdf_extract::Processor::process_stream
at C:\Users\rotc.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-extract-0.6.2\src\lib.rs:1561
14: pdf_extract::output_doc
at C:\Users\rotc.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-extract-0.6.2\src\lib.rs:2028
15: pdf_extract::extract_textstd::path::PathBuf*
at C:\Users\rotc_.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-extract-0.6.2\src\lib.rs:1989
16: contador_palavras::read_pdf
at .\src\main.rs:32
17: contador_palavras::main
at .\src\main.rs:53
18: std::rt::lang_start::{{closure}}<core::result::Result<(), exitfailure::ExitFailure>>
at C:\Users\rotc_.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib\rustlib\src\rust\src\libstd\rt.rs:67
19: std::rt::lang_start_internal::{{closure}}
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\rt.rs:52
20: std::panicking::try::do_call
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panicking.rs:297
21: std::panicking::try
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panicking.rs:274
22: std::panic::catch_unwind
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\panic.rs:394
23: std::rt::lang_start_internal
at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src\libstd\rt.rs:51
24: std::rt::lang_start<core::result::Result<(), exitfailure::ExitFailure>>
at .rustup\toolchains\stable-x86_64-pc-windows-msvc\lib\rustlib\src\rust\src\libstd\rt.rs:67
25: main
26: invoke_main
at d:\A01_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78
27: __scrt_common_main_seh
at d:\A01_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
28: BaseThreadInitThunk
29: RtlUserThreadStart

Unsafe get and Missing char

When running examples/extract.rs on lockchain_for_deep_learning.pdf I get following error:

thread 'main' panicked at 'no entry found for key', src/lib.rs:466:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Error happens in following line (src/lib.rs) line 4666:

dlog!("{} {}", code, unicode_map[&(code as u32)]);

When replaced with

dlog!("{} {}", code, unicode_map.get(&(code as u32)));

Error message changes to

thread 'main' panicked at 'missing char 2 in map {130: " ", 128: "•"} for <</BaseFont /QAJSTB+AdvPSSym/Encoding 1219 0 R/FirstChar 2/FontDescriptor 1221 0 R/LastChar 130/Subtype /Type1/ToUnicode 1202 0 R/Type /Font/Widths [791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 458 0 0]>>', src/lib.rs:873:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Error handling - replace `.unwrap` and `panic` with `?`

Hi there! Thanks for creating an maintaining this :)

I like what this crate does, but I can't really use it as a library in my project because there are many (quite common) paths that will panic. I want to handle errors, not crash my app.

I suggest we change all .unwrap, .expect and panic with ? and Err. We add a custom Result type and implement impl From<&str> for Err for convenience.

UPDATE: I've opened a PR that changes most of the project to return results.

Feature comparison with pdfminer?

DeviceN colorspace is not supported

Hi,

I got a panic parsing http://www.kozlonyok.hu/nkonline/mkpdf/hiteles/mk13190.pdf:
thread 'main' panicked at 'color_space [67, 83, 48] "DeviceN" [/DeviceN, [/Black], /DeviceCMYK, 1565 0 R, 1562 0 R]', src/lib.rs

missing char 48 in map

The crate is panicking when I try to extract text from pdf:
thread 'main' panicked at 'missing char 48 in map {40: "R", 7: "i", 31: "v", 59: "-", 57: "]", 18: "C", 26: "P", 28: "H", 4: "w", 37: "f", 5: "e", 51: "A", 50: "q", 43: "x", 46: "”", 25: "k", 27: ".", 60: "O", 34: "/", 52: "(", 17: "h", 11: "p", 30: "B", 10: " ", 2: "l", 24: "’", 14: "s", 35: "S", 3: "o", 29: "!", 45: "“", 44: "W", 41: "V", 15: "d", 19: "m", 47: "→", 13: "t", 20: "b", 53: ")", 16: "u", 12: "a", 58: "G", 9: "g", 38: "z", 55: ";", 56: "[", 1: "F", 39: "E", 42: "D", 49: "‘", 54: "J", 36: "I", 6: "r", 8: "n", 21: "c", 23: "y", 32: "j", 33: ",", 22: "T"}', /home/mykhailo/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.4/src/lib.rs:733:27

Missing LICENSE File

Hi, as you stated that this project is under MIT license in Cargo.toml, would you like to add a LICENSE file to declare you're publishing it it MIT?

Thank you

Handle characters without unicode information

Sometimes we don't have the information needed to turn glyphs into unicode. We can probably improve the situation by storing a mapping of a hash of the glyph data to the unicode value.

Literal out of range errors from overflowing_literals lint

In a future version of Rust the overflowing_literals lint will become deny by default. See rust-lang/rust#55632. When checking for breakage in crates uploaded to crates.io it was discovered that this crate will no longer compile thanks to this lint. The error produced is here.

Processing pollutes stderr

I see a ton of errors or meta data printed when running the example script. Using this in a CLI app makes the app not very useful to end users. Can the verbosity made configurable?

remove some println! to be more CLI / TUI friendly

I'm using pdf-extract in a tui file manager, where I preview the files on demand.

Sometimes, when the pdf is corrupted, there's some println!("Unicode mismatch"); on the screen which I can't prevent.

Could those prints be replaced with logs instead like it's proposed here

I can count 5 other prints in lib.rs.

panic on unwrap on a None value

I'm getting a strange panic when reading an arbitrary PDF file

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /Users/me/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.6.5/src/lib.rs:955:37

I can't find any other issues that seem to be in this same area of the code; is the PDF malformed?

Empty text output

Not sure if you are collecting such examples, but here's one:

The PDF: Model-Card-Claude-2.pdf
pdftotext and pdf.js both work on this file.

Sanity Check - Unicode Mismatch

I have created a PDF search application that scours your folders in search of documents and allows you to find keywords in the document.

At first, I was not using this crate, but at some point it turned out that my app was not finding the right wording in the PDFs. https://github.com/piotroxp/pdfscan

I am learning Rust at the same time when solving my real life need, which is going over terabytes of scientific PDF articles and finding the keywords in them.

Since I want to build a warp drive xD and have a very admirable cache of papers, you can understand that its critical for me to read all files regardless of encoding.

Today marks about 4 hours spent on looking at this error:

Unicode mismatch

For some PDF docs, it works. For others, mainly those downloaded from popular scientific publishers, i am hit with that log.

My repo is attached just so you can understand what I want to achieve.

Wherein is the issue? I am new to Rust. I'm pretty sure that Rust, being a systems programming language, does supply PDF libs regardless of encoding. I can be wrong in that statement.

How can I fix my code? Ideally, I would enjoy the ability to read in bytes raw, and only then transform that representation to utf8. Right now, I am unable to search through sci papers.

This ticket is created just because I find it amusing and mentally challenging to understand what I do wrong. Unless you are doing something wrong, which is also a learning expierience.

Empty output file running extract example on a test pdf file

Hi, I'm trying to understand how to use your library, but I'm not able to run your example code corrrectly:

git clone https://github.com/jrmuizel/pdf-extract.git

cd pdf-extract

wget https://orimi.com/pdf-test.pdf

cargo run --example extract pdf-test.pdf

The output file is empty...

cat pdf-test.txt

Using pdftotext the output file is filled with text:

pdftotext -layout pdf-test.pdf

cat pdf-test.txt

PDF Test File

Congratulations, your computer is equipped with a PDF (Portable Document Format)
reader! You should be able to view any of the PDF documents and forms available on
our site. PDF forms are indicated by these icons:   or  .

Yukon Department of Education
Box 2703
Whitehorse,Yukon
Canada
Y1A 2C6

Please visit our website at: http://www.education.gov.yk.ca/

Thanks

RUSTSEC-2021-0017

Crate: postscript
Version: 0.11.1
Title: Read on uninitialized buffer may cause UB (impl Walue for Vec<u8>)
Date: 2021-01-30
ID: RUSTSEC-2021-0017
URL: https://rustsec.org/advisories/RUSTSEC-2021-0017
Solution: Upgrade to >=0.14.0
Dependency tree:
postscript 0.11.1
└── pdf-extract 0.6.4-alpha.0

thread 'main' panicked at 'missing char 33 in map

example code:

extern crate pdf_extract;
extern crate lopdf;
extern crate indicatif;

use std::env;
use std::fs;
use std::io::{self, Write};
use std::path::Path;
use std::time::{SystemTime, UNIX_EPOCH};
use indicatif::{ProgressBar, ProgressStyle};
use pdf_extract::;
use lopdf::;
use std::fs::File;
use std::panic::{self, AssertUnwindSafe};

fn main() {
let args: Vec = env::args().collect();
if args.len() < 3 {
eprintln!("Usage: {} ", args[0]);
return;
}

let pdf_dir = Path::new(&args[1]);
let output_dir = Path::new(&args[2]);

if !output_dir.exists() {
    fs::create_dir_all(&output_dir).unwrap_or_else(|_| panic!("Could not create output directory: {:?}", output_dir));
}

process_directory(&pdf_dir, &output_dir);

}

fn process_directory(pdf_dir: &Path, output_dir: &Path) {
for entry in fs::read_dir(pdf_dir).unwrap() {
let entry = entry.unwrap();
let path = entry.path();

    if path.is_dir() {
        println!("Processed directory: {:?}", pdf_dir);
        println!("Next directory: {:?}", path);
        println!("Do you wish to proceed? (yes/no)");

        let mut input = String::new();
        io::stdin().read_line(&mut input).unwrap();

        if input.trim().eq_ignore_ascii_case("yes") {
            process_directory(&path, output_dir);
        } else {
            continue;
        }
    } else if path.extension().and_then(|s| s.to_str()) == Some("pdf") {
        // Wrap the call to process_pdf with catch_unwind to handle panics
        let result = panic::catch_unwind(AssertUnwindSafe(|| {
            process_pdf(&path, &output_dir);
        }));

        if let Err(e) = result {
            eprintln!("An error occurred while processing {:?}: {:?}", path, e);
        }
    }
}

}

fn process_pdf(pdf_path: &Path, output_dir: &Path) {
let pb = ProgressBar::new_spinner();
pb.set_style(ProgressStyle::default_spinner().template("{spinner:.green} {msg}"));
pb.enable_steady_tick(120);
pb.set_message("Processing PDF...");

let filename = pdf_path.file_stem().unwrap().to_str().unwrap();
let output_path = output_dir.join(filename).with_extension("txt");

let mut file = File::create(&output_path).expect("Could not create output file");

match Document::load(pdf_path) {
    Ok(doc) => {
        print_metadata(&doc);

        let mut output: Box<dyn OutputDev> = Box::new(PlainTextOutput::new(&mut file as &mut dyn Write));

        if let Err(e) = output_doc(&doc, output.as_mut()) {
            eprintln!("Error processing document {}: {}", pdf_path.display(), e);
        }
    }
    Err(e) => eprintln!("Failed to load document {}: {}", pdf_path.display(), e),
}

pb.finish_with_message("Done.");

let time = SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_secs();
let mut log_file = fs::OpenOptions::new().append(true).create(true).open(output_dir.join(format!("processed_{}.log", time))).unwrap();

writeln!(log_file, "Processed PDF: {:?}", pdf_path).unwrap();

}

fn print_metadata(doc: &Document) {
_ = doc; // Simulate using the doc variable, or implement logic here
}

warning: fields name, alternate_space, and tint_transform are never read
--> src/lib.rs:1310:5
|
1309 | pub struct Separation {
| ---------- fields in this struct
1310 | name: String,
| ^^^^
1311 | alternate_space: AlternateColorSpace,
| ^^^^^^^^^^^^^^^
1312 | tint_transform: Box,
| ^^^^^^^^^^^^^^
|
= note: Separation has a derived impl for the trait Clone, but this is intentionally ignored during dead code analysis

warning: pdf-extract (lib) generated 7 warnings
Compiling pdf-extract v0.7.2 (/home/walter/programs/pdf-extract)
warning: unused Result that must be used
--> bin/extract.rs:72:5
|
72 | output_doc(&doc, output.as_mut());
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: this Result may be an Err variant, which should be handled
= note: #[warn(unused_must_use)] on by default
help: use let _ = ... to ignore the resulting value
|
72 | let _ = output_doc(&doc, output.as_mut());
| +++++++

warning: pdf-extract (bin "pdf-extract") generated 1 warning
Finished dev [unoptimized + debuginfo] target(s) in 1.09s
Running target/debug/pdf-extract /media/ /media/extract_pdfminer
⠁
Done.
Done.
⠚ Processing PDF...
thread 'main' panicked at 'missing char 33 in map {48: "∙", 34: "(", 36: ")"} for <</Type /Font/Subtype /TrueType/BaseFont /AAAAAI+CambriaMath/FontDescriptor 161 0 R/ToUnicode 90 0 R/FirstChar 33/LastChar 48/Widths [698 415 672 415 351 469 605 728 579 579 728 579 440 507 579 247]>>', src/lib.rs:750:27

Keep getting thread main panic.

Consider supporting ActualText

I have several PDFs with some very weird ToUnicode mappings. Some characters get extracted as lowercase instead of uppercase, even though the CID corresponds to the ASCII uppercase version. Unfortunately this breaks later processing steps for these documents.

For example I have the following: https://stickman.hu/junk/actualtext_example.pdf

Here, the line

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 EU rendeletek” szövegrész helyébe

Extracts as

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 eU rendeletek” szövegrész helyébe
                                                                             ^
                                                                             |
                                                                         Lowercase

Note that with several PDF viewers (e.g. the firefox built-in one) will also copy the wrong text. Chrome, Okular, and poppler in general will capitalize the E in EU. pdftotext from the poppler suite also works OK.

Now why is this? For some reason, the CID for both E and e are mapped to the ASCII code point 101 (lowercase e) in the font.

Why is it handled OK by some extractors? Because this is what the actual operations look like around that part:

op: Operation { operator: "BDC", operands: [/Span, <</ActualText (��^@E)>>] }
op: Operation { operator: "Td", operands: [30.888, 0] }
op: Operation { operator: "Tj", operands: [(E)] }
op: Operation { operator: "EMC", operands: [] }

The ActualText thing here is described in the PDF standard "14.9.4 Replacement Text", and has a special code path in poppler: https://github.com/freedesktop/poppler/blob/315ab3006fb24bf47b595343e6a3e90995f2a588/poppler/Gfx.cc#L5052-L5059

As far as I see, handling this case would need some refactoring around show_text, and I'm really not sure how to do it. Probably a fully separate code path for the "simple" and the replacement text use-cases, both of which would call output_character in the end.

P.S. 1: It seems like this guy had a related issue back in the day: https://stackoverflow.com/questions/17737776/pdf-text-extraction-issue-font-capitalization-inconsistencies

P.S. 2: In the end, I might just expose the CID on the output_character interface and do the same workaround I did in python: https://github.com/badicsalex/hun_law_py/blob/master/hun_law/extractors/pdf.py#L88-L93

P.S. 3: Thanks for taking the time to fix some of the bugs I reported, I really appreciate it.

Performance: use nom_parser in lopdf instead of pom_parser

Apparently lopdf is also changing to nom_parser as a default, but since this is forced in pdf-extract's Cargo.toml, it should be modified there too.
The performance change is dramatic, for one of my sample files I get a 60%+ reduction in runtime.

See J-F-Liu/lopdf#157

/ToUnicode spec violation

get_unicode_map includes a check that if the /ToUnicode key is a name, that it must be /Identity-H.
I have searched the pdf spec and checked a bunch of personal pdfs to see whether this is justified.
It seems to be a violation of the standard, is there an example of this occurring in practice?

Failure to extract text from AMD GPU ISA docs

Frankly, I have no clue whether the problem lies in pdf-extract, or in one of its dependencies, please redirect me if this issue is misplaced.

For the public AMD GPU ISA documentation, such as:
https://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf
pdf-extract extracts blank pages. Other extractors, such as PyPDF2, extract the text just fine.

RUSTSEC-2020-0144

Crate: time
Version: 0.1.44
Title: Potential segfault in the time crate
Date: 2020-11-18
ID: RUSTSEC-2020-0071
URL: https://rustsec.org/advisories/RUSTSEC-2020-0071
Solution: Upgrade to >=0.2.23
Dependency tree:
time 0.1.44
Crate: lzw
Version: 0.10.0
Warning: unmaintained
Title: lzw is unmaintained
Date: 2020-02-10
ID: RUSTSEC-2020-0144
URL: https://rustsec.org/advisories/RUSTSEC-2020-0144
Dependency tree:
lzw 0.10.0
└── lopdf 0.26.0
├── pdf-extract 0.6.4-alpha.0

panicked at 'attempt to add with overflow'

attempt to add with overflow

thread 'main' panicked at 'attempt to add with overflow', ~/.cargo/registry/src/github.com-1ecc6299db9ec823/adobe-cmap-parser-0.3.3/src/lib.rs:202:41

backtrace

stack backtrace:
   0: rust_begin_unwind
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/panicking.rs:111:5
   3: adobe_cmap_parser::get_unicode_map
             at /Users/wangli/.cargo/registry/src/github.com-1ecc6299db9ec823/adobe-cmap-parser-0.3.3/src/lib.rs:202:41
   4: pdf_extract::get_unicode_map
             at /Users/wangli/Repos/pdf-extract/src/lib.rs:815:24
   5: pdf_extract::PdfCIDFont::new
             at /Users/wangli/Repos/pdf-extract/src/lib.rs:881:27
   6: pdf_extract::make_font
             at /Users/wangli/Repos/pdf-extract/src/lib.rs:319:17
   7: pdf_extract::Processor::process_stream::{{closure}}
             at /Users/wangli/Repos/pdf-extract/src/lib.rs:1483:84
   8: std::collections::hash::map::Entry<K,V>::or_insert_with
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/std/src/collections/hash/map.rs:2559:43
   9: pdf_extract::Processor::process_stream
             at /Users/wangli/Repos/pdf-extract/src/lib.rs:1483:32
  10: pdf_extract::output_doc
             at /Users/wangli/Repos/pdf-extract/src/lib.rs:2044:9
  11: extract::main
             at ./extract.rs:39:5
  12: core::ops::function::FnOnce::call_once
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ops/function.rs:507:5

new line

Extractions does not output carriage returns. After looking at an output file in a hex editor the reason is clear. It seems to determine line feeds perfectly fine but it only inserts the Line Feed character (0x0A) and not the carriage return character that a windows text file expects. (0x0D 0x0A)

So a better behavior would be to take 0x0D in an output string and replace the hex found with 0x0D 0x0A. Definitely increase usability.

License?

Hi, I'm trying different PDF extraction libraries and I'd very much prefer to use one in Rust because that's what my program is written in, but I can't use a library without a license.

Could you please consider adding one?
If you don't care much about licenses, maybe the MIT one would be adequate. Otherwise GPL or LGPL is also fine with me.

Performance: glyphnames::name_to_unicode is very slow

I have a very accent-heavy hungarian document I'm parsing, and 95% of the processing time was spent in name_to_unicode

Please consider using a HashMap or, even better, a compile-time perfect hash function. Example patch here:
badicsalex@5cb9b67

Panic on the SC command when using Pattern colorspace

Hi,

I have a minimal reproducing document of a panic: https://stickman.hu/junk/pdf_extract_repro_1.pdf
The panic in the as_num function, because "/P0" was not a number.

This is caused by what I assume a copy-paste mistake in the handler at https://github.com/jrmuizel/pdf-extract/blob/master/src/lib.rs#L1413, where instead of fill_colorspace, stroke_colorspace should be used.

`pdf_extract::extract_text` returns an empty string for a non-empty PDF

pdf_extract::extract_text returns an empty string for a non-empty PDF. Are there PDF attributes that are known not to work I should check for? Or anything else I can do to to narrow down the issue? (I don't have a reproducible case I can share unfortunately).

unexpected smask type 554 0 R

Hi,
at first thank you for very useful crate! :)

I've tried to convert PDF file ( https://arxiv.org/pdf/2108.11950v1.pdf ) to text and got following exception:

thread 'main' panicked at 'unexpected smask type 554 0 R', /home/user/.cargo/git/checkouts/pdf-extract-1e3ad5dc34c14d18/e03d663/src/lib.rs:1190:24

It seems that the problem is somewhere at page 7 (I have text output from pages 1-6).

Panic: Unexpected smask type <</Type /Mask/S /Luminosity/G 6 0 R>>

Parsing this PDF panics. Is there a reason that parsing can't return a Result?

Announcing Tauri 1.4.0 _ Tauri Apps.pdf

Upgrade lopdf to version 0.26 to resolve panic

Installed pdf-extract = "^0.6" and got the following error when trying to parse a PDF:

thread 'main' panicked at 'attempted to leave type `linked_hash_map::Node<std::vec::Vec<u8>, object::Object>` uninitialized, which is invalid', /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b/library/core/src/mem/mod.rs:663:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Searched and found a similar bug here that was solved by upgrading lodpf to 0.26.

Installing the latest version leads to new errors in pdf-extract because the Document was moved since lopdf 0.24:

output_doc(&doc, output.as_mut());
   |                ^^^^ expected struct `lopdf::document::Document`, found struct `lopdf::Document`

Sorry, total Rust noob here

Multiple panics on Arxiv.org PDFs

I'm attempting to extract the text from multiple PDFs from arxiv.org, and 15 out of the 20 I just attempted resulted in panics, many (but not all) apparently Unicode-related. Here are the links to the PDFs that failed:

Here are some of the errors:

For http://arxiv.org/pdf/2312.00064v1:

Unicode mismatch true fl "fl" Ok("ﬂ") [64258]
Unicode mismatch true fi "fi" Ok("ﬁ") [64257]
Unicode mismatch true fl "fl" Ok("ﬂ") [64258]
thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 16 in map {60: "\u{f8f2}", 208: "Γ", 218: "Ω", 65: "\u{f8f8}", 217: "Ψ", 210: "Θ", 213: "Π", 63: "\u{f8e6}", 50: "\u{f8ee}", 160: " ", 57: "\u{f8fc}", 64: "\u{f8ed}", 212: "Ξ", 55: "\u{f8fa}", 209: "∆", 66: "\u{f8ec}", 49: "\u{f8f6}", 59: "\u{f8fe}", 48: "\u{f8eb}", 67: "\u{f8f7}", 51: "\u{f8f9}", 61: "\u{f8fd}", 52: "\u{f8f0}", 62: "\u{f8f4}", 211: "Λ", 159: "√", 53: "\u{f8fb}", 215: "Υ", 58: "\u{f8f3}", 214: "Σ", 54: "\u{f8ef}", 56: "\u{f8f1}", 216: "Φ"} for <</Type /Font/Subtype /Type1/BaseFont /VSLKGG+CMEX10/FirstChar 0/FontDescriptor 4273 0 R/LastChar 125/ToUnicode 4304 0 R/Widths 4259 0 R>>

For http://arxiv.org/pdf/2312.00140v1:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 0 in map {50: "\u{f8ee}", 54: "\u{f8ef}", 67: "\u{f8f7}", 53: "\u{f8fb}", 48: "\u{f8eb}", 160: " ", 63: "\u{f8e6}", 215: "Υ", 61: "\u{f8fd}", 214: "Σ", 57: "\u{f8fc}", 66: "\u{f8ec}", 60: "\u{f8f2}", 64: "\u{f8ed}", 209: "∆", 65: "\u{f8f8}", 208: "Γ", 218: "Ω", 159: "√", 213: "Π", 211: "Λ", 49: "\u{f8f6}", 212: "Ξ", 58: "\u{f8f3}", 56: "\u{f8f1}", 51: "\u{f8f9}", 62: "\u{f8f4}", 210: "Θ", 217: "Ψ", 52: "\u{f8f0}", 55: "\u{f8fa}", 216: "Φ", 59: "\u{f8fe}"} for <</Type /Font/Subtype /Type1/BaseFont /BJKPRR+CMEX10/FirstChar 0/FontDescriptor 1313 0 R/LastChar 88/ToUnicode 1374 0 R/Widths 1287 0 R>>

For http://arxiv.org/pdf/2309.02511v2:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 44 in map {43: "⇁", 165: "Ξ", 13: "γ", 91: "♭", 184: "λ", 46: "▷", 74: "J", 85: "U", 78: "N", 121: "y", 111: "o", 28: "τ", 89: "Y", 101: "e", 176: "γ", 191: "τ", 162: "∆", 15: "ϵ", 5: "Π", 109: "m", 178: "ϵ", 177: "δ", 103: "g", 98: "b", 174: "α", 173: "Ω", 125: "℘", 194: "χ", 100: "d", 8: "Φ", 94: "⌣", 26: "ρ", 68: "D", 30: "ϕ", 12: "β", 75: "K", 54: "6", 70: "F", 175: "β", 181: "θ", 104: "h", 34: "ε", 4: "Ξ", 42: "⇀", 62: ">", 23: "ν", 119: "w", 38: "ς", 11: "α", 90: "Z", 195: "ψ", 193: "ϕ", 180: "η", 86: "V", 17: "η", 124: "ȷ", 35: "ϑ", 128: "ψ", 73: "I", 36: "ϖ", 166: "Π", 189: "ρ", 112: "p", 170: "Ψ", 107: "k", 77: "M", 120: "x", 99: "c", 76: "L", 93: "♯", 27: "σ", 64: "∂", 190: "σ", 50: "2", 29: "υ", 53: "5", 188: "π", 24: "ξ", 115: "s", 97: "a", 168: "Υ", 164: "Λ", 9: "Ψ", 39: "φ", 41: "↽", 25: "π", 118: "v", 66: "B", 67: "C", 187: "ξ", 81: "Q", 83: "S", 88: "X", 179: "ζ", 95: "⌢", 3: "Λ", 52: "4", 14: "δ", 122: "z", 31: "χ", 183: "κ", 22: "µ", 113: "q", 80: "P", 60: "<", 102: "f", 47: "◁", 82: "R", 32: "ψ", 6: "Σ", 110: "n", 169: "Φ", 84: "T", 123: "ı", 167: "Σ", 192: "υ", 87: "W", 161: "Γ", 106: "j", 37: "ϱ", 48: "0", 117: "u", 71: "G", 72: "H", 65: "A", 108: "l", 49: "1", 1: "∆", 96: "ℓ", 2: "Θ", 51: "3", 186: "ν", 59: ",", 63: "⋆", 16: "ζ", 105: "i", 92: "♮", 7: "Υ", 56: "8", 55: "7", 21: "λ", 160: " ", 33: "ω", 57: "9", 20: "κ", 58: ".", 69: "E", 116: "t", 18: "θ", 10: "Ω", 40: "↼", 114: "r", 19: "ι", 182: "ι", 0: "Γ", 185: "µ", 126: "\u{20d7}", 79: "O", 163: "Θ", 61: "/"} for <</Type /Font/Subtype /Type1/BaseFont /APPDUE+CMMI10/FirstChar 11/FontDescriptor 1143 0 R/LastChar 122/ToUnicode 1193 0 R/Widths 1129 0 R>>

For http://arxiv.org/pdf/2312.00735v1:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 118 in map {159: "√", 62: "\u{f8f4}", 57: "\u{f8fc}", 218: "Ω", 213: "Π", 63: "\u{f8e6}", 64: "\u{f8ed}", 50: "\u{f8ee}", 66: "\u{f8ec}", 212: "Ξ", 55: "\u{f8fa}", 65: "\u{f8f8}", 58: "\u{f8f3}", 49: "\u{f8f6}", 215: "Υ", 53: "\u{f8fb}", 56: "\u{f8f1}", 67: "\u{f8f7}", 208: "Γ", 59: "\u{f8fe}", 216: "Φ", 160: " ", 210: "Θ", 217: "Ψ", 211: "Λ", 51: "\u{f8f9}", 54: "\u{f8ef}", 52: "\u{f8f0}", 60: "\u{f8f2}", 214: "Σ", 48: "\u{f8eb}", 61: "\u{f8fd}", 209: "∆"} for <</Type /Font/Subtype /Type1/BaseFont /KFVYMG+CMEX10/FirstChar 16/FontDescriptor 638 0 R/LastChar 118/ToUnicode 671 0 R/Widths 617 0 R>>

Would you be able to make the fields of MediaBox public?

Hi,

First of all thanks for the great library. I'm just wondering if you would consider making the fields of MediaBox public?

(I'm creating my a little parser based on your extractor, but to do that, requires the fields to be public, so I can create my own OutputDev)

Also one more question, do you know if there is a way in which the OutputDev, would be able to terminate processing after say the first page?

I noticed there may be a println! that could have been left in accidentally too - https://github.com/jrmuizel/pdf-extract/blob/master/src/lib.rs#L1118

Thanks!

thread 'main' panicked at 'assertion failed: name == \"Identity-H\"

hi，report this

Error in compiling the example

While compiling the example I got the below errors and warnings:

   Compiling rust v0.1.0 (/Users/hasan/PycharmProjects/rust)
warning: trait objects without an explicit `dyn` are deprecated
  --> src/main.rs:26:25
   |
26 |     let mut output: Box<OutputDev> = match output_kind.as_ref() {
   |                         ^^^^^^^^^ help: use `dyn`: `dyn OutputDev`
   |
   = note: `#[warn(bare_trait_objects)]` on by default

warning: trait objects without an explicit `dyn` are deprecated
  --> src/main.rs:27:74
   |
27 |         "txt" => Box::new(PlainTextOutput::new(&mut output_file as (&mut std::io::Write))),
   |                                                                          ^^^^^^^^^^^^^^ help: use `dyn`: `dyn std::io::Write`

error[E0308]: mismatched types
  --> src/main.rs:24:20
   |
24 |     print_metadata(&doc);
   |                    ^^^^ expected struct `lopdf::document::Document`, found a different struct `lopdf::document::Document`
   |
   = note: expected type `&lopdf::document::Document` (struct `lopdf::document::Document`)
              found type `&lopdf::document::Document` (struct `lopdf::document::Document`)
note: Perhaps two different versions of crate `lopdf` are being used?
  --> src/main.rs:24:20
   |
24 |     print_metadata(&doc);
   |                    ^^^^

error[E0308]: mismatched types
  --> src/main.rs:33:16
   |
33 |     output_doc(&doc, output.as_mut());
   |                ^^^^ expected struct `lopdf::document::Document`, found a different struct `lopdf::document::Document`
   |
   = note: expected type `&lopdf::document::Document` (struct `lopdf::document::Document`)
              found type `&lopdf::document::Document` (struct `lopdf::document::Document`)
note: Perhaps two different versions of crate `lopdf` are being used?
  --> src/main.rs:33:16
   |
33 |     output_doc(&doc, output.as_mut());
   |                ^^^^

error: aborting due to 2 previous errors

Page attributes are not inherited

PDF supports have pages inherit there attributes from the page directory.

Extract text from string

Currently extract_text only supports AsRef<Path> but what if the user wants to input from String? Why not take in anything that implements Read instead?

extract_text_from_mem not found in `pdf_extract`

I am trying to get the example lines from the readme work but without success.

This is my main.rs

extern crate pdf_extract;

use pdf_extract::*;

fn main() {
    println!("Hello, world!");

    let bytes = std::fs::read("tests/docs/simple.pdf").unwrap();
    let out = pdf_extract::extract_text_from_mem(&bytes).unwrap();
    assert!(out.contains("This is a small demonstration"));
}

This is my cargo.toml

[package]
name = "pdfreader"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
pdf-extract = "0.6.4"

This is the error I get:

error[E0425]: cannot find function `extract_text_from_mem` in crate `pdf_extract`
 --> src/main.rs:9:28
  |
9 |     let out = pdf_extract::extract_text_from_mem(&bytes).unwrap();
  |                            ^^^^^^^^^^^^^^^^^^^^^ not found in `pdf_extract`

Can anyone help me what I am missing here?

Panic at FirstChar

I am using the most recent version of this crate, and am using it to extract text from old PDF documents. When dealing with PDF documents with PDF version 1.3, it consistently throws the following error:

thread 'main' panicked at 'FirstChar', pdf-extract/src/lib.rs:201:30

Not sure if the issue is actually due to the PDF version, but it seems to be a consistent factor across the PDFs that are causing this panic. For anything version 1.4 or newer it seems to have fewer or different issues.

Here is the backtrace as well:

0:        0x1030e5c11 - std::backtrace_rs::backtrace::libunwind::trace::h0b624e35bf84187c
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:        0x1030e5c11 - std::backtrace_rs::backtrace::trace_unsynchronized::h435d9bd636904605
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x1030e5c11 - std::sys_common::backtrace::_print_fmt::h3ca407d645e7e73d
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x1030e5c11 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h4f26ffad025fdbe8
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:46:22
   4:        0x10310586b - core::fmt::write::h0a9937d83d3944c1
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/fmt/mod.rs:1168:17
   5:        0x1030e2a68 - std::io::Write::write_fmt::hfaf2e2e92eda8127
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/io/mod.rs:1660:15
   6:        0x1030e7e87 - std::sys_common::backtrace::_print::h11335bd900abe1ce
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:49:5
   7:        0x1030e7e87 - std::sys_common::backtrace::print::hdf5291c87f745042
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:36:9
   8:        0x1030e7e87 - std::panicking::default_hook::{{closure}}::hc11e9b8d348e68b0
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:211:50
   9:        0x1030e7a95 - std::panicking::default_hook::h1d26ec4d0d63be04
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:228:9
  10:        0x1030e8510 - std::panicking::rust_panic_with_hook::hef4f5e524db188b3
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:606:17
  11:        0x1030e823e - std::panicking::begin_panic_handler::{{closure}}::h6e8805ea2351af89
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:502:13
  12:        0x1030e6087 - std::sys_common::backtrace::__rust_end_short_backtrace::hd383ade987b76f63
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:139:18
  13:        0x1030e7f2a - rust_begin_unwind
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
  14:        0x1031152cf - core::panicking::panic_fmt::hb58956db718d5b79
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
  15:        0x103103c8b - core::panicking::panic_display::hbc9d28d62fda8ebd
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:72:5
  16:        0x103103c3c - core::panicking::panic_str::h157a3bd169616ebc
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:56:5
  17:        0x1031151d9 - core::option::expect_failed::h453cfa4fcdc0da1c
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/option.rs:1817:5
  18:        0x102fd79fc - pdf_extract_custom::get::h0049d79db907b50d
  19:        0x102fd9c9c - pdf_extract_custom::make_font::h069cf264ff12e0dd
  20:        0x102fe17f4 - std::collections::hash::map::Entry<K,V>::or_insert_with::h0c1bd9bdfb41b914
  21:        0x102fde982 - pdf_extract_custom::Processor::process_stream::h57aad9bde7a6ebbf
  22:        0x102fe066b - pdf_extract_custom::output_doc::h7674a2a38c26fe1f
  23:        0x102fd43b5 - pdf_extract_custom::extract_text::h893fb460cfdd03c8
  24:        0x102fcd0df - pdf_indexer::main::hb45e438a30676c8a
  25:        0x102fcd796 - std::sys_common::backtrace::__rust_begin_short_backtrace::hf8b6885c183ef9a5
  26:        0x102fcd78c - std::rt::lang_start::{{closure}}::h839ae8a5a873071a
  27:        0x1030e531e - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h1d1e9294d7151cb0
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:259:13
  28:        0x1030e531e - std::panicking::try::do_call::h315943602cc1e70c
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:406:40
  29:        0x1030e531e - std::panicking::try::h5be753f80fffd492
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:370:19
  30:        0x1030e531e - std::panic::catch_unwind::h9fdcb02c74b07e26
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panic.rs:133:14
  31:        0x1030e531e - std::rt::lang_start_internal::{{closure}}::h1558447834abc29f
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/rt.rs:128:48
  32:        0x1030e531e - std::panicking::try::do_call::h5721bf6e49d6926d
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:406:40
  33:        0x1030e531e - std::panicking::try::hee7cffb35a5e550d
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:370:19
  34:        0x1030e531e - std::panic::catch_unwind::hf45e91e6006ab16e
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panic.rs:133:14
  35:        0x1030e531e - std::rt::lang_start_internal::h64086fc6655bfbe8
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/rt.rs:128:20
  36:        0x102fcd759 - _main

jrmuizel / pdf-extract Goto Github PK

pdf-extract's Introduction

pdf-extract

See also

Not PDF specific

pdf-extract's People

Stargazers

Watchers

Forkers

pdf-extract's Issues

attempt to add with overflow

backtrace

Recommend Projects

Recommend Topics

Recommend Org