j-f-liu / lopdf Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 170.0 7.15 MB

A Rust library for PDF document manipulation.

License: MIT License

Rust 100.00%

pdf-document rust rust-library

lopdf's People

Contributors

Stargazers

Watchers

Forkers

crlf0710 simonsapin trizinix ndusart ctforks ykankaya jrmuizel stillson xanonid nfinger ashleygwilliams srinivest kornelski deciduously litttley linecode laeeth uuhan 0xflotus cjiajiazhuiqiu ideaplexus isgasho srijs georgemarshall divergentdave zerospam anuragtb jjpe paulyc geoffreyporto jeon-studio gabrielmajeri xychen9459 svenfoo eijebong kandykoma marcgrotheer zhutony anderejd hroi tschet1 rye carlos-marques z33ky boan-anbo emulator000 sbeckeriv makyos hfiguiere lrosenthol bearrrrr tako8ki andy-tsg hsahin heroickatora georgohneh kimamov jakeoshannessy champignoom jetasap placrosse rojer-98 niedzwiedzw enzingerm stephica balrog-rust psy-repos-rust donjayamanne solomatovdmitrij ajunlonglive standardgalactic ralpha msrd0 oyelowo zhengxiwan as207960 yossan rben01 goodpaul6 mattjurenka wiktor-k codefather-labs jumpdiffusion cactter laplacekorea mohamedtaoufik igi-111 emarsden ta-vroom bobrimperator baro77 brmmm3 yaminiu flatbartender eznj runyasak sww1235 tyrylu look-before-you-leap unpublished

lopdf's Issues

Support LZWDecode filters

https://github.com/yurydelendik/pdf.js/raw/5973d40afe5a1f82474438caae71c4039dc3ba84/test/pdfs/bug864847.pdf has a fun example of an LZWDecode filter being used on a ToUnicode CMap

Compiler errors on README example

I tried to run the code on the README, and I got 2 errors: One complaining about the dictionary! macro, and the other complaining about an extra comma.

error: cannot find macro `dictionary!` in this scope

To fix this, I copied the dictionary! macro from object.rs. Is there a better way to do this?

error: no rules expected the token `,`
  --> src/main.rs:27:30
   |
27 |         "BaseFont" => "Courier",,
   |                                 ^

To fix this I just removed the extra comma.

Add a parser for pdf content streams

It would be very valuable if lopdf also supported parsing pdf content streams. I'm not sure how easy these are to parse and how well pom would deal with them but it seems like an interesting challenge.

Get Byte Offsets

In order to build out signature support, we need the ability to get the underlying bytestream of segments of the file around certain objects. It would be helpful if that was exposed in the library, even if just the ability to get the byte offset of an object in the pdf file.

http://www.faithaliveresources.org/Content/Site135/FilesSamples/105315400440pdf_00000009843.pdf does not load

It gives: thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: InvalidInput, error: StringError("corrupt deflate stream") }', libcore/result.rs:945:5

Allow parsing a content stream from stream.decompressed_content().

This seems like a natural addition to the API.

Question: scale pdf

Is this lib able to scale/resize pdf pages ?
What I need is something where I can take a pdf with multiple pages and resize/scale them so its only A4 and A3 print sizes.
So everything under A4 to A4, between A4 an A3 to A3 and any page ver A3 to A3.

Extract pages as images?

Is there a method to extract pages in a PDF document as images?

Add relaxed mode (ignores things like false byte offsets in xref table)

Found another error for http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/ksp-thesis/ksp-thesis.pdf which gives:

Custom { kind: InvalidData, error: StringError("Not a valid PDF file (xref_and_trailer).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 267986 }") }

Document::extract_text prints encoding list to stdout

I don't know if this was intentional or not, but when extracting the text from a document, I get to following output in the console:

{"F1": "WinAnsiEncoding", "F2": "WinAnsiEncoding", "F3": "WinAnsiEncoding", "F4": "WinAnsiEncoding"}
{"F1": "WinAnsiEncoding", "F2": "WinAnsiEncoding", "F3": "WinAnsiEncoding", "F4": "WinAnsiEncoding", "F5": "WinAnsiEncoding"}
{"F1": "WinAnsiEncoding", "F2": "WinAnsiEncoding", "F3": "WinAnsiEncoding", "F4": "WinAnsiEncoding", "F5": "WinAnsiEncoding"}

Here is the code doing the printing:

lopdf/src/processor.rs

Line 190 in fa3a198

println!("{:?}", encodings);

Trouble with https://web.archive.org/web/20140324020304/http://www.solv.ch/files/magazin/blw2014_nachwuchs.pdf

I get a bunch of the following:

Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 9137).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 9383 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 24036).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 24282 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 38935).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 39181 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 53834).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 54080 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 68733).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 68979 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 83632).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 83878 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 98531).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 98777 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 113430).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 113676 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 164220).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 164466 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 215010).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 215256 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 265755).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 265914 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 266025).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 266184 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 266970).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 267129 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 275356).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 275515 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 279461).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 279620 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 310100).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 310259 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 310855).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 311014 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 357054).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 357213 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 363298).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 363457 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 369716).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 369875 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 374713).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 374872 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 375742).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 375901 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 379986).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 380145 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 389773).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 389932 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 391382).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 391541 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 404469).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 404628 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 446244).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 446403 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 490353).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 490512 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 538405).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 538564 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 538913).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 539072 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 545760).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 545919 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 547091).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 547250 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 588499).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 588658 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 634527).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 634686 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 638297).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 638456 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 721012).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 721171 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 721633).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 721792 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 805803).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 805962 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 808239).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 808398 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 808676).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 808835 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 820750).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 820909 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 846137).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 846296 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 849787).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 849946 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 850280).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 850439 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 850918).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 851077 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 851558).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 851717 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 852197).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 852356 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 852675).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 852834 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 853163).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 853322 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 853657).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 853816 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 854201).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 854360 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 854680).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 854839 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 855264).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 855423 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 855774).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 855933 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 856291).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 856450 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 856873).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 857032 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 857465).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 857624 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 857948).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 858107 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 858384).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 858543 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 858902).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 859061 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 859556).\nMismatch { message: \"seq endobj expect: 101, found: 115\", position: 859616 }") }

and then a panic at:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [70, 109, 48, 95, 48, 95, 45, 52, 95, 49, 55, 95, 70, 108, 228, 99, 104, 101, 95, 79, 98, 106, 101, 107, 116, 95, 67, 61, 48, 95, 77, 61, 50, 51, 48, 95, 89, 61, 50, 51, 48, 95, 75, 61, 48], error: Utf8Error { valid_up_to: 14, error_len: Some(1) } }', libcore/result.rs:945:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:207
   3: std::panicking::default_hook
             at libstd/panicking.rs:223
   4: std::panicking::begin_panic
             at libstd/panicking.rs:402
   5: std::panicking::try::do_call
             at libstd/panicking.rs:349
   6: std::panicking::try::do_call
             at libstd/panicking.rs:325
   7: core::ptr::drop_in_place
             at libcore/panicking.rs:72
   8: core::result::unwrap_failed
             at /Users/travis/build/rust-lang/rust/src/libcore/macros.rs:26
   9: <core::result::Result<T, E>>::unwrap
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:782
  10: lopdf::parser::dictionary::{{closure}}::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/parser.rs:107
  11: core::iter::iterator::Iterator::fold::{{closure}}
             at /Users/travis/build/rust-lang/rust/src/libcore/iter/iterator.rs:1594
  12: core::iter::iterator::Iterator::try_fold
             at /Users/travis/build/rust-lang/rust/src/libcore/iter/iterator.rs:1481
  13: core::iter::iterator::Iterator::fold
             at /Users/travis/build/rust-lang/rust/src/libcore/iter/iterator.rs:1594
  14: lopdf::parser::dictionary::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/parser.rs:105
  15: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &'a F>::call_once
             at /Users/travis/build/rust-lang/rust/src/libcore/ops/function.rs:252
  16: <core::result::Result<T, E>>::map
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:468
  17: <pom::parser::Parser<'a, I, O>>::map::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:34
  18: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  19: <pom::parser::Parser<'a, I, O>>::map::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:34
  20: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  21: <pom::parser::Parser<'a, I, O> as core::ops::bit::BitOr>::bitor::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:520
  22: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  23: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  24: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  25: pom::parser::call::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:426
  26: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  27: <pom::parser::Parser<'a, I, O> as core::ops::arith::Add<pom::parser::Parser<'b, I, U>>>::add::{{closure}}::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  28: <core::result::Result<T, E>>::and_then
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:621
  29: <pom::parser::Parser<'a, I, O> as core::ops::arith::Add<pom::parser::Parser<'b, I, U>>>::add::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  30: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  31: <pom::parser::Parser<'a, I, O>>::repeat::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:129
  32: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  33: <pom::parser::Parser<'a, I, O> as core::ops::arith::Mul<pom::parser::Parser<'b, I, U>>>::mul::{{closure}}::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:485
  34: <core::result::Result<T, E>>::and_then
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:621
  35: <pom::parser::Parser<'a, I, O> as core::ops::arith::Mul<pom::parser::Parser<'b, I, U>>>::mul::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:485
  36: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  37: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  38: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  39: <pom::parser::Parser<'a, I, O>>::map::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:34
  40: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  41: <pom::parser::Parser<'a, I, O>>::map::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:34
  42: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  43: <pom::parser::Parser<'a, I, O> as core::ops::bit::BitOr>::bitor::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:520
  44: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  45: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  46: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  47: pom::parser::call::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:426
  48: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  49: <pom::parser::Parser<'a, I, O> as core::ops::arith::Add<pom::parser::Parser<'b, I, U>>>::add::{{closure}}::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  50: <core::result::Result<T, E>>::and_then
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:621
  51: <pom::parser::Parser<'a, I, O> as core::ops::arith::Add<pom::parser::Parser<'b, I, U>>>::add::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  52: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  53: <pom::parser::Parser<'a, I, O>>::repeat::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:129
  54: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  55: <pom::parser::Parser<'a, I, O> as core::ops::arith::Mul<pom::parser::Parser<'b, I, U>>>::mul::{{closure}}::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:485
  56: <core::result::Result<T, E>>::and_then
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:621
  57: <pom::parser::Parser<'a, I, O> as core::ops::arith::Mul<pom::parser::Parser<'b, I, U>>>::mul::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:485
  58: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  59: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  60: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  61: <pom::parser::Parser<'a, I, O>>::map::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:34
  62: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  63: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  64: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  65: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  66: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  67: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  68: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  69: <pom::parser::Parser<'a, I, O> as core::ops::bit::Shr<F>>::shr::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:501
  70: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  71: <pom::parser::Parser<'a, I, O>>::map::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:34
  72: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  73: <pom::parser::Parser<'a, I, O> as core::ops::bit::BitOr>::bitor::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:520
  74: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  75: <pom::parser::Parser<'a, I, O> as core::ops::bit::BitOr>::bitor::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:516
  76: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  77: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  78: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  79: <pom::parser::Parser<'a, I, O> as core::ops::arith::Add<pom::parser::Parser<'b, I, U>>>::add::{{closure}}::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  80: <core::result::Result<T, E>>::and_then
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:621
  81: <pom::parser::Parser<'a, I, O> as core::ops::arith::Add<pom::parser::Parser<'b, I, U>>>::add::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  82: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  83: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  84: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  85: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  86: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  87: <pom::parser::Parser<'a, I, O> as core::ops::arith::Sub<pom::parser::Parser<'b, I, U>>>::sub::{{closure}}
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  88: <pom::parser::Parser<'a, I, O>>::parse
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  89: lopdf::reader::Reader::read_object
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:139
  90: <pom::input::DataInput<'a, T> as pom::input::Input<T>>::position
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:91
  91: lopdf::reader::<impl lopdf::document::Document>::load_internal
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:38
  92: lopdf::reader::<impl lopdf::document::Document>::load
             at /Users/jrmuizel/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:19

Cannot read pdf document

When I try to read this document: http://mirror.hmc.edu/ctan/macros/latex/contrib/iwhdp/Back_2015.pdf I get:

Not a valid PDF file (prev xref_and_trailer).
Mismatch { message: "expect repeat at least 1 times, found 0 times", position: 3117 }

Hopefully the last panic

This document https://www.uspsoig.gov/sites/default/files/document-library-files/2016/RARC-WP-16-001.pdf

gives:

thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: InvalidInput, error: StringError("corrupt deflate stream") }', libcore/result.rs:945:5

http://www.srl.inf.ethz.ch/workshop2016/Su.pdf does not Load

Gives StringError("Not a valid PDF file (read object at 0).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 0 }")

Recursive parser with additional arguments

Hi, I've tried your library for creating programming language parser, all things seems be fine until I've reached recursion. I have code like this (simplified):

fn expr<'a>(n: AST, c: Context) -> Combinator<impl Parser<'a, u8, Output = AST>> {
    (one_of(b"<>=|~,^#_$?@:'") + phrase(c.clone())).map(move |(v, e)| {...} |
    empty().map(move |_| n.clone())
}
//
fn phrase<'a>(c: Context) -> Combinator<impl Parser<'a, u8, Output = AST>> {
    noun(c.clone()) >> move |n: AST| expr(n, c.clone())
}
//
fn prg<'a>(c: Context) -> Combinator<impl Parser<'a, u8, Output = AST>> { 
    phrase(c.clone()) - endp() 
}

Compiler says: error[E0275]: overflow evaluating the requirement `impl pom::Parser<u8>`...
I've found call() and comb(), unfortunately, they don't accept passing additional arguments (like Context and AST in my case).
Maybe, I've missed something?

docs.rs failed to build documentation for lopdf

See https://docs.rs/crate/lopdf/. It looks like an issue building pom-0.6.1

[Question] Looking for a crate to parse and extract content from PDF

First of all: really nice work!
I am looking for a Rust crate to parse a PDF and extract the content from it.
For example, to extract each text line and metadata from the first page, to get which font style and/or font family belongs to the given line, etc.

I am asking if this crate supports a good depth of extraction (for example, if the crate supports already font family and font size extraction for a single line, or word, things like that...).
Can you give me some information about that please?

Thanks a lot

IBWA05ModelCode_Mar2.pdf does not load

http://web.archive.org/web/20070317213312/http://www.bottledwater.org/public/pdf/IBWA05ModelCode_Mar2.pdf gives Error { repr: Custom(Custom { kind: InvalidData, error: StringError("Not a valid PDF file (xref_and_trailer).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 116 }") }) }

Add changelog?

Hello,

It's difficult to identify what is changing with this library (and as there have been two api changes in as many weeks there's quite a lot to follow).

Would it be possible to add a changelog (ideally for the past few versions)?

Panic when reading a pdf document

When I try to read http://ctan.math.washington.edu/tex-archive/macros/latex/contrib/multibibliography/tug-paper.pdf then I get a panic:

thread 'main' panicked at 'Stream Length should be an integer.', libcore/option.rs:989:5
note: Run with RUST_BACKTRACE=1 for a backtrace.

Support to add password protection

Thanks for the lopdf. Does it allow to add password protection?

Add support for non-mutable decompression

The current decompression functionality requires mutating the document. It would be nice to have an api that supports decompression without mutating for consumers that just want to read the document.

Another panic

Another panic for http://ctan.math.washington.edu/tex-archive/macros/latex/contrib/bg/description.pdf

thread 'main' panicked at 'called Result::unwrap() on an Err value: FromUtf8Error { bytes: [139], error: Utf8Error { valid_up_to: 0, error_len: Some(1) } }', libcore/result.rs:945:5

replace_text not working for PDF generated from a LaTeX file

I use tectonic to generate a PDF from LaTeX.
The replace_text method does not seem to work on that generated document.

I recreated the bug here: https://github.com/J-F-Liu/lopdf/compare/master...efx:text-not-replaced?expand=1
If you clone that repo / pull in the branch just run:

cargo run --example replace_text

I am new to rust but can help fix this with some guidance.

Help: Copy all page content to a new pdf file

Hi,
I want to merge pages from 2 pdfs files into one. I already have such a tool that I wrote in GO but I want to rewrite it in Rust using lopdf.

This is the state of where I am: https://gist.github.com/bn3t/1508f3526bc4ca894f818182bf23e602. It tries to copy page 1 of the input document to the doc document but still produces a white page.

Would you be so kind to indicate the step needed to achieve this?

5014.CIDFont_Spec.pdf fails to load

Trying to load http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf gives Not a valid PDF file (xref_and_trailer).

Corrupt PDF when using the default example [0.6.0]

Hi, I made a completely new project and copied the example code.

I had to switch to rust nightly because #![feature(field_init_shorthand)] is not allowed on stable. I compiled and ran the get started guide. What I got was a corrupted PDF (I'll attach it if I can).

I do not know why the PDF is corrupted, sadly. It would be nice if someone could look into it.

Uploading example.pdf.txt…

Add remove password protection?

I had a quick look but couldn't find anything related to password protection. Does the library support it?

Object::as_datetime() has issues

While working on tests for rsvg-convert (a tool that's built with librsvg), I found several issues with the Object::as_datetime() implementation. Maybe I'm just using this incorrectly ot it's some other misunderstanding, But I would like to raise the issues here.

Timezone handling

The implementation uses chrono::Local.datetime_from_str(). This method will return an error (ParseError::Impossible)) if the timezone offset in the string doesn't match the local timezone offset. So while this may work for PDF files that are created in the local timezone, it is likely going to fail quite often.

The chrono crate offers another method which is DateTime::parse_from_str():
https://docs.rs/chrono/0.3.0/chrono/datetime/struct.DateTime.html#method.parse_from_str
This method seems more appropriate as it can handle different timezone offsets.

In my opinion it would also make sense to consider changing the return value of Object::as_datetime() to return a DateTime in the UTC offset instead of using Local.

Parsing of incomplete dates

I've run into problems because the PDF I tested did not specify the minutes of the timezone offset. So the CreationDate string looked like this:

D:20200211085039+00'

Instead of the proper

D:20200211085039+00'00'

While this was due to a bug in the library that created the PDF, it still seems valid according to the PDF spec. According to the spec all fields after the year are optional. However the code in Object::as_datetime() will raise ParseError(TooShort) unless the complete datetime string is given. So I think the parser should be changed to deal gracefully with the optional fields missing.

Getting empty contents when reading PDF

When I read in a large pdf (34 pages), load a page and then try and iterate over the contents I am getting an empty array
Content { operations: [Operation { operator: "x", operands: [] }] }
Could you shed some light on why this may be happening?

unicode does not show correctly

#[macro_use]
extern crate lopdf;
use lopdf::content::{Content, Operation};
use lopdf::{Document, Object, Stream};

fn main() {
	let mut doc = Document::with_version("1.5");
	let pages_id = doc.new_object_id();
	let font_id = doc.add_object(dictionary! {
		"Type" => "Font",
		"Subtype" => "Type1",
		"BaseFont" => "Courier",
	});
	let resources_id = doc.add_object(dictionary! {
		"Font" => dictionary! {
			"F1" => font_id,
		},
	});
	let content = Content {
		operations: vec![
			Operation::new("BT", vec![]),
			Operation::new("Tf", vec!["F1".into(), 48.into()]),
			Operation::new("Td", vec![100.into(), 600.into()]),
			
			//change text to unicode (arabic)
			Operation::new("Tj", vec![Object::string_literal("مرحبا بالعالم!")]),
			Operation::new("ET", vec![]),
		],
	};
	let content_id = doc.add_object(Stream::new(dictionary! {}, content.encode().unwrap()));
	let page_id = doc.add_object(dictionary! {
		"Type" => "Page",
		"Parent" => pages_id,
		"Contents" => content_id,
	});
	let pages = dictionary! {
		"Type" => "Pages",
		"Kids" => vec![page_id.into()],
		"Count" => 1,
		"Resources" => resources_id,
		"MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],
	};
	doc.objects.insert(pages_id, Object::Dictionary(pages));
	let catalog_id = doc.add_object(dictionary! {
		"Type" => "Catalog",
		"Pages" => pages_id,
	});
	doc.trailer.set("Root", catalog_id);
	doc.compress();
	doc.save("example.pdf").unwrap();
}
`


and the result is 


<img width="1029" alt="Screen Shot 2019-11-16 at 1 36 12 PM" src="https://user-images.githubusercontent.com/169691/68992001-40876680-0876-11ea-8c05-a1f20fbd824a.png">

Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 0).\nMismatch { message: \"expect repeat at least 1 times, found 0 times\", position: 0 }") }

Using https://github.com/isocpp/CppCoreGuidelines/raw/master/docs/Lifetimes%20I%20and%20II%20-%20v0.9.1.pdf

target/debug/pdfutil print_streams -i "Lifetimes I and II - v0.9.1.pdf"
Open Lifetimes I and II - v0.9.1.pdf
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 0).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 0 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 0).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 0 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 0).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 0 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 0).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 0 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 0).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 0 }") }
Custom { kind: InvalidData, error: StringError("Not a valid PDF file (read object at 0).\nMismatch { message: "expect repeat at least 1 times, found 0 times", position: 0 }") }

Consider using rust log instead of println

This would be more robust. Output can be controlled by users of lopdf crate or by users of crates which use lopdf.

https://crates.io/crates/log

There are nice println!-like macros:

trace!
debug!
info!
warn!
error!

Printing to stdout can hinder for example usage of lopdf in command line utilities. For example when someone create program to print number of pages to stdout and then want to use it in bash script (or as input to other CLI program) which will do something with that number pulled from stdout. This would hardly work for pdf files that triggers some lopdf warning.

It also looks nicer in code println!("Warning: {}", err) -> warn!("{}", err)

Or at least print to stderr using eprintln! or use conditional compilation and add feature to turn off printing to stdout.

How to parse a pdf file and get the text

Support arrays of filters

The Filter entry for streams can be an array of filters. i.e. /Filter [/FlateDecode] vs /Filter /FlateDecode

Support Name Tree and Number Tree

A name tree, according to the PDF 32000-1:2008 specification (7.9.6 Name Trees), is like a dictionary but it may be arbitrarily large, the keys are strings (not name objects) and are ordered, and there are various criteria on the values.

e.g. (from the spec

<</Limits [(Xenon) (Zirconium)]
  /Names [(Xenon) 129 0 R
          (Ytterbium) 130 0 R
          (Yttrium) 131 0 R
          (Zinc) 132 0 R
          (Zirconium) 133 0 R
         ]
>>

A number tree, according to the PDF 32000-1:2008 specification (7.9.7 Number Trees), is 'similar to a name tree' except the keys are integers, sorted in ascending numerical order.
};

/Nums [ 0 << \S \r >>
        4 << \S \D >>
        7 << \S \D
             \P (A-)
             \St 8
          >>
      ]

They'll probably need an extension to the Object enum.

Should the library support this?

edit: whoops, accidentally submitted the issue part way through writing it.

Parsing of all objects on load is slow

I've done a bit of comparision benchmarking for extracting URLs from PDF files.

Test file: PDF 1.7 specification https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

(This file is encrypted with an empty password, that is another feature I would like to bring to lopdf. My code using lopdf cannot yet unscramble strings.)

Using PyPDF2, the process takes 2.7s while just loading the file takes 43s using lopdf. PyPDF2 has some stability / looping issues that are a no-go for me. I would like to improve lopdf performance by a lot.

On load, most of the time is spent in the pom parser in parser.rs.

I see three approches that would be workable:

Make the parser a LOT faster somehow. This keeps the current API and objects can be accessed immutably in parallel code.
Do on demand parsing of objects. Somewhat complex, can introduce mutability issues, might break the API.
Parallelize the parsing using rayon. Would bring a performance gain, but would still consume the same amount of CPU time. Are you open to adding rayon as a dependency ?

From a design perspective, what would be the approaches that make the most sense to you ? I have some time to spend on improving lopdf over the coming weeks.

Findings from fuzzing

I recently started fuzzing this crate and have found a few crashes. First two issues I've found is a stack overflow and a panic from subtraction overflow, examples are attached.. As I continue to find new issues I'll add them as comments on this issue.
stackoverflow.pdf
subtractoverflow.pdf

The test I'm using to open these files and cause the crash is simply:


use lopdf::Document;

#[test]
fn f1(){
       let _ = Document::load("stackoverflow.pdf");
}

Stream content does not get written when referenced directly, only via reference

Stream content (the Vec<u8>) does not get written if it is inside another lopdf::Object.
The length gets calculated and written (but the length is wrong, maybe it overflows a certain buffer?), but the following stream corrupts the PDF document and doesn't get handled correctly.

Example - this works correctly (stream as reference):

use lopdf::{Document as LoDocument, Dictionary as LoDictionary,  Object as LoObject, Stream as LoStream};
use lopdf::Object::*;

let problematic_text = "<xml>test</xml>".to_string()
let mut doc = LoDocument::new();
let stream = Stream(LoStream::new(LoDictionary::from_iter(vec![
    ("Type", "Metadata".into()),
    ("Subtype", "XML".into()), ]),
    problematic_text.as_bytes().to_vec() ));

let catalog = Dictionary(LoDictionary::from_iter(vec![
                      ("Metadata", Reference(doc.add_object(stream)) )]));
doc.add_object(catalog);
doc.save("working.pdf");

The above works and the length gets calculated correctly.
Now let's try putting the stream into the catalog as a direct object instead of a reference:

let mut doc = LoDocument::new();

let catalog = Dictionary(LoDictionary::from_iter(vec![
    ("Metadata", Stream(LoStream::new(
        LoDictionary::from_iter(vec![
            ("Type", "Metadata".into()),
            ("Subtype", "XML".into()), ]),
            problematic_text.as_bytes().to_vec() 
         ))
    )]));

This will corrupt the PDF (something with the / and \ is off). It will not create a string and calculate a wrong "Length" for the stream.

I am working on why this is an issue. It's merely inconvenient, I couldn't explain myself why the content doesn't get written.

For reference, here is the full context in which I discovered the issue: https://github.com/sharazam/printpdf/blob/master/src/api/types/pdf_document.rs#L156-L184

It's a bit too big to put it directly in an issue (the repository compiles). xmp_metadata is basically this file with the necessary fields filled out (yes, I know, UTF-8 weirdness). The important part is Line 156. If you copy-paste the stream inside there, the PDF will get written without error, but it will be corrupt.

Convert pdf to markdown?

Hi there, how to use lopdf to convert a pdf document to markdown format?

Content decoding does not handle inline images

Example pdf file: bi.pdf

Content stream contains:

100 0 0 100 0 0 cm
BI /W 4 /H 4 /CS /RGB /BPC 8
ID
00000z0z00zzz00z0zzz0zzzEI aazazaazzzaazazzzazzz
EI

There is chapter 4.8.6 about inline images in pdf reference.

extern crate lopdf;

fn main()
{
	let doc = lopdf::Document::load("bi.pdf").unwrap();
	let cont = doc.get_and_decode_page_content(doc.get_pages()[&1]);
	println!("{:#?}", cont);
}

Ok(
    Content {
        operations: [
            Operation {
                operator: "cm",
                operands: [
                    100,
                    0,
                    0,
                    100,
                    0,
                    0,
                ],
            },
            Operation {
                operator: "BI",
                operands: [],
            },
            Operation {
                operator: "ID",
                operands: [
                    /W,
                    4,
                    /H,
                    4,
                    /CS,
                    /RGB,
                    /BPC,
                    8,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzzEI",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "aazazaazzzaazazzzazzz",
                operands: [],
            },
            Operation {
                operator: "EI",
                operands: [],
            },
        ],
    },
)

To handle this properly it is needed to calculate size of decoded image data from parameters like width, height, bit per component, color space and decode using filters (note "EI " byte sequence in middle of image data, there can be any byte sequence). Unfortunately there is no required "Length" key which could be used to skip stream data like in normal pdf streams.

Also this affects other functionality of lopdf which depends on content decoding like text extraction. For example there can be false positive "Tj" inside image. Or in some circumstances could lopdf return error maybe when byte sequence in image data is not valid UTF-8 string and so on.

How to get pages numbers and ids

I tried the below:

use lopdf::{Document, Object, Stream};

fn main() {
    println!("Hello, world!");
    let doc = Document::load("HasanResume.pdf").unwrap();
    println!("version: {:#?}, trailer: {:#?}", &doc.version, &doc.trailer);
    let pages = doc.get_pages().iter()
        .map(|page| println!("pagenum: {:#?}, pageid: {:#?}",
                          page.0, page.1)).count();
    println!("{}", pages);
}

But got the below:

    Finished dev [unoptimized + debuginfo] target(s) in 1.13s
     Running `target/debug/read-pdf`
Hello, world!
version: "1.7", trailer: <</Info 92 0 R/Root 1 0 R/Size 93>>
0

读取 chrome 保存出来的 pdf文件，得到一堆乱码

    let mut doc = Document::load("bbb.pdf").unwrap();

    for page_id in doc.page_iter() {
        let x = doc.get_and_decode_page_content(page_id).unwrap();
        let y = x.operations;
        for i in y.iter() {
            if i.operator == "Tj".to_string() {
                let i2 = &i.operands[0];
                println!("{:?}", i2);
            } else {
//                println!("{:?}", i.operator);
            }
        }
    }

打印出来的值是一堆乱码，请问要怎么转换为正常字符？类似下面这样的

(�N)
(�N)
(�Q)
(��)
(�Y)
(�Q)
(�T)
(�N)
(�F)
(��)
(��)
(��)
(�_)

Add tests from Caradoc

Caradoc is a parser and validator of PDF files written in OCaml. Caradoc provides many commands to analyze PDFs, as well as an interactive user interface in console. They have an interesting set of the PDF files, which can make a good testcases for you

The test files are here: https://github.com/caradoc-org/caradoc/tree/master/test_files

See more information in their presentation:

thread 'main' panicked at 'Stream Length should be an integer.', libcore/option.rs:914:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:205
   3: std::panicking::default_hook
             at libstd/panicking.rs:221
   4: <std::panicking::begin_panic::PanicPayload<A> as core::panic::BoxMeUp>::get
             at libstd/panicking.rs:457
   5: std::panicking::try::do_call
             at libstd/panicking.rs:344
   6: std::panicking::try::do_call
             at libstd/panicking.rs:322
   7: <&'a T as core::fmt::Display>::fmt
             at libcore/panicking.rs:71
   8: core::ptr::drop_in_place
             at libcore/option.rs:914
   9: alloc::raw_vec::alloc_guard
             at /Users/travis/build/rust-lang/rust/src/libcore/option.rs:302
  10: lopdf::parser::stream::{{closure}}
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/parser.rs:114
  11: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:501
  12: lopdf::encodings::bytes_to_string::{{closure}}
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:621
  13: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:501
  14: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  15: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:34
  16: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  17: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:520
  18: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  19: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:516
  20: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  21: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  22: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  23: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  24: lopdf::encodings::bytes_to_string::{{closure}}
             at /Users/travis/build/rust-lang/rust/src/libcore/result.rs:621
  25: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:453
  26: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  27: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  28: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  29: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  30: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  31: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:469
  32: <pdf_word_count::WordCount as core::default::Default>::default
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/pom-1.1.0/src/parser.rs:23
  33: lopdf::reader::Reader::read_object
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:139
  34: <std::collections::hash::map::RandomState as core::hash::BuildHasher>::build_hasher
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:91
  35: lopdf::reader::<impl lopdf::document::Document>::load_internal
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:38
  36: lopdf::reader::<impl lopdf::document::Document>::load_from
             at /Users/ek/.cargo/registry/src/github.com-1ecc6299db9ec823/lopdf-0.15.1/src/reader.rs:26
  37: pdf_word_count::Collector::process_document
             at ./src/lib.rs:27
  38: pdf_wc::main
             at src/main.rs:23
  39: std::rt::lang_start::{{closure}}
             at /Users/travis/build/rust-lang/rust/src/libstd/rt.rs:74
  40: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:304
  41: panic_unwind::dwarf::eh::read_encoded_pointer
             at libpanic_unwind/lib.rs:105
  42: std::sys_common::cleanup
             at libstd/panicking.rs:283
             at libstd/panic.rs:361
             at libstd/rt.rs:58
  43: std::rt::lang_start
             at /Users/travis/build/rust-lang/rust/src/libstd/rt.rs:74
  44: pdf_wc::main

Please publish a new version of lopdf

I can't publish my own library if it depends on a github repository (required by cargo, because this repo could be deleted). Please publish a new version. Thanks.