Git Product home page Git Product logo

Comments (6)

packdat avatar packdat commented on September 22, 2024

The PDF has some interesting properties...

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0).
This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

  1. It reads all regular objects (objects stored directly in the PDF)
  2. It reads all objects stored in object-streams

The exception is thrown in phase 1.:
The mentioned stream-object is a regular object.
When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

While the PDF-spec states that objects representing the /Length of object-streams shall not be stored in an object-stream, it says nothing about objects representing the /Length of ordinary stream-objects.
So at least regarding the spec, the PDF seems perfectly fine.

That said, it's the first time i encountered a PDF doing this and I've seen a lot of PDFs over the last >10 years !

Because it does not seem to be common practice to store objects like that, a fix handling this specific case should be straightforward.
As there are many PDFs out there with incorrectly specified /Length entries for stream-objects, the case mentioned here should be treated as a special case of an incorrectly specified stream-length.
The library should use a fallback in these cases and should attempt to locate the endstream-keyword manually.

@ThomasHoevel I already have code doing this, tested successfully with the attached PDF.
I could open a PR (unless you want to consider incorrect/missing stream-lengths as invalid PDF)

On the long run, the library should be adapted to read objects in a way that can locate referenced objects regardless of their location, i.e. whether they are stored on the file-level or in object-streams.
(think of Font-objects storing their Descriptor inside object-streams...)

from pdfsharp.

TH-Soft avatar TH-Soft commented on September 22, 2024

@packdat Thanks for your feedback and analysis.
PDFsharp 1.5 did not read object streams at all. Support for object streams was added to an old architecture and there are some issues coming from that.
Stefan has to decide about the PR. We will look at it if you create it.

from pdfsharp.

andresdbv avatar andresdbv commented on September 22, 2024

Hi @packdat, thanks for your help.

I have had trouble with similar files that are scanned, right now I don't have any example of the PDFs that are converted from .dwg files that throw me errors when trying to open with PDFSharp too.

Is there any tip you can give me please to relay to the people who are scanning these documents that way we can avoid this problem and have a clean and safe PDF to use with PDFSharp?

from pdfsharp.

packdat avatar packdat commented on September 22, 2024

Hi @andresdbv

The metadata of the PDF states: <pdf:Producer>PDFlib 8.0.0 (Win32)</pdf:Producer>
This seems to be the library used to create the PDF.
I would start by asking them if they could update this library, try different parameters when creating the PDF or using a different library altogether.

If all they do is convert images obtained from a scanner to PDF, you could write your own little tool based of PDFsharp that does the conversion and let them use that.
That should definitively create compatible PDFs πŸ˜‰

from pdfsharp.

Audionysos avatar Audionysos commented on September 22, 2024

@packdat

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0). This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

  1. It reads all regular objects (objects stored directly in the PDF)
  2. It reads all objects stored in object-streams

The exception is thrown in phase 1.: The mentioned stream-object is a regular object. When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

Hi, I've just downloaded recent ISO spec 2 days ago and I'm slowly reading it... I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case. In 7.3.10 it's stated:

An indirect reference to an undefined object shall not be considered an error by a PDF processor; it
shall be treated as a reference to the null object.
EXAMPLE 2
If a file contains the indirect reference 17 0 R but does not contain the corresponding definition then the
indirect reference is considered to refer to the null object.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

And about your point(step) 2 - I believe the PDF was specifically designed so there is no need to load whole thing into memory. In version 1.0 at https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf it's mentioned several times that the format should allow reading only single page (from a big document) and should be suitable for devices with low memory...
I believe only top objects should be loaded at the beginning and the streams should be extracted as you go. A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used but before that it's just null reference.

Does this make sense? Again, I didn't even go through half of the specs, so sorry if I made some confusion, but I'm also curious about this.

from pdfsharp.

packdat avatar packdat commented on September 22, 2024

I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case.

I'm not thowing anything, the PR attempts to avoid that.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

Correct.
But even if the library would return a null object, it would be wrong also, because the object is not missing, it's just not known yet.

I believe the PDF was specifically designed so there is no need to load whole thing into memory.

Are we talking about PDF in general or about the specific PDF mentioned here ?
If we're talking about the latter, i have to disagree.
From an efficiency standpoint, this comes close to "worst case" IMO.
Why burying a stream-length in a object-stream that needs to be located, unpacked, and parsed to extract a single integer value when you could store that value as a direct object in the /Length property ?

I believe only top objects should be loaded at the beginning and the streams should be extracted as you go.

That's exactly, what #85 attempts to do.

A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used

In a perfect world, that would be the case.
But PDF writers are free to store their objects wherever they please, as long as they obey the spec.

sorry if I made some confusion

No worries. PDF (and "flavors" thereof) are a sometimes confusing matter.

from pdfsharp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.