I have been using for years 1.50.5147.0 to open pdfs and add some text to them. The th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs about pdfsharp HOT 6 OPEN

andresdbv commented on September 22, 2024

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs

from pdfsharp.

Comments (6)

packdat commented on September 22, 2024

The PDF has some interesting properties...

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0).
This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

It reads all regular objects (objects stored directly in the PDF)
It reads all objects stored in object-streams

The exception is thrown in phase 1.:
The mentioned stream-object is a regular object.
When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

While the PDF-spec states that objects representing the /Length of object-streams shall not be stored in an object-stream, it says nothing about objects representing the /Length of ordinary stream-objects.
So at least regarding the spec, the PDF seems perfectly fine.

That said, it's the first time i encountered a PDF doing this and I've seen a lot of PDFs over the last >10 years !

Because it does not seem to be common practice to store objects like that, a fix handling this specific case should be straightforward.
As there are many PDFs out there with incorrectly specified /Length entries for stream-objects, the case mentioned here should be treated as a special case of an incorrectly specified stream-length.
The library should use a fallback in these cases and should attempt to locate the endstream-keyword manually.

@ThomasHoevel I already have code doing this, tested successfully with the attached PDF.
I could open a PR (unless you want to consider incorrect/missing stream-lengths as invalid PDF)

On the long run, the library should be adapted to read objects in a way that can locate referenced objects regardless of their location, i.e. whether they are stored on the file-level or in object-streams.
(think of Font-objects storing their Descriptor inside object-streams...)

from pdfsharp.

TH-Soft commented on September 22, 2024

@packdat Thanks for your feedback and analysis.
PDFsharp 1.5 did not read object streams at all. Support for object streams was added to an old architecture and there are some issues coming from that.
Stefan has to decide about the PR. We will look at it if you create it.

from pdfsharp.

andresdbv commented on September 22, 2024

Hi @packdat, thanks for your help.

I have had trouble with similar files that are scanned, right now I don't have any example of the PDFs that are converted from .dwg files that throw me errors when trying to open with PDFSharp too.

Is there any tip you can give me please to relay to the people who are scanning these documents that way we can avoid this problem and have a clean and safe PDF to use with PDFSharp?

from pdfsharp.

packdat commented on September 22, 2024

Hi @andresdbv

The metadata of the PDF states: <pdf:Producer>PDFlib 8.0.0 (Win32)</pdf:Producer>
This seems to be the library used to create the PDF.
I would start by asking them if they could update this library, try different parameters when creating the PDF or using a different library altogether.

If all they do is convert images obtained from a scanner to PDF, you could write your own little tool based of PDFsharp that does the conversion and let them use that.
That should definitively create compatible PDFs 😉

from pdfsharp.

Audionysos commented on September 22, 2024

@packdat

There is a stream-object with a /Length entry referring to an indirect object with the ID (12 0). This indirect object is not stored as a regular object, but as an entry in an object-stream.

This is an issue because of the way the library loads the objects stored in a PDF:

It reads all regular objects (objects stored directly in the PDF)

It reads all objects stored in object-streams

The exception is thrown in phase 1.: The mentioned stream-object is a regular object. When trying to locate the object referred to by the /Length entry, the library cannot find it (because it is stored in an object-stream that is not yet processed) and throws the exception.

Hi, I've just downloaded recent ISO spec 2 days ago and I'm slowly reading it... I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case. In 7.3.10 it's stated:

An indirect reference to an undefined object shall not be considered an error by a PDF processor; it
shall be treated as a reference to the null object.
EXAMPLE 2
If a file contains the indirect reference 17 0 R but does not contain the corresponding definition then the
indirect reference is considered to refer to the null object.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

And about your point(step) 2 - I believe the PDF was specifically designed so there is no need to load whole thing into memory. In version 1.0 at https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf it's mentioned several times that the format should allow reading only single page (from a big document) and should be suitable for devices with low memory...
I believe only top objects should be loaded at the beginning and the streams should be extracted as you go. A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used but before that it's just null reference.

Does this make sense? Again, I didn't even go through half of the specs, so sorry if I made some confusion, but I'm also curious about this.

from pdfsharp.

packdat commented on September 22, 2024

I'm completely new to this so I may be mistaking but I believe you shouldn't throw in this case.

I'm not thowing anything, the PR attempts to avoid that.

So the System.AggregateException: 'Invalid object ID.' should not happen according to spec.

Correct.
But even if the library would return a null object, it would be wrong also, because the object is not missing, it's just not known yet.

I believe the PDF was specifically designed so there is no need to load whole thing into memory.

Are we talking about PDF in general or about the specific PDF mentioned here ?
If we're talking about the latter, i have to disagree.
From an efficiency standpoint, this comes close to "worst case" IMO.
Why burying a stream-length in a object-stream that needs to be located, unpacked, and parsed to extract a single integer value when you could store that value as a direct object in the /Length property ?

I believe only top objects should be loaded at the beginning and the streams should be extracted as you go.

That's exactly, what #85 attempts to do.

A PDF writer should write the file in such a way that the indirect object should be already extracted when the reference to it is used

In a perfect world, that would be the case.
But PDF writers are free to store their objects wherever they please, as long as they obey the spec.

sorry if I made some confusion

No worries. PDF (and "flavors" thereof) are a sometimes confusing matter.

from pdfsharp.

System.AggregateException: 'Invalid object ID.' when trying to open some pdfs about pdfsharp HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent