Git Product home page Git Product logo

Comments (14)

mintern avatar mintern commented on September 23, 2024 1

The error is happening in the constructor: new PdfReader(inputStream).

I'm locating the file now to see if it can be shared.

from openpdf.

riccardo-noviello avatar riccardo-noviello commented on September 23, 2024 1

Ok, so I have done a fair bit of reading, and interestingly my search had some correlation with Acrobat Reader 9 mentioned a few times, not 100% sure though this is the cause of the problem.

A file TRAILER in PDFs contains information about the number of objects in the document and such. This is, in my understanding, mandatory. Although it looks like some file do not have a TRAILER keyword followed by a TRAILER dictionary, instead this information live in the XREF stream dictionary, but technically it's still a valid trailer.

Unfortunately I was unable to recreate a unit test because I don't have a PDF with such structure described above. Regarding your comment to provide an empty PdfDictionary instead of null, it makes sense, but this cannot be a justifiable fix without proper testing.

Given you are unable to share the malformed pdf document, I'd suggest you edit the PdfReader class, specifically make protected PdfDictionary readXrefSection() method return an empty PdfDictionary instead of null just as you suggested. Then see if you can read your PDF, if you can then I guess there should be a possible fix, otherwise it is probably a corrupted file.

Hope this information helps you.

from openpdf.

mkl-public avatar mkl-public commented on September 23, 2024 1

One problem: If the startxref occurs partially in one read packet and partially in another, it won't be found. This can be resolved by having the packets overlap, e.g. by initializing pos with

int pos = file.length() - 8;

and reading a packet with

String str = readString(step + 8);

One aside, though, I don't concur with

But actually it does not matter, because OpenPDF library should be able to work with any PDF. Users do not care if the original file was produced not according to the standards.

While users indeed often don't care whether a PDF was produced not according to the standards and expect any file Acrobat Reader can display to be processable by one's software, OpenPDF (or any other PDF library) should not simply give in and allow any deviation from the standard that some PDF viewer somewhere in the wild supports.

One reason those users should understand is that deviations from the standard are open to different interpretations by different software, so something broken may be displayed one way by a tolerant viewer and processed a different way by a tolerant software further processing the file. Thus, what the user wanted to send into some process (which might in the end involve forwarding the resulting PDF out to thousands or millions of customers) might be something different than what is actually processed (and so eventually is read by those thousands or millions of customers in the example). Other reasons are related to attack vectors for forgery of signed data, etc.

So one probably should create some flag (e.g. in some static variable) and only allow such deviations if that flag is set.

from openpdf.

riccardo-noviello avatar riccardo-noviello commented on September 23, 2024

Do you have any more information on how to reproduce this error please? Maybe attaching the file you are attempting to read and or a snippet of code you use to read the file, if not even better a unit test? Many thanks

from openpdf.

mintern avatar mintern commented on September 23, 2024

Unfortunately, I cannot publish the file, and I don't have enough expertise or time to share a sanitized version. Sorry about that.

from openpdf.

riccardo-noviello avatar riccardo-noviello commented on September 23, 2024

no worries, I'll take a look at this as soon as I can anyway

from openpdf.

daviddurand avatar daviddurand commented on September 23, 2024

I've looked at this before (thought, in fact, I'd fixed it, once upon a time, but there may be more than one trailer check/reference).

Trailers are no longer required for PDF 1.5 and greater. I'll try to make an example, and take a look at the crash.

The trailer isn't really used for much, since it's just an optimization for finding the directory by looking at the end of a file.

from openpdf.

mspnr avatar mspnr commented on September 23, 2024

I have NullPointerException in PdfReader.java in readPages because trailer is null.

  protected void readPages() throws IOException {
    catalog = trailer.getAsDict(PdfName.ROOT);            << NullPointerException

This is pretty critical, because it happens at the very beginning and nothing can't be done with the document.

The problem appeared as mentioned before because startxref can't be read correctly. I also can't share my PDF. But here is a screenshot of its last bytes. I scrolled to the very end, lines are wrapped.

image

It appeared, that the PDF has ~2050 bytes after startxref to the end of the file.

But in PRTokeniser.java in getStartxref somehow fixed 1024 bytes are read from the end of the document and this is not enough:

    public int getStartxref() throws IOException {
        int size = Math.min(1024, file.length());         << hardcoded 1024
        int pos = file.length() - size;
        file.seek(pos);
        String str = readString(1024);                    << once again
        int idx = str.lastIndexOf("startxref");
        if (idx < 0)
            throw new InvalidPdfException(MessageLocalization.getComposedMessage("pdf.startxref.not.found"));
        return pos + idx;
    }

I've made an experiment extending 1024 and it worked out well.

I would suggest to think of replacing 1024 with a greater number or just searching startxref dynamically from the end of the file.

I am not a big pro in PDFs, but here I've found a mentioning, that the requirement, that %%EOF should appear at the end of file was dropped from ISO, whatever that means.

from openpdf.

mkl-public avatar mkl-public commented on September 23, 2024

I am not a big pro in PDFs, but here I've found a mentioning, that the requirement, that %%EOF should appear at the end of file was dropped from ISO, whatever that means.

I'm afraid you misinterpreted that. What's meant there is that the relaxed implementation note by Adobe (which only requires the %%EOF to appear somewhere within the last 1024 bytes of the file) has been dropped when PDF became an ISO standard.

The ISO standard requires: The last line of the file shall contain only the end-of-file marker, %%EOF. And both OpenPDF and iText accept somewhat invalid PDFs with a certain amount of extra trash thereafter.

from openpdf.

mkl-public avatar mkl-public commented on September 23, 2024

That been said, your screenshot looks like the extra trash after the %%EOF is actually some PDF content (the endstream ... stream segment is very PDF'ish).

This has the smell of PDFs generated by buggy software that stores PDFs in files inappropriately opened for writing (with Open instead of Create) or in memory streams with some previous content without making sure to eventually only use the newly written part.

from openpdf.

mspnr avatar mspnr commented on September 23, 2024

@mkl-public thank you for your feedback!

Now I understand why 1024 is hardcoded here.

I replaced the function mentioned above via this code:

    public int getStartxref() throws IOException {
        int step = 1024;
        int pos = file.length();
        int idx;
        do {
            pos = Math.max(0, pos - step);
            file.seek(pos);
            String str = readString(step);
            idx = str.lastIndexOf("startxref");
        } while (pos > 0 && idx < 0);
        if (idx < 0)
            throw new InvalidPdfException(MessageLocalization.getComposedMessage("pdf.startxref.not.found"));
        return pos + idx;
    }

It is working universally for me for the usual files as long as for the files with "trash" at the end. I would propose a PR to include the code to the main repository.

The PDF file mentioned above is most likely was produced from Bluebeam Revu. But actually it does not matter, because OpenPDF library should be able to work with any PDF. Users do not care if the original file was produced not according to the standards.

from openpdf.

asturio avatar asturio commented on September 23, 2024

Pull requests are welcome!
@mkl-public : Are you fine with the suggested code?

from openpdf.

mspnr avatar mspnr commented on September 23, 2024

One problem: If the startxref occurs partially in one read packet and partially in another, it won't be found. This can be resolved by having the packets overlap, e.g. by initializing pos with

@mkl-public thank you for pointing to the case with packets overlaping. I have updated the code under PR #505

from openpdf.

mspnr avatar mspnr commented on September 23, 2024

So one probably should create some flag (e.g. in some static variable) and only allow such deviations if that flag is set.

@mkl-public I understand your point.

I think as long as deviations are still comprehensible and the document can be opened/rendered/etc., it is OK to show some marking "you document is not build according to the standard". If from the other hand OpenPDF chooses to bump such a message to the users while failing to operate such documents, it will be proud following the standards, but the users and other apps will turn away from OpenPDF.

So I think the truth is somewhere in the middle: you can notify the user about the potential problem and do whatever you can to provide normal operation for the users.

from openpdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.