Comments (14)
The error is happening in the constructor: new PdfReader(inputStream)
.
I'm locating the file now to see if it can be shared.
from openpdf.
Ok, so I have done a fair bit of reading, and interestingly my search had some correlation with Acrobat Reader 9 mentioned a few times, not 100% sure though this is the cause of the problem.
A file TRAILER in PDFs contains information about the number of objects in the document and such. This is, in my understanding, mandatory. Although it looks like some file do not have a TRAILER keyword followed by a TRAILER dictionary, instead this information live in the XREF stream dictionary, but technically it's still a valid trailer.
Unfortunately I was unable to recreate a unit test because I don't have a PDF with such structure described above. Regarding your comment to provide an empty PdfDictionary
instead of null
, it makes sense, but this cannot be a justifiable fix without proper testing.
Given you are unable to share the malformed pdf document, I'd suggest you edit the PdfReader
class, specifically make protected PdfDictionary readXrefSection()
method return an empty PdfDictionary
instead of null
just as you suggested. Then see if you can read your PDF, if you can then I guess there should be a possible fix, otherwise it is probably a corrupted file.
Hope this information helps you.
from openpdf.
One problem: If the startxref occurs partially in one read packet and partially in another, it won't be found. This can be resolved by having the packets overlap, e.g. by initializing pos
with
int pos = file.length() - 8;
and reading a packet with
String str = readString(step + 8);
One aside, though, I don't concur with
But actually it does not matter, because OpenPDF library should be able to work with any PDF. Users do not care if the original file was produced not according to the standards.
While users indeed often don't care whether a PDF was produced not according to the standards and expect any file Acrobat Reader can display to be processable by one's software, OpenPDF (or any other PDF library) should not simply give in and allow any deviation from the standard that some PDF viewer somewhere in the wild supports.
One reason those users should understand is that deviations from the standard are open to different interpretations by different software, so something broken may be displayed one way by a tolerant viewer and processed a different way by a tolerant software further processing the file. Thus, what the user wanted to send into some process (which might in the end involve forwarding the resulting PDF out to thousands or millions of customers) might be something different than what is actually processed (and so eventually is read by those thousands or millions of customers in the example). Other reasons are related to attack vectors for forgery of signed data, etc.
So one probably should create some flag (e.g. in some static variable) and only allow such deviations if that flag is set.
from openpdf.
Do you have any more information on how to reproduce this error please? Maybe attaching the file you are attempting to read and or a snippet of code you use to read the file, if not even better a unit test? Many thanks
from openpdf.
Unfortunately, I cannot publish the file, and I don't have enough expertise or time to share a sanitized version. Sorry about that.
from openpdf.
no worries, I'll take a look at this as soon as I can anyway
from openpdf.
I've looked at this before (thought, in fact, I'd fixed it, once upon a time, but there may be more than one trailer check/reference).
Trailers are no longer required for PDF 1.5 and greater. I'll try to make an example, and take a look at the crash.
The trailer isn't really used for much, since it's just an optimization for finding the directory by looking at the end of a file.
from openpdf.
I have NullPointerException
in PdfReader.java in readPages
because trailer
is null.
protected void readPages() throws IOException {
catalog = trailer.getAsDict(PdfName.ROOT); << NullPointerException
This is pretty critical, because it happens at the very beginning and nothing can't be done with the document.
The problem appeared as mentioned before because startxref
can't be read correctly. I also can't share my PDF. But here is a screenshot of its last bytes. I scrolled to the very end, lines are wrapped.
It appeared, that the PDF has ~2050 bytes after startxref
to the end of the file.
But in PRTokeniser.java in getStartxref
somehow fixed 1024 bytes are read from the end of the document and this is not enough:
public int getStartxref() throws IOException {
int size = Math.min(1024, file.length()); << hardcoded 1024
int pos = file.length() - size;
file.seek(pos);
String str = readString(1024); << once again
int idx = str.lastIndexOf("startxref");
if (idx < 0)
throw new InvalidPdfException(MessageLocalization.getComposedMessage("pdf.startxref.not.found"));
return pos + idx;
}
I've made an experiment extending 1024
and it worked out well.
I would suggest to think of replacing 1024
with a greater number or just searching startxref
dynamically from the end of the file.
I am not a big pro in PDFs, but here I've found a mentioning, that the requirement, that %%EOF should appear at the end of file was dropped from ISO, whatever that means.
from openpdf.
I am not a big pro in PDFs, but here I've found a mentioning, that the requirement, that %%EOF should appear at the end of file was dropped from ISO, whatever that means.
I'm afraid you misinterpreted that. What's meant there is that the relaxed implementation note by Adobe (which only requires the %%EOF to appear somewhere within the last 1024 bytes of the file) has been dropped when PDF became an ISO standard.
The ISO standard requires: The last line of the file shall contain only the end-of-file marker, %%EOF. And both OpenPDF and iText accept somewhat invalid PDFs with a certain amount of extra trash thereafter.
from openpdf.
That been said, your screenshot looks like the extra trash after the %%EOF is actually some PDF content (the endstream ... stream segment is very PDF'ish).
This has the smell of PDFs generated by buggy software that stores PDFs in files inappropriately opened for writing (with Open
instead of Create
) or in memory streams with some previous content without making sure to eventually only use the newly written part.
from openpdf.
@mkl-public thank you for your feedback!
Now I understand why 1024 is hardcoded here.
I replaced the function mentioned above via this code:
public int getStartxref() throws IOException {
int step = 1024;
int pos = file.length();
int idx;
do {
pos = Math.max(0, pos - step);
file.seek(pos);
String str = readString(step);
idx = str.lastIndexOf("startxref");
} while (pos > 0 && idx < 0);
if (idx < 0)
throw new InvalidPdfException(MessageLocalization.getComposedMessage("pdf.startxref.not.found"));
return pos + idx;
}
It is working universally for me for the usual files as long as for the files with "trash" at the end. I would propose a PR to include the code to the main repository.
The PDF file mentioned above is most likely was produced from Bluebeam Revu. But actually it does not matter, because OpenPDF library should be able to work with any PDF. Users do not care if the original file was produced not according to the standards.
from openpdf.
Pull requests are welcome!
@mkl-public : Are you fine with the suggested code?
from openpdf.
One problem: If the startxref occurs partially in one read packet and partially in another, it won't be found. This can be resolved by having the packets overlap, e.g. by initializing
pos
with
@mkl-public thank you for pointing to the case with packets overlaping. I have updated the code under PR #505
from openpdf.
So one probably should create some flag (e.g. in some static variable) and only allow such deviations if that flag is set.
@mkl-public I understand your point.
I think as long as deviations are still comprehensible and the document can be opened/rendered/etc., it is OK to show some marking "you document is not build according to the standard". If from the other hand OpenPDF chooses to bump such a message to the users while failing to operate such documents, it will be proud following the standards, but the users and other apps will turn away from OpenPDF.
So I think the truth is somewhere in the middle: you can notify the user about the potential problem and do whatever you can to provide normal operation for the users.
from openpdf.
Related Issues (20)
- PdfCopy cannot be used for writing HOT 1
- Streamlining Chunk Addition in ColumnText Without Storing All in JVM HOT 5
- Handling Row Content Splitting in PdfPTable.writeSelectedRows() HOT 3
- font can't display Complete when text length greater than pdf edit box length HOT 2
- Is OpenPDB library is fully supported on Android? HOT 3
- Support for Circular Shaped Images HOT 1
- Does OpenPDF support multiThreadding ? HOT 1
- Opening a file with `PdfReader` (or `RandomAccessFileOrArray`) keeps the file locked after `close()` HOT 1
- Print large text cells that exceed the remaining height of one page of paper starting from the second page rather than from the remaining height of the first page HOT 1
- Prevent leading from affecting the first cell HOT 2
- Unable to insert text in Identity-H encoding using a subset font file HOT 2
- Please provide Android support for GrallVM native-image feature; HOT 1
- Is there html to pdf capability HOT 2
- Japanese symbols are not rendered into PDF file HOT 14
- Rendering Multi lingual Text HOT 1
- Example code for RadioCheckField HOT 1
- "Unknown encryption type R = 5" HOT 6
- VariableBorder cell renders border differently from nonVariableBorder cell
- Non-styled versions of TTF fonts are chosen and artificially styled over available styled versions
- NullPointerException using PdfTextExtractor HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openpdf.