librepdf / openpdf Goto Github PK

OpenPDF is a free Java library for creating and editing PDF files, with a LGPL and MPL open source license. OpenPDF is based on a fork of iText. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository.

License: Other

Java 99.79% HTML 0.18% Shell 0.03%

hacktoberfest itext java openpdf pdf pdf-generation

openpdf's People

Contributors

Stargazers

Watchers

Forkers

nwinkler ejchet adandersen martinkocour radagan renierr bbeyssac byronwlong renber santeriv litrax murathazer dmavrodiev slamj1 ksri99 tlxtellef grassit kmix hkisgood gitter-badger hbcbh1999 shaikficus sun363587351 cybernetics carter-ya rebeckanylander rameshgadenaboinablujay 983045775 albz657 albfernandez wikes82 tizra aykutavci gotoolkits riccardo-noviello hilali-msc edhilion ispardoa fernandor777 zimmyg danschmidt1 tubbynl ysmaelov e9925248 tonyt-vo alxsty lesstif korli jmiquelgarcia pretix prayerq glue-software-engineering-ag wrightrocket iallison xwine nguyendat cjiajiazhuiqiu oceancx caihelin jithinraj rbkgh gianniscris nmb4mf floneu brodaua willthink user135711 krokiet haifauniversity besieged arthurblake danmaidesenling j0s3f repoforks grx7 giant369 evgeniysokolov pascalschumacher developinside3074 mohanaraosv bberto kandyjam fcopardo syduc pvandenbroucke callousedfinger ddecaro94 pramoth winsonrich ascetice ro-rams suifengpiaobo adesolaz pdf4j evernat vsajip tiliasagen sullis shisanchanggong paulcormier

openpdf's Issues

docs?

Limited functionality under Google App Engine

Hi, is there a way to get rid of the dependency on java.awt.Color? It's not white listed on GAE, so setting cell colors, etc. will not work. This library is amazing otherwise. I've tried PDFBox and it is years from where this library is.

PAdES signatures support

PAdES support in OpenPDF would be nice to verify the authenticity of PDF documents such as invoices.

Search for OpenPDF here:
https://ec.europa.eu/cefdigital/DSS/webapp-demo/doc/dss-documentation.html

https://en.wikipedia.org/wiki/PAdES

http://www.etsi.org

https://librepdf.github.io/OpenPDF/docs-1-1-0/com/lowagie/text/pdf/PdfStamper.html

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

https://developers.itextpdf.com/examples/security-itext5/digital-signatures-white-paper/digital-signatures-chapter-2

This is highly relevant: PAdES, with a LGPL license:
https://github.com/esig/dss
https://github.com/esig/dss/tree/master/dss-pades

Ongoing working for OpenPDF integration in DSS:
https://github.com/esig/dss/tree/openpdf-integration

Nullpointer on Font create

in class: BaseFont
Line: 856

You always get null if you do not cache the font (cache = false).
Lines before will put it inside the cache.
You just habe to return the var here.

This does not work when my PDF starts with "%PDF-1.4". "%PDF-" should be checked

OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/PRTokeniser.java

Line 212 in 251761c

int idx = str.indexOf("%FDF-1.2");

POM dependency is incorrect

cannot use pdf-xml 1.0.3

Broken retro-compatibility

Commit ae40ae2 changed the signature of PdfSignatureAppearance.setCrypto removing some arrays.
Unfortunately the diff can't be seen on GitHub because the file was dropped and re-added, but here's a part of the diff:

--- PdfSignatureAppearance-fdb76b2.java 2018-03-29 18:15:23.827579000 +0200
+++ PdfSignatureAppearance-ae40ae2.java 2018-03-29 18:15:30.675425000 +0200
@@ -247,36 +271,65 @@
     
     /**
      * Sets the cryptographic parameters.
-     * @param privKey the private key
+   * 
-     * @param certChain the certificate chain
+   * @param privKey
+   *          the private key
+   * @param certificate
+   *          the certificate
+   * @param crl
-     * @param crlList the certificate revocation list. It may be <CODE>null</CODE>
+   *          the certificate revocation list. It may be <CODE>null</CODE>
-     * @param filter the crytographic filter type. It can be SELF_SIGNED, VERISIGN_SIGNED or WINCER_SIGNED
+   * @param filter
+   *          the cryptographic filter type. It can be SELF_SIGNED,
+   *          VERISIGN_SIGNED or WINCER_SIGNED
      */    
-    public void setCrypto(PrivateKey privKey, Certificate[] certChain, CRL[] crlList, PdfName filter) {
+  public void setCrypto(PrivateKey privKey, X509Certificate certificate,
+      CRL crl, PdfName filter) {
         this.privKey = privKey;
-        this.certChain = certChain;
+    this.certificate = certificate;
-        this.crlList = crlList;
+    this.crl = crl;
         this.filter = filter;
     }

As far as the change goes I guess it's fine, because the array was never used for anything more than [0] but it breaks binary compatibility with itext-4.2.0 for no real reason.
I would suggest adding a method such as:

  /**
   * Sets the cryptographic parameters.
   * @deprecated use {@link #setCrypto(PrivateKey, X509Certificate, CRL, PdfName)}
   */
  public void setCrypto(PrivateKey privKey, Certificate[] certChain, CRL[] crlList, PdfName filter) {
    setCrypto(privKey, (X509Certificate) certChain[0], crlList != null ? crlList[0] : null, filter);
  }

Tell me if you'd like a PR for that.

NullPointerException due to missing trailer (on bad startxref?)

In 1.0.5, we got a NullPointerException with the following stacktrace while trying to read a PDF:

PdfReader.java:1112 - com.lowagie.text.pdf.PdfReader.readPages
PdfReader.java:622 - com.lowagie.text.pdf.PdfReader.readPdf
PdfReader.java:282 - com.lowagie.text.pdf.PdfReader.
PdfReader.java:295 - com.lowagie.text.pdf.PdfReader.

Based on the line numbers, trailer must be null. Tracing through the execution, this can happen in the following sequence of events:

readPdf calls readXref, which is supposed to set trailer (among other things).
readXref doesn't find a valid startxref and throws an exception, or readXrefSection throws an exception due to an invalid xref.
readPdf catches the exception and calls rebuildXref
That method tries to set the trailer, too, but it can return without actually setting it.
readPdf proceeds to readPages, trailer is unset, and we get an NPE.

I'm not sure what the proper fix would be, though. Should one of the caught exceptions instead bubble out of readPdf? Should rebuildXref set trailer to an empty PdfDictionary if it doesn't find the actual trailer?

Sorry, I'm pretty ignorant about the PDF format in general. This report is just based on working through this exception's execution path.

DocumentException should be unchecked

DocumentException extends Extension that makes it checked exception.

Basically, it's useless to force to catch it and making it harder to use (e.g. in lambda expressions).

Duplicate entry

Hi,

it looks like XmlDomWriter in openpdf dependency is the same as XmlDomWriter in pdf-xml dependency.
Both of them has same package name and also content.
I have a dependency on pdf-html sources in my android procject.
compile 'com.github.librepdf:pdf-html:1.0.1'
When i try to build my project i get

Error:Execution failed for task ':touchPoApp:transformClassesWithJarMergingForTstingDebug'.
> com.android.build.api.transform.TransformException: java.util.zip.ZipException: duplicate entry: com/lowagie/text/xml/XmlDomWriter.class

Make Travis use JDK 7.

This includes testing that it works as intended in regards to Java 7 code and dependency compatibility.

Update pom.xml and create a new Maven Release to reflect organizational ownership.

Bouncy Castle is not optional

A simple test which only instantiate a PdfReader on a empty PDF requires Bouncy Castle in the classpath. The pom declare it optional in the manifest.
This problem does not happen with the iText version at ymasory/iText-4.2.0.

Problem In Signing of Pdf with externalSiging service (eSign)

Hi ,

we want to sign the pdf with external signature provided by esign service.we have used openpdf 1.0.1..
the problem is we are unable to calculate the exclusion size of signature appearance before preclose..
please find below code -

	byte[] signeddata = null;
	PdfSignatureAppearance pdfSigApp=null;
	File destFile=null;
	PdfReader reader=null;
	 ByteArrayOutputStream arrayOutputStream = new ByteArrayOutputStream();
	
	reader = new PdfReader(signingHelper.getSrc());
	destFile = new File(signingHelper.getDest());
   
	OutputStream os = arrayOutputStream;
	
	PdfStamper pdfStamper = PdfStamper.createSignature(reader,os, '\0',null, true);
	pdfSigApp = pdfStamper.getSignatureAppearance();

	

	pdfSigApp.setVisibleSignature("SignatureField1[0]");// for existing
												// signaure feild
	SimpleDateFormat dt = new SimpleDateFormat("dd-MMM-yyyy");
	String formatedDate = dt.format(new Date());													// pass name of
														// field
	
	pdfSigApp.setLayer2Text("Digitally Signed" + "\nReason: " + signingHelper.getReason()
	+"\nDate: "+formatedDate+"\nLocation: " + signingHelper.getLocation());
	Font font = new Font();
	//font.setColor(Color);

	font.setSize(9);
	pdfSigApp.setLayer2Font(font);
	

	pdfSigApp.setLocation(signingHelper.getLocation());
	// pdfSigApp.set
	PdfSignature sigDic = new PdfSignature(PdfName.ADOBE_PPKMS, PdfName.ADBE_PKCS7_DETACHED);
	
	sigDic.put(PdfName.FT, PdfName.SIG);
	sigDic.setReason(signingHelper.getReason());
	sigDic.setLocation(signingHelper.getLocation());

	
	pdfSigApp.setCryptoDictionary(sigDic);
	------------------------------
	HashMap exclusions = new HashMap();
	
	
	exclusions.put(PdfName.CONTENTS,new Integer(7622));  //== Here is problem how to  caclucate this Value ?
	

	pdfSigApp.preClose(exclusions);
	LOGGER.info("exclusions:"+exclusions);
	
	String hashPdf = generateSha256HashInHexForPdf(pdfSigApp.getRangeStream());

	//sending this hash to external service for sining----geting pkcs7 signature in response..
			
		signeddata = Base64.getDecoder().decode(pkcs7Signature);
		
		byte out[] = new byte[signeddata.length];
		System.arraycopy(signeddata, 0, out, 0, signeddata.length);
		updates.put(PdfName.CONTENTS, new PdfString(out).setHexWriting(true));
		pdfSigApp.close(updates);
		
		reader.close();
		FileOutputStream fileOutputStream = new FileOutputStream(destFile);
		fileOutputStream.write(arrayOutputStream.toByteArray());
		fileOutputStream.close();

please provide suggestions or help in above code...

Thanks,
Arjun

Language glyphs and diacritics

Hi,

have you been able to figure out the language diacritics not rendering properly in openpdf ? :)

Recent changes to PdfArray broke the Kids field

Recent changes in PdfArray.java broke the pdf creation. PdfPages.writePageTree creates empty "Kids" field. Maybe some other code parts affected. Reverting getArrayList (and getElements) to returning internal list fixes the issue.
There is pull request #80 on just the same issue (partially)

Compilation error on Java 7

The master branch of OpenPDF doesn't compile with Java 7. If Java 8 now is a requirement, then it would be nice if the README could be updated to show that OpenPDF now requires Java 8.

This is the compilation error I get with Java 7:

[ ERROR] /C:/OpenPDF-master/pdf-html/src/test/java/com/lowagie/text/html/simpleparser/FactoryPropertiesTest.java:[24,19] cannot access java.util.stream.Stream
class file for java.util.stream.Stream not found
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] OpenPDF - Free and Open PDF ........................ SUCCESS [ 1.030 s]
[INFO] openpdf ............................................ SUCCESS [ 31.981 s]
[INFO] pdf-xml ............................................ SUCCESS [ 2.155 s]
[INFO] pdf-rtf ............................................ SUCCESS [ 7.805 s]
[INFO] pdf-html ........................................... FAILURE [ 0.687 s]
[INFO] pdf-swing .......................................... SKIPPED
[INFO] pdf-toolbox ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 45.628 s
[INFO] Finished at: 2017-10-24T12:38:45+02:00
[INFO] Final Memory: 41M/613M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:testCompile (default-testCompile) on project pdf-html: Compilation failure
[ERROR] /C:/OpenPDF-master/pdf-html/src/test/java/com/lowagie/text/html/simpleparser/FactoryPropertiesTest.java:[24,19] cannot access java.util.stream.Stream
[ERROR] class file for java.util.stream.Stream not found

Add PDF 2.0 support

Add PDF 2.0 support to OpenPDF:
https://www.pdfa.org/publication/iso-32000-2-pdf-2-0/
https://www.pdfa.org/what-will-pdf-2-0-bring/

https://issues.apache.org/jira/browse/PDFBOX-3892

Update to Bouncy Castle 1.59

Update to Bouncy Castle 1.59
https://www.bouncycastle.org/releasenotes.html

Improper handling of line-height by HTML to PDF parser

Hi,

I've prepared the test example showing an issue: https://gist.github.com/syakovyn/6ead2da4f00716b25a4803e36b64bb90

The attached PDF shows a workaround that fixes the paragraph leading and the paragraphs without a fix test.pdf

The issue is in com.lowagie.text.html.simpleparser.FactoryProperties#insertStyle(java.util.HashMap, com.lowagie.text.html.simpleparser.ChainedProperties) that doesn't account for a case when line-height is a number, e.g. "line-height:1.3".

Serhiy

Bouncy Castle maven dependency moved

Bouncy Castle is a dependency of OpenPDF. Their maven artifact seems to have moved, according to this merged pull request in the flyingsaucer project: flyingsaucerproject/flyingsaucer#115

Perhaps a similar change should be done in OpenPDF, I'm not sure yet.

Manual is missing

I want to evaluate this fork and I miss a manual / simple demo applications. I tried to search for iText tutorials but most of them does not compile with OpenPDF. So I suggest that demos / wiki would be nice for newcommers.

Btw is there some chat for a discussion with OpenPDF developers?

Unicode Characters

Hi everyone,

i'm new to your repo. I'm trying to print a special unicode character \u25b2 which would be a triangle in the Times new Roman font...

unfortianaly it is ignroed when looking at the pdf
private static final Font NORMALFONT = new Font(Font.TIMES_ROMAN, 7, Font.NORMAL, Color.black);

and the result

where there should be this triangle before every number

Do i do something wrong here?

misc_licenses.txt

New behaviour of how text is extracted from a page

Hi everyone

I just updated from the original iText 4.2.0 (https://github.com/ymasory/iText-4.2.0) to your OpenPDF 1.0.5. So far, it works fine, but I mentioned a change to the behaviour how text is extracted from a pdf.
With the previous version, the text has been extracted via PdfTextExtractor.getTextFromPage(i) as "plain text", now I get every word surrounded by markup tags.

For example:
before:
Hello

after:
<br class='t-pdf' /><span class="t-word" style="bottom: 81.79%; left: 56.18%; width: 17.45%; height: 0.83%;" id="word7">Hello</span>

I found out, that this change has been made by the following fork respectively the following change:
kulatamicuda/iText-4.2.0@7d7c218#diff-b2e0f949a7f5d2e581f63cedf5f30922

Is there a way to get the old behaviour without using the old "SimpleTextExtractingPdfContentRenderListener" class? I don't want to integrate old code because of maintainability...

Thanks in advance!
M.T.

P.S.: I know, this change has been made by another repository, but the original repository has not been updated since at least 3 years...

SHA-1 is unsafe / deprecated

SHA-1 is unsafe / deprecated. How should we handle this in OpenPDF?

https://en.wikipedia.org/wiki/SHA-1
https://itextpdf.com/blog/are-pdf-signatures-shattered

https://github.com/LibrePDF/OpenPDF/search?q=sha-1

Possibility to customize "producer"-Flag in PDF-Metadata

I would like to change the producer-flag in the meta-data of a generated PDF-file.
Which way of doing that would you prefer?

Exposing an API for Metadata-Manipulation?
A protected Method for metadata-processing that can be overridden in a custom class?
Any other option?
Please let me know which way you would prefer from an architectural perspective.
I can then try to implement it and creating a pull request.

Thanks in advance!

How to use openPDF in vbscript and C#

How to use openPDF object in vbscript and C# because customer installed openPDF software to view and edit pdf files

Issue with subsetting on OTF/CFF fonts

I use OpenPDF in flying saucer to generate PDFs from HTML and I've run into a problem that I cannot use fonts such as NotoSansCJKjp (an OTF/CFF font) because CFF font subsetting does not work correctly.

The subsequent PDF output is broken in Acrobat, stating that the embedded font cannot be extracted. It does work in other readers, but I believe that is because they are more lenient than Acrobat on this issue, but unfortunately using another reader is not an option.

I created a fork of OpenPDF and turned off font subsetting entirely and the output works fine, but obviously this is also not an real solution, because these fonts can be quite large and this particular font results in a minimum PDF size of 12MB, so the problem compounds when using multiple fonts.

I inspected the PDF output with PDFBox's preflight and it errors with "Font DICT invalid without "Private" entry", which does indeed seem to point again to the subsetting being broken, not including a private section in each font dict, which would explain why Acrobat is falling over as well.

I did my best to try and fix this myself, but I've not made much headway so I thought I would reach out to the community and see if anyone has the necessary experience with CFF font subsetting in order to fix this issue.

Thanks

OpenPDF makes no distinctions between reading password vs editing password

I have a PDF that is password protected for editing (PDF/A compatible PDF), but can be read without password. If you open Acrobat Reader, go to File -> Properties -> Security -> Show Details.., you can see that there are actually two passwords possible and only for editing it is enabled. Acrobat can even force the PDF to be editable without password, losing the PDF/A compatibility in the process. So either way, a password protected PDF should be viewable in OpenPDF.

OpenPDF detects that the document is encrypted, but since I don't have a password it fails the following check:

public final boolean isOpenedWithFullPermissions() {
  return !encrypted || ownerPasswordUsed;
}

I can open the PDF in other readers just fine as long as I don't enable editing mode. In OpenPDF I would expect something like the following check instead:

public final boolean isOpenedWithFullPermissions() {
  return !encrypted || ownerPasswordUsed || (!pdfRequiresReadingPassword && readonlyMode);
}

If I force this method to return true, it actually is able to read the PDF without issues (this is my current workaround, unfortunately).

Update to Bouncy Castle version 1.58

https://www.bouncycastle.org/

Update README

Update links to refer to this new repository location (LibrePDF/OpenPDF)
Document recent changes

PDF Metadata producer is always "OpenPDF 1.0.0-SNAPSHOT"

I have found this code in class "com.lowagie.text.Document" :

private static final String OPENPDF = "OpenPDF";
private static final String RELEASE = "1.0.0-SNAPSHOT";
private static final String OPENPDF_VERSION = OPENPDF + " " + RELEASE;

This will be great if the version was read from a property file, automatically updated at Maven build phase with resource filtering.

Image.getInstance: mono PNG with color ICC profile displays wrong

Although it doesn't make a lot of sense to me, a monochrome PNG (1 component) might have a color ICC profile (3 components). One way to create a file like that is to use GhostScript and ImageMagick:

Start with a black-and-white text PDF
Use GhostScript to render to color PNG: gs -sDEVICE=png256 -o test.png test.pdf
Use ImageMagick to convert to B&W PNG: convert test.png test2.png

Here's such an image. It displays fine in any browser:
test2.png

But when importing it into a PDF with Image.getInstance, it displays incorrectly, because the raster is 1-component but the /ColorSpace is 3-component:
out.pdf

This problem occurs with PNGs, but not with TIFFs (the TiffImage class ignores an ICC if its getNumComponents() doesn't match).

I made a quick logo, is it okay?

@andreasrosdal I made a logo for LibrePDF. I didn't spend much time on it. I just wanted the account to look a little better than the default generated image.

Is it okay? I'm happy to remove it or make minor adjustments if not.

Release of OpenPDF 1.0.2

Perhaps it is time to release a new version of OpenPDF.

@bengolder @nwinkler Perhaps one of you could please create and publish the release?

https://github.com/LibrePDF/OpenPDF/wiki/Release-Process

FontAwesome icons bundled in openpdf

Hi I got this idea from Vaadin:

Could FontAwesome icons be also added similar way to openpdf?

https://github.com/vaadin/framework/blob/7.7/scripts/generateFontAwesomeEnum.sh
https://github.com/vaadin/framework/blob/7.7/server/src/main/java/com/vaadin/server/FontAwesome.java
https://github.com/vaadin/framework/blob/7.7/server/src/main/java/com/vaadin/server/FontAwesome.java#L773
https://github.com/vaadin/framework/blob/7.7/server/src/main/java/com/vaadin/server/GenericFontIcon.java#L94

So if openpdf supports html evaluation this should be doable?

I've used some years ago commercial one (iText) with xhtml,css pipelines but don't know how exactly html is converted to pdf with openpdf if it is possible?

ref: http://fontawesome.io/

svg file support in OpenPDF

Hi,
We want to use SVG files to add icons in PDF. Is there any standard way to do that using OpenPDF?

Unable to add Group3/Group4 TIFFs into Version 1.2.0

Hello, when trying to add a Group3 or Group4 TIFF image into a PDF in release 1.2.0, there is an exception thrown by the underlying sanselan library for "unknown compression":

ExceptionConverter: org.apache.sanselan.ImageReadException: Tiff: unknown compression: 4

	at org.apache.sanselan.formats.tiff.datareaders.DataReader.decompress(DataReader.java:135)
	at org.apache.sanselan.formats.tiff.datareaders.DataReaderStrips.readImageData(DataReaderStrips.java:96)
	at org.apache.sanselan.formats.tiff.TiffImageParser.getBufferedImage(TiffImageParser.java:505)
	at org.apache.sanselan.formats.tiff.TiffDirectory.getTiffImage(TiffDirectory.java:163)
	at org.apache.sanselan.formats.tiff.TiffImageParser.getBufferedImage(TiffImageParser.java:441)
	at com.lowagie.text.ImageLoader.getTiffImage(ImageLoader.java:163)
	at com.lowagie.text.Image.getInstance(Image.java:363)

This was working previously in version 1.0.5, I presume because the method by which TIFFs were read has changed.

My initial research indicates that Group3/Group4 support was added to some later version of the sanselan/commons-imaging project, but unsure of how stable these releases are.

Provide CSS resolver

If you have an HTML contains <style> tags in the head, the HTMLWorker cannot parse them, instead, it generates the style tags into the PDF file. Can we support inline styles like iText5+?


        OutputStream outputStream = new FileOutputStream(optionalPath);        
        Document document = new Document(PageSize.A4, 30, 30, 30, 30);
        PdfWriter w = PdfWriter.getInstance(document, outputStream);
        HTMLWorker worker = new HTMLWorker(document);
        document.open();
        worker.parse(new StringReader(HTMLUtil.getLongContent()));

        worker.close();
        document.close();
        w.close();

Unable to parse HTML table with whitespace inside it

Document doc1 = new Document();
doc1.open();
HtmlParser.parse(doc1, new StringReader("<table><tr><td>test</td></tr></table>")); // succeeds

Document doc2 = new Document();
doc2.open();
HtmlParser.parse(doc2, new StringReader("<table> <tr><td>test</td></tr></table>")); // fails

The last line throws this exception:

Exception in thread "main" java.lang.ClassCastException: com.lowagie.text.Table cannot be cast to com.lowagie.text.TextElementArray
	at com.lowagie.text.xml.SAXiTextHandler.handleStartingTags(SAXiTextHandler.java:229)
	at com.lowagie.text.html.SAXmyHtmlHandler.startElement(SAXmyHtmlHandler.java:206)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1359)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2784)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
	at com.lowagie.text.html.HtmlParser.go(HtmlParser.java:85)
	at com.lowagie.text.html.HtmlParser.parse(HtmlParser.java:190)
	at com.example.PDF.main(PDF.java:17)

pom.xml

<dependency>
	<groupId>com.github.librepdf</groupId>
	<artifactId>openpdf</artifactId>
	<version>1.0.5</version>
</dependency>
<dependency>
	<groupId>com.github.librepdf</groupId>
	<artifactId>pdf-html</artifactId>
	<version>1.0.5</version>
</dependency>

Calls to String.toLowerCase(), and friends should be checked for proper use of locales

Case-folding is used for various pieces of data and metadata used in PDF syntax (perhaps also in user content). Those transformations are based on Adobe's general assumption of the use of US_ASCII/ISO-Latin-1 encoding. As noted in the comment on pull request #76 This fails for (at least) Turkish locales, where Capital I can fold to a non-Latin-1 lowercase dotless i (ı).

Where PDF syntax is being processed the ROOT (no-language) locale should be used. Each case has to be examined, at least superficially, to determine if:

the System locale should be use (e.g. for filenames)
The ROOT local should be used, as discussed.
A locale defined in the PDF itself needs to be used, as transformations are being performed on the content streams. (I do not know for sure that there are any such cases at this point).

Error while retrieving text from pdf with an empty page

While reading text per page of a Pdf had had some issues when it had a blank page. Any other contents were loaded just fine

java.lang.NullPointerException at com.lowagie.text.pdf.parser.PdfTextExtractor.getContentBytesFromContentObject(PdfTextExtractor.java:157)
  at com.lowagie.text.pdf.parser.PdfTextExtractor.getContentBytesForPage(PdfTextExtractor.java:138)
  at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:223)
  at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:199)

Setup Travis CI automatic tests

We should setup automatic builds on Travis CI for this project.

(This requires organization access to the LibrePDF organization)

Develop version - SNAPSHOT?

I noticed that all of the code in the master branch has 1.0 as a version number in the pom.xml files.

Shouldn't we follow the Maven standard of using SNAPSHOT versions, and use the full version number only for the released version? That would mean changing the 1.0 in the pom files to 1.1-SNAPSHOT or 1.0.1-SNAPSHOT.

Since I'm preparing a couple of pull requests for this project, I'd like to understand how this is handled in OpenPDF. Happy to also create a PR for adjusting the version numbers - just let me know.

Change package name to com.github.librepdf

Since the maven groupId and the java package name of the library have nothing in common now, wouldn't it be better to move all class from com.lowagie to com.github.librepdf / com.github.librepdf.openpdf?

Correct the license

This repo still has the incorrectly changed license from rtfarte/OpenPDF. It needs to pull in the latest changes from rtfarte/OpenPDF#17

Some input files use or override a deprecated API

I get these warnings of "Some input files use or override a deprecated API" when compiling the latest version of OpenPDF:

[INFO] --- maven-compiler-plugin:3.2:compile (default-compile) @ openpdf ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 344 source files to C:\Users\andreas\librepdf\OpenPDF\openpdf\target\classes
[INFO] /C:/Users/andreas/librepdf/OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/PdfPKCS7.java: Some input files use or override a deprecated API.
[INFO] /C:/Users/andreas/librepdf/OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/PdfPKCS7.java: Recompile with -Xlint:deprecation for details.
[INFO] /C:/Users/andreas/librepdf/OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/PdfReader.java: Some input files use unchecked or unsafe operations.
[INFO] /C:/Users/andreas/librepdf/OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/PdfReader.java: Recompile with -Xlint:unchecked for details.

Assertj dependency should have test scope

Openpdf version 1.0.4 has assertj-core as a compile scope dependency (inherited from openpdf-parent). I believe this dependency should have test scope.

By the way: Thanks for providing OpenPDF. 👍

Issue closed: LGPL license

The readme says that this fork is based on iText 4, but to be precise:

iText 2.1.7 (7 Jul 2009) was the last MPL/LGPL release by iText Software.
4.2.0 was an internal SVN tag, used to sync up versions between iText (Java) and iTextSharp (.NET). The latter was at 4.1.6 at that point. However, iText Software never released a build based on the 4.2.0 tag. It was a mid-development construct and the software wasn't guaranteed to be stable at that point. When iText migrated from SVN to Git, some technical constructs were cleaned up (by me personally, see full disclosure below), including the internal 4.2.0 tag. As far as iText Software concerns, there never was a release of "iText 4".
- 2.1.7-59-g935969371a (27 November 2009) was the last pre-AGPL commit and corresponds with the former 4.2.0 tag in SVN.
- 2.1.7-60-gf69dd81b2e (1 December 2009) is where the AGPL headers were added
iText 5.0.0 (8 December 2009) was the first AGPL release by iText Software.
On 31 August 2010, GitHub user ymasory uploaded a version of iText "MPL/LGPL" to Github. It is unclear if this was based on 2.1.7 or on 2.1.7-59-g935969371a. They did not accept pull requests or did any other development.
On 19 September 2012, a now-defunct New York software startup called InProTopia Corporation (as far as I can tell, founded by a student of the Columbia University) took ymasory's repo and used that to upload a Maven build of "iText 4.2.0" and "iText 4.2.1" to Maven Central. However, they used (or hijacked?) com.lowagie as GroupId, which they were not allowed to do according to Apache's Guide to uploading artifacts to the Central Repository. This is explained in a blog post on iText's website: http://itextpdf.com/maven-update-problem-with-itext-4.2.2. See also this Stack Overflow answer: http://stackoverflow.com/a/14213851/766786
For clarity of this overview, I skipped some of the intermediate forks.

Conclusion: this project should do it's due diligence and make absolutely sure what it is based upon: is it iText 2.1.7 or is it iText 2.1.7-59-g935969371a? I don't want to sound like I'm spreading FUD, but you need to make absolutely sure that your users aren't in uncharted territory.

As a side note (maybe this should be a separate issue?), if you ever plan to upload to Maven Central, then you need to change every reference to com.lowagie to something else, as described in the link above.

I recommend that you contact Software Freedom Conservancy for legal and technical advice. They have extensive experience with community developers taking over an Open Source project after a license change.

Full disclosure: I am QA & Release Engineer at iText Software, but I've been an Open Source user & advocate since decades before I joined iText Software. From a personal point of view, I wish this project good luck because it is 8 years behind in development. The fact that I took considerable time to do my research for this issue, should give you a clue that I don't want to intentionally harm this project. From a professional point of view, I welcome the competition. It keeps us on edge. :-)