dotemacs / pdfboxing Goto Github PK

View Code? Open in Web Editor NEW

172.0 5.0 36.0 2.73 MB

Nice wrapper of PDFBox in Clojure

License: BSD 3-Clause "New" or "Revised" License

Clojure 98.79% HTML 1.21%

pdf clojure pdfbox pdf-forms

pdfboxing's Introduction

`pdfboxing`

Clojure PDF manipulation library & wrapper for PDFBox.

Usage

Extract text

(require '[pdfboxing.text :as text])
(text/extract "test/pdfs/hello.pdf")

Merge multiple PDFs

(require '[pdfboxing.merge :as pdf])
(pdf/merge-pdfs :input ["test/pdfs/clojure-1.pdf" "test/pdfs/clojure-2.pdf"] :output "foo.pdf")

Merge multiple images into single PDF

You can use either merge-images-from-path for providing images in form of vector of string paths or merge-images-from-byte-array to provide them as a vector of byte arrays. Each image will be inserted into its own page.

(require '[pdfboxing.merge :as pdf])
(pdf/merge-images-from-path ["image1.png" "image2.png"] "output.pdf")

Split a PDF into mutliple PDDocuments

 (require '[pdfboxing.split :as pdf])

List of PDDocument pages 1 through 8

 (pdf/split-pdf :input "test/pdfs/multi-page.pdf" :start 1 :end 8)

Splits the PDF into single pages as a list of PDDocument

 (pdf/split-pdf :input "test/pdfs/multi-page.pdf")

Splits the PDF in half and writes them to disk as multi-page-1.pdf and multi-page-2.pdf

 (pdf/split-pdf-at :input "test/pdfs/multi-page.pdf")

Splits into two PDFs, the first having 5 pages and second has rest

 (pdf/split-pdf-at :input "test/pdfs/multi-page.pdf" :split 5)

List form fields of a PDF

To list fields and values:

(require '[pdfboxing.form :as form])
(form/get-fields "test/pdfs/interactiveform.pdf")
{"Emergency_Phone" "", "ZIP" "", "COLLEGE NO DEGREE" "", ...}

Fill in PDF forms

To fill in form's field supply a hash map with field names and desired values. It will create a copy of fillable.pdf as new.pdf with the fields filled in:

(require '[pdfboxing.form :as form])
(form/set-fields "test/pdfs/fillable.pdf" "test/pdfs/new.pdf" {"Text10" "My first name"})

Rename form fields of a PDF

To rename PDF form fields, supply a hash map where the keys are the current names and the values new names:

(require '[pdfboxing.form :as form])
(form/rename-fields "test/pdfs/interactiveform.pdf" "test/pdfs/addr1.pdf" {"Address_1" "NewAddr"})

Get page count of a PDF document

(require '[pdfboxing.info :as info])
(info/page-number "test/pdfs/interactiveform.pdf")

Get info about a PDF document

Such as title, author, subject, keywords, creator & producer

(require '[pdfboxing.info :as info])
(info/about-doc "test/pdfs/interactiveform.pdf")

Draw lines on a PDF document

Supply a PDF document, a name for the output PDF document, the coordinates where the line should be drawn along with the page number on which the line should be drawn

(require '[pdfboxing.draw :as draw])
(draw/draw-line :input-pdf "test/pdfs/clojure-1.pdf"
                :output-pdf "ninja.pdf"
                :coordinates {:page-number 0
                              :x 0
                              :y 160
                              :x1 650
                              :y1 160})

Convert a PDF document to a very simple HTML document

Supply a PDF document's name, a simple HTML is created in the root folder

(require '[pdfboxing.tools :as tools])
(tools/pdf-to-html "myFile.pdf")

Compatibility with PDFBox's PDDocuments

The following functions referenced above have direct compatibility with PDFBox's internal PDDocument type:

text/extract
pdf/split-pdf
form/get-fields
form/set-fields
form/rename-fields
info/page-number
draw/draw-line

This allows you to substitute each filepath (of each function's input) referenced above with a PDDocument type. This is helpful for example in the case that you were to want to split a PDF up by pages and then extract the text from only the 3rd page:

(require '[pdfboxing.text :as text])
(require '[pdfboxing.split :as split])
(-> (split/split-pdf :input "test/pdfs/multi-page.pdf")
    (nth 2)
    text/extract)

pdfboxing's People

Contributors

Stargazers

Watchers

pdfboxing's Issues

Update to PDFBox 2.0.12 due to security issue in 2.0.11

Details are here

Images

I have a PDF file with a form field that's an image. Is there anything in this library to put an image in that field?

Can't load PDF URLs

Though PDFBox's PDDocument/load can handle PDFs given as a URL, pdfboxing cannot. This is because load-pdf wraps PDDocument/load with is-pdf?, which in turn relies on FileDataSource. Joshua Miller delineates an approach that has no such restriction:

  (with-open [pd (PDDocument/load (URL. url))]
    (let [stripper (PDFTextStripper.)]
    (.getText stripper pd)))

I'm not sure how you want to approach this restriction so I don't have a PR at this time. Cheers!

Export to image support

First of all - kudos for this library! It proves to be very useful to our project in Magnet.
However we need an export to image functionality that Apache's PDFbox provides. We fought that it would be nice if your library has it as well.

We'd be happy to make a PR with this.

warning on PDF merge

Hi there,

I've been messing around a little with this library, and I noticed that PDFBox generates a warning about not closing the file when using the merge example you give on the readme:

Apr 02, 2016 12:30:17 PM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document

Is this expected behavior? I'm not sure how PDFBox works under the hood, so not clear if this is a problem or just a silly warning that should be suppressed. (When working with large collections of documents, does this perhaps suggest that it they won't be garbage collected?)

COSStream has been closed and cannot be read.

Hi,

This code snippet:

(-> (split/split-pdf :input "test/pdfs/multi-page.pdf")
    (nth <some-idx>)
    text/extract)

often throws:
Execution error (IOException) at org.apache.pdfbox.cos.COSStream/checkClosed (COSStream.java:83). COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?

Thanks for lib :)

Less strict `merge-pdfs`

Expected Behavior

I have a couple of input-streams containing PDFs that I want to merge into an output-stream.

The current merge-pdfs function requires that all inputs are files, but PDFBox supports adding InputStreams.

Same with the output, merge-pdfs requires that the output is a filename, but PDFBox supports both filenames and OutputStreams.

Potential Solution

To solve this in my codebase I've updated merge-pdfs to be less strict:

(defn merge-pdfs [& {:keys [input output]}]
  (let [merger (PDFMergerUtility.)]
    (doseq [source input]
      (.addSource merger source))
    (cond
      (instance? java.io.OutputStream output)
      (.setDestinationStream merger output)

      :else
      (.setDestinationFileName merger output))
    (.mergeDocuments merger)))

But I wanted your feedback before submitting a PR.
Is this something you'd be interested in?

How to embed a signature of an image of a signature?

I need to fill some PDF forms and render a signature on top of the form. Is this possible?

Issues with `merge-pddocuments`

Opening a new issue after discussion in #26

(pdf-split/merge-pddocuments
  :docs (pdf-split/split-pdf :input path :start 1 :end 4)
  :output "test.pdf")

Unhandled java.io.IOException
   COSStream has been closed and cannot be read. Perhaps its enclosing
   PDDocument has been closed?

Please check the mentioned issue for further details

Working with split pdfs

A few questions

What is the best way to save the output of split-pdf as a pdf?
Is java interop necessary for that, or another clojure library?
for instance, if I want to turn "/sample/pdf-title.pdf" into "sample/pdf-title-pages/1.pdf" "sample/pdf-title-pages/2.pdf"

Please deploy 0.1.6 to Clojars

I'd like to use the new version without vendoring this lib. Thank you!

License

Hi, what is the license for this project? Could you add one to the repo?

README wrongly documents split-pdf-at

In readme we can read:
Splits the PDF in half and writes them to disk as clojure-1.pdf and clojure-2.pdf

The example below is code that truly splits its input file in half. But there's no way you specify what the output files are.

ClassNotFoundException when using Java 9

OS: Linux Mint (Sarah)

Using this project.clj

(defproject myproject "0.1.0-SNAPSHOT"
  ...
  :dependencies [[org.clojure/clojure "1.9.0"]
                           [proto-repl "0.3.1"]
                           [pdfboxing "0.1.14-SNAPSHOT"]])

I get the following error under Java 9:

➜ myproject> lein -U repl
nREPL server started on port 33000 on host 127.0.0.1 - nrepl://127.0.0.1:33000
REPL-y 0.3.7, nREPL 0.2.12
Clojure 1.9.0
Java HotSpot(TM) 64-Bit Server VM 9.0.4+11
Docs: (doc function-name-here)
(find-doc "part-of-name-here")
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e

user=> (require '[pdfboxing.text :as text] :verbose)
(clojure.core/load "/pdfboxing/text")
(clojure.core/load "/pdfboxing/common")

CompilerException java.lang.ClassNotFoundException: javax.activation.FileDataSource, compiling:(pdfboxing/common.clj:1:1)
(clojure.core/in-ns 'pdfboxing.common)
(clojure.core/alias 'io 'clojure.java.io)

The issue seems to be that java.activation is deprecated in Java 9, see this issue.

I briefly tried downloading the JavaBeans Activation Framework and using the --add-modules argument to :jvm-opts in my profile, but gave up. Instead, when using pdfboxing I reverted to Java 8 via

➜ myproject> sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority Status

0 /usr/lib/jvm/java-9-oracle/bin/java 1091 auto mode
1 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 1081 manual mode
2 /usr/lib/jvm/java-9-oracle/bin/java 1091 manual mode

Press to keep the current choice[*], or type selection number: 1
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode

Now, starting a REPL and bringing in pdfboxing works:

➜ myproject> lein -U repl
nREPL server started on port 39468 on host 127.0.0.1 - nrepl://127.0.0.1:39468
REPL-y 0.3.7, nREPL 0.2.12
Clojure 1.9.0
OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
Docs: (doc function-name-here)
(find-doc "part-of-name-here")
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e

user=> (require '[pdfboxing.text :as text] :verbose)
(clojure.core/load "/pdfboxing/text")
(clojure.core/load "/pdfboxing/common")
(clojure.core/in-ns 'pdfboxing.common)
(clojure.core/alias 'io 'clojure.java.io)
(clojure.core/in-ns 'pdfboxing.text)
(clojure.core/alias 'common 'pdfboxing.common)
(clojure.core/in-ns 'user)
(clojure.core/alias 'text 'pdfboxing.text)
nil
user=>

IOException COSStream has been closed and cannot be read

I'm using the example code from the readme to try and extract the text from the second page of a pdf, but I get the following error:

IOException COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed? org.apache.pdfbox.cos.COSStream.checkClosed (COSStream.java:77)

This is the code I'm running:

(text/extract (nth (split/split-pdf :input "resources/my-file.pdf") 1))

Extract text from pdf area

For text extraction, pdfboxing currently uses org.apache.pdfbox.text.PDFTextStripper which works on the entire document. However, any document structure is removed during text extraction, so the more data the pdf contains, the harder it becomes to sort it out.

As an alternative, there's also org.apache.pdfbox.text.PDFTextStripperByArea, which allows you to specify a rectangle to extract text from with pretty good results in PDF files with (visually) structured content.

I have prepared a rough prototype that seems to work:

(ns my-ns
  (:require [pdfboxing.common :as common])
  (:import (org.apache.pdfbox.text PDFTextStripperByArea)
           (java.awt Rectangle)))

(defn extract-by-area
  "get text from a specified area of a PDF document"
  [pdfdoc x y w h page]
  (with-open [doc (common/obtain-document pdfdoc)]
    (let [rectangle       (Rectangle. x y w h)
          pdpage          (.getPage doc (inc page))
          pdftextstripper (doto (PDFTextStripperByArea.)
                            (.addRegion "region" rectangle)
                            (.extractRegions pdpage))]
      (.getTextForRegion pdftextstripper "region"))))

@dotemacs would you (or anyone else around, for that matter) be interested in this functionality?

If so let me know and I'll put some time into making a proper PR.

note: the unit of measurement when defining the rectangle coordinates is a pt (~0.035cm or ~0.0139in)

Expected Behavior

Update the code to align with the major changes in pdfbox v3

Actual Behavior

Tests pass after upgrade

Steps to Reproduce the Problem

Specifications

Version:
Platform:
Subsystem:

Anything else that you think is relevant

I have made the updates in a fork

Let me know if you will accept a PR

[ I ask because I see that there are several open PRs and I don't want to waste your time ]

dotemacs / pdfboxing Goto Github PK

pdfboxing's Introduction

pdfboxing

Usage

Extract text

Merge multiple PDFs

Merge multiple images into single PDF

Split a PDF into mutliple PDDocuments

List form fields of a PDF

Fill in PDF forms

Rename form fields of a PDF

Get page count of a PDF document

Get info about a PDF document

Draw lines on a PDF document

Convert a PDF document to a very simple HTML document

Compatibility with PDFBox's PDDocuments

pdfboxing's People

Contributors

Stargazers

Watchers

Forkers

pdfboxing's Issues

Expected Behavior

Potential Solution

Using this project.clj

Selection Path Priority Status

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Anything else that you think is relevant

Recommend Projects

Recommend Topics

Recommend Org

`pdfboxing`