Git Product home page Git Product logo

pdfboxing's Introduction

pdfboxing

Clojure PDF manipulation library & wrapper for PDFBox.

  • "Clojure CLI"
  • "Leiningen version"
  • "Continuous Integration status"
  • License
  • Dependencies Status
  • Downloads

Usage

Extract text

(require '[pdfboxing.text :as text])
(text/extract "test/pdfs/hello.pdf")

Merge multiple PDFs

(require '[pdfboxing.merge :as pdf])
(pdf/merge-pdfs :input ["test/pdfs/clojure-1.pdf" "test/pdfs/clojure-2.pdf"] :output "foo.pdf")

Merge multiple images into single PDF

You can use either merge-images-from-path for providing images in form of vector of string paths or merge-images-from-byte-array to provide them as a vector of byte arrays. Each image will be inserted into its own page.

(require '[pdfboxing.merge :as pdf])
(pdf/merge-images-from-path ["image1.png" "image2.png"] "output.pdf")

Split a PDF into mutliple PDDocuments

 (require '[pdfboxing.split :as pdf])

List of PDDocument pages 1 through 8

 (pdf/split-pdf :input "test/pdfs/multi-page.pdf" :start 1 :end 8)

Splits the PDF into single pages as a list of PDDocument

 (pdf/split-pdf :input "test/pdfs/multi-page.pdf")

Splits the PDF in half and writes them to disk as multi-page-1.pdf and multi-page-2.pdf

 (pdf/split-pdf-at :input "test/pdfs/multi-page.pdf")

Splits into two PDFs, the first having 5 pages and second has rest

 (pdf/split-pdf-at :input "test/pdfs/multi-page.pdf" :split 5)

List form fields of a PDF

To list fields and values:

(require '[pdfboxing.form :as form])
(form/get-fields "test/pdfs/interactiveform.pdf")
{"Emergency_Phone" "", "ZIP" "", "COLLEGE NO DEGREE" "", ...}

Fill in PDF forms

To fill in form's field supply a hash map with field names and desired values. It will create a copy of fillable.pdf as new.pdf with the fields filled in:

(require '[pdfboxing.form :as form])
(form/set-fields "test/pdfs/fillable.pdf" "test/pdfs/new.pdf" {"Text10" "My first name"})

Rename form fields of a PDF

To rename PDF form fields, supply a hash map where the keys are the current names and the values new names:

(require '[pdfboxing.form :as form])
(form/rename-fields "test/pdfs/interactiveform.pdf" "test/pdfs/addr1.pdf" {"Address_1" "NewAddr"})

Get page count of a PDF document

(require '[pdfboxing.info :as info])
(info/page-number "test/pdfs/interactiveform.pdf")

Get info about a PDF document

Such as title, author, subject, keywords, creator & producer

(require '[pdfboxing.info :as info])
(info/about-doc "test/pdfs/interactiveform.pdf")

Draw lines on a PDF document

Supply a PDF document, a name for the output PDF document, the coordinates where the line should be drawn along with the page number on which the line should be drawn

(require '[pdfboxing.draw :as draw])
(draw/draw-line :input-pdf "test/pdfs/clojure-1.pdf"
                :output-pdf "ninja.pdf"
                :coordinates {:page-number 0
                              :x 0
                              :y 160
                              :x1 650
                              :y1 160})

Convert a PDF document to a very simple HTML document

Supply a PDF document's name, a simple HTML is created in the root folder

(require '[pdfboxing.tools :as tools])
(tools/pdf-to-html "myFile.pdf")

Compatibility with PDFBox's PDDocuments

The following functions referenced above have direct compatibility with PDFBox's internal PDDocument type:

  • text/extract
  • pdf/split-pdf
  • form/get-fields
  • form/set-fields
  • form/rename-fields
  • info/page-number
  • draw/draw-line

This allows you to substitute each filepath (of each function's input) referenced above with a PDDocument type. This is helpful for example in the case that you were to want to split a PDF up by pages and then extract the text from only the 3rd page:

(require '[pdfboxing.text :as text])
(require '[pdfboxing.split :as split])
(-> (split/split-pdf :input "test/pdfs/multi-page.pdf")
    (nth 2)
    text/extract)

pdfboxing's People

Contributors

akeboshiwind avatar bitdeli-chef avatar cguckes avatar charignon avatar danielglauser avatar dotemacs avatar dsapala avatar jokimaki avatar kkazuo avatar lucasdf avatar mpenttila avatar mrichards42 avatar nukecoder avatar p5764 avatar pmensik avatar raymcdermott avatar stankovicmarko avatar tirkarthi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pdfboxing's Issues

Images

I have a PDF file with a form field that's an image. Is there anything in this library to put an image in that field?

Can't load PDF URLs

Though PDFBox's PDDocument/load can handle PDFs given as a URL, pdfboxing cannot. This is because load-pdf wraps PDDocument/load with is-pdf?, which in turn relies on FileDataSource. Joshua Miller delineates an approach that has no such restriction:

  (with-open [pd (PDDocument/load (URL. url))]
    (let [stripper (PDFTextStripper.)]
    (.getText stripper pd)))

I'm not sure how you want to approach this restriction so I don't have a PR at this time. Cheers!

Export to image support

First of all - kudos for this library! It proves to be very useful to our project in Magnet.
However we need an export to image functionality that Apache's PDFbox provides. We fought that it would be nice if your library has it as well.

We'd be happy to make a PR with this.

warning on PDF merge

Hi there,

I've been messing around a little with this library, and I noticed that PDFBox generates a warning about not closing the file when using the merge example you give on the readme:

Apr 02, 2016 12:30:17 PM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document

Is this expected behavior? I'm not sure how PDFBox works under the hood, so not clear if this is a problem or just a silly warning that should be suppressed. (When working with large collections of documents, does this perhaps suggest that it they won't be garbage collected?)

COSStream has been closed and cannot be read.

Hi,

This code snippet:

(-> (split/split-pdf :input "test/pdfs/multi-page.pdf")
    (nth <some-idx>)
    text/extract)

often throws:
Execution error (IOException) at org.apache.pdfbox.cos.COSStream/checkClosed (COSStream.java:83). COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?

Thanks for lib :)

Less strict `merge-pdfs`

Expected Behavior

I have a couple of input-streams containing PDFs that I want to merge into an output-stream.

The current merge-pdfs function requires that all inputs are files, but PDFBox supports adding InputStreams.

Same with the output, merge-pdfs requires that the output is a filename, but PDFBox supports both filenames and OutputStreams.

Potential Solution

To solve this in my codebase I've updated merge-pdfs to be less strict:

(defn merge-pdfs [& {:keys [input output]}]
  (let [merger (PDFMergerUtility.)]
    (doseq [source input]
      (.addSource merger source))
    (cond
      (instance? java.io.OutputStream output)
      (.setDestinationStream merger output)

      :else
      (.setDestinationFileName merger output))
    (.mergeDocuments merger)))

But I wanted your feedback before submitting a PR.
Is this something you'd be interested in?

Issues with `merge-pddocuments`

Opening a new issue after discussion in #26

(pdf-split/merge-pddocuments
  :docs (pdf-split/split-pdf :input path :start 1 :end 4)
  :output "test.pdf")

Unhandled java.io.IOException
   COSStream has been closed and cannot be read. Perhaps its enclosing
   PDDocument has been closed?

Please check the mentioned issue for further details

Working with split pdfs

A few questions

  1. What is the best way to save the output of split-pdf as a pdf?
    Is java interop necessary for that, or another clojure library?
    for instance, if I want to turn "/sample/pdf-title.pdf" into "sample/pdf-title-pages/1.pdf" "sample/pdf-title-pages/2.pdf"

License

Hi, what is the license for this project? Could you add one to the repo?

README wrongly documents split-pdf-at

In readme we can read:
Splits the PDF in half and writes them to disk as clojure-1.pdf and clojure-2.pdf

The example below is code that truly splits its input file in half. But there's no way you specify what the output files are.

ClassNotFoundException when using Java 9

OS: Linux Mint (Sarah)

Using this project.clj

(defproject myproject "0.1.0-SNAPSHOT"
  ...
  :dependencies [[org.clojure/clojure "1.9.0"]
                           [proto-repl "0.3.1"]
                           [pdfboxing "0.1.14-SNAPSHOT"]]) 

I get the following error under Java 9:

➜ myproject> lein -U repl
nREPL server started on port 33000 on host 127.0.0.1 - nrepl://127.0.0.1:33000
REPL-y 0.3.7, nREPL 0.2.12
Clojure 1.9.0
Java HotSpot(TM) 64-Bit Server VM 9.0.4+11
Docs: (doc function-name-here)
(find-doc "part-of-name-here")
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e

user=> (require '[pdfboxing.text :as text] :verbose)
(clojure.core/load "/pdfboxing/text")
(clojure.core/load "/pdfboxing/common")

CompilerException java.lang.ClassNotFoundException: javax.activation.FileDataSource, compiling:(pdfboxing/common.clj:1:1)
(clojure.core/in-ns 'pdfboxing.common)
(clojure.core/alias 'io 'clojure.java.io)

The issue seems to be that java.activation is deprecated in Java 9, see this issue.

I briefly tried downloading the JavaBeans Activation Framework and using the --add-modules argument to :jvm-opts in my profile, but gave up. Instead, when using pdfboxing I reverted to Java 8 via

➜ myproject> sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority Status

  • 0 /usr/lib/jvm/java-9-oracle/bin/java 1091 auto mode
    1 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 1081 manual mode
    2 /usr/lib/jvm/java-9-oracle/bin/java 1091 manual mode

Press to keep the current choice[*], or type selection number: 1
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode

Now, starting a REPL and bringing in pdfboxing works:

➜ myproject> lein -U repl
nREPL server started on port 39468 on host 127.0.0.1 - nrepl://127.0.0.1:39468
REPL-y 0.3.7, nREPL 0.2.12
Clojure 1.9.0
OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
Docs: (doc function-name-here)
(find-doc "part-of-name-here")
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e

user=> (require '[pdfboxing.text :as text] :verbose)
(clojure.core/load "/pdfboxing/text")
(clojure.core/load "/pdfboxing/common")
(clojure.core/in-ns 'pdfboxing.common)
(clojure.core/alias 'io 'clojure.java.io)
(clojure.core/in-ns 'pdfboxing.text)
(clojure.core/alias 'common 'pdfboxing.common)
(clojure.core/in-ns 'user)
(clojure.core/alias 'text 'pdfboxing.text)
nil
user=>

IOException COSStream has been closed and cannot be read

Hi

I'm using the example code from the readme to try and extract the text from the second page of a pdf, but I get the following error:

IOException COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed? org.apache.pdfbox.cos.COSStream.checkClosed (COSStream.java:77)

This is the code I'm running:

(text/extract (nth (split/split-pdf :input "resources/my-file.pdf") 1))

Extract text from pdf area

For text extraction, pdfboxing currently uses org.apache.pdfbox.text.PDFTextStripper which works on the entire document. However, any document structure is removed during text extraction, so the more data the pdf contains, the harder it becomes to sort it out.

As an alternative, there's also org.apache.pdfbox.text.PDFTextStripperByArea, which allows you to specify a rectangle to extract text from with pretty good results in PDF files with (visually) structured content.

I have prepared a rough prototype that seems to work:

(ns my-ns
  (:require [pdfboxing.common :as common])
  (:import (org.apache.pdfbox.text PDFTextStripperByArea)
           (java.awt Rectangle)))

(defn extract-by-area
  "get text from a specified area of a PDF document"
  [pdfdoc x y w h page]
  (with-open [doc (common/obtain-document pdfdoc)]
    (let [rectangle       (Rectangle. x y w h)
          pdpage          (.getPage doc (inc page))
          pdftextstripper (doto (PDFTextStripperByArea.)
                            (.addRegion "region" rectangle)
                            (.extractRegions pdpage))]
      (.getTextForRegion pdftextstripper "region"))))

@dotemacs would you (or anyone else around, for that matter) be interested in this functionality?

If so let me know and I'll put some time into making a proper PR.

note: the unit of measurement when defining the rectangle coordinates is a pt (~0.035cm or ~0.0139in)

Exception is thrown during lines drawing

java.lang.IllegalArgumentException: No matching field found: getAllPages for class org.apache.pdfbox.pdmodel.PDDocumentCatalog is raised when trying to draw a line. I will prepare PR for this simple fix :)

Upgrade to pdfbox 3

Expected Behavior

Update the code to align with the major changes in pdfbox v3

Actual Behavior

Tests pass after upgrade

Steps to Reproduce the Problem

Specifications

  • Version:
  • Platform:
  • Subsystem:

Anything else that you think is relevant

I have made the updates in a fork

Let me know if you will accept a PR

[ I ask because I see that there are several open PRs and I don't want to waste your time ]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.