Git Product home page Git Product logo

minipdf's Introduction

miniPDF

A python library for making PDF files in a very low level way.

The legendary minipdf python library reaches github. This is a cleaner version of the old micro lib used in more than 10 PDF related exploits.

Features

It supports only the most basic file structure (PDF3200:7.5.1), that’s it without incremental updates or linearization.

  • A one-line header identifying the version of the PDF file
  • A body containing the objects that make up the document contained in the file
  • A cross-reference table containing information about the indirect objects in the file
  • A trailer dictionary pointing the location of the cross-reference table and other special objects within the body of the file

Also all basic PDF types: null, references, strings, numbers, arrays and dictionaries.

Example: A minimal text displaying PDF

As an example Let's create a minimal text displaying PDF file in python using minipdf. The following graph outlines the simplest possible structure:

The python script

First we import the lib and create a PDFDoc object representing a document in memory …

from minipdf import *
doc = PDFDoc()

As shown in the last figure the main object is the Catalog. The next 3 lines builds a Catalog dictionary object, add them to the document and set it as the root object…

catalog = PDFDict()
catalog['Type'] = PDFName('Catalog')
doc += catalog
doc.setRoot(catalog)

At this point we don’t even have a valid pdf but if we output the inclomplete PDF this is how the output will look like:

%PDF-1.5
%���
1 0 obj
<</Type /Catalog >>
endobj
xref
0 2
0000000000 65535 f 
0000000015 00000 n 
trailer
<</Root 1 0 R /Size 2 >>
startxref
50
%%EOF

As you can see, it's only a matter of adding all the different pdf objects link together from the Catalog. The library allows to add them in almost any order. Let’s try to follow the basic tree structure. To add a page, first we need a pages dictionary.

pages = PDFDict()
pages['Type'] = PDFName('Pages')
doc += pages

Which should be linked from the Catalog.

catalog['Pages'] = PDFRef(pages)

Then a page.

#page
page = PDFDict()
page['Type'] = PDFName('Page')
page['MediaBox'] = PDFArray([0, 0, 612, 792])
doc += page

#add parent reference in page
page['Parent'] = PDFRef(pages)

Which should be linked from the pages dictionary.

pages['Kids'] = PDFArray([PDFRef(page)])
pages['Count'] = PDFNum(1)

Now we add some content to the page. This is called a content stream.

contents = PDFStream('''BT 
/F1 24 Tf 0 700 Td 
%s Tj 
ET
'''%PDFString(sys.argv[1]))
doc += contents

The content stream is linked from the page

page['Contents'] = PDFRef(contents)

Note that in the content stream we are referencing a font name /F1. We shall define this font.

font = PDFDict()
font['Name'] = PDFName('F1')
font['Subtype'] = PDFName('Type1')
font['BaseFont'] = PDFName('Helvetica')

Associate each defined font with a name in a font map.

fontname = PDFDict()
fontname['F1'] = font

And add/link all that from the /Font field of the resource dictionary.

#resources
resources = PDFDict()
resources['Font'] = fontname
doc += resources

Then link the resources to it's page under the Resources field.

page['Resources'] = PDFRef(resources)

We are done! Just print the resulted document..

print doc

minipdf's People

Contributors

feliam avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.