mfenniak / pypdf Goto Github PK

Pure-Python PDF Library; this repository is no longer maintained, please see https://github.com/knowah/PyPDF2/ insead.

Home Page: https://github.com/knowah/PyPDF2/

License: Other

Python 100.00%

pypdf's Introduction

Example:

    from pyPdf import PdfFileWriter, PdfFileReader

    output = PdfFileWriter()
    input1 = PdfFileReader(file("document1.pdf", "rb"))

    # add page 1 from input1 to output document, unchanged
    output.addPage(input1.getPage(0))

    # add page 2 from input1, but rotated clockwise 90 degrees
    output.addPage(input1.getPage(1).rotateClockwise(90))

    # add page 3 from input1, rotated the other way:
    output.addPage(input1.getPage(2).rotateCounterClockwise(90))
    # alt: output.addPage(input1.getPage(2).rotateClockwise(270))

    # add page 4 from input1, but first add a watermark from another pdf:
    page4 = input1.getPage(3)
    watermark = PdfFileReader(file("watermark.pdf", "rb"))
    page4.mergePage(watermark.getPage(0))

    # add page 5 from input1, but crop it to half size:
    page5 = input1.getPage(4)
    page5.mediaBox.upperRight = (
        page5.mediaBox.getUpperRight_x() / 2,
        page5.mediaBox.getUpperRight_y() / 2
    )
    output.addPage(page5)

    # print how many pages input1 has:
    print "document1.pdf has %s pages." % input1.getNumPages())

    # finally, write "output" to document-output.pdf
    outputStream = file("document-output.pdf", "wb")
    output.write(outputStream)

pypdf's People

Contributors

Stargazers

Watchers

pypdf's Issues

Error 110 when open a pdf file

Use pyPdf read and write the attached file.
The adobe reader will report 110 error.

Possible freeze / bug when reading nameObject

Adding "" in delimiterCharacters fix the issue (end of stream)

Found it when reading portfolio

Ability to embed Javascript

Seems like this fellow has added the ability to insert Javascript snippets to pyPdf:
http://blog.rsmoorthy.net/2012/01/add-javascript-to-existing-pdf-files.html

Suspect he used this code here:
http://blog.didierstevens.com/programs/pdf-tools/#make-pdf

Seems like a useful addition to pyPdf!

setup.py needs to be updated to 1.13

new release listed on home page, but setup.py still refers to 1.12

get page number outline references

I have been looking at the documentation and code for pyPdf and I cannot figure out how to go from the outline to the page it links to. Is there a way to iterate over the outline and get the page number it references so that it can be passed to the getPage() method? I am trying to split a large pdf file into smaller ones based on the outline.

Bug fix and improvement in RectangleObject getWidth() and getHeight()

In generic.py on line 727, it's like that:

def getHeight(self):
    return self.getUpperRight_y() - self.getLowerLeft_x()

And should be like that (the "x" change to "y" at the end):

def getHeight(self):
    return self.getUpperRight_y() - self.getLowerLeft_y()

Also both getWidth() and getHeight() output should be wrap in abs() like that:

def getHeight(self):
    return abs(self.getUpperRight_y() - self.getLowerLeft_y())

And we could add properties to make the thing more pleasant:

@property
def width(self):
    return self.getWidth()

@property
def height(self):
    return self.getHeight()

Infinite loop on empty input

Create an empty StringIO and call the pdf reader on it. It will loop in the readNextEndLine calls before the %%EOF check in read.

release new version for Python 2

even if there were not that many commits since 1.12, can you please release a new version? That'll make it much easier getting the updates in Fedora. Otherwise I have to work with a git checkout which I try to avoid.

pyPdf.pdf.PageObject.extractText() incorrectly concatinates words across line break

At least with the pdf I'm looking at, the TD operator is used to move from the end of one line to the start of another. This is ignored by extractText(), so if one line ends with the last letter of a word, and the next line begins with the first letter of a word, then these two characters are also immediately adjacent in the resulting text, producing a new "word" that is not present in the document.

A specific case I'm seeing is a line ending with "phase" is followed by a line beginning with "insufficiency", so what is included at that point in the resulting string is "phaseinsufficiency", a non-word that does not, in fact, occur in the document. I'm using the result in full text search, so this is problematic, in that a search for "phase" or for "insufficiency", or, in fact, for "phase insufficiency", will fail.

I have a patch (if needed) which adds "TD" to the operators extractText() processes, which checks to see if the y operand (operands[1]) is non-zero, whether text is non-empty, and whether text ends with a non-whitespace character. If all this is true, a newline gets appended to text. This works, and is sufficient to my needs.

Since this is a change in behavior, I have also added an argument to extractText() called split_on_y_change with a default value of False, making the default behavior the old behavior. One could do something similar for x changes and vertical languages, but I don't know enough about such languages to propose the details.

Let me know if you want my patch, and whether you can accept a unified diff somewhere, or whether you need a pull request.

Bill

merging a cropped page reveals the region cropped away

Hi,
Thanks for the great function additions in 1.13, especially the "merge page" functions. I'm however having difficulty using them with cropped files.

Suppose I want to rid myself of unwanted text on one side of a page, and so apply a crop:
page.mediaBox.upperRight = (page.mediaBox.upperRight[0]/2,
page.mediaBox.upperRight[1])

Now, let us merge this page onto a newly created blank page using:
newblank = PageObject.createBlankPage(None, 612,792)
newblank.mergePage(page)

Unfortunately, in the merged page "newpage", all the cropped region in "page" get displayed again, i.e. the mergePage function does not honor the mediaBox/cropBox of the file being merged. Can this be fixed?

Thanks!
Soum

little bug in the ASCII85Decode class

Hi! I'm Biszak Előd, I'm a hungarian developer, I've been using pyPdf and realized there's a bug int the ASCII85Decode class' decode function. When c=='z' the variable x doesn't increment, so the function remains in an infinite loop.

elif c == 'z':
    assert len(group) == 0
    retval += '\x00\x00\x00\x00'
    continue

should be:

elif c == 'z':
    assert len(group) == 0
    retval += '\x00\x00\x00\x00'
    x += 1
    continue

Seemingly infinite loop on PdfFileReader().getPage().ExtractText() on certain files. Workaround included in post.

Don't know if anyone else has run into this, but ExtractText() seems to loop infinitely on certain files, and even then, only certain pages on those files. Even left over a 3-day weekend, it remains stuck. I've attached a short sample script illustrating a workaround for whomever comes after me in search of a solution. It uses a timeout argument on the multiprocessing module's Process object.

#this is a workaround for an infinite loop bug in pyPdf
from pyPdf import PdfFileReader
from multiprocessing import Process, Queue

def get_highest_page_number(pdf_path):
    pdf_handle = file(pdf_path, "rb")
    pdf_file = PdfFileReader(pdf_handle)
    if pdf_file.getIsEncrypted():
        pdf_file.decrypt("")
    highest_page_number = pdf_file.getNumPages()
    pdf_handle.close()
    return highest_page_number

def get_page_text(pdf_path, page, que):
    pdf_handle = file(pdf_path, "rb")
    pdf_file = PdfFileReader(pdf_handle)
    if pdf_file.getIsEncrypted():
        pdf_file.decrypt("")
    pdf_page = pdf_file.getPage(page)
    page_text = pdf_page.extractText()
    pdf_handle.close()
    que.put(page_text)

def read_pdf(pdf_path):
    pages_top_limit = get_highest_page_number(pdf_path)
    for page in range(0, pages_top_limit):
        page_text_que = Queue()
        page_text_process = Process(target = get_page_text, args = (pdf_path, page, page_text_que))
        page_text_process.start()
        page_text_process.join(10)
        if page_text_process.is_alive():
            page_text_process.terminate()
            raise RuntimeError
        else:
            page_text = page_text_que.get()

def main():
    pdf_path = "file.pdf"
    read_pdf(pdf_path)

if __name__ == "__main__":
    main()

I don't like having to re-open the handle for every page, but I really don't see another option at present.

Trailing spaces and NUL characters in PDF cause failure identifying EOF

I have a collection of PDFs that contain a line of NUL and space characters on the line after the %%EOF marker. The current technique for identifying the %%EOF fails on these PDFs because the 'while not line' check on line 704 of pdf.py (the start of the read() method on PdfFileReader) isn't sufficient to identify this line of NUL and spaces as something worth ignoring.

More pythonic API

What about PEP8 compliant API (under_scores instead camelCase etc.)?

Fail to read a text object

readStringFromStream() fails to create a string object
if a text object like below was given.

BT 1 0 0 1 0 1.9 Tm /F3+0 8.6 Tf 10.5 TL (\376\377 ) Tj T* ET

readStringFromStream() decodes (\376\377 ) to a string '\xfe\xff\x20'.

createStringObject() checks first 2 bytes of the string,
and will attempt to decode with UTF-16.
Then an exception will be raised because '\x20' is illegal as UTF-16.

Apparently, a text "\376\377" should not be treated as BOM.

BOM check would be a conformance of "Text Strings" described in PDF Reference,
but it should be applied only to the "text string" type item specified in PDF Reference.

Error for files with Layers

I'm having error when using pdf with Layers:
Traceback:
File "G:\python-education\pdfinfo.py", line 16, in
print name, inFile.getNumPages()
File "build\bdist.win-amd64\egg\pyPdf\pdf.py", line 431, in getNumPages
File "build\bdist.win-amd64\egg\pyPdf\pdf.py", line 607, in _flatten
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 165, in getObject
File "build\bdist.win-amd64\egg\pyPdf\pdf.py", line 649, in getObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 67, in readObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 531, in readFromStream

File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 58, in readObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 153, in readFromStream

File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 67, in readObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 531, in readFromStream

File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 67, in readObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 531, in readFromStream

File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 58, in readObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 153, in readFromStream

File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 67, in readObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 531, in readFromStream

File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 52, in readObject
File "build\bdist.win-amd64\egg\pyPdf\generic.py", line 339, in readStringFrom
Stream pyPdf.utils.PdfReadError: Unexpected escaped string

If I merge Layers in that pdf - all works good

Interface to set document meta data via PdfFileWriter

AFAIK there is no interface in PdfFileWriter to set the document meta data (like pdf title). This is a problem for example in the example on your home page (http://pybrary.net/pyPdf/) where the final pdf does not have a title anymore.

generic.NameObject causes infinite loops

I have a PDF document that seems to get stuck in an infinite loop in the "while True" clause of generic.NameObject.

I added the empty string "" to the tuple of NameObject.delimiterCharacters to fix this issue. Don't know if it's the right solution, but it seems to break the infinite loop perfectly.

problem in NameObject.readFromStream when stream.read(1) does not advance

I'm in way over my head here...kind of feel like the blind pig that found an acorn. Anyway, I'm trying to process a PDF that contains the following items:

10 0 obj
/DeviceGray
endobj

The problem is that when the line "/DeviceGray" is read, tok = stream.read(1) does not seem to advance the file pointer. (I checked by looking at the value of stream.tell() before and after the stream.read())

I don't know why the pointer does not get advanced, but making the code look like this fixes the problem, and things seem to move along just fine.

    while True:
        pre_read = stream.tell() # new
        tok = stream.read(1)
        if tok.isspace() or tok in NameObject.delimiterCharacters or stream.tell() == pre_read:
            stream.seek(-1, 1)
            break
        name += tok
    return NameObject(name)

I can provide a copy of the PDF to someone if they want an example. (Note to self: this is 98421_SupLegal 2008-02 Stmt_p83_r8.pdf)

Render PDF page to an image using PIL?

Is there anyway to render a single PDF page to an image using PIL with pyPdf? Thanks

pypy compatibility

Hi,
Today I've installed pyPdf 1.13 for PyPy 1.6 using easy_install.
It doesn't work, but the bug fix is increadibly simple. Just change line 200 of pyPdf/generic.py.

original one:
int.init(value)

bug fix:
super(int, self).init(value)

Sorry for not directly contributing patch, but I'm new to github.

BTW, the error that I got was:

Traceback (most recent call last):
File "app_main.py", line 53, in run_toplevel
File "crack_passwd.py", line 11, in
reader = PdfFileReader(file('ZAJECIA5-PRZYROWNANIE_SEKWENCJI.pdf', 'rb'))
File "/Users/tomek/pypy-1.6/site-packages/pyPdf/pdf.py", line 374, in init
self.read(stream)
File "/Users/tomek/pypy-1.6/site-packages/pyPdf/pdf.py", line 732, in read
num = readObject(stream, self)
File "/Users/tomek/pypy-1.6/site-packages/pyPdf/generic.py", line 87, in readObject
return NumberObject.readFromStream(stream)
File "/Users/tomek/pypy-1.6/site-packages/pyPdf/generic.py", line 236, in readFromStream
return NumberObject(name)
File "/Users/tomek/pypy-1.6/site-packages/pyPdf/generic.py", line 220, in init
int.init(value)

Now it's fixed!!!

Cheers,
paparazzo

internal links not preserved when encrypted

Steps to Duplicate:

Obtain a PDF with internal links (you click the link and it takes you to another page in the PDF)
Encrypt the PDF with the function below
Open the PDF and see that the links no longer work.


def encrypt(in_stream, out_stream, user_password, owner_password=None):
"""
Encrypt an existing PDF file (stream)
    `in_stream`         stream with pdf data
                        open(filename, 'rb')
    `out_stream`        stream where output will be written
                        open(filename, 'wb')
    `user_password`     the password used for limited access
    `owner_password`    the password used for full access (defaults to user_password)

    I copied this from /sm/script/encryptPdf.py
    """
    reader = PdfFileReader(in_stream)
    writer = PdfFileWriter()
    for i in range(reader.getNumPages()):
        writer.addPage(reader.getPage(i))
    writer.encrypt(user_password, owner_password)
    writer.write(out_stream)

Microsoft Reporting Service workaround

hey folks :)

on some files generated by Microsoft Reporting Service i get one of the following errors using this script:

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("infile.pdf", "rb"))

output.addPage(input1.getPage(0))

outputStream = file("outfile.pdf", "wb")

output.write(outputStream)

Traceback (most recent call last):
File "/backup/print/municipality stara zagora/110228/Aitos_1/test.py", line 20, in
output.write(outputStream)
.....
File "/usr/local/lib/python2.6/site-packages/pyPdf/generic.py", line 232, in readFromStream
return NumberObject(name)
ValueError: invalid literal for int() with base 10: ''

or using another approach (loading pages in array and then saving them):

Traceback (most recent call last):
File "/backup/print/municipality stara zagora/110228/municipality stara zagora pdf combine 110228 start.py", line 60, in
outpdf.write(outfile)
.....
File "/usr/local/lib/python2.6/site-packages/pyPdf/pdf.py", line 545, in getObject
self.stream.seek(start, 0)
ValueError: I/O operation on closed file

where the file is (of course) not closed

i workaround it resaving the file using pdftk like this:

from pyPdf import PdfFileWriter, PdfFileReader

import shlex, subprocess
pdftkcommand = 'pdftk infile.pdf cat output fixed_infile.pdf'
args = shlex.split(pdftkcommand)
subprocess.call(args)

output = PdfFileWriter()
input1 = PdfFileReader(file("fixed_infile.pdf", "rb"))

output.addPage(input1.getPage(0))

outputStream = file("outfile.pdf", "wb")

output.write(outputStream)

but only when using last pdftk version (1.44 - 1.41 produces blank pdf) - i guess this is what pdftk guys have fixed:
1.43 - September 30, 2010
Fixed a stream parsing bug that was causing page content to disappear after merge of PDFs generated by Microsoft Reporting Services PDF Rendering Extension 10.0.0.0.

unfortunately i can't provide the broken file as contents are confidential

hope this helps :)

georgi

py3: global name 'RectangleObject' is not defined

I try the Example from README in python 3.1. there are two issues:

command file() should be replaced with command open()
pdf.py needed the RectangleObject but is not imported.

here is an diff to solve the second issue

--- ../../old_pdf.py/pdf.py 2009-10-15 10:56:54.000000000 +0200
+++ pdf.py 2010-05-12 18:19:45.000000000 +0200
@@ -47,7 +47,7 @@
from .generic import (readObject, DictionaryObject, DecodedStreamObject,
NameObject, NumberObject, ArrayObject, IndirectObject,
ByteStringObject, StreamObject, NullObject, TextStringObject,

```
   createStringObject, BooleanObject)
```
```
   createStringObject, BooleanObject, RectangleObject)
```
from .utils import (readNonWhitespace, readUntilWhitespace,
ConvertFunctionsToVirtualList, PdfReadError, RC4_encrypt)
thanks so far,
david wiesner

pyPdf: parsing not robust to whitespace

Some pyPdf users noticed problems with whitespacing. As an example http://bugs.debian.org/563443 . I (the pyPdf maintainer in Debian) am including the patch proposed in that bug. But clearly deeper recoding is needed.

getNumPages fails on encrypted PDF

I'm not an expert on the PDF file format but I think that PDF files contains a "/Page" instruction for each page in it, and this is visible even if the file is protected.

Also, there is the "/Type /Pages" instruction that give a "/Count" of the number of pages of the document that is visible even on a protected file too.

So why is the getNumPages method so complicated? What am I missing?

Issue about the _sweepIndirectReferences function

I think there's a little problem in the PdfFileWriter class' _sweepIndirectReferences function. There's a list called self.stack where the indirect references that we've already seen are stored. I suppose that it is used so that we don't sweep the same indirect reference over and over again. However in the function after it's sweeped once it is removed from self.stack, I don't see the point of that. If there are lots of objects referencing the same object ( for example if we copy the Logical Structure of the pdf as well, many objects reference the same page object wich is quite expensive to sweep ) mantaining it in self.stack could mean significant improvement in time.

if data.pdf == self:
            if data.idnum in self.stack:
                return data
            else:
                self.stack.append(data.idnum)
                realdata = self.getObject(data)
                self._sweepIndirectReferences(externMap, realdata)
                self.stack.pop()
                return data

I think it should be:

if data.pdf == self:
            if data.idnum in self.stack:
                return data
            else:
                self.stack.append(data.idnum)
                realdata = self.getObject(data)
                self._sweepIndirectReferences(externMap, realdata)
                return data

pdf package changed to all caps?

Needed to convert pdf to allcaps in init.py

pyPdf.utils.PdfReadError: multiple definitions in dictionary

i have some code :

import pyPdf

def getPDFContent():
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(pathToPdf, 'rb'))
# Iterate pages
print pdf.documentInfo
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + " \n"
# Collapse whitespace
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
f = open(pathToTxt,'w+')
f.write(getPDFContent())
f.close()

where pathToPdf and pathToTxt it is absolute path to the files.
but i got error :
Traceback (most recent call last):
File "C:/Users/will/Desktop/coding/mytest.py", line 21, in
print pdf.getPage(14)
File "C:\Python\lib\site-packages\pyPdf\pdf.py", line 450, in getPage
self._flatten()
File "C:\Python\lib\site-packages\pyPdf\pdf.py", line 607, in _flatten
self._flatten(page.getObject(), inherit, **addt)
File "C:\Python\lib\site-packages\pyPdf\generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "C:\Python\lib\site-packages\pyPdf\pdf.py", line 649, in getObject
retval = readObject(self.stream, self)
File "C:\Python\lib\site-packages\pyPdf\generic.py", line 67, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "C:\Python\lib\site-packages\pyPdf\generic.py", line 531, in readFromStream
value = readObject(stream, pdf)
File "C:\Python\lib\site-packages\pyPdf\generic.py", line 67, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "C:\Python\lib\site-packages\pyPdf\generic.py", line 531, in readFromStream
value = readObject(stream, pdf)
File "C:\Python\lib\site-packages\pyPdf\generic.py", line 67, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "C:\Python\lib\site-packages\pyPdf\generic.py", line 534, in readFromStream
raise utils.PdfReadError, "multiple definitions in dictionary"
pyPdf.utils.PdfReadError: multiple definitions in dictionary

A much faster mergePage function

mergePage function is slow. Needing more speed, I have written a modified version mergePage3 which is much faster when you merge pages from the same file (up to 200x faster) and faster also when you merge pages from different files. I can share the code if you are interested.
The basic idea : mergePage uses StreamContent to get the content of a page. But this class always starts the parseContentStream function even when this is not needed, and this function is time consuming.
mergePage3 parses the content only when really needed. Result is :

On a test file of 55 pages, if I put two pages on a sheet (booklet), with mergePage, it takes 34 seconds, with mergePage3 it takes 0.4 second. (I consider here only the time needed for mergePage, not the generation of the output file.

If you are interested, I can share the code.

RectangleObject import missing & file handling workaround

I use pyPdf with Python 3.2 on my Windows machine and just got some errors I could resolve:

first one was the opening of a PDF file with the PdfFileReader. I used the code

 file = open("PATH_WITH_FILE_AND_EXTENSION", "rb")
 doc = PdfFileReader(file)```

The second thing I discovered was that the pdf.py misses the RectangleObject import from the generic.py file. So just add it.

Unsupported filter /LZWDecode

I got the above Error while trying to extractText() by iterating through the pages in a PDF document created with Acrobat Distiller.

Traceback:
Original Traceback (most recent call last):
File "/Users/ulo/.virtualenvs/zrbackend/src/django/django/template/debug.py", line 71, in render_node
result = node.render(context)
File "/Users/ulo/.virtualenvs/zrbackend/src/django/django/template/defaulttags.py", line 155, in render
nodelist.append(node.render(context))
File "/Users/ulo/.virtualenvs/zrbackend/src/django/django/template/debug.py", line 87, in render
output = force_unicode(self.filter_expression.resolve(context))
File "/Users/ulo/.virtualenvs/zrbackend/src/django/django/template/init.py", line 546, in resolve
obj = self.var.resolve(context)
File "/Users/ulo/.virtualenvs/zrbackend/src/django/django/template/init.py", line 687, in resolve
value = self._resolve_lookup(context)
File "/Users/ulo/.virtualenvs/zrbackend/src/django/django/template/init.py", line 722, in _resolve_lookup
current = current()
File "/Users/ulo/.virtualenvs/zrbackend/lib/python2.5/site-packages/pyPdf/pdf.py", line 1035, in extractText
content = ContentStream(content, self.pdf)
File "/Users/ulo/.virtualenvs/zrbackend/lib/python2.5/site-packages/pyPdf/pdf.py", line 1117, in init
stream = StringIO(stream.getData())
File "/Users/ulo/.virtualenvs/zrbackend/lib/python2.5/site-packages/pyPdf/generic.py", line 636, in getData
decoded._data = filters.decodeStreamData(self)
File "/Users/ulo/.virtualenvs/zrbackend/lib/python2.5/site-packages/pyPdf/filters.py", line 237, in decodeStreamData
raise NotImplementedError("unsupported filter %s" % filterType)
NotImplementedError: unsupported filter /LZWDecode

PdfFileWriter.write() won't return

Example code:

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("1.pdf", "rb"))

for i in range(input1.getNumPages()):
    output.addPage(input1.getPage(1))

outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

Example pdf file can be found here (it's a paper named Sequential hashing: A flexible approach for unveiling significant patterns in high speed networks).

long loop in read an special pdf

I have attached an invalid pdf file.
when I wanted to open that pdf file,
I have faced with long time to read a pdf(more than 1 hour).
seems a bug is in PdfFileReader.
this is my test to reproduce the bug on pypdf2==1.26.0 and python 3.6:

from PyPDF2 import PdfFileReader
f = open('file1.pdf', 'rb')
p = PdfFileReader(f) # in this line we will be wait a long

file1.pdf

Proposing a PR to fix a few small typos

Issue Type

[x] Bug (Typo)

Steps to Replicate and Expected Behaviour

Examine pyPdf/xmp.py and observe signifigance, however expect to see significance.
Examine pyPdf/pdf.py and observe preceeding, however expect to see preceding.
Examine pyPdf/pdf.py and observe optionnal, however expect to see optional.
Examine pyPdf/xmp.py and observe keywrods, however expect to see keywords.
Examine pyPdf/pdf.py and observe matricies, however expect to see matrices.
Examine pyPdf/pdf.py and observe heigth, however expect to see height.
Examine pyPdf/pdf.py and observe enviroment, however expect to see environment.
Examine pyPdf/pdf.py and observe dimentions, however expect to see dimensions.
Examine pyPdf/pdf.py and observe dictionnary, however expect to see dictionary.

Notes

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR. Alternatively if the fix is undesired please
close the issue with a small comment about the reasoning.

https://github.com/timgates42/pyPdf/pull/new/bugfix_typos

Thanks.

mfenniak / pypdf Goto Github PK

pypdf's Introduction

pypdf's People

Contributors

Stargazers

Watchers

Forkers

pypdf's Issues

output.write(outputStream)

output.write(outputStream)

Issue Type

Steps to Replicate and Expected Behaviour

Notes

Recommend Projects

Recommend Topics

Recommend Org