rsc / pdf Goto Github PK

View Code? Open in Web Editor NEW

508.0 508.0 320.0 56 KB

PDF reader

License: BSD 3-Clause "New" or "Revised" License

Go 100.00%

pdf's Introduction

go get rsc.io/pdf

http://godoc.org/rsc.io/pdf

pdf's People

Contributors

Stargazers

Watchers

Forkers

atd86 josharian mzmansour rajasekar1347 tajtiattila silviucm carstn marcus-downing jmheidly browncrown pombredanne lanzafame totalboy ahplankton paulcadman pbberlin tumdum mennis hengfeiyang oneplus1000 fenimore elhu raff qoorp ashang klemenswinter odeke-em ledongthuc jonathan-robertson gwatts jimgrier abaumann0204 nikkatalnikov duzhanyuan dpifke cgilling zchee nakazavva wheelcomplex smallpdf fawick endfirstcorp lfany forging2012 asm72 surajchhetry jimberlage joneskoo bizhenhe apengwin xushiwei xinyu391 santsai hmli aykevl lakal3 rockinglor torokoid daniel-007 osref monjovi ademi swiftlydeft shakeel rjguanwen jiyulongxu robertindie kkkmmu hatching kardianos yongjacky job2wd tylerflick backwardn presleyhank fierte-product-development jokerdx cnbailh kevin70g yeonsu100 isgasho benluteijn giovanni-annunzio caik13 nevata lookuptables jackchen1986sh akmubi iballbar wroge jdeng sfloam vzool zhangbo bigzeroworld zhangshiguang ssmsteve whitetiger21022014 faisalraja nj-eka

pdf's Issues

Skipping spaces?

Correct me if I am using this library incorrectly but I seem to get the text (string) output of a PDF page and it does not include spaces between characters.

func ParsePDF() (text string) {
	fileName := "./someFolder/testPDF.pdf"
	reader, err := pdf.Open(fileName)
	if err != nil {
		// log the error
	}
	foundEnd := 0
	pageNum := 1
	text = ""
	for foundEnd < 1 {
		page := reader.Page(pageNum)
		if page.V.IsNull() {
			foundEnd++
			break
		} else {
			content := page.Content()
			textStruct := content.Text
			for _, v := range textStruct {
				text += v.S
			}
			pageNum++
		}
	}
	return text
}

When I call this method that wraps the code of this library, the result is a correct text and characters, but with no space characters. I believe this is related to: https://github.com/rsc/pdf/blob/master/page.go#L422

Is there a particular reason spaces are being ignored? Am I just using the library incorrectly?

big-endian UCS-2 decoder not implemented yet?

Do we have workaround for big-endian UCS-2 decoder?

new pdf version with new encryption?

panic: unsupported PDF: encryption version V=4

Decryption with PKCS

I met a pdf with PKCS protection which I had to decrypt it with a pfx cert. I hope you could add a feature to decode it.

valid PDF file getting "not a pdf file missing %%EOF" while reading

I have this valid PDF which I can read and I can even parse it with nodejs. When I try to read it I get this:

not a PDF file: missing %%EOF

Any idea?

Text position for certain font might not work

Documenting the issue at least, so people with similar goals with me would know it exists.
Basically when some fonts are decoded, it is analyzed character by character, however, all those characters would have the same position coordinates...See screenshot below.
I might dig into it and try to fix it. We'll see.

panic on some PDFs + suspect memory leak

I have the following Go program that uses this library:

package main

import (
	"fmt"
	"os"
	"strconv"
	"rsc.io/pdf"
)

func main() {
	if len(os.Args) < 2 || os.Args[1] == "-h" || os.Args[1] == "--help" {
		fmt.Println("usage: pdfpage file.pdf [pnum]")
		os.Exit(1)
	}
	reader, err := pdf.Open(os.Args[1])
	if err != nil {
		fmt.Println(err)
		os.Exit(2)
	}
	if len(os.Args) == 3 {
		var pnum int
		var err error
		if pnum, err = strconv.Atoi(os.Args[2]); err != nil {
			pnum = 1
		}
		fmt.Printf("PAGE %d\n", pnum)
		printPage(reader, pnum)
	} else {
		for pnum := 1; pnum <= reader.NumPage(); pnum++ {
			fmt.Printf("PAGE %d\n", pnum)
			printPage(reader, pnum)
			fmt.Println("")
		}
	}
}

func printPage(reader *pdf.Reader, pnum int) {
	page := reader.Page(pnum)
	if page.V.IsNull() {
		fmt.Printf("failed to read page %d\n", pnum)
		os.Exit(3)
	}
	for _, chunk := range page.Content().Text {
		fmt.Printf("x=%06.2f y=%06.2f w=%06.2f %q %s %.1fpt\n",
			chunk.X, chunk.Y, chunk.W, chunk.S, chunk.Font,
			chunk.FontSize)
	}
}

This builds and runs fine and for many PDFs gives the expected output (although it is rather slow).
However I have a few PDFs which produce a panic:

PAGE 1
panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
rsc.io/pdf.(*buffer).errorf(0xc4200d3948, 0x507f70, 0x27, 0xc4200d36d0, 0x2, 0x2)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:82 +0x74
rsc.io/pdf.(*buffer).reload(0xc4200d3948, 0x8)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:95 +0x193
rsc.io/pdf.(*buffer).readByte(0xc4200d3948, 0x599da0)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:71 +0x69
rsc.io/pdf.(*buffer).readToken(0xc4200d3948, 0xc42000aca0, 0x1000)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:135 +0x4a
rsc.io/pdf.Interpret(0xc42006e060, 0x37, 0x4d78a0, 0xc42000ab60, 0xc4200d3b08)
	/home/mark/app/go/src/rsc.io/pdf/ps.go:64 +0x1c6
rsc.io/pdf.Page.Content(0xc42006e060, 0x37, 0x4db2e0, 0xc420014810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/mark/app/go/src/rsc.io/pdf/page.go:613 +0x326
main.printPage(0xc42006e060, 0x1)
	/home/mark/app/go/src/pdfpage2/main.go:47 +0xa8
main.main()
	/home/mark/app/go/src/pdfpage2/main.go:35 +0x25d

I also have a 647 page PDF for which the program outputs the first 22 pages, then outputs PAGE 23 and then just sits there eating memory and using ~25% CPU. That particular page has some Japanese characters but I don't know if they are Unicode text or paths.

The Page.Content().S don't have any space character

From code of https://github.com/rsc/pdf/blob/master/page.go#L421

I see you skip all character, why do you do it. Not sure I got your idea but when reading all text form pdf, we can't use the content and the result's very useless.

go get error: x509: certificate signed by unknown authority

hi,

it seems rsc.io/pdf can not be retrieved with go get (short of using the -insecure flag):

$ go get -v -u rsc.io/pdf
Fetching https://rsc.io/pdf?go-get=1
https fetch failed: Get https://rsc.io/pdf?go-get=1: x509: certificate signed by unknown authority
package rsc.io/pdf: unrecognized import path "rsc.io/pdf" (https fetch: Get https://rsc.io/pdf?go-get=1: x509: certificate signed by unknown authority)

could this be fixed?

-s

Reader.Page() - Is it 0-indexed or 1-indexed?

https://github.com/rsc/pdf/blob/master/page.go#L22

The GoDoc for this function says that it's 1-indexed, but the comment on L22 says it's 0-indexed. When calling the method as such:

r.Page(0)

we land in an infinite loop, because the initial num-- adjustment puts num at -1, and therefore we never find a page. Maybe an error condition should be returned in case a 0 is passed as an argument?

Can i convert pdf to jpg?

can't read Chinese Charactors

as title say,i cant use this lib to read chinese charactors pdf

Plans for the future

Hi @rsc. First off: Cool library, thanks for making it!

Do you have any plans to support more PDF versions, or improve on any of the bugs listed on godoc? I would love to have a proper library in Go for parsing PDF documents instead of having to rely on Python's PDFMiner.

I would also love to help out, but I am not sure where to start. I do not know anything about the black magic that seems to be the inner workings of PDF documents.

How to avoid loops when traversing the graph?

Greetings,

Given a PDF file that has a loop (ex: pages found from "Kids" have entries for "Parent" pointing back), how to I traverse the graph without getting stuck?

I tried saving the Values that I have visited, but they are not == to the new Values.

trouble decoding PDF with space at end of line

I received a PDF from clippercard.com that has "%PDF-1.3 " as the first line - note the space character after the "3". This causes rsc.io/pdf to fail to parse the PDF file. The PDF file in question has a number of space characters at the end of lines which cause the PDF library to alternately return these errors, depending on which space characters you fix:

not a PDF file: invalid header
malformed PDF: cross-reference table not found: ref
malformed PDF file: missing final startxref

Chrome, Mozilla Firefox and Preview.app have no problem displaying the PDF in question.