This issue is a master issue/epic and can lead to subissues that will be referenced from here.
Proposal
The extractor package will have the capability to extract vectorized text and objects (with position and dimensions).
Goal: Extract a list of graphics objects from each PDF page.
There are three types of graphics objects:
- text
- path (a PDF path that has been stroked or filled)
- image
Each of these objects has a
- bounding box in device coordinates
- color
- rendering mode (fill, stroke, clip or some combination of these)
- content (e.g. text)
- optionally other properties
- transparency?
This is not a rendering system but we hope to design it in a way that will allow it to be extended to become a renderer. Initial versions of the renderer could convert the lists of graphics objects to PDF or PostScript pages. This would provide closed-loop tests.
Definitions
-
text: Text objects and operators. The text operators specify the glyphs to be painted, represented by string objects whose values shall be interpreted as sequences of character codes. A text object encloses a sequence of text operators and associated parameters. (page 237)
-
Paragraph fragments are the largest substrings in text paragraphs that are rendered contiguously on a PDF page. If a paragraph is split between pages or columns then the parts of the paragraph that appear at the end of the first page / column and the start of the second page / column are paragraph fragments. When a paragraph fits entirely within a single column and page, the entire paragraph is a paragraph fragment.
There are at least three levels of text objects, all of which are composed of lower level (lower numbered in the following list) objects.
- Text elements emitted by the renderer as a result of PDF text operators like Tj.
a. A text element’s properties include the text content, location and size in device coordinates, font etc
b Text elements can be used to recreate the text as it appears on the page
- Paragraph fragments are created from the text elements on a page. Each paragraph fragment occupies a contiguous region on a single page.
a. Paragraph fragments include the start of a paragraph that is completed on the following page / column, captions, form field labels, footnotes, etc
b. The paragraph fragments in a page can be used to make inferences about the page.
- Paragraphs are created from the paragraph fragments
a. Paragraphs can be used to create extract the text of a PDF in plain text format
- path: A path is made up of one or more disconnected subpaths, each comprising a sequence of connected segments. (page 131)
Initially we will only concern ourselves with stroked and filled paths and ignore clipping paths
// Path can define shapes, trajectories and regions of all sorts. Used to draw lines and define shapes of filled areas.
type Path struct {
segments []lineSegments
}
// Only export if deemed necessary for outside access.
// For connected subpaths (segments), the x1, y1 coordinate will start at x2, y2 coordinate of the previous segment.
type lineSegment struct {
isCurved bool // Bezier curve if true, otherwise line
x1, y1 float64
x2, y2 float64
cx, cy float64 // Control point (if curved)
isNoop bool // Path ended without filling/stroking.
isStroked bool
strokeColor model.PdfColor
isFilled bool
fillColor model.PdfColor
fillRule windingTypeRule
}
type windingNumberRule int
const (
nonZeroWindingNumberRule windingNumberRule = iota
evenOddWindingNumberRule
)
- image. A sampled image (or just image for short) is a rectangular array of sample values, each representing a colour. (page 203)
This should include inline images, XObject images, possibly some shadings etc. UniDoc already has a pretty good framework for this.
API ideas
func (e *Extractor) GraphicsObjects() []GraphicsObject
type GraphicsObject interface {
// What do graphics objects have in common, or what common operations can be applied to them?
// Possibly make into a struct rather than an interface and convert to an interface if we think it makes sense.
}
- Rendering Interface Ideas
Renderers may need access to the graphics context to render each graphics object.
Imagine a callback to emit graphics objects to a renderer (or other caller).
func render(o GraphicsObject, gs GraphicsState)
The rendering would be over all graphics objects on a page in the order they occur. This would be driven by a single processor.AddHandler()
that could be configured to emit any combination of text, shape, and image objects.
func renderCore(doText, doShapes, doImages bool, render Renderer)
or rendering context/state rather than doX...
Use cases
Potential use cases that should be possible to base on this implementation:
- Find text/shapes/images within a specified area.
- Remove/redact text/shapes/images within a specified area.
- Characterize headings, normal text.
- Detect tables and inner contents
- Detect mathematical formulas
- PDF to markdown conversion: Requires basic heading detection, text style, tables
- PDF to word/excel: Requires advanced detection of detailed features to reproduce in oxml.
Going from the primitive contentstream operands to a higher level representation, there is a need to have a connection from the higher level representation to the lower level. For example if removing content, may need to filter on a higher level basis but have a connection down to the primitive operands to actually filter those out.
There may be a cascade/sequence of processing operations, initially on the primitive operands, for example grouping.
It should be clear whether those processes are lossy or lossless, where lossless would mean that they could reproduce the exact same operands as originally and same look. Lossy would mean that some aspect was lost, for example if grouping text together, character spacing/kerning info could be lost.
Preferably all processing would have the capability to be lossless, but it remains to be seen whether that is practical.