kba / hocr-spec Goto Github PK

View Code? Open in Web Editor NEW

67.0 13.0 20.0 575 KB

The hOCR Embedded OCR Workflow and Output Format

Home Page: http://kba.github.io/hocr-spec/1.2/

HTML 99.06% Makefile 0.38% Python 0.56%

hocr-spec's Introduction

hocr-spec

The hOCR Embedded OCR Workflow and Output Format

About

This repository contains the hOCR format specification originally written by Thomas Breuel.

Versions

1.0 English
- Google Doc the original text by @tmbdev
- Last substantial edit in May 2010
1.1 English, 中文 (Chinese)
- Port of the Google Doc
- Cleaning obvious errata (duplicate content)
- More fine-grained heading structure
- Table of contents
- Chinese translation provided by @littlePP24 and @wanghaisheng
- Last substantial edit in September 2016
1.2 English
- Create a WHATWG-like spec using bikeshed
- Add issues where appropriate
- Semantically backwards-compatible with both 1.0 and 1.1

Contribute

There is no formal body. Feel free to use the Github issues for discussion and questions. Pull requests are very welcome.

For quick questions you can use the hocr-spec gitter channel.

Building the spec

To build the spec, you will need to have installed:

GNU make
One of the following programs installed:
- bikeshed
- docker
Python 3

To install the python requirements:

pip3 install --user -r requirements.txt

The Makefile will first look for a local bikeshed installation and fallback to docker to use the bikeshed docker container to build the spec.

To change the spec, adapt

<VERSION>/spec.md to change the body of the spec
<VERSION>/spec.before.html to change
- the bikeshed metadata
- the references to terms from other specs
<VERSION>/spec.after.html to change
- Javascript to run in the generated spec document
<VERSION>/defs.yml to change the definition lists for elements and properties

Then run make VERSION=<VERSION> to build that spec.

Examples:

To build the 1.2 version: make VERSION=1.2 or simply make
To build the 1.2-zh version: make VERSION=1.2-zh

Open Tasks

The goal of this project is to make the hOCR specification more accessible and easier to maintain.

Cross-reference other specs
Harmonize style
Add samples
...

hocr-spec's People

Contributors

Stargazers

Watchers

hocr-spec's Issues

Under which conditions may bounding boxes overlap?

In many cases that should not happen (words, lines). For floats, it's inevitable.

hocr-check actually checks for this but it is not spelled out somewhere AFAIK.

"Recommended" / "Optional" / "Non-Recommended"

These shouldn't be headings but part of the description of the individual properties / classes or a special table. Having them as headings leads to bbox appearing twice http://kba.github.io/hocr-spec/1.2/#bbox and http://kba.github.io/hocr-spec/1.2/#bbox-typesetting (the latter section could be removed completely IMHO).

Delete x_cost deprecated?

The property x_cost is only mentioned as an alternative to nlp in a single place. Why use x_cost over nlp?

Define abstract algorithm for parsing 'title'

There are a few ambiguities in the algorithm, it would be helpful to define the abstract steps necessary to get from the string in a title= attribute to a key-value structure.

Uniform ellipsis character

Consistently Use ... (three dots) instead of … (\u22EF).

Suggestion: Group all properties under one section

Group all properties alphabetically under one section (without their current headings).

Each class should reference its possible properties.

Old paper about hOCR

Written by Thomas Breuel
https://www.researchgate.net/publication/232632963_The_hOCR_Microformat_for_OCR_Workflow_and_Results_PDF
Maybe you want to try to get the full text...

how to express uncertainty of special ocr result

in section ¶21.1 Levels of Certainty

The certainty element is designed to encode the following sorts:

    a given tag may or may not correctly apply (e.g. a given word may be a personal name, or perhaps not)
    the precise point at which an element begins or ends is uncertain
    the value given for an attribute is uncertain
    the content given for an element is unreliable for any reason.

how to express such info in HOCR

Future of hOCR

hOCR is easy to implement because it's based on HTML but it can hardly be called a standard while there are living standards for OCR like ALTO.

hOCR is used by Open Source engines like tesseract, ocropy, kraken, cuneiform. Is their output spec-conformant and uniform? Would it not be better to enhance them to support ALTO if they do not already?

I like hOCR's approach for extensibility and microformat-like simplicity but it has not been updated for several years and I think it should not be used for new implementations unless there are very compelling reasons not to use ALTO.

That being said, there is software around that produces hOCR and related tools that expect hOCR (or some dialect of it).

What I think needs to be done in any case:

Reduce the specs to the parts that are in actual use
Restructure it to make it more coherent and provide more examples
Produce a new major version indicating those changes and removals.

That new version should either be developed/refined further (e.g. by standardizing x* properties/classes) or contain a prominent deprecation notice that recommends another format like ALTO.

CC @tmbdev @mittagessen @zdenop @amitdo @zuphilip @cneud @stweil

Logical Tags/classes

I don't understand how the logical tags in hOCR should be used. Moreover, I see potential conflicts with other nested tags from the layout. AFAIK ocropus itself does not use any logical tags and tesseract only supports ocr_par. For most hocr logical classes there are equivalent html tags and therefore I don't see any advantage to add special logical hocr classes there.

Some more specific questions about the logical hocr classes:

Is the ocr_document the same as the html document or can there be multiple ocr_documents in the same html document?
"The standard HTML tags given in brackets specify the preferred HTML tags to use with those logical structuring elements." How exactly are these elements used? Are the just marking the beginning of something new or should the be nested into each other? What happens for examplem, if the page break happens inside the abstract, i.e. the abstract is spread among two images?
Should ocr_authors be used to indicate some "byline" area or should there be some metadata about the authors given there?
What is ocr_display?
Is ocr_linear a special case of ocr_par or why is it inside this subsection?

What do you think?

The 'ocr_page' capability must always be present

Since every hOCR document must have a ocr_page (http://kba.github.io/hocr-spec/1.2/#ocr_page) and must list every ocr_* tag it might use, ocr-capabilities must always contain ocr_page.

ocr_line vs. ocrx_line

@zdenop asked 2012 on the hocr ML without an answer:

I need clarification of ocr_line vs. ocrx_line

hOCR spec define ocrx_line as:

any kind of "line" returned by an OCR system that differs from the
standard ocr_line above

might be some kind of "logical" line

hocr-tools provide this example of ocr_line[1]:
 Alice was beginning to get very tired of sitting by her sister on the bank,
And tesseract-ocr (r729) produce this hocr output:
 
 Alice
 was
 ...
 bank,
 
Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?

Rendering of internal links

Is it possible to render the internal links in the html without the paragraph signs and section numbers? I feel that the readability suffers from this, e.g.

Would you agree?

How to handle hyphens?

While there is a hardbreak property and references to soft hyphens in the spec but not actually an explicit recommendation on how hyphens should be handled.

For example, ALTO has a <HYP CONTENT="-"/> element for hyphens.

I see two options:

A: Encode hyphens as a minus sign - and part of the word it hyphenates;

<span class="ocr_line">
  <span class="ocrx_word">what-
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

B: Encoding the hyphen as  or an inline span

<span class="ocr_line">
  <span class="ocrx_word">what&shy;
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

Personally, I prefer option A because that is more in line with the pragmatic nature of hOCR and makes the hOCR output more uniform for post-processing tools.

On the other hand, when converting to hOCR from ALTO, the information that a minus sign is actually a hyphen will be lost.

How about non-hyphen dashes? Should the spec offer guidance on how to encode these?

correct MIME type for hOCR?

I'm publishing some hOCR, but uncertain what MIME type to give. I'm using text/html but that seems incomplete. Is there a standard way to convey that a file is hOCR?

Suggestion: <TAG> -> <tag>

UPPERCASE to lowercase...

Anchoring and flow-around properties for floating elements

Issue in the original spec:

There is currently no way of indicating anchoring or flow-around properties for floating elements; properties need to be defined for this.

A typo

Clases for Character Information

What exactly is `baseline` in @title?

https://github.com/kba/hocr-spec/blob/master/hocr-spec.md#baseline:

baseline pn pn-1 ... p0 - a polynomial describing the baseline of a line of text
the polynomial is in the coordinate system of the line, with the bottom left of the bounding box as the origin

If I understand correctly, this will be a tuple x y for all rectangular areas (with bbox)?

cflow: the paragraph before this section

The following property relates the flow...

Should be moved to cflow section

This property relates the flow...

Also, in the cflow section

... must be present on ocrcarea

ocrcarea -> ocr_carea

bbox x0 y0 x1 y1

where is the bbox x0 y0 x1 y1 detail description?
how to compute x0 y0 x1 y1 ?
whether it a relative value or not

'ocr_separator' appears twice

...

Reduce number of top-level headings

There are a lot of top-level sections that could be either merged or organized into subsections.

XML namespace for hOCR HTML?

This is important for XSLT transformations.

2.0: Replace title= props with data-ocr-* attributes

Reusing the title= attribute of HTML elements for OCR-specific values is bad practice. It's understandable since at the time of hOCR's initial development, there were few mechanisms to extend HTML, but in HTML5, there are quite a few.

In a (possible) next major revision of the standard, we could use data-ocr-* attributes for that purpose.

<span id="line1" class="ocr_line" title="bbox 0 0 100 100">...</span>

could be expressed as

<span id="line1" data-ocr-tag="line" data-ocr-bbox="[0,0,100,100]"> ... </span>

This is more verbose but it would make it much easier to specify behavior and work with the content, i.e. in Javascript, you could do:

var line = document.querySelector("#line1");
var bbox = JSON.parse(line.dataset.ocrBbox);
var width = ocrBbox[2] - ocrBbox[0];

'Other superscripts and subscripts...'

Superscript and Subscript

Other superscripts and subscripts must be represented using the HTML ^{and _tags}

'Other' seems to refer to superscripts and subscripts within 'ocr_chem' and 'ocr_math'. If that's the case, this section should come after 'ocr_math'.

Merge capabilities into metadata

What's the purpose of ocrx_cinfo?

Spec says

  * ocrx_cinfo should nest inside ocrx_line
  * ocrx_cinfo should contain only x_confs, x_bboxes, and cuts attributes

but not what ocrx_cinfo actually is.

Flags are not rendered

It seems that if you use the flags :gb: or :cn: within link context, then they are not rendered as expected.

License?

If you can reach Thomas Breuel, ask him to License his spec under open source license, preferably CC0.

Move a paragraph from the 'hardbreak' section

hardbreak

Any special characters representing the desired end-of-line...

should be moved to another section.

What DOCTYPE for hOCR HTML?

Merge Profiles and HTML markup

Document spec build process

ocr_carea vs ocrx_block

When should engines output the latter?

`ocr_glyphs` is occuring twice

How to treat bounding boxes that contradict reading order?

Due to its line segmentation, ocropus inserts ocr_line at the wrong position in the flow of elements, i.e. in the middle of another paragraph. From the bounding box it is clear that these should not be at this position.

Can we find some rules for bounding box - reading order dependency to catch such obvious(?) mistakes while still allowing complex layouts?

Related to #23

x_boxes should be x_bboxes

Use of property `presence`

There's a title property for it, defined as

presence presence must be declared in the document meta data

Move sections to the appendix

Revision History
IANA considerations
sample usage

Add option to build spec via bikeshed web service

curl https://api.csswg.org/bikeshed/ -F [email protected]  > index.html

c.f. #30

Usage of id

(Started in #17)

id= is part of HTML/XML, all elements can have one id= that must be unique per identifier.

Shall we add to the specs that all elements SHOULD have an id or MUST have one?

Should the ids follow a certain syntactical form?

Classes for Inline Representation

6 Inline Representations

6.1 Classes for Inline Representation
    6.1.1 ocr_glyph
    6.1.2 ocr_glyphs
    6.1.3 ocr_dropcap
    6.1.4 ocr_chem
    6.1.5 ocr_math
    6.1.6 Non-breaking space
    6.1.7 Non-default spaces
    6.1.8 Hyphenation
    6.1.9 Superscript and Subscript
    6.1.10 Ruby characters

'classes' => have class="..." attribute.
So,

    6.1.6 Non-breaking space
    6.1.7 Non-default spaces
    6.1.8 Hyphenation
    6.1.9 Superscript and Subscript
    6.1.10 Ruby characters

should not be under 'Classes for Inline Representation'.

If necessary, the markup may use the following non-standard tags:

<{nobr}> to indicate that line breaking is not permitted for the enclosed content

This is indeed non-standard and was never part of any HTML spec. Has anyone an example when it would be necessary to indicate that linebreaking is not allowed? Why not white-space: nowrap; CSS?

Drop support for polygons?

Polygons are obviously more flexible than rectangles but make the specs more complicated, e.g. #15

Are there any engines with ocrp_poly capability? Are there any examples in the wild?

ocr_carea: Used to be called ocr_column

Used to be called ~~ocr_column~~

What is the desired stylistic effect here?

~~ocr_column~~
?

Merge "Alternative Segmentations / Readings" and "Grouped Elements and Multiple Hierarchies"

"or alternatively can be indicated as properties on elements"

Language and writing direction should be indicated using the HTML standard
attributes lang= and dir=, or alternatively can be indicated as properties on
elements.

Not clear what that last clause means. Title properties dir and lang?

I would delete it, one standard mechanism for dir/lang is better.