Git Product home page Git Product logo

hocr-spec's Introduction

hocr-spec

Join the chat at https://gitter.im/kba/hocr-spec

The hOCR Embedded OCR Workflow and Output Format

About

This repository contains the hOCR format specification originally written by Thomas Breuel.

Versions

  • 1.0 English
    • Google Doc the original text by @tmbdev
    • Last substantial edit in May 2010
  • 1.1 English, 中文 (Chinese)
    • Port of the Google Doc
    • Cleaning obvious errata (duplicate content)
    • More fine-grained heading structure
    • Table of contents
    • Chinese translation provided by @littlePP24 and @wanghaisheng
    • Last substantial edit in September 2016
  • 1.2 English
    • Create a WHATWG-like spec using bikeshed
    • Add issues where appropriate
    • Semantically backwards-compatible with both 1.0 and 1.1

Contribute

There is no formal body. Feel free to use the Github issues for discussion and questions. Pull requests are very welcome.

For quick questions you can use the hocr-spec gitter channel.

Building the spec

To build the spec, you will need to have installed:

  • GNU make
  • One of the following programs installed:
  • Python 3

To install the python requirements:

pip3 install --user -r requirements.txt

The Makefile will first look for a local bikeshed installation and fallback to docker to use the bikeshed docker container to build the spec.

To change the spec, adapt

  • <VERSION>/spec.md to change the body of the spec
  • <VERSION>/spec.before.html to change
  • <VERSION>/spec.after.html to change
    • Javascript to run in the generated spec document
  • <VERSION>/defs.yml to change the definition lists for elements and properties

Then run make VERSION=<VERSION> to build that spec.

Examples:

  • To build the 1.2 version: make VERSION=1.2 or simply make
  • To build the 1.2-zh version: make VERSION=1.2-zh

Open Tasks

The goal of this project is to make the hOCR specification more accessible and easier to maintain.

  • Cross-reference other specs
  • Harmonize style
  • Add samples
  • ...

hocr-spec's People

Contributors

amitdo avatar gitter-badger avatar jbaiter avatar kba avatar stefre avatar stweil avatar tmbdev avatar wanghaisheng avatar wollmers avatar zuphilip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hocr-spec's Issues

Delete x_cost deprecated?

The property x_cost is only mentioned as an alternative to nlp in a single place. Why use x_cost over nlp?

how to express uncertainty of special ocr result

in section ¶21.1 Levels of Certainty

The certainty element is designed to encode the following sorts:

    a given tag may or may not correctly apply (e.g. a given word may be a personal name, or perhaps not)
    the precise point at which an element begins or ends is uncertain
    the value given for an attribute is uncertain
    the content given for an element is unreliable for any reason.

how to express such info in HOCR

Future of hOCR

hOCR is easy to implement because it's based on HTML but it can hardly be called a standard while there are living standards for OCR like ALTO.

hOCR is used by Open Source engines like tesseract, ocropy, kraken, cuneiform. Is their output spec-conformant and uniform? Would it not be better to enhance them to support ALTO if they do not already?

I like hOCR's approach for extensibility and microformat-like simplicity but it has not been updated for several years and I think it should not be used for new implementations unless there are very compelling reasons not to use ALTO.

That being said, there is software around that produces hOCR and related tools that expect hOCR (or some dialect of it).

What I think needs to be done in any case:

  1. Reduce the specs to the parts that are in actual use
  2. Restructure it to make it more coherent and provide more examples
  3. Produce a new major version indicating those changes and removals.

That new version should either be developed/refined further (e.g. by standardizing x* properties/classes) or contain a prominent deprecation notice that recommends another format like ALTO.

CC @tmbdev @mittagessen @zdenop @amitdo @zuphilip @cneud @stweil

Logical Tags/classes

I don't understand how the logical tags in hOCR should be used. Moreover, I see potential conflicts with other nested tags from the layout. AFAIK ocropus itself does not use any logical tags and tesseract only supports ocr_par. For most hocr logical classes there are equivalent html tags and therefore I don't see any advantage to add special logical hocr classes there.

Some more specific questions about the logical hocr classes:

  • Is the ocr_document the same as the html document or can there be multiple ocr_documents in the same html document?
  • "The standard HTML tags given in brackets specify the preferred HTML tags to use with those logical structuring elements." How exactly are these elements used? Are the just marking the beginning of something new or should the be nested into each other? What happens for examplem, if the page break happens inside the abstract, i.e. the abstract is spread among two images?
  • Should ocr_authors be used to indicate some "byline" area or should there be some metadata about the authors given there?
  • What is ocr_display?
  • Is ocr_linear a special case of ocr_par or why is it inside this subsection?

What do you think?

ocr_line vs. ocrx_line

@zdenop asked 2012 on the hocr ML without an answer:

I need clarification of ocr_line vs. ocrx_line

hOCR spec define ocrx_line as:

  • any kind of "line" returned by an OCR system that differs from the
    standard ocr_line above
  • might be some kind of "logical" line

hocr-tools provide this example of ocr_line[1]:

 <span class='ocr_line' title='bbox 461 648 2077 707'>Alice was beginning to get very tired of sitting by her sister on the bank,</span>

And tesseract-ocr (r729) produce this hocr output:

  <span class='ocr_line' id='line_2' title="bbox 464 651 2074 704">
      <span class='ocrx_word' id='word_5' title="bbox 464 651 569 688">Alice</span>
      <span class='ocrx_word' id='word_6' title="bbox 591 665 667 688">was</span>
       ...
      <span class='ocrx_word' id='word_19' title="bbox 1962 660 2074 704">bank,</span>
  </span>

Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?

Rendering of internal links

Is it possible to render the internal links in the html without the paragraph signs and section numbers? I feel that the readability suffers from this, e.g.

hocr-spec-links

Would you agree?

How to handle hyphens?

While there is a hardbreak property and references to soft hyphens in the spec but not actually an explicit recommendation on how hyphens should be handled.

For example, ALTO has a <HYP CONTENT="-"/> element for hyphens.

I see two options:

A: Encode hyphens as a minus sign - and part of the word it hyphenates;

<span class="ocr_line">
  <span class="ocrx_word">what-
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

B: Encoding the hyphen as &shy; or an inline span

<span class="ocr_line">
  <span class="ocrx_word">what&shy;
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

Personally, I prefer option A because that is more in line with the pragmatic nature of hOCR and makes the hOCR output more uniform for post-processing tools.

On the other hand, when converting to hOCR from ALTO, the information that a minus sign is actually a hyphen will be lost.

How about non-hyphen dashes? Should the spec offer guidance on how to encode these?

correct MIME type for hOCR?

I'm publishing some hOCR, but uncertain what MIME type to give. I'm using text/html but that seems incomplete. Is there a standard way to convey that a file is hOCR?

A typo

Clases for Character Information

cflow: the paragraph before this section

The following property relates the flow...

Should be moved to cflow section

=>

This property relates the flow...

Also, in the cflow section

... must be present on ocrcarea

ocrcarea -> ocr_carea

bbox x0 y0 x1 y1

where is the bbox x0 y0 x1 y1 detail description?
how to compute x0 y0 x1 y1 ?
whether it a relative value or not

2.0: Replace title= props with data-ocr-* attributes

Reusing the title= attribute of HTML elements for OCR-specific values is bad practice. It's understandable since at the time of hOCR's initial development, there were few mechanisms to extend HTML, but in HTML5, there are quite a few.

In a (possible) next major revision of the standard, we could use data-ocr-* attributes for that purpose.

<span id="line1" class="ocr_line" title="bbox 0 0 100 100">...</span>

could be expressed as

<span id="line1" data-ocr-tag="line" data-ocr-bbox="[0,0,100,100]"> ... </span>

This is more verbose but it would make it much easier to specify behavior and work with the content, i.e. in Javascript, you could do:

var line = document.querySelector("#line1");
var bbox = JSON.parse(line.dataset.ocrBbox);
var width = ocrBbox[2] - ocrBbox[0];

'Other superscripts and subscripts...'

Superscript and Subscript

Other superscripts and subscripts must be represented using the HTML and tags

'Other' seems to refer to superscripts and subscripts within 'ocr_chem' and 'ocr_math'. If that's the case, this section should come after 'ocr_math'.

What's the purpose of ocrx_cinfo?

Spec says

  * ocrx_cinfo should nest inside ocrx_line
  * ocrx_cinfo should contain only x_confs, x_bboxes, and cuts attributes

but not what ocrx_cinfo actually is.

Flags are not rendered

It seems that if you use the flags :gb: or :cn: within link context, then they are not rendered as expected.

License?

If you can reach Thomas Breuel, ask him to License his spec under open source license, preferably CC0.

How to treat bounding boxes that contradict reading order?

Due to its line segmentation, ocropus inserts ocr_line at the wrong position in the flow of elements, i.e. in the middle of another paragraph. From the bounding box it is clear that these should not be at this position.

Can we find some rules for bounding box - reading order dependency to catch such obvious(?) mistakes while still allowing complex layouts?

Related to #23

Use of property `presence`

There's a title property for it, defined as

presence presence must be declared in the document meta data

Usage of id

(Started in #17)

id= is part of HTML/XML, all elements can have one id= that must be unique per identifier.

Shall we add to the specs that all elements SHOULD have an id or MUST have one?

Should the ids follow a certain syntactical form?

Classes for Inline Representation

6 Inline Representations

6.1 Classes for Inline Representation
    6.1.1 ocr_glyph
    6.1.2 ocr_glyphs
    6.1.3 ocr_dropcap
    6.1.4 ocr_chem
    6.1.5 ocr_math
    6.1.6 Non-breaking space
    6.1.7 Non-default spaces
    6.1.8 Hyphenation
    6.1.9 Superscript and Subscript
    6.1.10 Ruby characters

'classes' => have class="..." attribute.
So,

    6.1.6 Non-breaking space
    6.1.7 Non-default spaces
    6.1.8 Hyphenation
    6.1.9 Superscript and Subscript
    6.1.10 Ruby characters

should not be under 'Classes for Inline Representation'.

Specify that class must be a single value

It is not stated explicitly but it seems consensus among implementations that the special classes like ocr_page, ocrx_word etc. must be the one and only class= of an HTML element.

nobr

If necessary, the markup may use the following non-standard tags:

  • <{nobr}> to indicate that line breaking is not permitted for the enclosed content

This is indeed non-standard and was never part of any HTML spec. Has anyone an example when it would be necessary to indicate that linebreaking is not allowed? Why not white-space: nowrap; CSS?

Drop support for polygons?

Polygons are obviously more flexible than rectangles but make the specs more complicated, e.g. #15

Are there any engines with ocrp_poly capability? Are there any examples in the wild?

"or alternatively can be indicated as properties on elements"

Language and writing direction should be indicated using the HTML standard
attributes lang= and dir=, or alternatively can be indicated as properties on
elements.

Not clear what that last clause means. Title properties dir and lang?

I would delete it, one standard mechanism for dir/lang is better.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.