ocr-d / gt-guidelines Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 5.0 119.56 MB

OCR-D guidelines for Ground Truth production

Home Page: https://ocr-d.de/en/gt-guidelines/trans/

License: Creative Commons Attribution Share Alike 4.0 International

CSS 0.51% XSLT 2.99% HTML 96.48% Makefile 0.03%

ocr-d

gt-guidelines's People

Contributors

Stargazers

Watchers

Forkers

stweil kba bertsky

gt-guidelines's Issues

Oxygen project files contain duplicate scenarios?

The <ditaScenario> seems to exist twice. Do they differ?

Can you describe how the documentation is built on an abstract level, regardless of IDE? There are quite a few flags set here that I don't understand the meaning of w/o running eclipse with that dita setting.

new build

Guidelines vs Guiedelines

There are two XPR files:

OCR-D_GT_Guidelines.xpr
OCR-D_GT_Guiedelines.xpr

git history:

* 1e42c0b (5 hours ago) Matthias Boenig Update |
|  OCR-D_GT_Guidelines.xpr  | 392 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
|  OCR-D_GT_Guiedelines.xpr | 101 ++----------------------------------
|  2 files changed, 372 insertions(+), 121 deletions(-)

* 2b43f78 (3 days ago) Matthias Boenig update
   OCR-D_GT_Guidelines.xpr  | 340 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   OCR-D_GT_Guiedelines.xpr | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   2 files changed, 656 insertions(+)

@tboenig Which one is the up-to-date one?

To enhance confusion, the correctly spelt one is referring to the misspelled one:

$ ag guiede
OCR-D_GT_Guidelines.xpr
4:        <filters directoryPatterns="" filePatterns="OCR-D_GT_Guiedelines.xpr" positiveFilePatterns="" showHiddenFiles="false"/>

show example images of font families

In trans/lySchriftarten, the major historically adequate fontFamily clusters are listed that have been identified by the Leipzig/Erlangen-Nürnberg/Mainz module project:

textura

rotura

bastarda

antiqua

greek

hebrew

italic

fraktur

It would be great if an example for each could be added to the page, like figure 3 from http://digitalia.sbn.it/article/view/2630/1837.

clarify level 2 rules for Greek ligatures and abbreviation glyphs

The current formulation is quite vague on Greek, but the corresponding precise ruleset for Latin (vocalic yes, consonantal no) suggests that we should indeed try to "regularize" or "canonicalize" the ligatures and abbreviation glyphs used pervasively in hellenistic times (at least those surviving until the 19th century and thus receiving a Unicode codepoint).

In particular,

ȣ → ου
ϛ → στ (i.e. not ς!)
Ϛ → ΣΤ
...
ϗ → καί
Ϗ → Καί
...

or is this only relevant on level 1?

2 trans pages missing in output

The following two topics are missing in the generated output (both en and de), because they are not mentioned in ocrd_ocrd.ditamap:

trans/lyBesonderheiten.dita
trans/trApostrophe.dita

img vs images

Is documentation/img for the PAGE docs and documentation/images for the gt guidelines?

clarify quotation rules for level 1

I can only find this short statement about quotation in level 1:

Quotation marks are transferred to today's use and are not differentiated

Considering the many differentations offered by today's use in Unicode, also listed in full by the spec, I wonder what that means.

Today could mean only the the ubiquitous ASCII reduction, i.e. only " and '. That would mean the differentiation is reserved for levels 2 and 3.

But it could also be a subset (e.g. no "low-9" or "high-reversed-9" or no angular quotes).

Could someone please clarify here and update the specs accordingly?

How is the PAGE documentation generated?

How to encode mathematical fractions?

While Unicode does have codepoints for the most common fractions (¼, ½, ¾ etc). this does not scale because of course not all possible numerator/denominator combinations are available. So it might be best to encode fractions as just "numerator fraction-slash denominator" (with regular numbers or super/subscript numbers?) or even produce LaTeX syntax.

Translation

https://ocr-d.github.io/gt//trans_documentation/trSchreibungR.html

versioning / releases

Idea from @cneud and @kba in call: start semantic versioning here and GH releases.

(Probably even: PDF export for individual releases?)

temp folder

Can we purge the documentation/temp folder from the Git history?

Same for the documentation/out folder.

Aren't those generated?

trans/trFremdsprache is broken in en

It seems the table in trans/trFremdsprache.dita has been crippled during translation from de: only 1 column survived which collapses the 3 GT levels and the comments.

Why all the content on Google

E.g. https://lh4.googleusercontent.com/1HdqrPTGjE0nHay-YhnX16ipvQzmWW44oiWjsL4x9lh-aRq_J-flmV1oNOgflJFu5T-F9OEvKeQW8H1i_7gR7EMcF36wq1E8ktKO2fBWqLw2NDylG81YNE-Yt6DK8P599sZXajvP and many more.

Branch pruned of all generated/temporary output

Branch https://github.com/OCR-D/gt-guidelines/tree/purged is a rewrite of the git history, undoing all commits to

documentation/out
documentation/temp
page_documentation/out
documentation/img/pagecontent*

which reduces the size of repo and index by about 400 MB.

Since this is changes the git history, it cannot be merged, we need to rebase master on it (hence no PR hence this issue).

documentation vs page_documentation

There's a lot of redunancy wrt to generated images from XML (how are they created), for example:

097c6501fa303d27bdf2042ddea7a9f5  ./documentation/img/pagecontent_xsd_Complex_Type_pc_RegionRefType.jpeg
097c6501fa303d27bdf2042ddea7a9f5  ./pagexml_dokumentation/img/pagecontent_xsd_Complex_Type_pc_RegionRefType.jpeg
097c6501fa303d27bdf2042ddea7a9f5  ./pagexml_dokumentation/out/webhelp-responsive/img/pagecontent_xsd_Complex_Type_pc_RegionRefType.jpeg

structural concordance: collect more (possible) pairs

In en/trans/structurmets2page.dita, we could add the Page/@type types with their DFG Strukturdatenset counterparts (some of which are already covered in en/trans/structur_gtpageformat.dita, perhaps because they were also in the Zot format already):

cover_back: back-cover
cover_front: front-cover
binding / endsheet / spine / paste_down / colour_checker: empty
title_page: title
table_of_contents: table-of-contents
index: index

Also, why is mets:div/@type=table likened to pc:TextRegion/@type=heading and not @type=caption (or pc:TableRegion directly)? (Same probably goes for mets:div/@type=map vs. caption / pc:MapRegion, as well as mets:div/@type=musical_notation vs. caption or pc:MusicRegion.)

Also, where is illustration?

Next, I would have expected that pc:GraphicRegion/@type gets mapped, too:

annotation: handwritten-annotation
stamp: stamp
printers-mark: decoration
...?

Furthermore, IIUC it seems plausible to also suggest mapping some of the mets:div types to pc:ReadingOrder types:

text: paragraph
illustration: figure
article: article
section / chapter / part: div

Generally, it would also help to strictly differentiate between structural types (what ENMAP calls contentUnit) and layout types (what ENMAP calls contentItem).

Lastly, how about also collecting concordance between mets:div types and alto:LayoutTags and alto:StructureTags? I can see many similar entities in the official documentation. Perhaps a full discussion of this would also need to include the various ENMAP profiles...