ocr-d / gt-guidelines Goto Github PK
View Code? Open in Web Editor NEWOCR-D guidelines for Ground Truth production
Home Page: https://ocr-d.de/en/gt-guidelines/trans/
License: Creative Commons Attribution Share Alike 4.0 International
OCR-D guidelines for Ground Truth production
Home Page: https://ocr-d.de/en/gt-guidelines/trans/
License: Creative Commons Attribution Share Alike 4.0 International
The <ditaScenario>
seems to exist twice. Do they differ?
Can you describe how the documentation is built on an abstract level, regardless of IDE? There are quite a few flags set here that I don't understand the meaning of w/o running eclipse with that dita setting.
There are two XPR files:
OCR-D_GT_Guidelines.xpr
OCR-D_GT_Guiedelines.xpr
git history:
* 1e42c0b (5 hours ago) Matthias Boenig Update |
| OCR-D_GT_Guidelines.xpr | 392 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
| OCR-D_GT_Guiedelines.xpr | 101 ++----------------------------------
| 2 files changed, 372 insertions(+), 121 deletions(-)
* 2b43f78 (3 days ago) Matthias Boenig update
OCR-D_GT_Guidelines.xpr | 340 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
OCR-D_GT_Guiedelines.xpr | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 656 insertions(+)
@tboenig Which one is the up-to-date one?
To enhance confusion, the correctly spelt one is referring to the misspelled one:
$ ag guiede
OCR-D_GT_Guidelines.xpr
4: <filters directoryPatterns="" filePatterns="OCR-D_GT_Guiedelines.xpr" positiveFilePatterns="" showHiddenFiles="false"/>
In trans/lySchriftarten
, the major historically adequate fontFamily clusters are listed that have been identified by the Leipzig/Erlangen-Nürnberg/Mainz module project:
- textura
- rotura
- bastarda
- antiqua
- greek
- hebrew
- italic
- fraktur
It would be great if an example for each could be added to the page, like figure 3 from http://digitalia.sbn.it/article/view/2630/1837.
The current formulation is quite vague on Greek, but the corresponding precise ruleset for Latin (vocalic yes, consonantal no) suggests that we should indeed try to "regularize" or "canonicalize" the ligatures and abbreviation glyphs used pervasively in hellenistic times (at least those surviving until the 19th century and thus receiving a Unicode codepoint).
In particular,
ȣ
→ ου
ϛ
→ στ
(i.e. not ς
!)Ϛ
→ ΣΤ
ϗ
→ καί
Ϗ
→ Καί
or is this only relevant on level 1?
The following two topics are missing in the generated output (both en
and de
), because they are not mentioned in ocrd_ocrd.ditamap
:
Is documentation/img
for the PAGE docs and documentation/images
for the gt guidelines?
I can only find this short statement about quotation in level 1:
Quotation marks are transferred to today's use and are not differentiated
Considering the many differentations offered by today's use in Unicode, also listed in full by the spec, I wonder what that means.
Today
could mean only the the ubiquitous ASCII reduction, i.e. only "
and '
. That would mean the differentiation is reserved for levels 2 and 3.
But it could also be a subset (e.g. no "low-9" or "high-reversed-9" or no angular quotes).
Could someone please clarify here and update the specs accordingly?
While Unicode does have codepoints for the most common fractions (¼
, ½
, ¾
etc). this does not scale because of course not all possible numerator/denominator combinations are available. So it might be best to encode fractions as just "numerator fraction-slash denominator" (with regular numbers or super/subscript numbers?) or even produce LaTeX syntax.
Can we purge the documentation/temp
folder from the Git history?
Same for the documentation/out
folder.
Aren't those generated?
It seems the table in trans/trFremdsprache.dita
has been crippled during translation from de: only 1 column survived which collapses the 3 GT levels and the comments.
Branch https://github.com/OCR-D/gt-guidelines/tree/purged is a rewrite of the git history, undoing all commits to
which reduces the size of repo and index by about 400 MB.
Since this is changes the git history, it cannot be merged, we need to rebase master on it (hence no PR hence this issue).
There's a lot of redunancy wrt to generated images from XML (how are they created), for example:
097c6501fa303d27bdf2042ddea7a9f5 ./documentation/img/pagecontent_xsd_Complex_Type_pc_RegionRefType.jpeg
097c6501fa303d27bdf2042ddea7a9f5 ./pagexml_dokumentation/img/pagecontent_xsd_Complex_Type_pc_RegionRefType.jpeg
097c6501fa303d27bdf2042ddea7a9f5 ./pagexml_dokumentation/out/webhelp-responsive/img/pagecontent_xsd_Complex_Type_pc_RegionRefType.jpeg
In en/trans/structurmets2page.dita, we could add the Page/@type
types with their DFG Strukturdatenset counterparts (some of which are already covered in en/trans/structur_gtpageformat.dita, perhaps because they were also in the Zot format already):
Also, why is mets:div/@type=table
likened to pc:TextRegion/@type=heading
and not @type=caption
(or pc:TableRegion
directly)? (Same probably goes for mets:div/@type=map
vs. caption
/ pc:MapRegion
, as well as mets:div/@type=musical_notation
vs. caption
or pc:MusicRegion
.)
Also, where is illustration
?
Next, I would have expected that pc:GraphicRegion/@type
gets mapped, too:
Furthermore, IIUC it seems plausible to also suggest mapping some of the mets:div
types to pc:ReadingOrder
types:
Generally, it would also help to strictly differentiate between structural types (what ENMAP calls contentUnit
) and layout types (what ENMAP calls contentItem
).
Lastly, how about also collecting concordance between mets:div
types and alto:LayoutTags
and alto:StructureTags
? I can see many similar entities in the official documentation. Perhaps a full discussion of this would also need to include the various ENMAP profiles...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.