Git Product home page Git Product logo

gt-mufilevelrules's Introduction

gt-MufiLevelRules

Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by MUFI: The Medieval Unicode Font Initiative.

The resulting OCR-D level rules conform to the OCR-D specification. These rules can be used for substitutions or level checks, among other things.

Note:

  • There may not always be a definition for every level, esp. on level 1.
  • OCR-D will try to fill in these gaps manually or automatically. The automated completion is based on the unicruft program.
  • For this reason, using the rules for automatic character normalization from level 3 or level 2 to level 1 is currently not recommended before manually checking and correcting the corresponding rules.

Download the Rules

🚦 You can download the set of rules here. 🚦

Recreation of the rules

  1. copy or clone the repository.

    git clone https://github.com/tboenig/gt-MufiLevelRules.git

  2. Install Saxon for XSL Transformations v3.0. Then simply run with:

    java -jar saxon-he-XX.jar -xsl:scripts/MufiGTLevelRules2.xsl -s:scripts/MufiGTLevelRules.xsl output=characters merge=yes

Parameters:

  • output characters -> create the rules, all rules are saved under directory: [directory]/rules/characters
  • merge yes -> create the megarules, all rules in one file. Megarules saved under directoy [directory]/rules

The result of the conversion can be found in the directory: [directory]/rules/characters.

  • Output Format:
    • xml
    • json

The script uses:

  1. the MUFI rules [new Version] and MUFI rules old-Version

  2. a summary of the following additional rules from the OCR-D Ground-Truth Transcription Guide, which have priority (take precendence over MUFI rules where applicable):

Description of the rules

JSON Format

All JSON files (both the pure MUFI rules and the final result) follow the same schema.

Example:

 {"ruleset":[
       ...
       {"rule": ["ä", "", ""], "type": "level"}
       ...
]}
  • Each rule has a key: rule and a list of values
  • The values define the character representation on each of the 3 transcription levels:
    • Level 1 is at the first position
    • Level 2 is in the second place
    • Level 3 is in the third place
  • Additional key-value combinations: ...
  • Character values can be empty to signify there is no definition (representation) at that level.

XML Format

<levelrules>
    <ruleset>
        <range>AlphPresForm</range>
        <desc>LATIN SMALL LIGATURE FF</desc>
        <rule>ff</rule>
        <rule>ff</rule>
        <rule>ff</rule>
        <type>level</type>
    </ruleset>
</levelrules>
  • Elements
  • <levelrules> = root element of a gt-MufiLevelRules dataset
    • <ruleset> = root element of a ruleset
      • <range> = category of characters
      • <desc> = general description of the sign or symbol
      • <rule>
        • Level 1: rule[position() = 1]
        • Level 2: rule[position() = 2]
        • Level 3: rule[position() = 3]

The category of characters <range> and the general description of the sign or symbol <desc> were imported from the MUFI dataset.

The JSONPaths are:

  • range : $['..']['range']
  • desc : $['..']['description']

See Also

gt-mufilevelrules's People

Contributors

tboenig avatar github-actions[bot] avatar bertsky avatar

Stargazers

Mike Gerber avatar  avatar

Watchers

 avatar

gt-mufilevelrules's Issues

invalid entries in MUFI import

I found two rules with more than 3 levels:

  • ['ꝟꝟ', 'ꝟ', 'ꝟ', 'ꝟꝟ'] in ./rules/characters/LatExtD.json
  • ['q;q;', 'q;', 'q;', ''] in ./rules/characters/PUA-5.json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.