Git Product home page Git Product logo

page2tei's Introduction

page2tei

PAGE2TEI was created and is maintained by Dario Kampkaspar and is licensed under the MIT license.

How to use

Apply page2tei-0.xsl to the METS File:

java -jar saxon9he.jar -xsl:page2tei-0.xsl -s:mets.xml -o:[your tei file].xml

Additional stylesheets can be applied to the output created by the basic transformation:

  • combine-continued.xsl (or set parameter combine=true()) — try to combine entities that are split over a line break into one element
  • simplify-coordinates.xsl (parameter bounding-rectangles=true() by default) — convert polygons into bounding rectangles
  • tokenize.xsl (or set parameter tokenize=true()) — perform (very basic!) whitespace tokenization

Parameters

You can set the following parameters when calling page2tei-0.xsl (via command line or via an oXygen scenario; in oXygen, the parameters should be marked as “XPath“):

  • rs (default: true()): create rs type="..." for person/place/org (default) or persName etc.
  • tokenize (default: false()): Whether to run white space tokenization
  • combine (default: false()): Whether to combine entities over line breaks
  • ab (default: false()): If false(), region types that correspond to valid TEI elements will be returned as this element; types that do not correspond to a TEI element will be returned as tei:ab[@type]. If set to true(), all region types (except for paragraph, heading) will be returned as tei:ab.
  • word-coordinates (default: false()): If true(), export the (estimated) word coordinates to the facsimile section.
  • bounding-rectangles (default: true()): Whether to create bounding rectangles from polygons (default: true())
  • withoutBaseline (default: false()): Whether to export lines without baseline or not
  • withoutTextline (default: false()): Whether to export regions without text lines

Contributors

  • @tboenig
  • @peterstadler
  • @tillgrallert

Some contributions to this software were created within the scope of a project funded by the German BMBF, project ID 16TOA015A.

page2tei's People

Contributors

boenig avatar dariok avatar peterstadler avatar tboenig avatar tillgrallert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

page2tei's Issues

Document usage of @custom attribute

The script has assumptions on the conventional usage of @custom in PAGE XML based on how it's used in Transkribus (?). It would be helpful to document this, as well as the supported @types.

creation of seriesStmt should probably be publicationStmt

Running parameterize branch on my files, the default output for <fileDesc> looks like:

      <fileDesc>
         <titleStmt><title type="main">GB-LET-GB-FB-1921-01</title></titleStmt>
         <seriesStmt><title>Evolving Hands</title></seriesStmt>
         <sourceDesc>
            <bibl>
                <title type="main">GB-LET-GB-FB-1921-01</title>
               <idno type="Transkribus">1172123</idno>
            </bibl>
         </sourceDesc>
      </fileDesc>

but because <publicationStmt> is a mandatory TEI element, it should probably look something like:

    <fileDesc>
         <titleStmt><title type="main">GB-LET-GB-FB-1921-01</title></titleStmt>
         <publicationStmt>
           <p>Created from Evolving Hands Transkribus Collection</p>
         </publicationStmt>
         <sourceDesc>
            <bibl>
                <title type="main">GB-LET-GB-FB-1921-01</title>
               <idno type="Transkribus">1172123</idno>
            </bibl>
         </sourceDesc>
      </fileDesc>

(though that <p> could also be a <publisher>) Since people tend to put the project name in the field Transkribus is outputting, then modifying the seriesStmt to contain the 'collection' information such as 'Created from ColName Transkribus Collection'. Would probably be a not horrible way to make the teiHeader valid.
This would be at:
https://github.com/dariok/page2tei/blob/4d9e66b6a8e7bf852dd9ef1cf9639962c42887e2/page2tei-0.xsl#L135C28-L137 and following.

export of TableCell suppressed

My output from Transkribus looks like

<TableRegion id="Table_1572948924800_4" custom="readingOrder {index:0;}">
    <Coords points="85,74 3981,55 4041,4922 93,4939"/>
    <TableCell row="0" col="0" rowSpan="1" colSpan="1" leftBorderVisible="true" rightBorderVisible="true" topBorderVisible="true" bottomBorderVisible="true" orientation="0.0" id="TableCell_1572948967045_27">
        <Coords points="86,91 83,494 484,483 485,81"/>
        <TextLine id="TableCell_1572948967045_27l1" custom="readingOrder {index:0;}">
            <Coords points="93,212 161,219 186,221 212,222 237,223 263,222 288,220 314,218 339,216 365,213 379,229 443,257 443,172 390,126 365,128 339,131 314,133 288,135 263,137 237,138 212,137 186,136 164,95 96,88"/>

which, when transformed with page2tei-0.xsl, misses the information about the TableCell, i.e. no corresponding <tei:zone> element is created.

To me, this looks like a bug and I'm happy to send in a PR, yet several lines, e.g.

<xsl:when test="local-name(parent::*) = 'TableCell'">TableCell</xsl:when>

and
<xsl:apply-templates select="p:TableCell//p:TextLine" mode="facsimile" />
seem to indicate that this was done on purpose?

Support tei:l

Currently, export to <l>...</l> is an option.
While not generally correct (it is supposed to be used for verse only), it probably should be supported both for correct uses and backwards compatibility

Use <ab> instead of <p>

By wrapping each page in a <p> element, this stylesheet assumes a semantic model that just doesn't correspond to the input. It would be better to use <ab> ("anonymous block" in TEI parlance) if one absolutely wants to wrap each page in a single element.

page2tei produces wrong output when in the input there is more than one consecutive abbrev

Example C_0001:
TEI export by transkribus (also not correct):

<lb facs="#facs_1_r1035" n="N001"/>Die Classe beruft sich dagegen auf den von <choice><expan>Seiner</expan><abbr>S.<choice><expan>kaiserlichen</expan><abbr/></choice></abbr></choice><choice><expan>kaiserlichen</expan><abbr>k.<choice><expan>Hoheit</expan><abbr/></choice></abbr></choice><choice><expan>Hoheit</expan><abbr>H.</abbr></choice> dem Durchlauchtig-

This page2tei-0,xsl:

<lb facs="#facs_1_r1035" n="N001"/>Die Classe beruft sich dagegen auf den von <choice><expan>Seiner</expan><abbr>S.<choice><expan>kaiserlichen</expan><abbr/></choice></abbr></choice><choice><expan>kaiserlichen</expan><abbr>k.

saxon9he.jar dowload

Hello,

is there some source or link for saxon9he.jar where the file could be downloaded.

Many thanks in advance

Graphic to <figure>

Is it possible to show the regions marked in transkribus metadata as "selected element type: graphic" to appear in the <body> as <figure> ?

Empty Output

Hello all,

I'm having some trouble running this both in oXygen and from the command line. In both cases, when I run a command to transform based on the METS file I get a blank TEI 'skeleton' as output like so:

<?xml version="1.0" encoding="UTF-8"?><TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <fileDesc>
         <titleStmt/>
         <seriesStmt/>
         <sourceDesc>
            <bibl/>
         </sourceDesc>
      </fileDesc>
      <profileDesc>
      </profileDesc>
   </teiHeader>
   <facsimile>
   </facsimile>
   <text>
      <body>
      </body>
   </text>
</TEI>

I'm using the following command to transform in Zsh: java -jar ~/saxon/saxon-he-12.4.jar -xsl:page2tei/page2tei-0.xsl -s:m1-8-1/METS.xml -o:m1-8-1/tei-m1-8-1.xml and the PAGE files are generated automatically using eScriptorium, if that makes a difference. I've not got much experience working with PAGE XML but it seems to validate okay...

You can find the METS file I'm trying to transform here.

Thanks in advance,
Joshua

missing attributes

<TextLine id="r1l3" custom="readingOrder {index:2;} datum {offset:4; length:13;datum:1696-02-15;} persoonsnaam {offset:33; length:12; continued:true;}">`

results in the next line - without the property/attibute 'datum'

<lb facs="#facs_2_r1l3" n="N003"/>die <datum>15 febr. 1696</datum> gehuwd was met <persoonsnaam>Aletta Catha</persoonsnaam>

Image names with space

If we take the image name as identifier and it contains a space - like in this example
<pb facs='#facs_1' xml:id='JRL MS_551_1_p01.jpg' n='1'/>
we get an invalid TEI XML. Should we use an artificial identifier instead, e.g. IMG_1, IMG_2,...?

use of several parameters

Parameter-Name Beschreibung Default

bounding-rectangles
Polygone als bounding rectangle exportieren Default: true()

rs Tag person, place, org als tei:rs[@type] (true) oder als tei:persName etc. exportieren? Default: true()

tokenize White space Tokenisierung durchführen? Default: false()

combine Fortgesetzte Tags (@continued) zusammenziehen? Default: false()

ab Wenn true, werden alle Regionen als tei:ab mit @type ausgegeben (außer paragraph → tei:p und
heading → tei:head); wenn false, werden Regionen, deren Typ einem TEI-Element entspricht, in
dieses Element übersetzt. Default: false()

word-coordinates
Wenn true, werden die Wort-Koordinaten in tei:facsimile//tei:zone übernommen Default: false()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.