malamatenia / eutyches Goto Github PK

This repository hosts all the documents necessary for the evaluation of my M2 dissertation in Digital Humanities at the ENC.

License: Apache License 2.0

XSLT 0.38% Jupyter Notebook 87.18% HTML 10.67% TeX 1.77%

dataset distant-reading gloss htr kraken latin manuscripts marginalia eutyches yaltai

eutyches's People

Contributors

Stargazers

Watchers

Forkers

ponteineptique

eutyches's Issues

thoughts on lemma-glosse attribution to words that break in two lines

Since I haven't locally tagged my lemmas, I depend on an external document in order to identify and tag my lemmas (not the best practice probably).

In the case where a word is freestanding, identifying the lemma should not pose any problems with a given "list of lemmas" and a tokenized text.

BUT

what happens with words that break into the following line?
If I join all the lines into a continuous text (by lifting the ¬) and thus reestablish the broken words, I lose the information inherently and purposefully existing in my ALTO files.

I should consider finding a better practice for that lemma-gloss attribution by locally inserting "lemma markers" while transcribing, as suggested by prof. Clérice, and then writing a clean_text script to reestablish the ground truth.

check_Ontology_strings

Following today's attempt to manually check if all of my messy lines are characterized, I need a check_string script to verify that all of my lines are characterized and pop a message of the .XML file name that the omission occurs.

Otherwise, text_extraction depending on SegmOnto doesn't give the whole text(s).

first_commit

Making my first commit with the non-finalized data.
To do (a lot):
-finalize transcription norms (desinences, punctus elevatus, space/dot ambiguity -mayve comma?-)
-complete excel with hand characterization and granularity
-pipeline modeling
-scripts python for ALTO manipulation
-test for automatic glose-lemma attribution
-LaTeX documentation.

difficulties to consider in my MainZone(s)DefaultLine and InterlinearLine extraction

Need to define the best compromise between Line and Zone extraction in order to better annotate my manuscript(s).

3 of my pages pose a significant layout problem where the order of the text is perturbated and blocks of text need to be read vertically instead of horizontally.

Currently, I am assigning a new MainZone#n to bits of texts whose reading order I want to establish/redirect such as columns. It is important to keep this information as it can be useful for the transmission of the text.

Examples:
Leyden,VossLat041,folios 4v and 5r

The problem occurs when I try to extract the main text that lies in the DefaultLines of the multiple ordered MainZones and the glosses that correspond to each line of each zone.
I need the text to appear in that specific order, given that in my fiche de récollement I give for each gloss the exact line (after ordering) of the lemma associated with it.

I need a text_extraction code that extracts the DefaultLines of every MainZone in an ordered fashion (MainZone, MainZone#1-6) and then does the same for the InterlinearLines, in order to (at some point, somehow) tag automatically and associate the lemmas and the glosses following the info in my fiche.

malamatenia / eutyches Goto Github PK

eutyches's People

Contributors

Stargazers

Watchers

Forkers

eutyches's Issues

thoughts on lemma-glosse attribution to words that break in two lines

check_Ontology_strings

first_commit

difficulties to consider in my MainZone(s)DefaultLine and InterlinearLine extraction

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent