malamatenia / eutyches Goto Github PK
View Code? Open in Web Editor NEWThis repository hosts all the documents necessary for the evaluation of my M2 dissertation in Digital Humanities at the ENC.
License: Apache License 2.0
This repository hosts all the documents necessary for the evaluation of my M2 dissertation in Digital Humanities at the ENC.
License: Apache License 2.0
Since I haven't locally tagged my lemmas, I depend on an external document in order to identify and tag my lemmas (not the best practice probably).
In the case where a word is freestanding, identifying the lemma should not pose any problems with a given "list of lemmas" and a tokenized text.
BUT
what happens with words that break into the following line?
If I join all the lines into a continuous text (by lifting the ¬) and thus reestablish the broken words, I lose the information inherently and purposefully existing in my ALTO files.
I should consider finding a better practice for that lemma-gloss attribution by locally inserting "lemma markers" while transcribing, as suggested by prof. Clérice, and then writing a clean_text script to reestablish the ground truth.
Following today's attempt to manually check if all of my messy lines are characterized, I need a check_string script to verify that all of my lines are characterized and pop a message of the .XML file name that the omission occurs.
Otherwise, text_extraction depending on SegmOnto doesn't give the whole text(s).
Making my first commit with the non-finalized data.
To do (a lot):
-finalize transcription norms (desinences, punctus elevatus, space/dot ambiguity -mayve comma?-)
-complete excel with hand characterization and granularity
-pipeline modeling
-scripts python for ALTO manipulation
-test for automatic glose-lemma attribution
-LaTeX documentation.
Need to define the best compromise between Line and Zone extraction in order to better annotate my manuscript(s).
3 of my pages pose a significant layout problem where the order of the text is perturbated and blocks of text need to be read vertically instead of horizontally.
Currently, I am assigning a new MainZone#n to bits of texts whose reading order I want to establish/redirect such as columns. It is important to keep this information as it can be useful for the transmission of the text.
Examples:
Leyden,VossLat041,folios 4v and 5r
The problem occurs when I try to extract the main text that lies in the DefaultLines of the multiple ordered MainZones and the glosses that correspond to each line of each zone.
I need the text to appear in that specific order, given that in my fiche de récollement I give for each gloss the exact line (after ordering) of the lemma associated with it.
I need a text_extraction code that extracts the DefaultLines of every MainZone in an ordered fashion (MainZone, MainZone#1-6) and then does the same for the InterlinearLines, in order to (at some point, somehow) tag automatically and associate the lemmas and the glosses following the info in my fiche.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.