This repository is a sandbox in which to prototype tools for cleanup, transformation, and validation of data curated by editors of DIMEV: An Open Access, Digital Edition of the "Index of Middle English Verse". Researchers interested in Middle English verse should consult dimev.net, not this repository, as the XML source files in this repository are a snapshot and will not be updated. They are for testing only. Commentary is welcome.

The repository also hosts source files for an experimental new DIMEV website, built with Jekyll and hosted by GitHub Pages. All this is very much work in progress. An inspiration is Andrew Dunning's prototype for a digital edition of Richard Sharpe, A Handlist of Latin Writers of Great Britain and Ireland Before 1540.

Repository contents

artefacts/ Warnings, reports, and csv artefacts of the scripts in scripts/. Transformed source data are written instead to docs/ for use by the Jekyll website builder.
DIMEV_XML/ DIMEV source files as of May 2023.
docs/ Source files and templates for a website. The contents of docs/_items/ are written by scripts/transform-Records.py.
schemas/ JSON schemas for validation of transformed source files.
scripts/ Python scripts for review and transformation of the files in DIMEV_XML. For details see comments at the head of each file.

Technical direction

Records.xml will be atomized (one file per <record>) to make effective use of git distributed version control. Data will be parsed to identify irregularities, remediated (manually where necessary), and written to a new consistent structure. For instance, any field that may be an array must be an array (even if an array of one). After migration, subsequent updates to any file must validate against a schema. Early prototypes of data files are in docs_items. An early prototype of the schema is schemas/records.json. Cross references (i.e., those <record> items without an @xml:id) will be handled differently, tbd.
Manuscripts.xml and MSSIndex.xml will be de-duplicated. Data will be atomized (one file per <item>), parsed, remediated, and written to a new consistent structure. For an early partial prototype, see the output of scripts/transform-Manuscripts.py. Inscriptions.xml and PrintedBooks.xml will be handled similarly. After migration, subsequent updates to any file must validate against a schema.
Bibliography.xml. Data will be parsed and remediated (as above), written to a standard bibliographic data format and imported to Zotero for distribution and curation on that platform. For a prototype of this conversion, see artefacts/bibliography.yaml; the schema is schemas/csl-data.json. To import tags we must target a format other than CSL JSON, per this discussion. Tags will be used to link bibliographic items to their objects, as in the Bodleian Library's bibliographical references for Western manuscripts. Links to on-line facsimiles of manuscripts will be handled differently, probably as a field within the data structure for manuscripts.
Glossary.xml tbd.

icornelius / dimev-sandbox Goto Github PK

dimev-sandbox's Introduction

Repository contents

Technical direction

dimev-sandbox's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent