Git Product home page Git Product logo

dimev-sandbox's Introduction

This repository is a sandbox in which to prototype tools for cleanup, transformation, and validation of data curated by editors of DIMEV: An Open Access, Digital Edition of the "Index of Middle English Verse". Researchers interested in Middle English verse should consult dimev.net, not this repository, as the XML source files in this repository are a snapshot and will not be updated. They are for testing only. Commentary is welcome.

The repository also hosts source files for an experimental new DIMEV website, built with Jekyll and hosted by GitHub Pages. All this is very much work in progress. An inspiration is Andrew Dunning's prototype for a digital edition of Richard Sharpe, A Handlist of Latin Writers of Great Britain and Ireland Before 1540.

Repository contents

  • artefacts/ Warnings, reports, and csv artefacts of the scripts in scripts/. Transformed source data are written instead to docs/ for use by the Jekyll website builder.
  • DIMEV_XML/ DIMEV source files as of May 2023.
  • docs/ Source files and templates for a website. The contents of docs/_items/ are written by scripts/transform-Records.py.
  • schemas/ JSON schemas for validation of transformed source files.
  • scripts/ Python scripts for review and transformation of the files in DIMEV_XML. For details see comments at the head of each file.

Technical direction

  • Records.xml will be atomized (one file per <record>) to make effective use of git distributed version control. Data will be parsed to identify irregularities, remediated (manually where necessary), and written to a new consistent structure. For instance, any field that may be an array must be an array (even if an array of one). After migration, subsequent updates to any file must validate against a schema. Early prototypes of data files are in docs_items. An early prototype of the schema is schemas/records.json. Cross references (i.e., those <record> items without an @xml:id) will be handled differently, tbd.
  • Manuscripts.xml and MSSIndex.xml will be de-duplicated. Data will be atomized (one file per <item>), parsed, remediated, and written to a new consistent structure. For an early partial prototype, see the output of scripts/transform-Manuscripts.py. Inscriptions.xml and PrintedBooks.xml will be handled similarly. After migration, subsequent updates to any file must validate against a schema.
  • Bibliography.xml. Data will be parsed and remediated (as above), written to a standard bibliographic data format and imported to Zotero for distribution and curation on that platform. For a prototype of this conversion, see artefacts/bibliography.yaml; the schema is schemas/csl-data.json. To import tags we must target a format other than CSL JSON, per this discussion. Tags will be used to link bibliographic items to their objects, as in the Bodleian Library's bibliographical references for Western manuscripts. Links to on-line facsimiles of manuscripts will be handled differently, probably as a field within the data structure for manuscripts.
  • Glossary.xml tbd.

dimev-sandbox's People

Contributors

icornelius avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.