Git Product home page Git Product logo

yearspans's Introduction

yearspans

Utility to determine start/end years from textual expressions of year periods. Note this is a Python reworking of some earlier C# .NET project work (https://github.com/cbinding/timespans)

Background

Archaeological dataset records often use a textual expression of dating rather than absolute numeric years for the dating of artefacts. These textual data values can be in a variety of formats and different languages. There can be prefixes present such as Circa, Early, Mid, Late - and suffixes such as A.D., B.C., C.E., B.C.E., B.P. that may influence the dates intended. This situation presents potential problems for temporal comparison of records in a single dataset, but also introduces wider data integration issues, as illustrated in the table of examples below:

Category Language Expresssion
Ordinal century Dutch Begin 11e eeuw voor Christus
English Circa Second Century BC
French Début du 11e siècle avant JC
German Frühes elfte Jahrhundert v
Italian XV secolo d.C.
Norwegian Tidlig ellevte århundre e.Kr.
Spanish Principios del siglo XI d.C.
Swedish Tidigt elfte århundrade f.Kr.
Welsh Canol y 15fed ganrif
Year span English 1450-1460
English 1485-86
Single year (with tolerance) English C. 1485
English 1540±9
English AD400+
English 400 AD
Decade English Circa 1860s
Italian intorno al decennio 1910
Welsh 1930au
Century span English 5th – 6th century AD
Italian VIII-VII secolo a.C.
Welsh 5ed 6ed ganrif
Month and year English July 1855
Italian Luglio 1855
Welsh Gorffennaf 1855
Season and year English Summer 1855
Italian Estate 1855
Welsh Haf 1855
Named periods (from lookup) English Georgian
English Victorian

Normalising such data can make subsequent search and comparison of records easier and more accurate. We can do this by supplementing the original data values with additional attributes defining the start and end dates of the timespan. This application attempts to match textual values representing timespans to a number of known patterns, and from there to derive the intended start/end dates of the timespan. For some cases the start/end dates are present and may be extracted directly from the textual string, however in most cases a degree of additional processing is required after the initial pattern match is made. The output facilitates better comparison of textual date spans as often expressed in datasets. Due to the wide variety of possible formats (including punctuation and spurious extra text or white space), the matching patterns developed cannot comprehensively cater for every possible free-text variation present, so any remaining records not processed by this initial automated method can be manually reviewed and assigned suitable start/end dates.

Century Subdivisions

The output dates produced are derived relative to Common Era (CE). Centuries are considered to start at year 1 and end at year 100. Prefix modifiers for centuries take the following meaning (in this application):

Prefix Start End
Early* 1 40
Mid* 30 70
Late* 60 100
First Half 1 50
Second Half 51 100
First Quarter 1 25
Second Quarter 26 50
Third Quarter 51 75
Fourth Quarter 76 100

*Note the boundaries of Early, Mid and Late overlap, suggesting that a level of approximation is intended when using such terms.

In the case of decades, centuries or stated tolerances, an offset is added or subtracted from the initial extracted year in order to interpret the overall extents of the year span being expressed. (e.g. 1540±9 = start year 1531, end year 1549)

For matches on known named periods (e.g. Georgian, Victorian etc.) the start/end years are derived from suitable authority list lookups.

Usage

Command:

python yearspanmatcher.py -i "{input}" [-l "{language}"]
python yearspanmatcher.py -i "Early 2nd Century" -l "en"

Input (required)

The textual timespan expression to be processed. The matching patterns employed are all case insensitive, e.g. 2nd Century AD and 2nd century ad would yield identical results.

Language (optional)

The ISO639-1 two character language code corresponding to the language of the input data. This hints to the underlying matching process the most appropriate matching patterns to use. Languages currently supported are:

  • Czech ('cs')
  • Dutch ('nl')
  • English ('en') [default]
  • Italian ('it')
  • German ('de')
  • French ('fr')
  • Norwegian ('no')
  • Spanish ('es')
  • Swedish ('sv')
  • Welsh('cy')

If the language parameter is omitted or is not one of these recognised values then the default used will be en (English).

Example Output

input language min year max year
Early 2nd Century BC en -0199 -0159
1839-1895 en 1839 1895
1839-75 en 1839 1875
c. 1521 en 1521 1521
140-144 d.C. it 0140 0144
Inizio undicesimo secolo d.C. it 1001 1040
inizio del undicesimo alla fine del dodicesimo secolo d.C. it 1001 1200
III e lo II secolo a.C. it -0299 -0100
intorno a VI secolo d.C. it 0501 0600
575-400 a.C. it -0574 -0399
début 11ème à fin 12ème siècle après JC fr 1001 1200
georgienne à victorienne fr 1714 1901
dechrau'r 11eg i ddiwedd y 12fed ganrif OC cy 1001 1200
Canoloesol i Edwardaidd cy 1066 1910
Frühes elfte Jahrhundert n. Chr de 1001 1040
Völkerwanderungszeit de 0375 0586
Principios del siglo XI a.C. es -1099 -1059
finales del 1 ° a principios del 2 ° milenio d.C. es 0600 1400
Begin 11e eeuw voor Christus nl -1099 -1059
laat 1e tot begin 2e millennium na Christus nl 0600 1400
Tidlig på 1100-tallet e.Kr. no 1001 1040
1950-tallet no 1950 1959
Tidigt 1100-tal f.Kr. sv -1099 -1059
1250 - 57 e.Kr. sv 1250 1257
50. léta 20. století cs 1950 1959
počátek dvanáctého až konec jedenáctého století př. n. l. cs 1199 1000

Testing

A suite of tests using the Python unittest framework are located under the tests directory. There are 315 unit tests in all, covering the various categories of year span textual expressions in each supported language.

yearspans's People

Contributors

cbinding avatar

Stargazers

Alex Brandsen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.