helmadik / lsjlogeion Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 5.0 1.13 GB

LSJ as edited for Logeion at Chicago; please report corrections

Home Page: https://logeion.uchicago.edu

License: Other

lsjlogeion's People

Contributors

Stargazers

Watchers

Forkers

pharos-alexandria angelodel80 jeremymarch jacknoutch isaacbennettsmith

lsjlogeion's Issues

Missing or Inconsistent spacing between words

I have found many instances of spaces that appear to be missing between words, yet this is not always the case. For example, consider the entry for 'ἃ ἃ'. The following XML sequence is in the file:

<quote lang="greek">ἃ ἃ δασυνθὲν γέλωτα δηλοῖ</quote><author>Hsch.</author>

For this, we display:

δηλοῖHsch.

This is because the XML has no space between the <quote> and <author> elements. Yet, in other places in the same entry, the space is present:

<author>Pl.Com.</author> 16</bibl> (prob. l.), etc.; <cit>...

Here we see a space after </author> and </bibl> and before <cit>.

Also in this example from the entry for 'ἀάβακτοι':

<quote lang="greek">-βυκτον</quote><author>Cyr.</author>

For this, we display:

"-βυκτονCyr."

I mention this here to benefit others, rather than to request a change in the data, because adding the spaces would be a huge task, and it actually wouldn't benefit us. I have found after playing around with programmatic changes that the Microsoft .NET implementation of XML support will strip out spaces between elements if there are no other characters!

It may be possible to address this through the use of CSS.

replacing < > pair into unicode left and right angle brackets

Is it okay to replace < and > pair(<...>, used for angle brackets) into unicode left and right angle brackets( ⟨...⟩ )? It will distinguish angle brackets from etymological relation sign(<, which is the same as the less -than sign) in appearance.
I'm testing it with my local copy of grc.lsj.xml coming with Diogenes software; I've search-and-replaced them and hand-trimming to leave out isolated <s that are etymology relations.
Before I go fork and commit, I'd like to know if it is against the encoding standards of Logeion or PDL.

Please add a license or a CC0 dedication

This repository does not contain a license, which means I can't use it in some projects.

Could you please add either a Creative Commons license or a CC0 or something like that in a file called LICENSE.md? For instance:

This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International

Or:

This work is dedicated to the Public Domain under CC0 1.0 Universal

There are, of course, other options. But a clear license makes it possible to convince our lawyers to let us use this ...

Bacchylides work numbers: "Perseus:abo:tlg,0199,002:17:25"

LSJ sv τάλαντον has this reference to Bacchylides: δίκαϲ ῥέπει τάλαντον Bacchylides 17.25.

The markup says: Perseus:abo:tlg,0199,002:17:25

But work 002 is not TLG-E, right?

Or maybe I have the wrong map of number-to-works. But, as you will see below, the LSJ data knows of 001, 002, and 004. This is not enough works...

As ever, thanks for all the great work you are doing with the lexicon.

===

hipparchiaDB=# select universalid, title from works where universalid like 'gr0199%';
 universalid |             title              
-------------+--------------------------------
 gr0199w010  | Dithyrambi
 gr0199w011  | Dithyramborum fragmenta
 gr0199w015  | Paeanes (fragmenta)
 gr0199w017  | Partheneia (titulus solum)
 gr0199w018  | Hyporchemata (fragmenta)
 gr0199w019  | Erotica (fragmenta)
 gr0199w020  | Encomia (fragmenta)
 gr0199w021  | Fragmenta ex operibus incertis
 gr0199w012  | Epinicia
 gr0199w013  | Epinicorum fragmenta
 gr0199w014  | Hymnorum fragmenta
 gr0199w016  | Prosodia (fragmenta)
 gr0199w022  | Fragmenta dubia
 gr0199w023  | Epigrammata
(14 rows)

I am using LSJ data via helmadik commit 9143e9e.

Bacchylides 003 does not exist in the dictionary data. Nor does 005, 006, 007...

Here is every pace where Bacchylides 002 appears:

hipparchiaDB=# select entry_name from greek_dictionary where entry_body ~* 'Perseus:abo:tlg,0199,002' order by entry_name;
  entry_name
---------------
 Λυταῖοϲ
 Παιάν
 Ποϲειδῶν
 βία
 βαρυαχήϲ
 βαρύβρομοϲ
 βαϲιλεύϲ
 βοάω
 βοῶπιϲ
 δάμαλιϲ
 δαίμων
 δαιμόνιοϲ
 δαμαϲίχθων
 δινεύω
 δνόφεοϲ
 δολιχαύχην
 δόλιοϲ
 εὐαίνετοϲ
 εὐδαίδαλοϲ
 εὐθυμία
 εὐρυβίαϲ
 εὐρυνεφήϲ
 εὐρυϲθενήϲ
 εὐφεγγήϲ
 εὔθρονοϲ
 εὔπακτοϲ
 εὔτυκτοϲ
 θείνω
 θελημόϲ
 θεόπομποϲ
 θράϲοϲ
 θραϲυκάρδιοϲ
 θραϲυμήδηϲ
 θυμάρμενοϲ
 κάλυμμα
 κέλευθοϲ
 κακομήχανοϲ
 καλλικέραϲ
 καλλιπάρηοϲ
 κεδνόϲ
 κελαδέω
 κλάζω
 κλέω¹
 κλυτόϲ
 κλύω
 κραταιόϲ
 κρατερόϲ
 κρόμμυον
 κυανόπρῳροϲ
 κῆρ
 λείριοϲ
 λεπτόπρυμνοϲ
 λινόϲτολοϲ
 λιπαρόϲ
 μέριμνα
 μήδομαι
 μεγαλοῦχοϲ
 μεγιϲτοάναϲϲα
 μενέκτυποϲ
 μενεπτόλεμοϲ
 μιμνήϲκω
 μυρίοϲ
 ναῦϲ
 ξεϲτόϲ
 οὔλιοϲ
 οὖν
 πίτνω
 παγκρατήϲ
 παλαίϲτρα
 πανδερκήϲ
 παρθενική
 πεδοιχνέω
 ποδαρκήϲ
 πολέμαιγιϲ
 πολέμαρχοϲ
 πολεμήϊοϲ
 πολυήρατοϲ
 πολύδακρυϲ
 πορφύρεοϲ
 πορϲύνω
 ποταίνιοϲ
 πρίν
 προκόπταϲ
 πρύτανιϲ
 πρώθηβοϲ
 πυνθάνομαι
 πυριέθειρα
 πυρϲόχαιτοϲ
 πότεροϲ
 τάλαντον
 τέθηπα
 τέμνω¹
 τίκτω
 ταλαπενθήϲ
 τε¹
 τεόϲ
 τηλαυγήϲ
 τιϲ
 τόϲοϲ
 φέρω
 φερεϲτέφανοϲ
 φιλάγλαοϲ
 φρενοάραϲ
 φυτεύω
 χαίτη
 χαλκεόκτυποϲ
 χαλκοκώδων
 χείρ
 χρυϲεόπλοκοϲ
 χρυϲόπεπλοϲ
 χρύϲαϲπιϲ
 χρύϲεοϲ
 ϲέλαϲ
 ϲεύω
 ϲοέω
 ϲτίλβω
 ϲτεφανηφόροϲ
 ϲχάζω
 ἀγακλεήϲ
 ἀγλαόθρονοϲ
 ἀγλαόϲ
 ἀδίαντοϲ
 ἀκάματοϲ
 ἀλλοδημία
 ἀμφί
 ἀμφιβάλλω
 ἀμφικύμων
 ἀμύνω
 ἀμύϲϲω
 ἀνακάμπτω
 ἀνθεμόειϲ
 ἀνθεμώδηϲ
 ἀρέταιχμοϲ
 ἀταρβομάχαϲ
 ἁβρόβιοϲ
 ἁλιναιέτηϲ
 ἁμαρτέω
 ἄκοιτοϲ
 ἄλϲοϲ
 ἄϲπετοϲ
 ἆ
 ἐκβάλλω
 ἐπίφρων
 ἐπιδέχομαι
 ἐραννόϲ
 ἐρατόϲ
 ἐρατύω
 ἐρατώνυμοϲ
 ἐρύκω
 ἑπτάπυλοϲ
 ἔμποροϲ
 ἦ¹
 ἰαίνω
 ἰοβλέφαροϲ
 ἰόπλοκοϲ
 ἱμεράμπυξ
 ἴκρια
 ἵππιοϲ
 ὀβριμοδερκήϲ
 ὀβριμόϲποροϲ
 ὀπάων
 ὀρϲίαλοϲ
 ὀρϲιβάκχαϲ
 ὄφρα
 ὄψ¹
 ὑγρόϲ
 ὑπερήφανοϲ
 ὑφαίνω
 ὑψίκερωϲ
 ὠκύπομποϲ
 ὥτε
 ῥά
 ῥέπω
 ῥοδοδάκτυλοϲ
 ῥοδόειϲ
(175 rows)

Here is every pace where Bacchylides 001 appears:

hipparchiaDB=# select entry_name from greek_dictionary where entry_body ~* 'Perseus:abo:tlg,0199,001' order by entry_name;
    entry_name
------------------
 Αἶϲα
 Δηλογενήϲ
 Κύκλωψ
 Λοξίαϲ
 Παλλάϲ
 Φλειοῦϲ
 αἰθήρ
 αἰολόπρυμνοϲ
 αἰπεινόϲ
 αἰπύϲ
 αἰχμοφόροϲ
 αἰόλοϲ
 αἱμακουρίαι
 αἴ
 αἴγλη
 αἴθων
 αἴτιοϲ
 αὐδήειϲ
 αὐθιγενήϲ
 βία
 βαθυδείελοϲ
 βαθυπλόκαμοϲ
 βαθύζωνοϲ
 βαθύξυλοϲ
 βαθύπλουτοϲ
 βαρυπενθήϲ
 βαρύτλητοϲ
 βαρύφθογγοϲ
 βιάω
 βληχρόϲ
 βοηθόοϲ
 βουζύγηϲ
 βούθυτοϲ
 βοῶπιϲ
 βρίθω
 βραχύϲ
 βροτωφελήϲ
 βροτόϲ
 βρύω
 γελανόω
 γεραίρω
 γηρύω
 γλυκύδωροϲ
 γλυκύϲ
 γνήϲιοϲ
 γυιαλκήϲ

Bacchylides 004:

hipparchiaDB=# select entry_name from greek_dictionary where entry_body ~* 'Perseus:abo:tlg,0199,004' order by entry_name;
  entry_name
--------------
 βάρβιτοϲ
 εἰκάϲ
 εὐέανοϲ
 εὐλύραϲ
 θραϲύχειρ
 καλλίϲφυροϲ
 καλυκῶπιϲ
 καταπαύω
 λεύκιπποϲ
 λιγυηχήϲ
 μαινόληϲ
 μείγνυμι
 μιαιφόνοϲ
 πάϲϲαλοϲ
 τανύπεπλοϲ
 χαλκεομίτραϲ
 χρυϲόλοφοϲ
 ϲεύω
 ϲκόλιον
 ἑπτάτονοϲ
(20 rows)

Thomas Magister

I'll now clean out the other spurious Thom.Mag. bibls. @jacknoutch :-)

Plutarch References for the Moralia do not tally with the Perseus URNs

Issue

LSJ uses the Stephanus pagination without title of the work for Plutarch's Moralia. Typically references will be something like:
<author>Plu.</author> 2.345a. (The 2. is because LSJ used Wyttenbach's two volume edition of the Moralia. It is superfluous information for these purposes.)

These references do not tally with the Perseus URNs of the form Perseus:abo:tlg,0007,087,345a.

Aim

Each reference to Plutarch's Moralia should be wrapped in a <bibl n=Perseus:abo:tlg,0007,087,345a> tag.

NB @helmadik, after your email of 16 Sep, I'm writing this up as a GitHub issue since I think it will be easier for communication. Just got started on the problem yesterday :)

Inconsistent use of XML for identifying entries

Most entries are of the form:

<div2 id="crosse)pau/+sas" orig_id="n38392" key="e)pau/+sas" type="main" opt="n">
    <headword extent="full" lang="greek" opt="n" orth_orig="ἐπαΰσας">ἐπαΰσας</headword>

But there are 2 that are different:

<div2 id="e)pauri/skw">
    <headword extent="full" lang="greek" opt="n" orth_orig="ἐπαυρίσκω">ἐπαυρίσκω</headword>
    ...
<div2 id="tomh/" orig_id="n104250a" key="tomh/" type="main" opt="n">
    <headword extent="suff" lang="greek" opt="n" orth_orig="τομ-ή">τομή</headword>

These 2 do not have "cross" as the prefix to the id, and the first is lacking a key attribute! Neither of these entries is reachable in the online Perseus lexicon or in the Kindle version of the "Middle Liddell".

I have tried to correct these in my local copy of the data, but I mention it here for the benefit of others.

Inconsistent use of head tag

I have noticed that there seems to be an inconsistency in the use of the <head> tag in the XML markup. For example:

<head>Preface 1925</head>
<div2 id="cross*a" orig_id="n0" key="*a" type="main" opt="n"><headword extent="full" lang="greek" opt="n" orth_orig="Α α">α</headword>...

In the first case, 'head' seems to mean 'heading'; in the second case, it seems to mean 'headword'. Do I understand this correctly?

In both cases, this makes it harder to convert the XML into HTML for display purposes, since <head> has another meaning in HTML. I would recommend using different, more distinct tags for these situations. Is that a possibility?

Seemingly inconsistent use of square brackets in headwords

I have encountered 15 occurrences of headwords containing square brackets. Here are a few examples:

<head extent="full" lang="greek" opt="n" orth_orig="τεσσερᾰκαιεβδο[μη]κοντούτης">τεσσερακαιεβδο[μη]κοντούτης</head>
<head extent="suff" lang="greek" opt="n" orth_orig="σκολῐό-δ[ειρ]ος">σκολιόδ[ειρ]ος</head>
<head extent="full" lang="greek" opt="n" orth_orig="ῥηξί-[ζῠγ]ος">ῥηξί[ζυγ]ος</head>
<head extent="full" lang="greek" opt="n" orth_orig="ἀπόλυγμα[τος]·">ἀπόλυγμα[τος]</head>

But then I saw one occurrence where the closing square bracket was after the end head tag:

<head extent="full" lang="greek" opt="n" orth_orig="οὐλομέτ[ριον">οὐλομέτ[ριον</head>], ...

And then there are a few cases like these, where there is only one square bracket:

<head extent="full" lang="greek" opt="n" orth_orig="προεδικ[ός">προεδικ[ός</head>
<head extent="full" lang="greek" opt="n" orth_orig="τρῐσατ]ῠχής">τρισατ]υχής</head>

Given that these are so few in number, and the appearance of the brackets is inconsistent, I wondered if these might be typographical errors...?

Inconsistent use of XML for parallel items

I have found the following (probably there are others...?):

    <cit>
        <quote lang="greek">-ρῶς</quote>
        <i>without giving offence,</i>
        <bibl n="Perseus:abo:tlg,4013,006:p.85D" default="NO">
            <author>Simp.</author>
            <title>in Epict.</title>p.85D.
        </bibl>
    </cit>; 
    <i>without taking offence,</i> ib.
    <bibl n="Perseus:abo:tlg,4013,006:p.88D" default="NO">p.88D.</bibl>

I have formatted this in order to make it easier to see the imbalance.

The 2 <i> tags seem to be intended as parallel, yet one is wrapped in <cit> and the other is not. I would have expected each of these to be considered a <cit>. This appears OK in the online Perseus dictionary, but the formatting of the first (blue text) is different from the second in my Kindle version of the printed dictionary.

This is probably not a grievous problem, as it may simply make it difficult to apply formatting consistently.

n114551, n114552 ... n114554, n114555: χράω(B) is missing from the XML

There is a gap in the LSJ coverage. One finds:

...
n114551 (χραύω)
n114552 (χράω [A])
n114554 (χρέα)
n114555 (χρεαγωγός)
...

This means that χράω[B] is missing. Presumably it should be n114553.

I suspect that this is an old, old omission. But the word is very common, so it is worth filling in this gap.

Missing sense numbers in XML text

The sense number for each entry is in an attribute, rather than in the text, so it is removed when the XML elements are removed, as is normally done when XML data is formatted for display.

We have had to address this through the use of CSS, which I show here for the benefit of others:

sense:before {
    content: attr(n) '. ';
}

This copies the sense number from the element attribute to the element text before the element tags are removed.

Unmarked occurrence of Greek text

I found this in the entry for 'ἄα':

<sense n="II" id="n5.0" level="2" opt="n"> v. ἄας.</sense>

Not marking this with lang="greek" prevents us from formatting this text or turning it into a link to another entry. To correct this, it would need to be:

<sense n="II" id="n5.0" level="2" opt="n"> v. <foreign lang="greek">ἄας</foreign>.</sense>

I have not had the chance to search for other occurrences like this...

Entries with duplicate headwords and/or keys

I used the following XPath expression to extract a list of the headwords: /tei.2/text/div1/div2/headword (Note that I have already changed <head> to <headword>, as discussed in another issue.)

This produced a list of 116,471 headwords. In this list, I found 672 duplicates, using this search criteria:

Regular Expression: <entry>([\w-]+)</entry>\n\s*<entry>\1</entry>

This, in itself, is not a problem as much as a challenge. Many dictionaries are designed to keep these 'homograph' entries together; many others are designed to allow for separate entries for homographs, as we discussed in the above linked issue.

In the latter case, some dictionaries use 'homograph numbers' to keep them separate. For online dictionaries, this helps immensely, so that the headwords can be used as unique indexes to the entries.

In the LSJ data, I see the 'key' attribute on the <div2> elements. I had assumed that this 'key' could be used as a unique index. To make use of it, we would need to replace our direct-link from headword to entry with a level of indirection: headword > key > entry. And this would require that duplicate headwords would be shown in the index, but, unseen by the user, each headword would be associated with a unique key, and this key would be used to find the correct entry.

This is doable (and seems to have been done for the online Perseus dictionary).

However, I have encountered another related issue, which may be a real problem: duplicate keys.

I used the following XPath expression to extract a list of the keys: /tei.2/text/div1/div2/@key

This produced a list of 116,474 keys. (Not sure why this number is different from the number of headwords!) In this list, I found 23 duplicates, using this search criteria:

Regular Expression: key="([^"]+)"\n key="\1"

One example is the key 'ai)go/keras'. The first entry has the headword 'αἰγόκερας'; the second entry has the headword 'αἰγοκερεύς'. I notice that the online Perseus dictionary only seems to show the first of these, as does my Kindle version of the print dictionary.

Another example is the key 'a)mfeleli/zw'. This occurs 3 times, with headwords, 'ἀμφελελίζω', 'ἀμφελικτός' and 'ἀμφελίσσω'. Again, only the first of these 3 entries can be seen in both the Kindle version and the online Perseus version.

Are these intentional, or mistakes?

Perhaps we should use the 'orig_id' attribute, rather than the 'key'...?

Here, again, I find the number is different:

116,471 headwords
116,474 keys
116,473 orig_ids

(At the moment I'm having difficulty finding the entry which contains a key, but no orig_id...)

Non-Greek punctuation within spans of Greek text

There are a number of instances where non-Greek punctuation is included within spans of text which are labelled as Greek. For example:

<foreign lang="greek">ἄ-οινος, ἄ-υπνος,</foreign>

This prevents us from turning elements tagged as <foreign lang="greek"> into links to other entries. To correct this, it would need to be:

<foreign lang="greek">ἄ-οινος</foreign>, <foreign lang="greek">ἄ-υπνος</foreign>,

Here is another troublesome example:

[<foreign lang="greek">ῠ], <gen lang="greek" opt="n">ἡ</gen>, </foreign>

We don't, yet, have a solution for this.

< P> at the end of the entry under έπιουδίς

Under the headword έπιουδίς, a strange string < P> decorates the end of the entry. It is not present in the 1996 LSJ with revised supplement. I cannot decide if it is a relic of older version, a special indicator, or a misplaced garbage to be removed.

Is breve accent in various Greek words intentional?

I have seen a Unicode breve accent character in some places, and I wondered if this is a typo or if combining breve accent was intended or...? Is this something that needs to be corrected?

Here are some examples:

<cit><quote lang="greek">ἄτε˘κνος</quote><author>A.</author></cit>
<headword extent="suff" lang="greek" opt="n" orth_orig="αὐλά˘κ-ιον">αὐλάκιον</headword>
<itype lang="greek" opt="n">δη˘ῐων</itype>
<pron extent="full" lang="greek" opt="n">[ω˘]</pron>
<foreign lang="greek">ἥρω˘ος</foreign>

Looks like there are 60 occurrences...