perseusdl / canonical Goto Github PK

This will be the base repo for all text and annotation data published in the PDL

canonical's Introduction

PerseusDL canonical repository

This was the first public GitHub repository home for the TEI XML texts of the Perseus Digital Library.

As our strategy for working with the texts through GitHub has evolved and as the repository has been growing, we have decided to move the texts to individual repositories subdivided by the CTS namespace to which the texts have been assigned. In general, namespace corresponds to language of original transmission for the work.

Greek works are now in http://github.com/PerseusDL/canonical-greekLit

Latin works are now in http://github.com/PerseusDL/canonical-latinLit

Anglo-Saxon works are now in http://github.com/PerseusDL/canonical-angLit

Italian works are now in http://github.com/PerseusDL/canonical-itaLit

Norse works are now in http://github.com/PerseusDL/canonical-norseLit

Farsi works are now in http://github.com/PerseusDL/canonical-farsiLit

We are still working through our strategy for secondary sources and reference works. For now you can find them in http://github.com/PerseusDL/canonical-pdlrefwk but this is subject to change.

If you are unsure of where to find a work you are interested in, please use the Perseus Catalog. The Catalog interface prominently displays the CTS URN for each edition or translation of a work. The filestructure in our GitHub repositories for texts currently adhere to the following structure:

canonical-NAMESPACE/data/TEXTGROUP/WORK/TEXTGROUP-WORK-VERSION.xml

More information on the CTS identifier structure of the Perseus texts can be found in the Catalog documentation.

Note that all GitHub file locations are subject to change, and that URLS to the GitHub files should NOT be used as Permanent Stable Identifiers for the Perseus texts. Information on where and how to find stable identifiers for the Perseus texts is provided at

http://sites.tufts.edu/perseusupdates/beta-features/perseus-stable-uris/ and http://sites.tufts.edu/perseuscatalog/documentation/user-guide/catalogdata-uris/

Copyright

Tufts University holds the overall copyright to the Perseus Digital Library; the materials therein (including all texts, translations, images, descriptions, drawings, etc.) are provided for the personal use of students, scholars, and the public.

Materials within the Perseus DL have varying copyright status: please contact the project for more information about a specific component or object. Copyright is protected by the copyright laws of the United States and the Universal Copyright Convention.

Unless otherwise indicated, all contents of this repository are licensed under a Creative Commons Attribution-ShareAlike 3.0 United States License. You must offer Perseus any modifications you make. Perseus provides credit for all accepted changes.

canonical's People

Contributors

Stargazers

Watchers

Forkers

nkallen gregorycrane simonastoyanova scotartt helmadik rwhaling ponteineptique srdee nevenjovanovic jeidsath mlj tariqyosef gcelano thomask81 somiyagawa

canonical's Issues

Silius Italicus Punica

Poetry work with prose dtd and only unit known in refs decl is 'book' :
refsDecl doctype="TEI.2"
unit="book"
end refsDecl
(Angle brackets removed here because they make the whole thing disappear..)
https://github.com/PerseusDL/canonical/blob/master/CTS_XML_TEI/perseus/latinLit/phi1345/phi001/phi1345.phi001.perseus-lat1.xml

LSJ bad entry

entry 42819 and 42820 should be part of entry 42818

Strabo bibl fixes (citing Strabo)

Some texts use a different citation scheme for Strabo bibls, so they need to be changed to book:chapter:section. Look for Strabo citations that only have two numbers (Strab. X.XX), which will refer to the book and Casaubon chapter.

This requires finding the book, chapter and sections that refer to the book:Casaubon chapter (there will probably be multiple sections) and finding out which section actually has the bit of text that the citation refers to.

This process is not perfect, since we're not always sure which Greek text the author is using to determine the book:Casaubon chapter, but it gets you in the right general area. It's better than not having the citation work!

http://books.google.com/books?id=LfpGAAAAIAAJ [^](Greek, volumes 1-2)
http://books.google.com/books?id=1VwUAAAAYAAJ [^](Greek, volumes 3)

Also used the English versions on the Perseus website once I determined the book:chapter:section(s).

Fix keys for place names

Since these were automatically extracted, a lot of places will refer to contemporary cities, instead of the ancient site (ie, Troy in New York).

A list of files with their status can be found in texts/textwork/textsWithEntities.xlsx

The key attribute in and tags indicate what long/lat that place refers to, which may need to be fixed if they refer to the wrong key. All of the place keys can be found in texts/textwork/places.txt. One can also look at the map on the Perseus website (using the texts document ID) to see if there are any outliers, but every and tag should be double checked.

cast lists are not displayed

0000721: cast list not displaying
Description Aristophanes, Euripides, Plato, Plautus, Sophocles, Terence directories have texts with cast tags but do not display the cast info
Note: This could be an issue with the hopper, but it appears that this info is not within a div (unlike Shakespeare) so I placed it in this repo for now.

front matter for lexica

It is there for some works (LSJ) but not at all obvious to the user.

For other works, front matter and back matter are completely missing (as are reference tables, appendices, etc.)

Named Entity Issues Bancroft 2001.05.0326

X-Original-To: [email protected]
Date: Wed, 4 Feb 2009 05:18:02 -0800 (PST)
Subject: Typographical errors
To: [email protected]

Dear Sir/Madame,

It has come to my attention numerous errors in your transcription of George Bancroft's History of the Colonization of the United States, vol. 1.

Page 10:

Places panel: There is neither England nor Asia listed in the menu, but there is in the text.

Names panel: Henry the Seventh is not referenced to in the names menu, but he is in the text.

Page 11:

In the places panel: St. Marks (Kansas, United States); the St. Mark about which he writes is in Venice, Italy.

Places panel: Bristol Bay (Rhode Island, United States); the Bristol Harbour they speak of is in England, not America.

Places panel: England, the Canaries, Grand Cham, and India are missing.

People panel: Amerigo Vespucci and Columbus are missing.

rsingh04 (administrator)
2009-09-18 08:26

This is weird. The entities are tagged, but are not being extracted for some reason. Will need to look into these some more. The incorrectly tagged ones will probably need to be removed as we don't have these places in the TGN xml files.

line numbers in Seneca files do not match source

There are automated line numbers in the Seneca files, which didn't necessarily reflect the publisher's line numbers. So these files will have tags like this:
<milestone unit="line" ed="exclude" n="135"/><l n="135">

Basically, a student has to check each Seneca file and make sure that where there is a line number in the hard copy it is in the <l> tag in the XML (so the n="135" bit).

This is a low priority project, but just something to keep it consistent with the hard copy.

oddities in encoding Slater 1999.04.0072

some Slater headwords do not render properly and thus produce no results in the morph tool and do not work as cross references

Presumably will be fixed by Unicode conversion, but see:
/hopper/text?doc=Perseus%3Atext%3A1999.04.0072%3Aalphabetic+letter%3Da%3Aentry+group%3D19%3Aentry%3Da)ua%2Fta

rendered α?̓υα?́τα

also might want to be sure that cross refs within the lexicon work properly

line numbers missing: latinLit:phi0893.phi001.perseus-lat1;latinLit:phi0620.phi001.perseus-lat1

User requested every fifth line be in display. No line numbers are included in xml for these files.

Martin overview structure 1999.04.0009

a couple of sections I was trying to read seemed to drop out of the overview
I went to 10.1:
/hopper/text.jsp?doc=10.1&fromdoc=Perseus%3Atext%3A1999.04.0009

and used the sidebar to click on the Mystery of the Mysteries

/hopper/text.jsp?doc=Perseus%3Atext%3A1999.04.0009%3Achapter%3D10%3Asection%3D1%3Asubsection%3D7%3Asubsubsection%3D1

I found that when I arrow backwards, I get section 10.1.5.1, when I had been in 10.1.7.1

I think that it wants 10.1.6.1, but since there is only a 10.1.6, it is getting confused.

There are discrepancies between the Side-TOC/navbar navigation vs using the forward/backwards arrows which I need to spend some time looking into and merging (this isn't the only text where these issues manifest). There also smaller discrepancies between the Side-TOC and navbar, but they are less obvious.

The other issue is the structure of the text itself that may need to be reworked a bit since only some subsections have subsubsections, which is why the back/forward arrows don't work correctly because once you're in a subsubsection, it wants to go to the next/previous subsubsection.

I guess I might want to work on rewriting how the front/back arrows are linked, though I think it may be worth it rethink how these texts are structured (part of the regularization of the texts).

This seemed really strange to me. Let me know if there is something I can do to the structure to make the navigation work.

headers mistranslated

do a global check to see if headers in texts appear in correct language: some headers are not tagged separately (ie in a Greek text, the header is not tagged as English so it is displayed as Greeklish...)

tlg0653.tlg001.perseus-grc1.xml Aratus Solensis structure

text structure of Aratus
Description Aratus. Anything other than Arat. 1 doesn't return results: I think
the default is just that this is one book. From what I see throughout
the collection, the refs we have go to Arat. + line number (Arat. 631, etc.)

Additional Information the aratus problem is because I added a "fake" book enclosing the
lines because I needed at least one top level element. I think I did
that wrong. I need to look at how we handle this for other line-only
cited texts like Aeschylus and fix the chunking.

review Smith et al for missing entities

0000666: adding Unicode character entities
Description I'm working in Smith's realia and noticed that some of the things which the data entry people thought were images, are actually symbols and notations for which there are now Unicode equivalents (even if most people don't have fonts to view them or we are still in the UTF-8 charset). Most of these are either not marked at all or are marked with placeholder entities which seem to be particular to this document or to Perseus. Will it cause any problems with the transformation of these texts if I add the hexadecimal codes for the entities I recognize?

For example, we use but the triseme has a Unicode equivalent now (&x23d7;) and this entity doesn't seem to mean anything.

2006.05.0178.xml Richmond Dispatch

0000982: missing section of Richmond Dispatch
I'm not sure why this is happening, but sections of the Richmond Dispatch (at least one edition I am viewing) are not appearing on line. A user searched on the content, and it was found by Google, but you can't actually see it outside of the xml file.
This may be intentional but it seems strange.

User searched on "Robert A. S. Pittman , of the ship James Guthrie and Miss Ada V. Saunders"
which Google returns as
the Perseus XML document:
http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A2006.05.0178 [^]

In the XML, this reference appears in , but it does not appear on the online version. (Searches in this doc for Pittman, Saunders, etc, turn up empty).

everything from sentence 368 to 395 is not visible online

I would guess this happens with other editions of the Richmond Dispatch, but I'm not sure how to pinpoint this problem

This is happening because the text is chunked by article, and the text in question is not included in an enclosing article chunk.

I'm not sure how widely this applies to the Richmond collection, but in this particular file, the article chunks are at the div3 level, along with ad-blank. Anything that is in the higher div2 or div1 elements is getting omitted from the display. The div1 level includes types such as "page-image", "subscription","notices", "news". The div2 elements include various subcategories under each of those.

Articles are in any of the following paths:
/div1[@type='news' or @type='notices']/div2[@type='morning' or @type='evening' or @type='local' or @type='negroe' or @type='wants' or @type='servants' or @type='announcements' or @type='telegraphic' or @type='negro' or @type='slaves']

The undisplayed text above is in the following path
/div1[@type='notices']/div2[@type='advertisements']/div3[@type='ad-blank']

If the entire text is meant to be displayed then we would need to fix the chunking scheme and possibly cleanup the data as well.

check "turn" for "tum"

This happens in many Latin texts: a common OCR relic.

phi0119.phi020 Plautus Truculentus

The English uses different character names from the Latin which can be confusing: Stratilax in the English is Truculentus in the Latin.

Strabo Greek text 1999.01.0197 notes

Teubner notes seem to be completely missing and there are "Loeb" notes, which are really just bibls and may or may not be correct. All of these issues need to be addressed.
Additional Information Teubner links:
http://books.google.com/books?id=MvoHAAAAQAAJ ^
http://books.google.com/books?id=3QhHAAAAIAAJ ^
http://books.google.com/books?id=f_oHAAAAQAAJ ^

This is listed as the Teubner edition, but the notes in the text are drawn from the Loeb.

Also, they begin with book 6, which is where we used to begin our excerpts.

We avoided using the Loeb in previous years because of copyright questions, but I do not know why Loeb notes are in a text we are listing as a Teubner.

I cleaned up and fixed Book 7, chapter fragments. I changed notes to resp="Perseus" so they won't interfere with potentially wrong notes (which have resp="Loeb").

There are 402 notes that need to be checked and possibly removed if the quote it refers to does not exist in the Teubner hard copy.

Then we need to add the Teubner notes.

Last check: there are both "Perseus" and "Teubner" notes in this work. Worth checking to see how far the student got and if the "Perseus" notes are necessary.

Allen and Greenough: New LatinGrammar 1999.04.0001

There are tables and other formatting inconsistencies in this document which should be cleaned up. Also structural issues that keep it from parsing.

Text was sent to Dickinson College Commentaries. Plans for new version underway there. (August 2013 update)

CTS_XML_TEI/perseus/greekLit/tlg0385/tlg001/tlg0385.tlg001.perseus-grc1.xml

Added decl re edits made to this text (fragments of uncertain authorship?) in cvs v1.8

Would like to see this information in this new file as well.

tlg0062.tlg048.perseus-grc1.xml

User comments that there are errors in first line of text. Since there is no "high level of accuracy" decl in this file, I did not make any notations on that.

Worth a thorough proofread?

Marx Celsus <add> tags 2007.01.0088

There are over 1,000 in the text and they should either be <hi rend="italics"> or <> entities (lang and rang). Don't have the text online, so will need a student will need to use the book.

Additional notes
rescanning text for digital copy (unavailable online)

Onions text problems 1999.03.0068

Many of the entries use quote tag when the text should be italicized (using ).

Fix duplicate entries in Onions (printout of entries: add 1,2, etc to differentiate between entry keys)

The second part was fixed by Ian, I don't think the first part has been fixed yet.

Shakespeare texts

0000715: Chunking schemes for Shakespeare plays
Description Look into the schemes some more because they're not playing nice with the hopper. Also, line number indications are incorrect.

Fix references to Cic. Ver. in Smith texts

started working on realia, but also need to do bio and geo. Search for bibls with 2 numbers, i.e. n="Cic. Ver. \d+.\d+" using regular expressions.
Chunking scheme should be act.book.section

1999.04.0063
1999.04.0064
1999.04.0104

Goodwin: Syntax 1999.04.0065

Goodwin: Syntax of the Moods and Tenses of the Greek Verb doesn't have the preface or contents in sgml or xml.

source for greekLit:tlg0526.tlg001.perseus-grc1

cf PerseusDL/perseus_catalog#33

canonical citation scheme for Aristotle Politics, Economics, Nicomachean Ethics

In order to add Aristotle Politics (tlg086.tlg035.perseus-grc1/perseus-eng1, Economics (tlg0086.tlg029.perseus-grc1/perseus-eng1) and Nicomachean Ethics (tlg0086.tlg010.perseus-grc1/perseus-eng1) to Perseids as quotation sources for annotations of Bodin's De Republicae, we need to decide on the canonical citation scheme. In Perseus, these texts have Books as the containing divs, and then the bekker pages and bekker lines identified as milestones. In Politics, we the bekker page milestones are tagged as unit="section" rather than unit="bekker page".

It seems like the book # is maybe an alternate citation scheme to bekker pages - it's not clear whether we should, for example, cite urn:cts:tlg0086.tlg035:1.1252a or urn:cts:tlg0086.tlg035:1252a

What is the Latin text in Book 3 of Aristole Economics (tlg0086.tlg029.perseus-grc1)

Book 3 of tlg0086.tlg029.perseus-grc1, Aristotle Economics based on the Loeb edition, is in Latin and appears to have a different citation scheme than the rest (book/chapter/section as well as rose pages). Is this a different work?

tlg0062.tlg048.perseus-grc1.xml quality issues

Need to check errors and remove "high level of accuracy" notation

bug report:

Hoping all is well with you, and that you will have had a pleasant weekend; also, glad my redaction of your Oppian is being helpful.

An item this morning though which won't be that welcome I fear: looking for Lucian's de Astrologia, I found it on Perseus, good. Much less good is that it's in a very preliminary stage, yet marked "proofread to a high degree of accuracy".
The very first sentence reads
"ἀμφι τε ὀνρανοὺ ἀμφί τε ἀστβρων ἡ γραφη, οὐκ αὐτῶν ἀστέρων οὐδ᾽ αὐτοῦ πβρι ὀνρανού,"
which must be
"ἀμφί τε οὐρανοῦ ἀμφί τε ἄστρων ἡ γραφή (or γραφὴ), οὐκ αὐτῶν ἀστέρων οὐδ᾽ αὐτοῦ περὶ οὐρανοῦ,"
for a total of 6 words wrong out of 15 — and it continues that way, although the frequency of the mistakes does seem to taper off.

Surely, pending proofreading, you'll want to delete "and has been proofread to a high level of accuracy"?

Pindar cross references not working

see PerseusDL/Perseus5#19

Appian chunking schemes (tlg0551)

It has book:chapter:section and book:section. The first one if the default scheme, but clicking on the second one chunks the same way as the first one, so it seems redundant.

add page numbers

Some texts have empty tags, so we just need to add the page number to them.

Renaissance/Schmidt/copyright/schmidt.xml: pb are in wrong spots but numbers are there

Classics/Celsus/opensource/cels_marx_lat.xml
Classics/Cicero/copyright/cic.leg_lat.xml
Classics/Cicero/opensource/cic.ad.brut_lat.xml
Classics/Cicero/opensource/cic.att_lat.xml - edition was confusing, out of order
Classics/Cicero/opensource/cic.fam_lat.xml
Classics/Josephus/opensource/j.aj_gk.xml
Classics/Josephus/opensource/j.bj_gk.xml
Classics/Josephus/opensource/j.vit_gk.xml
Classics/Pliny/opensource/PlinyNH.xml
Classics/Theophrastus/opensource/char_gk.xml
Classics/Vergil/opensource/serv.verg.aen_lat.xml
Classics/Vergil/opensource/serv.verg.ecl_lat.xml
Classics/Vergil/opensource/serv.verg.georg_lat.xml

XML normalisation or element differences in different texts

Is the following just a matter of the XML being normalised?

When I was looking at the TEI XML up on the Perseus github, I discovered that texts tend to use different markup elements for the divisions. I was expecting something on the lines of what I found in Caesar, where the refsDecl metadata element in the header describes the document as being numbered “book.chapter section”, i.e. “1.1.1”, and the XML looks as follows;

<div1 n="1" type="Book"> 
    <head>COMMENTARIUS PRIMUS</head> 
     <div2 n="1" type=“chapter"> 
         <div3 n="1" type="section">

Clearly this is easily parsable in xpath/xquery, combining user input like “2.3.4” with the available metadata to give a xpath query something like:

.//div1[@type='Book'][@n='2']/div2[@type='chapter'][@n='3']
 /div3[@type='section'][@n='4']

Which pulls the desired section out of XML. (although I'll note the metadata in some texts sometimes seems to say the outer type is 'book' rather than 'Book' so I'll need to make that title case insensitive eventually.)

But then I look another text like the Iliad, the outer element is still “div1” but the inner elements are different names, e.g. ‘milestone’, plus, they (sort-of) nest;

<div1 type="Book" n="1">
<milestone ed="p" n="1" unit="card"/>
<l><milestone ed="P" unit="para"/>μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος</l>

There's nothing that I see in the metadata that could lead me to programatically expect the second level element name to be 'milestone' with a 'unit' element rather than a 'type' element.

But then when I look closer at the file the header of the Homeric TEI doesn’t have a proper (or any) DOCTYPE declaration?

<?xml version="1.0" encoding="UTF-8"?><TEI.2>

Does this indicate there’s a format cleanup and/or conversion process that that particular document is still yet to undergo? In my own software I can probably simply detect and reject such XML until it undergoes that process, but if that’s many months or years off I would probably like to write a series of alternative xpath structures for such texts?

errors in Smith bio citations: 1999.04.0104

Smith has bibl n="Just." for Justin, which should be simply bibl n="Justin" to avoid confusion with Justinian, who is also abbreviated in Perseus as "Just."
The citations to the Digest of Justinian were not completely tagged:
for example
"... for in Dig. 26. tit. 7, s. 34 he gives ..."

the tag did not capture the full citation, which should be "Dig. 26.7.34"

Seneca the Elder Excerpta Controversiae - missing pages 148-149 2008.01.0564

http://www.archive.org/stream/oratorumetrhetor00seneuoft#page/148/mode/2up [^]
these 2 pages apparently didn't get keyed

http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A2008.01.0564%3Abook%3D1%3Achapter%3D7

not sure about 150-151

Cross-references from Schmidt incorrect: 1999.03.0079

Schmidt isn't mapping to the correct part of the Shakespeare plays so all of the references are at the beginning of the play rather in the right act and scene.
Tags

phi0448/phi001/phi0448.phi001.perseus-lat1.xml missing notes

0000987: add notes to Caesar Holmes edition
Description There are some notations in this text which don't make sense without the notes: use source to add notes to work

http://books.google.com/ebooks?id=QHVfAAAAMAAJ [^] found here and elsewhere

Not getting right document ID for Cicero texts in search links morph.jsp

/hopper/morph.jsp?l=nocturnum&la=la&prior=te
Both of the search links link to to Perseus:text:1999.01.0010, instead Perseus:text:1999.01.0010:text=Catil. which is the text we're actually in.
Additional Information Each abo for this document ID maps to the correct document ID and subquery, so it needs to grab the subquery in order to give the users the right search link.

Note: (I think this is what this means)

If you are viewing:
http://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.02.0010:text=Catil.:speech=1:chapter=1&highlight=te+nocturnum%2C

Click on nocturnum in the 4th sentence view WST:
http://www.perseus.tufts.edu/hopper/morph?l=nocturnum&la=la&can=nocturnum0&prior=te&d=Perseus:text:1999.02.0010:text=Catil.:speech=1:chapter=1&i=1

The link "10" under Max and the Corpus Name "Against Catiline" resolve to 1999.02.0010 NOT 1999.02.0010:text=Catil.

So splitting these speeches into separate docs would probably resolve the issue.

1999.03.0068 Onions

I tried various display options to see if I could get the other div1 sections to display. It appears that we are showing all of the first div1 (letters A-Z) but there are two subsequent sections of addenda and foreign words and phrases which do not display.

A user had a link to this:
hopper/text?doc=Perseus%3Atext%3A1999.03.0068%3Aentry%3DQuod+me+alit+me+extinguit

which you can see if you do a Perseus search:
http://www.perseus.tufts.edu/hopper/searchresults?q=quod%20me%20alit%20me%20extinguit&inContent=true&language=English [^]

Steps To Reproduce Try the link above to see the problem. The content is in the xml file but it produces a broken link (doesn't display).

Additional Information Ideally, we'd probably just make this a flat, integrated document (alphabetize everything), perhaps. I'm not sure what we gain by preserving the book structure.

cross references in PEncyclopedia Perseus:text:1999.04.0004

0001012: fix links to Apollodorus' Library from PEncyclopedia

Due to the conversion away from book/page in earlier versions of Perseus, the links to Apollodorus are imprecise. When these were mapped from book/page to book/chapter/section the process was never checked by a reader.
About 2/3 of the links go to the wrong section. Most are in the right book/chapter, although there are egregious outliers with this (1.1.x and 1.2.x are actually supposed to go to 3.10.x, etc.)

I reviewed all 1.1.x links and half of the 1.2.x links and stopped making changes.

An alternative to rereading and checking these would be to drop the section and go with book and chapter. We would lose precision but eliminate most errors.

NB: when checking this for inclusion, the links I looked at matched those in P3. Not sure how widespread the issue is; presumably CTS will allow for better precision

refsDecl/state[@unit='book'] but div1[@type='Book']

In urn:cts:latinLit:phi0448.phi001.perseus-lat1 (i.e. Caesar, Gallic Wars) the refsDecl/state element declares attribute 'state=book'.

However, in the text body, the 'type' of the div1 is 'Book'. When you use an xpath 1.0 query engine (which is most of them), you can't do case transformation without awful hacks. Therefore this particular file, it's hard to programatically match what the metadata says should in the document, which what it is marked as.

The same situation applies for the English translation of that same text.

I can submit a pull request from my own branch which fixes those two files, if you like.

Book:chapter:section chunking issues

A lot of texts use that citation scheme, but most of them default to the section chunk rather than the chapter chunk when a user views the text. This is problematic because 1) the Vocab tool only goes down to the chapter chunk. 2) Users have complained that we are breaking up the text into too small chunks. 3) Some texts have a lot of citations that only go to the chapter level.
Additional Information Created a Google Spreadsheet with this information

Can change the texts to Book:chapter*:section schemes so when you go to a text, it defaults to the chapter. Though if we are going to keep the sections, I should probably rewrite the Vocab Tool data loader so it generates data for each level in the scheme so a user gets the vocab data no matter where they are in the text, though I don't yet know what that would involve.

Don't need to worry about vocab tool anymore, because it gets the data from hib_frequencies, which gives data for every chunk.

Still need to think about citations and size of chunks.

Issues with data incoming from Smith's Dictionary of Greek and Roman Geography

Languages incorrectly identified as Latin. zB: *karpa/sion, Schwarzwald, Black Forest, Bourbon l'Archam bault, El Hammat-el-Khabs, il Lagno.
Other things incorrectly identified as names, zB: ad loc.;, Inscr, Frag, Peripl, Indic.
Half Latin half betacode entities, zB: Adernò)
Special characters in the Latin words, zB: {"-", "´", "?", ":", ";", "."}

comprehensive review of text bibliographic source info

Bibliographic info for texts is missing and/or incorrect. A systematic review is desired.

odd characters in Smyth Greek 1999.04.0007

There are Unicode characters in Smyth which were not properly rendered in beta code. Must be visually cross checked and then enter Unicode equivalents somewhere in document (perhaps as comments).

I do not know if these will display in Perseus at this time so this is low priority.

Aristophane's Frogs (English edition) has a strange citation identifier

The last line group in tlg0019.tlg009.perseus-eng1 has the following citation value (i.e. in the div @n attribute): "1528--1". Is this an error or is there some meaning to the --1 suffix that needs to be conveyed in the citation identifier? It's causing a problem for the Alpheios CTS code, and we need to know whether this is an error or really something we need to support in some way in the citation identifiers.

Cross references with Gildersleeve Greek syntax

I noticed that Hom. Il. 1.1 has a reference to:
Basil L. Gildersleeve, Syntax of Classical Greek, Syntax of the simple sentence

which leads here:
/hopper/text?doc=Perseus:text:1999.04.0074:section=3
Gildersleeve grammar 3

the reference is buried in an enormous page and actually appears here:
Gildersleeve grammar 3.2.20

Is there a way to achieve more precision in the link from the Iliad text to the Gildersleeve? Is a matter of the Gildersleeve structure or something in the way the cross references are coded?

Steps To Reproduce Not sure if it is random or limited to Homer text(s)

Additional Information Could be a student assignment of checking these references and seeing which ones work or could require Gildersleeve structural retagging. Need more info.

Also found random broken cross ref here:

/hopper/text?doc=Perseus:abo:tlg,0012,001:11:831 to
Basil L. Gildersleeve, Syntax of Classical Greek, Forms of the verbal predicate

may need to rethink the way Gildersleeve operates: perhaps good summer project

Plutarch texts missing notes

The newest Greek Plutarch texts do not have the notes. tags are currently commented out and need to be added by hand.
Allie and Zehava are both working on this. I am also having them check and tags, per Greg.

Per Allie:

plut.alex_gk.xml
plut.caes_gk.xml
plut.cic_gk.xml
plut.comp.dem.cic_gk.xml
plut.comp.nic.crass_gk.xml
plut.comp.per.fab_gk.xml

Will need to double check these since Zehava completed all of the others, so perhaps she used the wrong editions.

plut.cam_gk.xml needs to be checked (one tag is commented out)

Also need to add notes for Greek Loeb Moralia:
081_loeb_gk.xml
082_loeb_gk.xml
082a_loeb_gk.xml
082b_loeb_gk.xml
083_loeb_gk.xml
084a_loeb_gk.xml
084b_loeb_gk.xml
085_loeb_gk.xml
086_loeb_gk.xml
087_loeb_gk.xml
088_loeb_gk.xml
093_loeb_gk.xml
094_loeb_gk.xml
095_loeb_gk.xml
096_loeb_gk.xml
097_loeb_gk.xml
098_loeb_gk.xml
099_loeb_gk.xml
100_loeb_gk.xml
101_loeb_gk.xml
102_loeb_gk.xml

Will have to redo all of the Greek Lives because Greg removed all of them. I will try to go back into CVS to grab the notes from older versions so at least they can just copy and paste the notes into the correct locations.

Need to do Teubner Moralia:
081-108, 110-111

Finished most of the Lives notes. Files that are left:
plut.alc_gk.xml
plut.alex_gk.xml
plut.art_gk.xml
plut.caes_gk.xml
plut.cic_gk.xml
plut.lyc_gk.xml
plut.mar_gk.xml
plut.num_gk.xml
plut.pyrrh_gk.xml

Still need to do Greek Moralia for Loeb and Teubner (English are finished)

Lewis and Short (ls.xml) split entry

note: "Joannes" is not missing but does resolve to "Joannis" as the LS entry is split. Need to remerge the entry and check the morphology.

user reports:
I believe that I have found an omission in the Perseus version of Lewis and Short's Latin dictionary.

I find that searching for "ioannes" with the Latin Word Study Tool

http://www.perseus.tufts.edu/hopper/morph?l=Joannes&la=la

turns up an entry from Lewis and Short as follows:

Jōannis , is, m., = Ἰωάννης. I. John the Baptist, Lact. 4, 15, 2; Vulg. Matt. 3, 1.— Nom. Joannis, Prud. Cath. 7, 46.— II. John the Evangelist, Vulg. Matt. 4, 21; Prud. Apoth. 9.—Nom. Joannis, Prud. Cath. 6, 108.

That is to say, the headword for the article is "Joannis".

My paper copy of Lewis and Short (Oxford Univ. Press "First edition 1879 | Impression of 1980") reads:

"Jōannes (trisyl. and quadrisyl.) and Jōannis, is, m.", etc. as above.

The first-given headword "Joannes" thus seems to be missing from the Perseus version of L&S; this has the side effect that searching for it fails to identify it as a singular form:

Joannis
John the Baptist
(Show lexicon entry in Lewis & Short) (search)

joannes noun pl masc voc
joannes noun pl masc nom
joannes noun pl masc acc