ubermichael / isetools Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 6.33 MB

Tools for parsing data for the Internet Shakespeare Editions

License: GNU General Public License v2.0

Shell 0.03% Java 98.67% ANTLR 1.29%

isetools's People

Contributors

Stargazers

Watchers

Forkers

telic emmental internetshakespeare

isetools's Issues

unify "typeform" handling

{s}, {r}, {w}, and {W} should be handled identically in the DOM. These are "typeforms" that appear a certain way in the printed text (ſ, ꝛ, vv, and VV respectively) which we want to generally handle as their modern equivalents in the digital text (s, r, w, and W).

Feature: required nesting validator

There are a few cases where a lack of nesting should be considered an error. In particular,

SP must be a descendant of S
BR must be a descendant of HW
CW, RT, SIG, PN, and COL must be a descendant of PAGE

In addition, a document should either have all its SCENE tags nested in ACT, or not have any ACTs at all.

missing `lig` and `typeform` tags in XMLWriter output

The XMLWriter is not producing lig or typeform tags when it should.

For example, on the title page of doc_Lr_Q1.txt the word Hi{{s}t}orie is being serialized as a single text node "Hiſtorie", instead of the expected Hi<lig unicode="ﬅ"><typeform set="ſ">s</typeform>t</lig>orie.

Tested on doc_Lr_Q1.txt through the web interface on isebeta.

transformers should log their changes

Some transformers (especially the DeprecationTransformer) can make a large number of changes to a DOM which are largely opaque from the user's perspective. These transformers would be a lot less scary to use if they logged a message for every modification they made so that the user could review said changes after the fact.

A textual diff gets us half way there, but log messages would be able to offer explainations for //why// each change took place, and show how multiple textual changes are related (eg. moving something from one place to another).

Feature: auto-closing tags

Several tags could be simplified by allowing them to be "auto-closing". In particular, the end of these tags can all be quite easily implied, rather than requiring the editor to explicitly close them:

WORK (at eof)
S (at next S)
ACT (at next ACT)
SCENE (at next SCENE or ACT)
DIV (at next DIV, ACT, SCENE, or BACKMATTER)
MODE (at next MODE)
BACKMATTER (at eof or closing WORK)
FRONTMATTER (at next ACT, SCENE, or DIV)
PAGE (at next PAGE)
COL (at next COL or PAGE)

Self-closing tags could also be made auto-closing (L, TLN, QLN, WLN, BR, RULE, SPACE).

Feature: disallow multi-line ORNAMENT

ORNAMENT should never span more than one line.

ellipsis

I stumbled accross a ticket in Trac (#1051) that would be more appropriate here.

Unicode ellipses, three consecutive periods, and three periods separated by spaces should all be replaced by the texts toolchain with an element (<ellipsis>?) containing a single unicode ellipsis character so they can be rendered identically.

I'm not convinced that this is something that belongs in code rather than editorial guidelines though.

If this is done in code, I don't see any reason to create a new tag for them. Just replace all ellipsis-like stuff with a real unicode ellipsis (…).

marg handled incorrectly in XMLWriter

This affects Malcolm's code, which I'm not sure whether it has been merged yet.

See doc_Lr_Q1, near QLN 412 for an example where MARG is not getting serialized correctly. That line should result in markup something like this:

<l>
  <s k="123"></s>
  <marg>
    <l><sd t="entrance"><i>er Edgar</i></sd></l>
  </marg>
  <s k="123"><i>Edgar</i>; and out hee comes ...</s>
</l>

but instead we're getting

<l>
  <s k="123"></s>
  <marg/>
  <sd t="entrance"><i>er Edgar</i></sd>
  <i>Edgar</i>; and out hee comes ...
</l>

TagNode text is missing whitespace

The text property of TagNodes appears to be getting set improperly by the parser.

For example, parsing the document

<hello world="foo"/>

results in a TagNode with a text value of

<helloworld="foo"/>

Attribute line numbers

Don't have em. Need em.

don't warn about most nesting

The NestingValidator doesn't need to be quite so picky. Most tags should be allowed to cross without warning or error, including self-nested tags.

There are a few exceptions:

sectioning elements (FRONTMATTER, BACKMATTER, DIV, ACT, SCENE) should not cross MARG or BRACEGROUP
ACT, SCENE, and DIV should not cross others of the same type
DIV should not cross FRONTMATTER or BACKMATTER
ORNAMENT should not cross lines
CL and LD should not cross each other
COL should not cross itself

All those mentioned above should be errors, not simply warnings.

Feature: disallow sectioning to cross split-lines

Sectioning elements should not be allowed to "cross" a split line (marked with @part). They may fully contain a run of split lines, but must not have their start or end tag occurring amongst them. The affected elements are:

ACT
BACKMATTER
DIV
FRONTMATTER
MARG
PAGE
SCENE
STANZA

I'm not sure about QUOTE and LINEGROUP. It might be useful in some cases for these, but it would make our XML serialization much more difficult...

only allow "part" attribute in modern

The "part" attribute of the L tag (used for creating "split lines") should only be used in modern texts. Use in old-spelling texts should be reported as an error.

The schema already has a concept of "locations" that could be extended for use with attributes in order to cover this limitation (depends on #43).

Feature: warn about missing main content

Given that FRONTMATTER is auto-closing as per #2, a warning should be logged if there isn't any content outside of the front/backmatter of the document.

validate DOM using "location" subset schema

The schema includes a concept of "locations" (kinds of documents) where a subset of the defined tags are allowed, but this is not currently enforced anywhere. DOMValidator should be modified (or a new validator created) to only allow tags within a given location subset. The location to match against could either be encoded into the document itself in some way (eg. as an attribute on the root element?), and/or could be provided as a runtime parameter.

Feature: validator(s) to check ID uniqueness

Several elements have identifying attributes that should have unique values within elements of their type. Add one or more validators to check that there are no duplicates.

Affected elements:

ACT/@n
SCENE/@n - but duplicates may exist between different acts
DIV/@name
L/@n - but duplicates may exist between different scenes
MS/@n within tags within the same @t
QLN/@n, TLN/@n, WLN/@n
- should also consider MS of the same type (eg <QLN n="1"/> and <MS t="qln" n="1"/> overlap)
PAGE/@n
STANZA/@n when used outside of SCENE

Note that many (but not all) of these could be automatically fixed by the renumbering transformer.

require indent/rule/space lengths to be non-zero

INDENT, RULE, and SPACE with 0 length don't make any sense and are probably editor error.

Add a validator that requires these lengths to be positive and non-zero.

(negative non-zero length might make sense for INDENT though... I'll check with MB)

Feature: case-insensitive tagging

All IML tags should be case-insensitive. Eg. <HW> and <hw> should be considered identical.

Feature: transformer to resolve deprecations

Add a transformer that replaces deprecated tagging with the preferred new tagging when possible.

Feature: transformer to remove redundant tagging

Add a transformer that strips any redundant tagging (identified as per #5) from the DOM.

attributes with whitespace parsed incorrectly

Attributes with whitespace between the = and the opening " are not being parsed correctly. It looks like the element and attribute are built fine, but the attribute value ends up containing the opening quote.

This might affect whitespace on the other side of the = as well, I haven't checked.

Log doesn't play well with multi-threading

Since we're now using the isetools as a library for a multi-threaded web service, having a singleton Log really doesn't make sense. Separate users validating separate documents don't want to see each other's error messages :) Worse than that, because Log is not synchronized, multi-threaded use may even lead to crashes...

It looks like @emmental made some progress on rewriting Log so that there was a separate instance for each DOM, but I'm not sure how far he got with that approach. It looks like he has a branch for it at https://github.com/emmental/isetools/tree/separate_logs.

An alternate strategy might be simply to make Log a singleton-per-thread using ThreadLocals. I imagine that would require a lot less work.

only one front/backmatter

Add a validator to ensure there is at most one FRONTMATTER and one BACKMATTER.

Feature: check usage of HW

HW should always be used at the end of a line, and never span more than one line.

digraphs are ligatures

After discussion with our coordinating editors, I've come to the conclusion digraphs and ligatures are identical for our purposes. The DigraphCharNode class should be dropped, and all curly-escapes it supports should instead be handled with LigatureCharNode.

disallow RULE on a line with content

RULE should only be allowed on a line with no non-whitespace text content. (ie. it is not an "inline" tag)

Feature: log warnings for redundant tagging

Redundantly nested tagging should log a warning. For example, <EM>one <EM>two</EM> three</EM> is redundant since the inner EM doesn't add any information (we don't support multiple "levels" of emphasis). The following tags would be affected:

BLL (if no intervening R)
C
CL
CW
EM
FONT (iff @size is the same)
FOREIGN (iff @lang is the same)
I
J
LD
LS
PN
R (if no intervening BLL)
RA
RT
SC
SD
SIG
SP
TITLE
WORK

Of course, the "crossing" errors described in #4 would also apply to self-nesting.

Feature: validator to check FOREIGN/@lang codes

FOREIGN/@lang should be a valid two-letter ISO 639-1 or three-letter ISO 639-3 language code. Unrecognized codes should log a warning.

Feature: validate section coverage

Each kind of sectioning should fully-partition a document, and if this is not the case a warning should be logged.

Specifically,

DIVs should either be used only in FRONTMATTER and BACKMATTER, or contain all text content in the document
ACTs and SCENEs should contain all text content in the document that is not in FRONTMATTER or BACKMATTER
PAGEs should contain all text content in the document

Of course, each type may simply not be used at all as well.

log warning for duplicate attributes

When an element has two attributes with the same name, all but the last instance are silently dropped (eg. <ACT n="1" n="2"> would appear in the DOM with only one attribute "n" set to "2").

ca.nines.ise.node.TagNode#setAttribute should log a warning if it notices this is about to happen.

One sp per s.

Don't allow multiple speech prefixes in s.