ubermichael / isetools Goto Github PK
View Code? Open in Web Editor NEWTools for parsing data for the Internet Shakespeare Editions
License: GNU General Public License v2.0
Tools for parsing data for the Internet Shakespeare Editions
License: GNU General Public License v2.0
{s}
, {r}
, {w}
, and {W}
should be handled identically in the DOM. These are "typeforms" that appear a certain way in the printed text (ſ, ꝛ, vv, and VV respectively) which we want to generally handle as their modern equivalents in the digital text (s, r, w, and W).
There are a few cases where a lack of nesting should be considered an error. In particular,
In addition, a document should either have all its SCENE tags nested in ACT, or not have any ACTs at all.
The XMLWriter is not producing lig
or typeform
tags when it should.
For example, on the title page of doc_Lr_Q1.txt the word Hi{{s}t}orie
is being serialized as a single text node "Hiſtorie", instead of the expected Hi<lig unicode="ſt"><typeform set="ſ">s</typeform>t</lig>orie
.
Tested on doc_Lr_Q1.txt through the web interface on isebeta.
Some transformers (especially the DeprecationTransformer) can make a large number of changes to a DOM which are largely opaque from the user's perspective. These transformers would be a lot less scary to use if they logged a message for every modification they made so that the user could review said changes after the fact.
A textual diff gets us half way there, but log messages would be able to offer explainations for //why// each change took place, and show how multiple textual changes are related (eg. moving something from one place to another).
Several tags could be simplified by allowing them to be "auto-closing". In particular, the end of these tags can all be quite easily implied, rather than requiring the editor to explicitly close them:
Self-closing tags could also be made auto-closing (L, TLN, QLN, WLN, BR, RULE, SPACE).
ORNAMENT should never span more than one line.
I stumbled accross a ticket in Trac (#1051) that would be more appropriate here.
Unicode ellipses, three consecutive periods, and three periods separated by spaces should all be replaced by the texts toolchain with an element (
<ellipsis>
?) containing a single unicode ellipsis character so they can be rendered identically.
I'm not convinced that this is something that belongs in code rather than editorial guidelines though.
If this is done in code, I don't see any reason to create a new tag for them. Just replace all ellipsis-like stuff with a real unicode ellipsis (…).
This affects Malcolm's code, which I'm not sure whether it has been merged yet.
See doc_Lr_Q1, near QLN 412 for an example where MARG is not getting serialized correctly. That line should result in markup something like this:
<l>
<s k="123"></s>
<marg>
<l><sd t="entrance"><i>er Edgar</i></sd></l>
</marg>
<s k="123"><i>Edgar</i>; and out hee comes ...</s>
</l>
but instead we're getting
<l>
<s k="123"></s>
<marg/>
<sd t="entrance"><i>er Edgar</i></sd>
<i>Edgar</i>; and out hee comes ...
</l>
The text
property of TagNode
s appears to be getting set improperly by the parser.
For example, parsing the document
<hello world="foo"/>
results in a TagNode with a text value of
<helloworld="foo"/>
Don't have em. Need em.
The NestingValidator doesn't need to be quite so picky. Most tags should be allowed to cross without warning or error, including self-nested tags.
There are a few exceptions:
All those mentioned above should be errors, not simply warnings.
Sectioning elements should not be allowed to "cross" a split line (marked with @part
). They may fully contain a run of split lines, but must not have their start or end tag occurring amongst them. The affected elements are:
I'm not sure about QUOTE and LINEGROUP. It might be useful in some cases for these, but it would make our XML serialization much more difficult...
The "part" attribute of the L tag (used for creating "split lines") should only be used in modern texts. Use in old-spelling texts should be reported as an error.
The schema already has a concept of "locations" that could be extended for use with attributes in order to cover this limitation (depends on #43).
Given that FRONTMATTER is auto-closing as per #2, a warning should be logged if there isn't any content outside of the front/backmatter of the document.
The schema includes a concept of "locations" (kinds of documents) where a subset of the defined tags are allowed, but this is not currently enforced anywhere. DOMValidator should be modified (or a new validator created) to only allow tags within a given location subset. The location to match against could either be encoded into the document itself in some way (eg. as an attribute on the root element?), and/or could be provided as a runtime parameter.
Several elements have identifying attributes that should have unique values within elements of their type. Add one or more validators to check that there are no duplicates.
Affected elements:
ACT/@n
SCENE/@n
- but duplicates may exist between different actsDIV/@name
L/@n
- but duplicates may exist between different scenesMS/@n
within tags within the same @t
QLN/@n
, TLN/@n
, WLN/@n
MS
of the same type (eg <QLN n="1"/>
and <MS t="qln" n="1"/>
overlap)PAGE/@n
STANZA/@n
when used outside of SCENE
Note that many (but not all) of these could be automatically fixed by the renumbering transformer.
INDENT, RULE, and SPACE with 0 length don't make any sense and are probably editor error.
Add a validator that requires these lengths to be positive and non-zero.
(negative non-zero length might make sense for INDENT though... I'll check with MB)
All IML tags should be case-insensitive. Eg. <HW>
and <hw>
should be considered identical.
Add a transformer that replaces deprecated tagging with the preferred new tagging when possible.
Add a transformer that strips any redundant tagging (identified as per #5) from the DOM.
Attributes with whitespace between the = and the opening " are not being parsed correctly. It looks like the element and attribute are built fine, but the attribute value ends up containing the opening quote.
This might affect whitespace on the other side of the = as well, I haven't checked.
Since we're now using the isetools as a library for a multi-threaded web service, having a singleton Log really doesn't make sense. Separate users validating separate documents don't want to see each other's error messages :) Worse than that, because Log is not synchronized, multi-threaded use may even lead to crashes...
It looks like @emmental made some progress on rewriting Log so that there was a separate instance for each DOM, but I'm not sure how far he got with that approach. It looks like he has a branch for it at https://github.com/emmental/isetools/tree/separate_logs.
An alternate strategy might be simply to make Log a singleton-per-thread using ThreadLocal
s. I imagine that would require a lot less work.
Add a validator to ensure there is at most one FRONTMATTER and one BACKMATTER.
HW should always be used at the end of a line, and never span more than one line.
After discussion with our coordinating editors, I've come to the conclusion digraphs and ligatures are identical for our purposes. The DigraphCharNode class should be dropped, and all curly-escapes it supports should instead be handled with LigatureCharNode.
RULE should only be allowed on a line with no non-whitespace text content. (ie. it is not an "inline" tag)
Redundantly nested tagging should log a warning. For example, <EM>one <EM>two</EM> three</EM>
is redundant since the inner EM doesn't add any information (we don't support multiple "levels" of emphasis). The following tags would be affected:
@size
is the same)@lang
is the same)Of course, the "crossing" errors described in #4 would also apply to self-nesting.
FOREIGN/@lang
should be a valid two-letter ISO 639-1 or three-letter ISO 639-3 language code. Unrecognized codes should log a warning.
Each kind of sectioning should fully-partition a document, and if this is not the case a warning should be logged.
Specifically,
Of course, each type may simply not be used at all as well.
When an element has two attributes with the same name, all but the last instance are silently dropped (eg. <ACT n="1" n="2">
would appear in the DOM with only one attribute "n" set to "2").
ca.nines.ise.node.TagNode#setAttribute should log a warning if it notices this is about to happen.
Don't allow multiple speech prefixes in s.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.