Git Product home page Git Product logo

isetools's People

Contributors

telic avatar ubermichael avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

isetools's Issues

unify "typeform" handling

{s}, {r}, {w}, and {W} should be handled identically in the DOM. These are "typeforms" that appear a certain way in the printed text (ſ, ꝛ, vv, and VV respectively) which we want to generally handle as their modern equivalents in the digital text (s, r, w, and W).

Feature: required nesting validator

There are a few cases where a lack of nesting should be considered an error. In particular,

  • SP must be a descendant of S
  • BR must be a descendant of HW
  • CW, RT, SIG, PN, and COL must be a descendant of PAGE

In addition, a document should either have all its SCENE tags nested in ACT, or not have any ACTs at all.

missing `lig` and `typeform` tags in XMLWriter output

The XMLWriter is not producing lig or typeform tags when it should.

For example, on the title page of doc_Lr_Q1.txt the word Hi{{s}t}orie is being serialized as a single text node "Hiſtorie", instead of the expected Hi<lig unicode="ſt"><typeform set="ſ">s</typeform>t</lig>orie.

Tested on doc_Lr_Q1.txt through the web interface on isebeta.

transformers should log their changes

Some transformers (especially the DeprecationTransformer) can make a large number of changes to a DOM which are largely opaque from the user's perspective. These transformers would be a lot less scary to use if they logged a message for every modification they made so that the user could review said changes after the fact.

A textual diff gets us half way there, but log messages would be able to offer explainations for //why// each change took place, and show how multiple textual changes are related (eg. moving something from one place to another).

Feature: auto-closing tags

Several tags could be simplified by allowing them to be "auto-closing". In particular, the end of these tags can all be quite easily implied, rather than requiring the editor to explicitly close them:

  • WORK (at eof)
  • S (at next S)
  • ACT (at next ACT)
  • SCENE (at next SCENE or ACT)
  • DIV (at next DIV, ACT, SCENE, or BACKMATTER)
  • MODE (at next MODE)
  • BACKMATTER (at eof or closing WORK)
  • FRONTMATTER (at next ACT, SCENE, or DIV)
  • PAGE (at next PAGE)
  • COL (at next COL or PAGE)

Self-closing tags could also be made auto-closing (L, TLN, QLN, WLN, BR, RULE, SPACE).

ellipsis

I stumbled accross a ticket in Trac (#1051) that would be more appropriate here.

Unicode ellipses, three consecutive periods, and three periods separated by spaces should all be replaced by the texts toolchain with an element (<ellipsis>?) containing a single unicode ellipsis character so they can be rendered identically.

I'm not convinced that this is something that belongs in code rather than editorial guidelines though.

If this is done in code, I don't see any reason to create a new tag for them. Just replace all ellipsis-like stuff with a real unicode ellipsis (…).

marg handled incorrectly in XMLWriter

This affects Malcolm's code, which I'm not sure whether it has been merged yet.

See doc_Lr_Q1, near QLN 412 for an example where MARG is not getting serialized correctly. That line should result in markup something like this:

<l>
  <s k="123"></s>
  <marg>
    <l><sd t="entrance"><i>er Edgar</i></sd></l>
  </marg>
  <s k="123"><i>Edgar</i>; and out hee comes ...</s>
</l>

but instead we're getting

<l>
  <s k="123"></s>
  <marg/>
  <sd t="entrance"><i>er Edgar</i></sd>
  <i>Edgar</i>; and out hee comes ...
</l>

TagNode text is missing whitespace

The text property of TagNodes appears to be getting set improperly by the parser.

For example, parsing the document

<hello world="foo"/>

results in a TagNode with a text value of

<helloworld="foo"/>

don't warn about most nesting

The NestingValidator doesn't need to be quite so picky. Most tags should be allowed to cross without warning or error, including self-nested tags.

There are a few exceptions:

  • sectioning elements (FRONTMATTER, BACKMATTER, DIV, ACT, SCENE) should not cross MARG or BRACEGROUP
  • ACT, SCENE, and DIV should not cross others of the same type
  • DIV should not cross FRONTMATTER or BACKMATTER
  • ORNAMENT should not cross lines
  • CL and LD should not cross each other
  • COL should not cross itself

All those mentioned above should be errors, not simply warnings.

Feature: disallow sectioning to cross split-lines

Sectioning elements should not be allowed to "cross" a split line (marked with @part). They may fully contain a run of split lines, but must not have their start or end tag occurring amongst them. The affected elements are:

  • ACT
  • BACKMATTER
  • DIV
  • FRONTMATTER
  • MARG
  • PAGE
  • SCENE
  • STANZA

I'm not sure about QUOTE and LINEGROUP. It might be useful in some cases for these, but it would make our XML serialization much more difficult...

only allow "part" attribute in modern

The "part" attribute of the L tag (used for creating "split lines") should only be used in modern texts. Use in old-spelling texts should be reported as an error.

The schema already has a concept of "locations" that could be extended for use with attributes in order to cover this limitation (depends on #43).

validate DOM using "location" subset schema

The schema includes a concept of "locations" (kinds of documents) where a subset of the defined tags are allowed, but this is not currently enforced anywhere. DOMValidator should be modified (or a new validator created) to only allow tags within a given location subset. The location to match against could either be encoded into the document itself in some way (eg. as an attribute on the root element?), and/or could be provided as a runtime parameter.

Feature: validator(s) to check ID uniqueness

Several elements have identifying attributes that should have unique values within elements of their type. Add one or more validators to check that there are no duplicates.

Affected elements:

  • ACT/@n
  • SCENE/@n - but duplicates may exist between different acts
  • DIV/@name
  • L/@n - but duplicates may exist between different scenes
  • MS/@n within tags within the same @t
  • QLN/@n, TLN/@n, WLN/@n
    • should also consider MS of the same type (eg <QLN n="1"/> and <MS t="qln" n="1"/> overlap)
  • PAGE/@n
  • STANZA/@n when used outside of SCENE

Note that many (but not all) of these could be automatically fixed by the renumbering transformer.

require indent/rule/space lengths to be non-zero

INDENT, RULE, and SPACE with 0 length don't make any sense and are probably editor error.

Add a validator that requires these lengths to be positive and non-zero.

(negative non-zero length might make sense for INDENT though... I'll check with MB)

attributes with whitespace parsed incorrectly

Attributes with whitespace between the = and the opening " are not being parsed correctly. It looks like the element and attribute are built fine, but the attribute value ends up containing the opening quote.

This might affect whitespace on the other side of the = as well, I haven't checked.

Log doesn't play well with multi-threading

Since we're now using the isetools as a library for a multi-threaded web service, having a singleton Log really doesn't make sense. Separate users validating separate documents don't want to see each other's error messages :) Worse than that, because Log is not synchronized, multi-threaded use may even lead to crashes...

It looks like @emmental made some progress on rewriting Log so that there was a separate instance for each DOM, but I'm not sure how far he got with that approach. It looks like he has a branch for it at https://github.com/emmental/isetools/tree/separate_logs.

An alternate strategy might be simply to make Log a singleton-per-thread using ThreadLocals. I imagine that would require a lot less work.

digraphs are ligatures

After discussion with our coordinating editors, I've come to the conclusion digraphs and ligatures are identical for our purposes. The DigraphCharNode class should be dropped, and all curly-escapes it supports should instead be handled with LigatureCharNode.

Feature: log warnings for redundant tagging

Redundantly nested tagging should log a warning. For example, <EM>one <EM>two</EM> three</EM> is redundant since the inner EM doesn't add any information (we don't support multiple "levels" of emphasis). The following tags would be affected:

  • BLL (if no intervening R)
  • C
  • CL
  • CW
  • EM
  • FONT (iff @size is the same)
  • FOREIGN (iff @lang is the same)
  • I
  • J
  • LD
  • LS
  • PN
  • R (if no intervening BLL)
  • RA
  • RT
  • SC
  • SD
  • SIG
  • SP
  • TITLE
  • WORK

Of course, the "crossing" errors described in #4 would also apply to self-nesting.

Feature: validate section coverage

Each kind of sectioning should fully-partition a document, and if this is not the case a warning should be logged.

Specifically,

  • DIVs should either be used only in FRONTMATTER and BACKMATTER, or contain all text content in the document
  • ACTs and SCENEs should contain all text content in the document that is not in FRONTMATTER or BACKMATTER
  • PAGEs should contain all text content in the document

Of course, each type may simply not be used at all as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.