Git Product home page Git Product logo

intertext's Introduction

ⒾⓃⓉⒺⓇⓉⒺⓍⓉ

Table of Contents generated with DocToc

InterText: Services for Recurrent Text-related Tasks

InterText provides pre-packaged solutioons for a number of tasks in text formatting and typesetting that tend to show up frequently. I'm aiming at conducing comparative benchmarks and soundness checks for all solutions. Areas covered so far include:

  • InterText HYPH for hyphenating text in multiple languages (only en-US covered so far, but underlying software is multilingual and configurable).

  • InterText SLABS for segmenting and re-assembling text according to Unicode Standard Annex #14: Unicode Line Breaking Algorithm (UAX#14); this is useful to determine line breaking opportunities (LBOs) for running text.

See also the (rough) list of planned features.

Related Links

To Do

  • use INTERTEXT.rpr() for tabulation instead of JSON.stringify()
  • implement path manipulation, integrate pathmap
  • integrate color-related code from DataMill colorizer
  • implement number formatting using Intl.NumberFormat, including percentages, rounding
  • CupOfHTML: make compatible with Paragate HTMLish parser
  • CupOfHTML: consider using template strings as in H`div#id.class` 'content'
  • turn into monorepo
  • integrate jzr-old/timetunnel

intertext's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

intertext's Issues

About words not hyphenated by Hyphenopoly

Hello

Recently I stumbled upon your analysis of the various hyphenation libraries (https://github.com/loveencounterflow/intertext#hyphenators).
As the author of Hyphenopoly, I was of course very pleased with your verdict. I thank you.

Nevertheless, there are those 36 words where Hyphenopoly does not find as many hyphenation possibilities as "hypher".
Of course I have investigated why this is the case, and I would now like to share my findings.

TLDR; there are two groups:

  1. words with diacritical characters
  2. Words in the exception list

About 1)
To extract words from a text string, Hyphenopoly searches for groups of characters from a particular alphabet. The alphabet is derived from the characters used in the patterns. Since there are no patterns with diacritical characters for en-us, the alphabet contains the characters "abcdefghijklmnopqrstuvwxyz".
Therefore, Hyphenopoly.js only finds groups of characters (mostly words) that contain these characters. For example, the word "Düsseldorfer" is not processed in its entirety, but only the part "sseldorfer". IMHO this makes sense, because the part with the "ü" could not be processed by the patterns anyway.
"hypher" on the other hand extracts words from text strings by dividing the string by non-word characters.

As a solution to this problem, one could use "substitutes". This mechanism is already implemented in the WebAssembly module, but is currently only used for the old German long s (Latin Small Letter Long S, U+017F). With substitutes the alphabet is extended by the corresponding characters (e.g. é), but for hyphenation by the hyphenation patterns the other character (e) is used. But since the list of characters with diacritics to a character without diacritics is quite long and the quality of the resulting hyphenation is unknown (rather no hyphenation than a wrong hyphenation), I did not do that so far.

About 2)
Most patterns also include a list of exceptions (e.g. https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/hyph-en-us.hyp.txt). Exceptions are words that are not correctly separated by the patterns or where the hyphenation is not unique. For example, the word "project" is separated as verb "pro-ject" and as noun "proj-ect". For a hyphenation program it is hardly possible to recognize whether it is a verb or a noun. To avoid wrong hyphenations such words are not separated at the error-prone places.
Hyphenopoly takes these exception lists into account, while "hypher" obviously does not.

Kind regards,
Mathias

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.