wooorm / parse-latin Goto Github PK

View Code? Open in Web Editor NEW

57.0 7.0 4.0 1.12 MB

Latin-script (natural language) parser

Home Page: https://wooorm.com/parse-latin/

License: MIT License

JavaScript 100.00%

natural-language parse latin-script

parse-latin's Introduction

parse-latin

Natural language parser, for Latin-script languages, that produces nlcst.

What is this?
When should I use this?
Install
Use
API
- ParseLatin()
Algorithm
Types
Compatibility
Security
Related
Contribute
License

What is this?

This package exposes a parser that takes Latin-script natural language and produces a syntax tree.

When should I use this?

If you want to handle natural language as syntax trees manually, use this.

Alternatively, you can use the retext plugin retext-latin, which wraps this project to also parse natural language at a higher-level (easier) abstraction.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), this project does a good job at tokenizing it.

For English and Dutch, you can instead use parse-english and parse-dutch.

You can somewhat use this for Latin-like scripts, such as Cyrillic (“привет”), Georgian (“გამარჯობა”), Armenian (“Բարեւ”), and such.

Install

This package is ESM only. In Node.js (version 16+), install with npm:

npm install parse-latin

In Deno with esm.sh:

import {ParseLatin} from 'https://esm.sh/parse-latin@7'

In browsers with esm.sh:

<script type="module">
  import {ParseLatin} from 'https://esm.sh/parse-latin@7?bundle'
</script>

Use

import {ParseLatin} from 'parse-latin'
import {inspect} from 'unist-util-inspect'

const tree = new ParseLatin().parse('A simple sentence.')

console.log(inspect(tree))

Yields:

RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
    └─0 SentenceNode[6] (1:1-1:19, 0-18)
        ├─0 WordNode[1] (1:1-1:2, 0-1)
        │   └─0 TextNode "A" (1:1-1:2, 0-1)
        ├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
        ├─2 WordNode[1] (1:3-1:9, 2-8)
        │   └─0 TextNode "simple" (1:3-1:9, 2-8)
        ├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
        ├─4 WordNode[1] (1:10-1:18, 9-17)
        │   └─0 TextNode "sentence" (1:10-1:18, 9-17)
        └─5 PunctuationNode "." (1:18-1:19, 17-18)

API

This package exports the identifier ParseLatin. There is no default export.

`ParseLatin()`

Create a new parser.

`ParseLatin#parse(value)`

Turn natural language into a syntax tree.

Parameters

value (string, optional) — value to parse

Returns

Tree (RootNode).

Algorithm

👉 Note: The easiest way to see how parse-latin parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

parse-latin splits text into white space, punctuation, symbol, and word tokens:

“word” is one or more unicode letters or numbers
“white space” is one or more unicode white space characters
“punctuation” is one or more unicode punctuation characters
“symbol” is one or more of anything else

Then, it manipulates and merges those tokens into a syntax tree, adding sentences and paragraphs where needed.

some punctuation marks are part of the word they occur in, such as non-profit, she’s, G.I., 11:00, N/A, &c, nineteenth- and…
some periods do not mark a sentence end, such as 1., e.g., id.
although periods, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, such as .), ."
…and many more exceptions

Types

This package is fully typed with TypeScript. It exports no additional types.

Compatibility

Projects maintained by me are compatible with maintained versions of Node.js.

When I cut a new major release, I drop support for unmaintained versions of Node. This means I try to keep the current release line, parse-latin@^7, compatible with Node.js 16.

Security

This package is safe.

parse-english — English (natural language) parser
parse-dutch — Dutch (natural language) parser

Contribute

Yes please! See How to Contribute to Open Source.

License

MIT © Titus Wormer

parse-latin's People

Contributors

Stargazers

Watchers

Forkers

timlesallen benchaqroun usergit ishirkhan

parse-latin's Issues

Please publish @types/parse-latin

Cannot compile parse-latin using modern typescript.

Could not find a declaration file for module 'parse-latin'.
Try `npm i --save-dev @types/parse-latin` if it exists or add a new declaration (.d.ts) file
containing `declare module 'parse-latin';`

Please publish the types on npm. Here's a good tutorial.

Should support basic Source detection

Solely non-word content between two new lines (\n---\n, or \n* * * *\n);
Maybe even detect a word-to-punctuation ratio, for example in a link (http://www.example.com contains 5 punctuation marks, 0 white space, and 17 alphabetic characters, and is currently parsed as word, punctuation, punctuation, word, punctuation, word, punctuation, word)

Note: Repost from parse-english.

(maybe) Should add the slash to inner word punctuation

e.g., for:

N/A (not applicable, not available)
w/o (without)
c/o (care of)
&c.

Please publish @types/parse-latin

Cannot compile parse-latin using modern typescript.

Could not find a declaration file for module 'parse-latin'.
Try `npm i --save-dev @types/parse-latin` if it exists or add a new declaration (.d.ts) file
containing `declare module 'parse-latin';`

Please publish the types on npm. Here's a good tutorial.

Deny comma as first token in a sentence

“Oh no!”, she screamed, “…don’t do it!” is parser incorrectly, while “Oh no!” she screamed, “…don’t do it!” is parsed correctly.

Possible implementation, add a fix on L854, rename mergeInitialLowerCaseLetterSentences into e.g., mergeInitialSentenceExceptions

Throws an incorrect error

At L973, throws with “ParseEnglish”, where an error with “ParseLatin” should be thrown.

Mistakenly categorises :email: as SymbolNode + WordNode

Hello! Thank you for all of your work on this package.

A short example in v4.3.0.

var inspect = require('unist-util-inspect')
var Latin = require('parse-latin')
var tree = new Latin().parse("You've got \u2709\uFE0F!")
                          // "You've got ✉️!"
console.log(inspect(tree))

RootNode[1] (1:1-1:15, 0-14)
└─ ParagraphNode[1] (1:1-1:15, 0-14)
   └─ SentenceNode[7] (1:1-1:15, 0-14)
      ├─ WordNode[3] (1:1-1:7, 0-6)
      │  ├─ TextNode: "You" (1:1-1:4, 0-3)
      │  ├─ PunctuationNode: "'" (1:4-1:5, 3-4)
      │  └─ TextNode: "ve" (1:5-1:7, 4-6)
      ├─ WhiteSpaceNode: " " (1:7-1:8, 6-7)
      ├─ WordNode[1] (1:8-1:11, 7-10)
      │  └─ TextNode: "got" (1:8-1:11, 7-10)
      ├─ WhiteSpaceNode: " " (1:11-1:12, 10-11)
      ├─ SymbolNode: "✉" (1:12-1:13, 11-12)           <------- This is a U+2709
      ├─ WordNode[1] (1:13-1:14, 12-13)               <------- 😢 
      │  └─ TextNode: "️" (1:13-1:14, 12-13)           <------- This is a U+FE0F
      └─ PunctuationNode: "!" (1:14-1:15, 13-14)

I've traced this down from a bug I was experiencing in https://github.com/tbroadley/spellchecker-cli when I spellcheck markdown that uses the :email: shortcode. It is flagged as a spelling mistake, due to this extra U+FE0F. Some other emoji are affected, ones that are based on older symbols, such as ✂️ and ✈️ .

I had a bit of a go at fixing this but didn't get very far. I would be very grateful to if you could point me in the right direction so I can submit a PR, though if you would prefer to handle yourself I will be equally grateful!

Edit: In particular, I got stuck trying to figure out which, if any, of the modules in lib/plugin ought to be amended to correct this behaviour.

Using custom prefix exceptions

Certain texts can contain abbreviations that are not captured by the regex tests.

I'm curious if this might be worth adding as an option on the constructor - instead of having to rely on extending the plugins (the way ParseEnglish does):

class ParseEnglish extends ParseLatin {}

ParseEnglish.prototype.tokenizeParagraphPlugins = [
  modifyChildren(mergeEnglishPrefixExceptions)
].concat(ParseEnglish.prototype.tokenizeParagraphPlugins)

I was originally going to open an issue on the parse-english repo but I thought it might be worth exploring the option here for a few reasons:

ParseLatin can be reliably used for most European languages with abbreviations being the main barrier to accurate sentence tokenization.
Exceptions can be supplied as a string array - providing a simpler interface for i18n support of out the box.

I'm not sure if the juice is worth the squeeze for a feature like this (especially since a solution exists). But it may be worthwhile to expose the mergePrefixExceptions function and to document this particular use case for others in the future.

mergeNonWordSentences should give precedence to preceding, rather than following, children

If L840 would use a minus operator, and L851 a plus operator, things like full-stops delimited by spaces (ellipses) would glue to their previous sentence (currently one full-stop glues to the previous, then a white space between sentences, followed by a new sentence, starting with two full-stop-space sequences).

I can’t however think of other use cases—though their probably are many more (like #1).

Should have a list depicting how the parser works

Importing non-default export of "toString" as default in some plugins causes webpack errors

At least in some webpack configurations, because some of the parse-latin plugins import toString from nlcst-to-string as default, while toString is not in fact a default export, webpack fails with errors like:

ERROR in ./node_modules/parse-latin/lib/plugin/break-implicit-sentences.js 33:61-69
export 'default' (imported as 'toString') was not found in 'nlcst-to-string' (possible exports: toString)

The solution is to either mark toString as a default export in the nlcst-to-string repo, or correct the default import to a named import in the parse-latin plugin files that import toString.

Please let me know if you can address it, or want me to do a pull request with one of the two fixes above.

Thanks for your great work on this suite!

Should add a mergeEtceteraAbbreviation sentence modifier

The & in &c.

To turn the current result...

> parseLatin.tokenizeSentence('&c.').children
[ { type: 'PunctuationNode', children: [ [Object] ] },
  { type: 'WordNode', children: [ [Object] ] },
  { type: 'PunctuationNode', children: [ [Object] ] } ]

…in the following:

> parseLatin.tokenizeSentence('&c.').children
[ { type: 'WordNode', children: [ [Object], [Object], [Object] ] } ]

Typo in API makes apostrophes not work as inter-word punctuation

https://github.com/wooorm/parse-latin/blob/master/index.js#L377

Fix: …\2019… should be …\u2019…

Typo in unit tests for `cp.`

cf. should be cp.

Should count quotes to detect if they are part of adjacent words

This could help the detection of if a quote depicts:

A possessive apostrophe (Warner Bros.’ movie, three cats’ toys);
An elision (’sup, ’round, ’nough, ’emselves);
And actual quote (Then he said “She said, ‘Shut up!’”).

Ignore sentence terminal markers meant as literals

Sometimes, full stops, emphasis marks, or question marks are meant as literal, and not as a terminal marker, such as when enclosed in parentheses or quotes.

Maybe the following will work: One or more terminal markers enclosed in either the same punctuation marks, or in matching opening/initial and closing/final punctuation.

This would match “.”, !, (!?), and ‘... ’.

Should add Location and Position to "TextNode" and "SourceNode"

partial interface TextNode {
    location: Location;
}
partial interface SourceNode {
    location: Location;
}

Where Location is represented by:

interface Location {
    start: Position;
    end: Position;
}

And Position by:

interface Position {
    line: unsigned long >= 1;
    column: unsigned long >= 1;
}

Should expose tokenizeWord, tokenizeWhiteSpace, and tokenizePunctuation

So content can be easily added in word, white space, and punctuation nodes using retext-content.

error instalar vía npm

Hola, tengo problemas al instalar vía npm, sale este error

npm ERR! error installing [email protected]

npm ERR! Error: No compatible version found: parse-latin@'^0.1.0-rc.6'
npm ERR! Valid install targets:
npm ERR! ["0.1.0-rc.3","0.1.0-rc.4","0.1.0-rc.3","0.1.0-rc.4","0.1.0-rc.5","0.1.0-rc.6","0.1.0-rc.7","0.1.0-rc.8","0.1.0-rc.9","0.1.0-rc.10","0.1.0-rc.11","0.1.0-rc.12","0.1.0"]
npm ERR!     at installTargetsError (/usr/share/npm/lib/cache.js:488:10)
npm ERR!     at /usr/share/npm/lib/cache.js:375:15
npm ERR!     at saved (/usr/share/npm/lib/utils/npm-registry-client/get.js:147:7)

Should merge colon surrounded by words

“The concert begins at 21:45.”
“The rocket launched at 09:15:05”

Should expose tokenizeText, and tokenizeSource

Not doing any tokenisation though, just for consistency.

Should allow single closing quote as initial punctuation

Such as in ’sup, ’em, &c.

Note that the smart version should never occur at the start of a word to indicate quoted content, whereas a dumb single quote could.

This would allow better automatic initial elision detection in parse-latin and other parsers. However, for the dumb quote to work as elision, language-specific parsers should still contain a list of allowed initial elision.

Should merge full-stops surrounded by words, or in initialisms

“C.I.A.” (initialism)
“You will need to arrive by 14.30”
“The programme will begin at 8.00 pm.”