Git Product home page Git Product logo

parse-latin's Introduction

parse-latin

Build Coverage Downloads Size

Natural language parser, for Latin-script languages, that produces nlcst.

Contents

What is this?

This package exposes a parser that takes Latin-script natural language and produces a syntax tree.

When should I use this?

If you want to handle natural language as syntax trees manually, use this.

Alternatively, you can use the retext plugin retext-latin, which wraps this project to also parse natural language at a higher-level (easier) abstraction.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), this project does a good job at tokenizing it.

For English and Dutch, you can instead use parse-english and parse-dutch.

You can somewhat use this for Latin-like scripts, such as Cyrillic (“привет”), Georgian (“გამარჯობა”), Armenian (“Բարեւ”), and such.

Install

This package is ESM only. In Node.js (version 16+), install with npm:

npm install parse-latin

In Deno with esm.sh:

import {ParseLatin} from 'https://esm.sh/parse-latin@7'

In browsers with esm.sh:

<script type="module">
  import {ParseLatin} from 'https://esm.sh/parse-latin@7?bundle'
</script>

Use

import {ParseLatin} from 'parse-latin'
import {inspect} from 'unist-util-inspect'

const tree = new ParseLatin().parse('A simple sentence.')

console.log(inspect(tree))

Yields:

RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
    └─0 SentenceNode[6] (1:1-1:19, 0-18)
        ├─0 WordNode[1] (1:1-1:2, 0-1)
        │   └─0 TextNode "A" (1:1-1:2, 0-1)
        ├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
        ├─2 WordNode[1] (1:3-1:9, 2-8)
        │   └─0 TextNode "simple" (1:3-1:9, 2-8)
        ├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
        ├─4 WordNode[1] (1:10-1:18, 9-17)
        │   └─0 TextNode "sentence" (1:10-1:18, 9-17)
        └─5 PunctuationNode "." (1:18-1:19, 17-18)

API

This package exports the identifier ParseLatin. There is no default export.

ParseLatin()

Create a new parser.

ParseLatin#parse(value)

Turn natural language into a syntax tree.

Parameters
  • value (string, optional) — value to parse
Returns

Tree (RootNode).

Algorithm

👉 Note: The easiest way to see how parse-latin parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

parse-latin splits text into white space, punctuation, symbol, and word tokens:

  • “word” is one or more unicode letters or numbers
  • “white space” is one or more unicode white space characters
  • “punctuation” is one or more unicode punctuation characters
  • “symbol” is one or more of anything else

Then, it manipulates and merges those tokens into a syntax tree, adding sentences and paragraphs where needed.

  • some punctuation marks are part of the word they occur in, such as non-profit, she’s, G.I., 11:00, N/A, &c, nineteenth- and…
  • some periods do not mark a sentence end, such as 1., e.g., id.
  • although periods, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, such as .), ."
  • …and many more exceptions

Types

This package is fully typed with TypeScript. It exports no additional types.

Compatibility

Projects maintained by me are compatible with maintained versions of Node.js.

When I cut a new major release, I drop support for unmaintained versions of Node. This means I try to keep the current release line, parse-latin@^7, compatible with Node.js 16.

Security

This package is safe.

Related

Contribute

Yes please! See How to Contribute to Open Source.

License

MIT © Titus Wormer

parse-latin's People

Contributors

greenkeeperio-bot avatar wooorm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

parse-latin's Issues

Please publish @types/parse-latin

Cannot compile parse-latin using modern typescript.

Could not find a declaration file for module 'parse-latin'.
Try `npm i --save-dev @types/parse-latin` if it exists or add a new declaration (.d.ts) file
containing `declare module 'parse-latin';`

Please publish the types on npm. Here's a good tutorial.

Should support basic Source detection

  • Solely non-word content between two new lines (\n---\n, or \n* * * *\n);
  • Maybe even detect a word-to-punctuation ratio, for example in a link (http://www.example.com contains 5 punctuation marks, 0 white space, and 17 alphabetic characters, and is currently parsed as word, punctuation, punctuation, word, punctuation, word, punctuation, word)

Note: Repost from parse-english.

Please publish @types/parse-latin

Cannot compile parse-latin using modern typescript.

Could not find a declaration file for module 'parse-latin'.
Try `npm i --save-dev @types/parse-latin` if it exists or add a new declaration (.d.ts) file
containing `declare module 'parse-latin';`

Please publish the types on npm. Here's a good tutorial.

Deny comma as first token in a sentence

“Oh no!”, she screamed, “…don’t do it!” is parser incorrectly, while “Oh no!” she screamed, “…don’t do it!” is parsed correctly.

Possible implementation, add a fix on L854, rename mergeInitialLowerCaseLetterSentences into e.g., mergeInitialSentenceExceptions

Mistakenly categorises :email: as SymbolNode + WordNode

Hello! Thank you for all of your work on this package.

A short example in v4.3.0.

var inspect = require('unist-util-inspect')
var Latin = require('parse-latin')
var tree = new Latin().parse("You've got \u2709\uFE0F!")
                          // "You've got ✉️!"
console.log(inspect(tree))
RootNode[1] (1:1-1:15, 0-14)
└─ ParagraphNode[1] (1:1-1:15, 0-14)
   └─ SentenceNode[7] (1:1-1:15, 0-14)
      ├─ WordNode[3] (1:1-1:7, 0-6)
      │  ├─ TextNode: "You" (1:1-1:4, 0-3)
      │  ├─ PunctuationNode: "'" (1:4-1:5, 3-4)
      │  └─ TextNode: "ve" (1:5-1:7, 4-6)
      ├─ WhiteSpaceNode: " " (1:7-1:8, 6-7)
      ├─ WordNode[1] (1:8-1:11, 7-10)
      │  └─ TextNode: "got" (1:8-1:11, 7-10)
      ├─ WhiteSpaceNode: " " (1:11-1:12, 10-11)
      ├─ SymbolNode: "✉" (1:12-1:13, 11-12)           <------- This is a U+2709
      ├─ WordNode[1] (1:13-1:14, 12-13)               <------- 😢 
      │  └─ TextNode: "️" (1:13-1:14, 12-13)           <------- This is a U+FE0F
      └─ PunctuationNode: "!" (1:14-1:15, 13-14)

I've traced this down from a bug I was experiencing in https://github.com/tbroadley/spellchecker-cli when I spellcheck markdown that uses the :email: shortcode. It is flagged as a spelling mistake, due to this extra U+FE0F. Some other emoji are affected, ones that are based on older symbols, such as ✂️ and ✈️ .

I had a bit of a go at fixing this but didn't get very far. I would be very grateful to if you could point me in the right direction so I can submit a PR, though if you would prefer to handle yourself I will be equally grateful!

Edit: In particular, I got stuck trying to figure out which, if any, of the modules in lib/plugin ought to be amended to correct this behaviour.

Using custom prefix exceptions

Certain texts can contain abbreviations that are not captured by the regex tests.

I'm curious if this might be worth adding as an option on the constructor - instead of having to rely on extending the plugins (the way ParseEnglish does):

class ParseEnglish extends ParseLatin {}

ParseEnglish.prototype.tokenizeParagraphPlugins = [
  modifyChildren(mergeEnglishPrefixExceptions)
].concat(ParseEnglish.prototype.tokenizeParagraphPlugins)

I was originally going to open an issue on the parse-english repo but I thought it might be worth exploring the option here for a few reasons:

  • ParseLatin can be reliably used for most European languages with abbreviations being the main barrier to accurate sentence tokenization.
  • Exceptions can be supplied as a string array - providing a simpler interface for i18n support of out the box.

I'm not sure if the juice is worth the squeeze for a feature like this (especially since a solution exists). But it may be worthwhile to expose the mergePrefixExceptions function and to document this particular use case for others in the future.

Importing non-default export of "toString" as default in some plugins causes webpack errors

At least in some webpack configurations, because some of the parse-latin plugins import toString from nlcst-to-string as default, while toString is not in fact a default export, webpack fails with errors like:

ERROR in ./node_modules/parse-latin/lib/plugin/break-implicit-sentences.js 33:61-69
export 'default' (imported as 'toString') was not found in 'nlcst-to-string' (possible exports: toString)

image

The solution is to either mark toString as a default export in the nlcst-to-string repo, or correct the default import to a named import in the parse-latin plugin files that import toString.

Please let me know if you can address it, or want me to do a pull request with one of the two fixes above.

Thanks for your great work on this suite!

Should add a mergeEtceteraAbbreviation sentence modifier

The & in &c.

To turn the current result...

> parseLatin.tokenizeSentence('&c.').children
[ { type: 'PunctuationNode', children: [ [Object] ] },
  { type: 'WordNode', children: [ [Object] ] },
  { type: 'PunctuationNode', children: [ [Object] ] } ]

…in the following:

> parseLatin.tokenizeSentence('&c.').children
[ { type: 'WordNode', children: [ [Object], [Object], [Object] ] } ]

Ignore sentence terminal markers meant as literals

Sometimes, full stops, emphasis marks, or question marks are meant as literal, and not as a terminal marker, such as when enclosed in parentheses or quotes.

Maybe the following will work: One or more terminal markers enclosed in either the same punctuation marks, or in matching opening/initial and closing/final punctuation.

This would match “.”, !, (!?), and ‘... ’.

Should add Location and Position to "TextNode" and "SourceNode"

partial interface TextNode {
    location: Location;
}
partial interface SourceNode {
    location: Location;
}

Where Location is represented by:

interface Location {
    start: Position;
    end: Position;
} 

And Position by:

interface Position {
    line: unsigned long >= 1;
    column: unsigned long >= 1;
} 

error instalar vía npm

Hola, tengo problemas al instalar vía npm, sale este error

npm ERR! error installing [email protected]

npm ERR! Error: No compatible version found: parse-latin@'^0.1.0-rc.6'
npm ERR! Valid install targets:
npm ERR! ["0.1.0-rc.3","0.1.0-rc.4","0.1.0-rc.3","0.1.0-rc.4","0.1.0-rc.5","0.1.0-rc.6","0.1.0-rc.7","0.1.0-rc.8","0.1.0-rc.9","0.1.0-rc.10","0.1.0-rc.11","0.1.0-rc.12","0.1.0"]
npm ERR!     at installTargetsError (/usr/share/npm/lib/cache.js:488:10)
npm ERR!     at /usr/share/npm/lib/cache.js:375:15
npm ERR!     at saved (/usr/share/npm/lib/utils/npm-registry-client/get.js:147:7)

Should allow single closing quote as initial punctuation

Such as in ’sup, ’em, &c.

Note that the smart version should never occur at the start of a word to indicate quoted content, whereas a dumb single quote could.

This would allow better automatic initial elision detection in parse-latin and other parsers. However, for the dumb quote to work as elision, language-specific parsers should still contain a list of allowed initial elision.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.