Git Product home page Git Product logo

hast-util-to-nlcst's Introduction

hast-util-to-nlcst

Build Coverage Downloads Size Sponsors Backers Chat

hast utility to transform to nlcst.

Contents

What is this?

This package is a utility that takes a hast (HTML) syntax tree as input and turns it into nlcst (natural language).

When should I use this?

This project is useful when you want to deal with ASTs and inspect the natural language inside HTML. Unfortunately, there is no way yet to apply changes to the nlcst back into hast.

The mdast utility mdast-util-to-nlcst does the same but uses a markdown tree as input.

The rehype plugin rehype-retext wraps this utility to do the same at a higher-level (easier) abstraction.

Install

This package is ESM only. In Node.js (version 16+), install with npm:

npm install hast-util-to-nlcst

In Deno with esm.sh:

import {toNlcst} from 'https://esm.sh/hast-util-to-nlcst@4'

In browsers with esm.sh:

<script type="module">
  import {toNlcst} from 'https://esm.sh/hast-util-to-nlcst@4?bundle'
</script>

Use

Say our document example.html contains:

<article>
  Implicit.
  <h1>Explicit: <strong>foo</strong>s-ball</h1>
  <pre><code class="language-foo">bar()</code></pre>
</article>

…and our module example.js looks as follows:

import {fromHtml} from 'hast-util-from-html'
import {toNlcst} from 'hast-util-to-nlcst'
import {ParseEnglish} from 'parse-english'
import {read} from 'to-vfile'
import {inspect} from 'unist-util-inspect'

const file = await read('example.html')
const tree = fromHtml(file)

console.log(inspect(toNlcst(tree, file, ParseEnglish)))

…now running node example.js yields (positional info removed for brevity):

RootNode[2] (1:1-6:1, 0-134)
├─0 ParagraphNode[3] (1:10-3:3, 9-24)
│   ├─0 WhiteSpaceNode "\n  " (1:10-2:3, 9-12)
│   ├─1 SentenceNode[2] (2:3-2:12, 12-21)
│   │   ├─0 WordNode[1] (2:3-2:11, 12-20)
│   │   │   └─0 TextNode "Implicit" (2:3-2:11, 12-20)
│   │   └─1 PunctuationNode "." (2:11-2:12, 20-21)
│   └─2 WhiteSpaceNode "\n  " (2:12-3:3, 21-24)
└─1 ParagraphNode[1] (3:7-3:43, 28-64)
    └─0 SentenceNode[4] (3:7-3:43, 28-64)
        ├─0 WordNode[1] (3:7-3:15, 28-36)
        │   └─0 TextNode "Explicit" (3:7-3:15, 28-36)
        ├─1 PunctuationNode ":" (3:15-3:16, 36-37)
        ├─2 WhiteSpaceNode " " (3:16-3:17, 37-38)
        └─3 WordNode[4] (3:25-3:43, 46-64)
            ├─0 TextNode "foo" (3:25-3:28, 46-49)
            ├─1 TextNode "s" (3:37-3:38, 58-59)
            ├─2 PunctuationNode "-" (3:38-3:39, 59-60)
            └─3 TextNode "ball" (3:39-3:43, 60-64)

API

This package exports the identifier toNlcst. There is no default export.

toNlcst(tree, file, Parser)

Turn a hast tree into an nlcst tree.

👉 Note: tree must have positional info and file must be a VFile corresponding to tree.

Parameters
Returns

NlcstNode.

Notes
Implied paragraphs

The algorithm supports implicit and explicit paragraphs, such as:

<article>
  An implicit paragraph.
  <h1>An explicit paragraph.</h1>
</article>

Overlapping paragraphs are also supported (see the tests or the HTML spec for more info).

Ignored nodes

Some elements are ignored and their content will not be present in nlcst: <script>, <style>, <svg>, <math>, <del>.

To ignore other elements, add a data-nlcst attribute with a value of ignore:

<p>This is <span data-nlcst="ignore">hidden</span>.</p>
<p data-nlcst="ignore">Completely hidden.</p>
Source nodes

<code> elements are mapped to Source nodes in nlcst.

To mark other elements as source, add a data-nlcst attribute with a value of source:

<p>This is <span data-nlcst="source">marked as source</span>.</p>
<p data-nlcst="source">Completely marked.</p>

ParserConstructor

Create a new parser (TypeScript type).

Type
type ParserConstructor = new () => ParserInstance

ParserInstance

nlcst parser (TypeScript type).

For example, parse-dutch, parse-english, or parse-latin.

Type
type ParserInstance = {
  parse(value?: string | null | undefined): NlcstRoot
  tokenize(value?: string | null | undefined): Array<NlcstSentenceContent>
  tokenizeParagraph(value?: string | null | undefined): NlcstParagraph
  tokenizeParagraphPlugins: Array<(node: NlcstParagraph) => undefined | void>
  tokenizeSentencePlugins: Array<(node: NlcstSentence) => undefined | void>
}

Types

This package is fully typed with TypeScript. It exports the additional types ParserConstructor and ParserInstance.

Compatibility

Projects maintained by the unified collective are compatible with maintained versions of Node.js.

When we cut a new major release, we drop support for unmaintained versions of Node. This means we try to keep the current release line, hast-util-to-nlcst@^4, compatible with Node.js 16.

Security

hast-util-to-nlcst does not change the original syntax tree so there are no openings for cross-site scripting (XSS) attacks.

Related

Contribute

See contributing.md in syntax-tree/.github for ways to get started. See support.md for ways to get help.

This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.

License

MIT © Titus Wormer

hast-util-to-nlcst's People

Contributors

christianmurphy avatar greenkeeperio-bot avatar wooorm avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

vitaly-z

hast-util-to-nlcst's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.