rgrove / parse-xml Goto Github PK

View Code? Open in Web Editor NEW

252.0 6.0 14.0 1.59 MB

A fast, safe, compliant XML parser for Node.js and browsers.

Home Page: https://rgrove.github.io/parse-xml

License: ISC License

JavaScript 56.10% HTML 0.14% TypeScript 43.76%

xml xml-parser xml-parsing parser node nodejs javascript js parsing parse-xml

parse-xml's Introduction

parse-xml

A fast, safe, compliant XML parser for Node.js and browsers.

Installation

npm install @rgrove/parse-xml

Or, if you like living dangerously, you can load the minified bundle in a browser via Unpkg and use the parseXml global.

Features

Returns a convenient object tree representing an XML document.
Works great in Node.js and browsers.
Provides helpful, detailed error messages with context when a document is not well-formed.
Mostly conforms to XML 1.0 (Fifth Edition) as a non-validating parser (see below for details).
Passes all relevant tests in the XML Conformance Test Suite.
Written in TypeScript and compiled to ES2020 JavaScript for Node.js and ES2017 JavaScript for browsers. The browser build is also optimized for minification.
Extremely fast and surprisingly small.
Zero dependencies.

Not Features

While this parser is capable of parsing document type declarations (<!DOCTYPE ... >) and including them in the node tree, it doesn't actually do anything with them. External document type definitions won't be loaded, and the parser won't validate the document against a DTD or resolve custom entity references defined in a DTD.

In addition, the only supported character encoding is UTF-8 because it's not feasible (or useful) to support other character encodings in JavaScript.

Examples

Basic Usage

ESM

import { parseXml } from '@rgrove/parse-xml';
parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>');

CommonJS

const { parseXml } = require('@rgrove/parse-xml');
parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>');

The result is an XmlDocument instance containing the parsed document, with a structure that looks like this (some properties and methods are excluded for clarity; see the API docs for details):

{
  type: 'document',
  children: [
    {
      type: 'element',
      name: 'kittens',
      attributes: {
        fuzzy: 'yes'
      },
      children: [
        {
          type: 'text',
          text: 'I like fuzzy kittens.'
        }
      ],
      parent: { ... },
      isRootNode: true
    }
  ]
}

All parse-xml objects have toJSON() methods that return JSON-serializable objects, so you can easily convert an XML document to JSON:

let json = JSON.stringify(parseXml(xml));

Friendly Errors

When something goes wrong, parse-xml throws an error that tells you exactly what happened and shows you where the problem is so you can fix it.

parseXml('<foo><bar>baz</foo>');

Output

Error: Missing end tag for element bar (line 1, column 14)
  <foo><bar>baz</foo>
               ^

In addition to a helpful message, error objects have the following properties:

column Number

Column where the error occurred (1-based).
excerpt String

Excerpt from the input string that contains the problem.
line Number

Line where the error occurred (1-based).
pos Number

Character position where the error occurred relative to the beginning of the input (0-based).

Why another XML parser?

There are many XML parsers for Node, and some of them are good. However, most of them suffer from one or more of the following shortcomings:

Native dependencies.
Loose, non-standard parsing behavior that can lead to unexpected or even unsafe results when given input the author didn't anticipate.
Kitchen sink APIs that tightly couple a parser with DOM manipulation functions, a stringifier, or other tooling that isn't directly related to parsing and consuming XML.
Stream-based parsing. This is great in the rare case that you need to parse truly enormous documents, but can be a pain to work with when all you want is a node tree.
Poor error handling.
Too big or too Node-specific to work well in browsers.

parse-xml's goal is to be a small, fast, safe, compliant, non-streaming, non-validating, browser-friendly parser, because I think this is an under-served niche.

I think parse-xml demonstrates that it's not necessary to jettison the spec entirely or to write complex code in order to implement a small, fast XML parser.

Also, it was fun.

Benchmark

Here's how parse-xml's performance stacks up against a few comparable libraries:

fast-xml-parser, which claims to be the fastest pure JavaScript XML parser
libxmljs2, which is based on the native libxml library written in C
xmldoc, which is based on sax-js

While libxmljs2 is faster at parsing medium and large documents, its performance comes at the expense of a large C dependency, no browser support, and a history of security vulnerabilities in the underlying libxml2 library.

In these results, "ops/s" refers to operations per second. Higher is faster.

Node.js v18.14.0 / Darwin arm64
Apple M1 Max

Running "Small document (291 bytes)" suite...
Progress: 100%

  @rgrove/parse-xml 4.1.0:
    191 553 ops/s, ±0.10%   | fastest

  fast-xml-parser 4.1.1:
    142 565 ops/s, ±0.11%   | 25.57% slower

  libxmljs2 0.31.0 (native):
    74 646 ops/s, ±0.30%    | 61.03% slower

  xmldoc 1.2.0 (sax-js):
    66 823 ops/s, ±0.09%    | slowest, 65.12% slower

Finished 4 cases!
  Fastest: @rgrove/parse-xml 4.1.0
  Slowest: xmldoc 1.2.0 (sax-js)

Running "Medium document (72081 bytes)" suite...
Progress: 100%

  @rgrove/parse-xml 4.1.0:
    1 065 ops/s, ±0.11%   | 49.81% slower

  fast-xml-parser 4.1.1:
    637 ops/s, ±0.12%     | 69.98% slower

  libxmljs2 0.31.0 (native):
    2 122 ops/s, ±2.48%   | fastest

  xmldoc 1.2.0 (sax-js):
    444 ops/s, ±0.36%     | slowest, 79.08% slower

Finished 4 cases!
  Fastest: libxmljs2 0.31.0 (native)
  Slowest: xmldoc 1.2.0 (sax-js)

Running "Large document (1162464 bytes)" suite...
Progress: 100%

  @rgrove/parse-xml 4.1.0:
    93 ops/s, ±0.10%    | 53.27% slower

  fast-xml-parser 4.1.1:
    48 ops/s, ±0.60%    | 75.88% slower

  libxmljs2 0.31.0 (native):
    199 ops/s, ±1.47%   | fastest

  xmldoc 1.2.0 (sax-js):
    38 ops/s, ±0.09%    | slowest, 80.9% slower

Finished 4 cases!
  Fastest: libxmljs2 0.31.0 (native)
  Slowest: xmldoc 1.2.0 (sax-js)

See the parse-xml-benchmark repo for instructions on how to run this benchmark yourself.

License

ISC License

parse-xml's People

Contributors

Stargazers

Watchers

Forkers

petejohanson soldy intermundos andxor j8088 siegesmund renatrazumov rossj born2net mahboubii lizliu01 isabella232 aneiosi hyzyla

parse-xml's Issues

xml.replace is not a function

when I try to use the function, I get this error.

Documentation says second argument is optional, but TS compiler says its required

Docs say I can do this:

parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>');

But I get an error:

Expected 2 arguments, but got 1.

Workaround:

parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>', undefined);

I think that the type definition needs to make the second parameter optional with ?:

options?: ParserOptions

Positional info

Hi Ryan! 👋

I’m looking for a well-tested, mostly-spec-compliant, XML parser, with importantly positional info (specifically starting and ending line/columns/offset into the whole string), for xast.

I see that this project has positional info on errors, but I don’t see it on nodes. Would it be of interest to add that data on nodes in this project?

Text content following CDATA is appended to a preceding `XmlCdata` node

When the preserveCdata parser option is true, text content that follows a CDATA section is incorrectly appended to the preceding XmlCdata node. Text content should only be appended to XmlText nodes.

Consider adding opt-in support for parsing XML 1.1

Hi there,

I'm across some XML files that contain some hex references of control characters, e.g. <a>hello</a>. Currently, these files give an error parsing with this library due to the check on line 400.

What are your thoughts on relaxing this check, or perhaps having an explicit opt-in option, to allow explicit character references in this control character range?

Thanks!

Optionally include XML declarations and doctype declarations in the DOM

It would be useful if parse-xml provided parser options which, when enabled, would cause XML declarations and doctype declarations to be included in the DOM. This would allow a custom serializer to round-trip a parsed document to an equivalent (though not strictly equal) XML string.

Question "claims to test an illegal char, but tests the wrong char"

Nice job! I'm happy to have run across this.

I see some of your conformance test exceptions (e.g. not-wf-sa-173) say "claims to test an illegal char, but tests the wrong char". How are these the wrong character?

Add type information for errors

parse-xml error objects contain additional properties, but currently we don't export a type describing these errors. We should!

Does it support running in WebWorkers?

Does this library work from browser WebWorkers?

I noticed that the retuned value is a. XmlDocument instance which I did not think WebWorkers supported it.

Great work on this library - thanks for sharing it

/Paul

RegExp issue with very long attributes

I've ran into a parsing issue with some very long XML attributes. For example, the following will raise a RangeError: Maximum call stack size exceeded while running the Attribute regex:

parseXml(`<a b="${'a'.repeat(9000000)}"/>`);

I was able to avoid the error by slightly relaxing the attribute value parts of the Attribute regex as follows:

[^<&"] | ${exports.Reference} => [^<"]

but this change causes 4 tests to fail that are expecting exceptions from various bad & uses in attribute values.

Perhaps the regexes could be kept strict while avoiding the call stack exception, or perhaps a 2-phase approach could be used where the attribute value is checked for invalid references after being parsed from the overall tag.

Don't trim comment content

parse-xml trims the content of comment nodes, but the spec doesn't require this, and it differs from what libxml2 does.

Include additional structured information in error objects

Thanks for this library, its the only one I've found with useful, descriptive error handling besides the DOMParser.

Would it be possible to extend the error object to break down the message a bit more? What I have in mind is to go from this:

Unclosed start tag for element `text` (line 1, column 243) iddle text-anchor="middle" font-size="200px" fill= ^

To this:

{
  errorCode: 0,
  message: Unclosed start tag for element,
  element: 'text',
  line: 1,
  col, 243,
  snippet: `middle text-anchor="middle" font-size="200px" fill=`
}

I know some of those entries are already in the object, which is great. Having the extra ones, like errorCode, message and element or similar would help with UI work.

Streams

Hi, great library!
I have a question though, in the documentation it says that this library can support stream-based parsing, but the xmlParser constructor only accepts string as argument. Any tips on how could I use this library with a ReadableStream?


 is interpreted as space instead of line feed

I have the following element with an attribute.

<element attribute="value0&#xA;value1" />


 is an encoded line feed, but the parser converts it to a space instead, which is unexpected.

Serialising back to XML

Hi there,

I'm wondering if you would consider adding functionality such that XmlDocument (and perhaps XmlNode) objects could be serialised back to strings of XML.

Thanks for a great library!

Very cool lib

It's not an issue.

I just want to say this is a very cool lib. The code is a pleasure to navigate too. 👍

Option to ignore missing ends

Hi,

Excellent library. The only thing didn't work out for me was not being able to parse old HTML documents, like a bookmarks file generated by Chrome;

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><H3 ADD_DATE="1501384533" LAST_MODIFIED="1509512466" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
    <DL><p>
        <DT><H3 ADD_DATE="1509512421" LAST_MODIFIED="1509512514">foo</H3>
        <DL><p>
            <DT><H3 ADD_DATE="1509512426" LAST_MODIFIED="1509512489">bar</H3>
            <DL><p>
                <DT><A HREF="https://github.com/" ADD_DATE="1509512445">Github</A>
                <DT><A HREF="https://getkozmos.com/" ADD_DATE="1509512489">Kozmos</A>
                <DT><A HREF="https://getkozmos.com/" ADD_DATE="1509512489">Duplicate</A>
            </DL><p>
            <DT><A HREF="http://novatogatorop.com/" ADD_DATE="1509512514">Nova Togatorop</A>
        </DL><p>
        <DT><H3 ADD_DATE="1509512466" LAST_MODIFIED="1509512472">span</H3>
        <DL><p>
            <DT><A HREF="http://azer.bike/" ADD_DATE="1509512461">Azer Koçulu</A>
        </DL><p>
    </DL><p>
</DL><p>

Unfortunately, even most modern browsers generate such terrible HTML files. Is there a way to parse it ?

Parsing can hang on attributes with many references

I've come across a few XML documents that seems to stop my program in its tracks. I've traced the issue down to an issue with the decoding of many reference characters in a single attribute value.

For example, take the following example XML with 35 references:

<a b="&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;"></a>

As is, it takes over 9 minutes to parse on my machine. Decreasing to 34 <, it takes 4.8 minutes, while increasing to 36 gives a parse time of 17.5 minutes. My guess is that this is due to a "catastrophic backtracking" RegExp.

Interestingly, self-closing the <a> tag makes things work at a normal / expected speed.

<a b="&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;"/>