Git Product home page Git Product logo

parse-xml's Issues

Optionally include XML declarations and doctype declarations in the DOM

It would be useful if parse-xml provided parser options which, when enabled, would cause XML declarations and doctype declarations to be included in the DOM. This would allow a custom serializer to round-trip a parsed document to an equivalent (though not strictly equal) XML string.

Documentation says second argument is optional, but TS compiler says its required

Docs say I can do this:

parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>');

But I get an error:

Expected 2 arguments, but got 1.

Workaround:

parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>', undefined);

I think that the type definition needs to make the second parameter optional with ?:

options?: ParserOptions

Include additional structured information in error objects

Thanks for this library, its the only one I've found with useful, descriptive error handling besides the DOMParser.

Would it be possible to extend the error object to break down the message a bit more? What I have in mind is to go from this:

Unclosed start tag for element `text` (line 1, column 243) iddle text-anchor="middle" font-size="200px" fill= ^ 

To this:

{
  errorCode: 0,
  message: Unclosed start tag for element,
  element: 'text',
  line: 1,
  col, 243,
  snippet: `middle text-anchor="middle" font-size="200px" fill=`
}

I know some of those entries are already in the object, which is great. Having the extra ones, like errorCode, message and element or similar would help with UI work.

Positional info

Hi Ryan! 👋

I’m looking for a well-tested, mostly-spec-compliant, XML parser, with importantly positional info (specifically starting and ending line/columns/offset into the whole string), for xast.

I see that this project has positional info on errors, but I don’t see it on nodes. Would it be of interest to add that data on nodes in this project?

Consider adding opt-in support for parsing XML 1.1

Hi there,

I'm across some XML files that contain some hex references of control characters, e.g. <a>hello&#x7;</a>. Currently, these files give an error parsing with this library due to the check on line 400.

What are your thoughts on relaxing this check, or perhaps having an explicit opt-in option, to allow explicit character references in this control character range?

Thanks!

Parsing can hang on attributes with many references

I've come across a few XML documents that seems to stop my program in its tracks. I've traced the issue down to an issue with the decoding of many reference characters in a single attribute value.

For example, take the following example XML with 35 references:

<a b="&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;"></a>

As is, it takes over 9 minutes to parse on my machine. Decreasing to 34 &lt;, it takes 4.8 minutes, while increasing to 36 gives a parse time of 17.5 minutes. My guess is that this is due to a "catastrophic backtracking" RegExp.

Interestingly, self-closing the <a> tag makes things work at a normal / expected speed.

<a b="&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;"/>

Add type information for errors

parse-xml error objects contain additional properties, but currently we don't export a type describing these errors. We should!

Very cool lib

It's not an issue.

I just want to say this is a very cool lib. The code is a pleasure to navigate too. 👍

RegExp issue with very long attributes

I've ran into a parsing issue with some very long XML attributes. For example, the following will raise a RangeError: Maximum call stack size exceeded while running the Attribute regex:

parseXml(`<a b="${'a'.repeat(9000000)}"/>`);

I was able to avoid the error by slightly relaxing the attribute value parts of the Attribute regex as follows:

[^<&"] | ${exports.Reference} => [^<"]

but this change causes 4 tests to fail that are expecting exceptions from various bad & uses in attribute values.

Perhaps the regexes could be kept strict while avoiding the call stack exception, or perhaps a 2-phase approach could be used where the attribute value is checked for invalid references after being parsed from the overall tag.

Serialising back to XML

Hi there,

I'm wondering if you would consider adding functionality such that XmlDocument (and perhaps XmlNode) objects could be serialised back to strings of XML.

Thanks for a great library!

Don't trim comment content

parse-xml trims the content of comment nodes, but the spec doesn't require this, and it differs from what libxml2 does.

Option to ignore missing ends

Hi,

Excellent library. The only thing didn't work out for me was not being able to parse old HTML documents, like a bookmarks file generated by Chrome;

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><H3 ADD_DATE="1501384533" LAST_MODIFIED="1509512466" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
    <DL><p>
        <DT><H3 ADD_DATE="1509512421" LAST_MODIFIED="1509512514">foo</H3>
        <DL><p>
            <DT><H3 ADD_DATE="1509512426" LAST_MODIFIED="1509512489">bar</H3>
            <DL><p>
                <DT><A HREF="https://github.com/" ADD_DATE="1509512445">Github</A>
                <DT><A HREF="https://getkozmos.com/" ADD_DATE="1509512489">Kozmos</A>
                <DT><A HREF="https://getkozmos.com/" ADD_DATE="1509512489">Duplicate</A>
            </DL><p>
            <DT><A HREF="http://novatogatorop.com/" ADD_DATE="1509512514">Nova Togatorop</A>
        </DL><p>
        <DT><H3 ADD_DATE="1509512466" LAST_MODIFIED="1509512472">span</H3>
        <DL><p>
            <DT><A HREF="http://azer.bike/" ADD_DATE="1509512461">Azer Koçulu</A>
        </DL><p>
    </DL><p>
</DL><p>

Unfortunately, even most modern browsers generate such terrible HTML files. Is there a way to parse it ?

Does it support running in WebWorkers?

Does this library work from browser WebWorkers?

I noticed that the retuned value is a. XmlDocument instance which I did not think WebWorkers supported it.

Great work on this library - thanks for sharing it

/Paul

Streams

Hi, great library!
I have a question though, in the documentation it says that this library can support stream-based parsing, but the xmlParser constructor only accepts string as argument. Any tips on how could I use this library with a ReadableStream?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.