ebookingservices / htmld Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 10.0 137 KB

Lightweight and forgiving HTML parser and DOM

License: MIT License

D 100.00%

htmld's People

Contributors

Stargazers

Watchers

Forkers

cyisfor forbjok martinnowak kaleidicforks mehmetabaci1995 yoplitein sirnickolas archipel aberba

htmld's Issues

Adapt for CTFE compatibility

Currently the entity processing code in parse.d will not execute when in compiled mode:

if (auto pindex = name in index_) {

Otherwise, this project can run as a CTFE, which is nice during debugging.

html tags get attributes

Hi, How can I get html tags attributes value?

Creating and inserting a element produces a end tag

... but   is not supposed to have an end tag.

This bug appears to have been introduced in v0.3.5.
In v0.3.4 it produces a simple   with no end tag, as expected.

Minimal reproduction of the issue here: https://github.com/forbjok/htmld-repro

find() is completely broken

As far as I understand, find is supposed to search only in descendants of the current node. However, find actually searches most nodes in the document:

    auto doc = createDocument("<html>
    <body>
        <div id=\"a1\">
          <div id=\"b1\">
             <div id=\"c1\"></div>
             <div id=\"c2\"></div>
          </div>
          <div id=\"b2\"></div>
        </div>
        <div id=\"a2\"></div>
    </body></html>");
    auto elem = doc.querySelector("#c1");
    writeln(elem.id); //c1
    writeln(elem.find("div")); //[<div id="c1"/>, <div id="c2"/>, <div id="b2"/>, <div id="a2"/>]

To prevent find from scanning ancestors the popFront function in DescendantsDFForward needs to be fixed:

if (parent /*&& (top_ != parent)*/) //FIX in commented code
    next = parent.next_;

However, even with this find will still search siblings (c1 and c2). Is this the expected behaviour? DescendantsDFForward explicitly sets top_ = first.parent_ but I think it'd be more intuitive if find only searched descendants.

pointer operation deprecation in dmd 2.078.0

../../.dub/packages/htmld-0.2.18/htmld/src/html/alloc.d(39,12): Deprecation: cannot subtract pointers to different types: void* and ubyte*.

Implement Relation.Rule.Or?

This syntax:

obj.querySelectorAll("p,b")

should takes all  or  elements.

Where is handler used?

Hi! I am hobby programmer, and I am not fully understand some part of languages. In doc section you wrote "Example handler". What is handlers and where are they can be helpful? Could you give any examples?

Segfault when calling .destroy() on element while iterating querySelectorAll()

The following will cause a segfault from v0.3.1, up to and including the current master branch of htmld.

foreach(e; document.querySelectorAll("script", document.root)) {
    e.destroy();
}

Minimal reproduction: htmld-segfault-repro.zip

Tested with DMD v2.079 and LDC2 v1.8.0.

I don't know whether this would be considered a bug, or whether the fact that it worked before was just an accident, but I do know it works fine in v0.2.19 and older.

Currently, as a workaround, I've changed my code to just add all elements to an array and iterate and destroy each one from the array afterwards.

UPDATE:
I did a bit more testing, and the exact commit that introduces the segfault is 8fe2134.

Compile Time support

It would be very interesting to have some functions working at compile time.
For example createDocument() would save a lot of time if parsing is done at compile time.

Returning a document silently produces corrupted documents.

Every node keeps a pointer to its document... a pointer. So that means documents cannot be copied. For instance:

unittest {
    Document foo() {
        return createDocument("<a>");
    }
    Document bar = foo();
    bar.root.appendChild(bar.createElement("fuuuuuuu"));
}

For whatever reason I can't fathom, this module does not allow removing a node from one document, when appending it, instead just erroring out. Removing that would be pretty harmless, since documents don't really do anything, but I dunno if there are any subtle gotchas there. But because the document must always be the same, this code snippet is an unmitigable error.

createDocument() creates and returns a Document, and in creating it, createDocument adds a root node, and a node for "a". Then foo() returns a Document, which silently copies the document that createDocument returned. bar is now "secretly blitted document 2" but the pointers of every single node in it haven't been rewritten. Thus bar.root.document_ is still document 1. bar.root.appendChild() therefore uses document 1, while bar.createElement() creates elements with document 2.

The end result is a document that has nodes from another document, and nothing can add elements to those nodes.

Possible solutions:

upgrade clone() to be the "this(this)" copy constructor. That way instead of secretly throwing away the valid document and creating an invalid one, it will secretly descend through the entire node tree, secretly duplicating each one.
Have createDocument returns a Document* since this module basically reimplements with structs, everything that classes already do.
When appending a child, instead of flipping out if the documents are different, simply assign the child's document to the current one, and possibly detach() it from any old node tree it might be part of.

I like the third option, honestly. There's no reason for an unrecoverable error there. It's just move semantics by default. People who want true copies of nodes, can appendChild(node.clone()). It's very efficient to just move a node from one document to another this way, and there's no scaling problems like with stuff that has to recursively alter the node tree.

Only problem is that children of an appended child will still have the old document. But since that's not an error anymore, it doesn't really seem like a huge concern. And if it is a huge concern, and scaling is not an issue, then you can just recursively set the document_ of all the children. That would still be scads cheaper than cloning the document.

The second option is an okay one too. Classes are implemented in D that way, with everything being a pointer. Since this module already reinvents all that logic internally with pointers and alloc(), it could stop pretending to be a real struct on the surface and just return a Document*. I don't like it, but at least it would eliminate hard to find errors where a document is secretly, implicitly copied. You could set @disable this(this) if you returned a pointer, I think. But there might be other errors where the code requires a document be copied no matter what.

The first option is terrible. If there are move semantics by default, and you want to copy a node, it's just move(...,node.clone()). If there are copy semantics by default, then moving a node is entirely impossible, and everything can only be copied. That gets really annoying when you're trying to embed one document fragment in another document.

Attribute values not quoted when getting HTML from Document

When simply parsing a HTML and then immediately re-retrieving it from the DOM document, quotes around attribute values disappear.
I noticed this because unit tests in one of my projects started failing due to retrieved HTMLs no longer matching the expected output.

This issue appears to have been introduced in v0.3.0. (0.2.19 and earlier don't have this, but of course, also don't compile with current D compilers)

Simple reproduction project showing the issue:
htmld-repro.zip

As far as I can tell, this is not technically a bug, but perhaps there should be a setting to force quoting of all attribute values, as I (and probably others) find inconsistent quoting in the resulting HTML to look messy and undesirable.

No apostrophes allowed!

writeln(createDocument!(DOMCreateOptions.None)("It's").root.html);
=>
I'ts

Instead of encoding text as entities when stored in the document, it also escapes those entities again when printing the document out. writeHTMLEscaped doesn't have any support for not doing that, so there's no way to stop characters from getting encoded as numeric entities. You have to escape the apostrophe if it's inside an attribute value, but there's no accounting for that. Plus there's smart quotes, and innumerable utf-8 characters that make perfectly valid HTML, but are forced into ugly numeric escapes instead, with this strategy.

It's confusing terminology too. "decode entities" seems to mean turning " " into " ". I don't think anyone would want to do that by default, and that's actually _trans_coding named entities into numeric entities, not decoding anything. "decoding entities" should at least mean turning the utf-8 encoded " " into " " Strictly speaking, that's transcoding too, since " " is still just a sequence of undecoded bytes, not a complex data structure. Decoding is where you take " " or " " and turn it into a call to onEntity() or something. It looks like the code tries to do that, but... doesn't really do it right.

I like to adopt the strategy of "never decode, until you are ready to display." Which is fancy way of writing "never decode." I only decode stuff when I need meaningful data inside it, otherwise I run the risk of double decoding it, or being unable to specify what encoding it forces everything into.

Really, entities are entirely separate from the structure of the HTML document. Entities aren't separate nodes in the DOM for any major web browser, they're just more bytes within the text node. There are also two entirely separate categories of entity that get conflated together: 7-byte characters that would mess up the HTML if unescaped, and 8-byte characters. I'll want to escape some HTML by replacing < and >, and maybe " with entities, but no other ones. Conversely, if someone wants to turn their HTML into proper english schoolteacher HTML, they will want all codepoints escaped, but leave < and > intact.

Here's how I think it should work:

writeln(createDocument("It’s").root.html);
=>
It’s

writeln(escapeEntities(createDocument("It’s <e/>").root.html));
=>
It'ts <e/>

writeln(escapeEntities!(named: false)(createDocument("It’s <e/>").root.html));
=>
Itߣts <e/>

writeln(escapeHTML(createDocument("It’s "<e/>"").root.html));
=>
It’s "<e/&gt"

writeln(escapeAttribute("this attribute has a \" in it, as well as > and <"));
=>
this attribute has a " in it, as well as > and <

node.html("<p>");
writeln(node.html);

=>


node.html(escapeHTML("<p>"));
writeln(node.html);

=>


I'm honestly starting to think that a HTML parser should not deal with entities at all. They have to be dealt with using a separate character-by-character parser since they're on the level of characters, and not part of the document structure.

Add indexing for id

I think it could be a good optimization:

On dom struct, add Node*[string] idIdx_;
When node.attr(k, v) is called then if k = "id" do idIdx[v] = &this;
If v == "" or when node is destroyed, remove it

Now a (quite common) call to dom.querySelector("#test ...") could be done in a very fast way, using a lookup table (only if selector starts with #id)

PS: why doesn't exists something like node.querySelectorAll(HTMLString selector) { return document_.querySelectorAll(selector, &this); } ?

depth first search is not depth first

I was toying with adding breadth first search, so you could skip trees of uninteresting children, but the more I looked at the algorithm... and it's breadth first, not depth first.

Okay, first in the constructor, it sets curr_ if the condition is true. But curr_ is the top node, so that's as breadth first as you get! Secondly, in popFront, it checks for curr_ to have a child, and sets curr_ to that child, then checks the condition. If that child itself has children, they don't get visited before this child is checked.

If it were depth first, it would need to loop until there were no children at all, then visit that node, then go up parent nodes, visiting each one. Once a parent node has a next node, go to that node, do not visit it, and start going down to the lowest child again.

I believe it is a breadth first traversal, even though it goes down the children, before going to the next siblings. The condition is checked at the topmost node, and no more depth is traversed, until the next iteration. Parent nodes will always be produced by the iterator before their children, since it checks the condition as soon as it goes down to firstChild, rather than continuing down firstChild to the bottom, then checking the condition only while returning back up.

So, good news for me I can just say "void skipChildren() { curr_ = curr_.next_; }" and it'll skip those children. But if you needed depth first searching, say to prune children without messing up the iterator, you'd mess up the iterator. Whatever you mutated the tree into, the iterator would descend into that, after visiting the parent node.

If it is a breadth first iterator, maybe it should go to .next_ first, rather than .firstChild_. You can use next_ to go across, just like firstChild_ goes down, until you hit the end, and then .prev_ instead of .parent_ to return, until a node is found that has a .firstChild. That'd be a more intutive way of breadth first searching... I think?

Considering this document:

<A>
  <AB>
    <BE/><BF/><BG/>
  <AC>
    <CH/><CI/><CJ/>
  </AC>
  <AD>
    <DK/><DL/>
  </AD>
</A>

A
AB       AC       AD
BE BF BG CH CI CJ DK DL

currently the order returned is:
A, AB, BE, BF, BG, AC, CH, CI, CJ, AD, DK, DL (I checked)

A breadth first search that prioritized next_ above firstChild_ would have this order:
A, AB, AC, AD, DK, DL, CH, CI CJ, BE, BF, BG

A depth first traversal would be thus:
BE, BF, BG, AB, CH, CI, CJ, AC, DK, DL, AD, A

That traversal prioritizes firstChild_ and goes from prev_ to next_. A depth first prioritizing firstChild_ and going from next_ to prev_ would be this:
DL, DK, AD, CJ, CI, CH, AC, BG, BF, BE, AB, A
which is the exact reverse order of the breadth first search prioritizing firstChild_ and going from prev_ to next_.

Documentation

It would be great to have some documentation, any documentation, for how to use this module.