Git Product home page Git Product logo

html5-php's People

Contributors

alecpl avatar alexpott avatar apeschar avatar bytestream avatar chris8934 avatar downsider avatar ericdowell avatar goetas avatar idimopoulos avatar imsop avatar javiereguiluz avatar jslegers avatar kaznovac avatar kitaitimakoto avatar mattfarina avatar miso-belica avatar ohader avatar remicollet avatar rubenv avatar sakarikl avatar samnela avatar sasezaki avatar siwinski avatar stof avatar sylus avatar technosophos avatar tgalopin avatar timwolla avatar vasiliicuhar avatar zhaofengli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html5-php's Issues

Problem with loadHTMLFragment()

hello,

i just started using your html5-parser and i'm trying to load a fragment.

using the code shown in the wiki it's no problem:

 require "vendor/autoload.php";
use Masterminds\HTML5;
$html5 = new HTML5();

// An example HTML fragment:
$fragment = "<p>This is a test of the HTML5 parser.<p>";
$dom = $html5->loadHTMLFragment($fragment);

but when i try to parse other tags it's not working.
for example this code

 require "vendor/autoload.php";
use Masterminds\HTML5;
$html5 = new HTML5();

// An example HTML fragment:
$fragment = "<td>This is a test of the HTML5 parser.<td>";
$dom = $html5->loadHTMLFragment($fragment);

error shown:

Notice: Undefined property: DOMDocumentFragment::$tagName in C:\xampp\htdocs\html5\vendor\masterminds\html5\src\HTML5\Parser\TreeBuildingRules.php on line 138

stacktrace:

Function Location
1 {main}( ) ..\index.php:0
2 Masterminds\HTML5->loadHTMLFragment( ) ..\index.php:21
3 Masterminds\HTML5->parseFragment( ) ..\HTML5.php:128
4 Masterminds\HTML5\Parser\Tokenizer->parse( ) ..\HTML5.php:181
5 Masterminds\HTML5\Parser\Tokenizer->consumeData( ) ..\Tokenizer.php:83
6 Masterminds\HTML5\Parser\Tokenizer->tagOpen( ) ..\Tokenizer.php:126
7 Masterminds\HTML5\Parser\Tokenizer->tagName( ) ..\Tokenizer.php:269
8 Masterminds\HTML5\Parser\DOMTreeBuilder->startTag( ) ..\Tokenizer.php:371
9 Masterminds\HTML5\Parser\TreeBuildingRules->evaluate( ) ..\DOMTreeBuilder.php:398
10 Masterminds\HTML5\Parser\TreeBuildingRules->closeIfCurrentMatches( ) ..\TreeBuildingRules.php:90

Parse error

Hi, I encounter some problems when parsing html which contains certain elments:

HTML:

<!DOCTYPE html>
<html>
<body>
<div>
<table style="width: 520px; height: 361px;" border="1px solid">
<tbody>
        <tr>
                <td>a</td>
                <td>b</td>
                <td>c</td>
                <td>d</td>
                <td>d</td>
                <td>f</td>
        </tr>
</tbody>
</table>
</div>
</body>
</html>

PHP:

require_once(__DIR__ . "/vendor/autoload.php");

$html = file_get_contents("1.html");
$dom = HTML5::loadHTML($html); //DOMDocument
echo  HTML5::saveHTML($dom);

What I get is a wrong result:

<html><body>
<div>
<table style="width: 520px; height: 361px;" border="1px solid"></table>
<tbody></tbody>
        <tr></tr>
                <td></td>a
                <td></td>b
                <td></td>c
                <td></td>d
                <td></td>d
                <td></td>f



</div>
</body>

</html>

It works well using DOMDocument::loadHTML parsing the same test html file.

PHP Documentor

phpdocumentor/phpdocumentor in compser.json is really neded?

It requires a lot of dependencies to be downloaded in build phase...

PHP Strict error messages

PHP Strict Standards:  Non-static method DOMImplementation::createDocumentType() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 60

Strict Standards: Non-static method DOMImplementation::createDocumentType() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 60
PHP Strict Standards:  Non-static method DOMImplementation::createDocument() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 62

Strict Standards: Non-static method DOMImplementation::createDocument() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 62

I've been getting these errors when running example.php.

The input "<p>0</p>" is not handled correctly

$html5 = new HTML5();
$doc = $html5->loadHTML( "<!DOCTYPE html>\n<html><head><title></title></head><body><p>0</p><p>1</p></body></html>" );
echo $html5->saveHTML( $doc );

Result:

<!DOCTYPE html>
<html><head><title></title></head><body><p></p><p>01</p></body></html>

I'm using html5-php 2.0.0 and PHP 5.3.3-7+squeeze15.

PSR-1/PSR-2

What about change the coding standard? (With the incoming 2.0 release?)

require-dev

The require-dev elements listed in the composer.json file are installed by default. That means every place someone installed this library they will also be installing phpunit and the symfony yaml parser in the vendor directory.

@technosophos should we pull or leave require-dev for phpunit?

CDATA sections parsed as comments, not CDATA.

The HTML5 spec supports CDATA sections, but the parser converts CDATA (incorrectly) into a comment section:

1) HTML5\Tests\SerializerTest::testCDATA
Failed asserting that '<!DOCTYPE html>
<html><head></head><body>a<!--[CDATA[ This <is--> a test. ]]&gt;b</body></html>
' matches PCRE pattern "|<![CDATA[ This <is> a test. ]]>|".

/Users/mattbutcher/Code/HP/HTML5-PHP/test/HTML5/SerializerTest.php:115

Error with parsing when HTML tags uppercased

Hi,
I discovered some weird behavior at this page http://rayer.g6.cz/. I also pasted source HTML here http://pastebin.com/FQjSEGCK .

Everything from the text in html > head > title is escaped (even </TITLE> tag). I find out that if I use function strtolower like this \HTML5::loadHTML(strtolower($html)) HTML is parsed correctly. Can you look at this please?

Thank you for your work - I can parse HTML also in PHP finally :)

Elements with dashes (Web Components support)

Over at https://www.drupal.org/node/1333730, we're working on pulling in masterminds/html5 via Composer into Drupal 8 core. But we're running into a problem: this library doesn't seem to support elements with dashes.

Elements with dashes are necessary for Web Components support (http://w3c.github.io/webcomponents/spec/custom/). However, technically, the Web Components spec is non-normative (http://www.w3.org/TR/html5/references.html#references), so it's not necessary — strictly speaking.

That being said, I think most people would argue Web Components are clearly going to become an important aspect of web development in the not-too-distant future, and hence we want to make sure Drupal 8 doesn't break them, and hence it'd necessary for this library not to break them, if Drupal 8 wants to use this library.

Would you be willing to add support for Web Components, and hence elements with dashes?

IE conditional tags being stripped

This appears to only be a issue before the tag, this:

<!--[if lte IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]> <!--><html class="no-js" lang="en"><!--<![endif]-->

is being turned into this:

<html class="no-js" lang="en"><!--<![endif]-->

Unescaped Output

The serializer currently is not encoding data properly for output. This enables certain documents to be crafted which can expose XSS vulnerabilities. For example, the cdata serializer just outputs the text directly. When crafted with a malicious payload, this results in an attack vector:

$html = "<!DOCTYPE html>
<html>
 <head>
  <title>TEST</title>
 </head>
 <body id='foo'>
  <h1>Hello World</h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>";

// Parse the document. $dom is a DOMDocument.
$dom = \HTML5::loadHTML($html);

$els = $dom->getElementsByTagName('h1');
$els->item(0)->appendChild(new DomCDataSection('this ]]><script>alert(hi!);</script><![CDATA[ is injected'));

var_dump(\HTML5::saveHTML($dom));

This will output:

<!DOCTYPE html>
<html><head>
  <title>TEST</title>
 </head>
 <body id="foo">
  <h1>Hello World<![CDATA[this ]]><script>alert(hi!);</script><![CDATA[ is     injected]]></h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>

Which is obviously bad.

Comments and raw text fields both suffer a similar problem as well (from a quick glance).

About namespaces...

Reading http://www.w3.org/TR/html51/syntax.html#the-before-html-insertion-mode

Especially:
A start tag whose tag name is "html"
Create an element for the token in the HTML namespace, with the Document as the intended parent. Append it to the Document object. Put this element in the stack of open elements.

HTML namespace should be: http://www.w3.org/1999/xhtml

Does this implies that line https://github.com/Masterminds/html5-php/blob/master/src/HTML5/Parser/DOMTreeBuilder.php#L227 should became:

$ele = $this->doc->createElement($lname, $htmlNs);

?

Whitespace in the end tag of RCDATA tags

The parser is confused if you add whitespaces into </title>, like:

<title>Note the space after "title"</title >
<title>Another example<title
>

Both examples above are valid according to the W3 Validator.

This behaviour is caused by Tokenizer.php which assumes the end tag is always exactly </title>.

Test script

<?php
require_once __DIR__ . "/vendor/autoload.php";
$html = <<<EOF
<!doctype html>
<html>
<head>
    <title>This is valid, really.</title >
</head>
<body></body>
</html>
EOF;
$parser = new Masterminds\HTML5;
$dom = $parser->loadHTML( $html );
echo $parser->saveHTML( $dom );

Output

<!DOCTYPE html>
<html><head>
    <title>This is valid, really.&lt;/title &gt;
&lt;/head&gt;
&lt;body&gt;&lt;/body&gt;
&lt;/html&gt;</title></head></html>

Real-world examples

Doc returned by loadHTML doesn't work with DOMXPath

$html5 = new HTML5();
$htmlStr = <<<HERE
  <!DOCTYPE html>
  <html>
  <head>
    <title></title>
  </head>
  <body>
    <p>Testing</p>
  </body>
HERE;
$doc = $html5->loadHTML( $htmlStr );

$xPath = new DOMXPath( $doc );
echo $xPath->query( '//p' )->length; // "0" in 2.0.0; "1" in 1.0.3

PHP 5.3.3.

Wrong text mode for title element?

Hello,

Thanks for the nice library. I installed and tried this HTML5 lib.

And founded that handling of entity references in title element is wrong like this:

<?php
// entityref-in-title.php
require_once 'vendor/autoload.php';

$html = <<<EOH
<!doctype html>
<title>&#x27;</title>
<p>&#x27;</p>
EOH;

echo \HTML5::loadHTML($html)->saveHTML();
$ php -v
PHP 5.6.0beta1 (cli) (built: Apr 17 2014 15:46:38)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.6.0-dev, Copyright (c) 1998-2014 Zend Technologies
$ php ./entityref-in-title.php
<!DOCTYPE html>
<html><title>&amp;#x27;</title>
<p>'</p></html>

In example above, &#x27; should be decoded as '(quotation) but actually doesn't.

If I set text mode for title element to 81, the entity ref is decoded properly:

<?php
require_once 'vendor/autoload.php';

$html = <<<EOH
<!doctype html>
<title>&#x27;</title>
<p>&#x27;</p>
EOH;

\HTML5\Elements::$html5['title'] = 81;
echo \HTML5::loadHTML($html)->saveHTML();
$ php ./entityref-in-title.php
<!DOCTYPE html>
<html><title>'</title>
<p>'</p></html>

I've intended to send a pull request but I couldn't because I didn't know why ¥HTML5¥Elements::$html5['title'] was set to 5.

Could you consider about this?

Non-inline elements being moved outside of inline containers automatically

In the current Drupal 8 test coverage, which still uses PHP's DomDocument (and hence makes assertions based on a XHTML POV), we have the following two assertions:

    $f = Html::normalize('<p>line1<br/><hr/>line2</p>');
    $this->assertEqual($f, '<p>line1<br></p><hr>line2', 'HTML corrector -- Move non-inline elements outside of inline containers.');

    $f = Html::normalize('<p>line1<div>line2</div></p>');
    $this->assertEqual($f, '<p>line1</p><div>line2</div>', 'HTML corrector -- Move non-inline elements outside of inline containers.');

The second still works with HTML5. The first doesn't.

Instead of moving the <hr> outside of the <p>, it keeps it inside:

<p>line1<br><hr>line2</p>

Looking at \MasterMinds\HTML5\Elements, I see:

        "hr" => 73, // NORMAL | VOID_TAG | BLOCK_TAG

So it's definitely marked as a block-level element. Which makes me suspect that HTML5 simply doesn't do this kind of clean-up, and that it's merely by accident (as a side-effect of some other parsing aspect) that the second test case is handled correctly.

Which makes me wonder if this is behavior only required for XHTML parsers and not HTML5 parsers?

Very long-running

Source:
image
Output:
image

$html contains doctype HTML 4.01, total size 37 KB

loadHTML skips whitespace after <html> tag

$in = "<!DOCTYPE html>
<html>
  <head>
    <title>My Webpage</title>
  </head>
  <body>foo</body>
</html>";
$dom = \HTML5::loadHTML($in);
$out = \HTML5::saveHTML($dom);

($out == $in); // false < should be true

The value of $out is:

<!DOCTYPE html>
<html><head>
    <title>My Webpage</title>
  </head>
  <body>foo</body>
</html>

Spaces between html and head has been removed

Convert to raw text without tags?

Years ago, I was using simple_html_dom to read in an HTML file and convert it to raw text for indexing in Apache Solr.

I'm in the process of converting those code to your library - is there a similar mechanism? If not, how do you recommend adding this functionality? Do I add a class that implements the RulesInterface?

Tag names with strange capitalisation

The parser fails when it encounters a tag name with strange capitalisation (e.g. <Title>, <titlE>, etc). For example, this script

<?php
require_once __DIR__ . "/vendor/autoload.php";
$html = <<< 'HERE'
<!doctype html>
<html>
<head>
        <Title>Hello, world!</Title>
</head>
<body></body>
</html>
HERE;
$parser = new Masterminds\HTML5;
$dom = $parser->loadHTML( $html );

echo "== HTML5 rendering ==\n";
echo $parser->saveHTML( $dom );

echo "== XPath queries ==\n";
$xpath = new DOMXPath( $dom );
$xpath->registerNamespace( "x", "http://www.w3.org/1999/xhtml" );
echo "=== Value of <title> ===\n";
echo $xpath->query( "//x:title" )->item( 0 )->nodeValue;

outputs:

== HTML5 rendering ==
<!DOCTYPE html>
<html><head>
    <title>Hello, world!&lt;/Title&gt;
&lt;/head&gt;
&lt;body&gt;&lt;/body&gt;
&lt;/html&gt;</title></head></html>
== XPath queries ==
=== Value of <title> ===
Hello, world!</Title>
</head>
<body></body>

The HTML supplied is valid.

Validate a HTML5 string

I'm trying to use your library to validate a HTML5 string. DOMDocument::validate() is the method I would be using.

$parser = new \HTML5;
$dom = $parser->loadHTML("<html><head><title>Herro</title></head><body></body></html>");
var_dump($dom->validate());

I get the following error:

Warning: DOMDocument::validate(): No declaration for element html 

I presume this is something to do with requiring a dtd schema, although I presumed that (as your library is specific to HTML5), this would be handled. Can you tell me if it's possible to use your library to achieve what I require and if so, how? Thanks.

Traverser error when saving an empty string previously loaded

Hello,

After install, running an extreme test with an empty input string leads to the following error:

Notice: Trying to get property of non-object in [...]\vendor\masterminds\html5\src\HTML5\Serializer\Traverser.php on line 96
Call Stack:
    0.0020     127680   1. {main}() [...]\test.php:0
    0.0550     873336   2. Masterminds\HTML5->saveHTML() [...]\test.php:15
    0.0550     874376   3. Masterminds\HTML5->save() [...]\vendor\masterminds\html5\src\HTML5.php:238
    0.0620     965688   4. Masterminds\HTML5\Serializer\Traverser->walk() [...]\vendor\masterminds\html5\src\HTML5.php:215
    0.0620     965736   5. Masterminds\HTML5\Serializer\OutputRules->document() [...]\vendor\masterminds\html5\src\HTML5\Serializer\Traverser.ph
p:68
    0.0620     966000   6. Masterminds\HTML5\Serializer\Traverser->node() [...]\vendor\masterminds\html5\src\HTML5\Serializer\OutputRules.php:11
8
<!-- Skipped --><!DOCTYPE html>

Here's the script executed with a PHP 5.5 CLI interpreter:

<?php
// Assuming you installed from Composer:
require __DIR__ . "/vendor/autoload.php";
use Masterminds\HTML5;

// An example HTML document:
$html = '';

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
echo $html5->saveHTML($dom);

To me, the library shall throw an exception when it is not able to deal with the input data. What if I push an object, integer, resource in loadHTML? What will happen?
Thanks for your help!
Vincent

No inner DOMText if getElementById is used

Given:
Test EOD; $html5 = \HTML5::loadHTML($string); print \HTML5::saveHTML($html5->getElementById('test')); ?>

Then:
$html5->getElementById('test') is empty.

(Do I use it wrong?)

Couldn't fetch DOMText. Node no longer exists

$html5 = \HTML5::loadHTML($string);
$newelem = new \DOMText('Test2');
$oldnode = $html5->getElementById('test');
$newnode = $oldnode->cloneNode()->appendChild($newelem); // <<<<
$parent = $oldnode->parentNode;
$parent->replaceChild($newnode, $oldnode);
print \HTML5::saveHTML($newnode);

Error:

Warning: Couldn't fetch DOMText. Node no longer exists in C:\...\HTML5\Serializer\Traverser.php on line 93

Notice: Undefined property: DOMText::$nodeType in C:\...\HTML5\Serializer\Traverser.php on line 93

Exception is thrown for wrong tag names

Hi,
I noticed some issues with pages that contain wrong tag names. I really don't know how to deal with the issue so maybe you find the solution. Below is the list of the pages with names of tag that are invalid. Exception DOMException#5: Invalid Character Error is always thrown at DOMTreeBuilder.php:227 by method DOMDocument::createElement. Every solution, except throwing the exception, is fine for me :) I can make the PR if you tell me what is proper fix for you.




Why is <li> not a BLOCK_TAG?

According to the spec, li can contain any "flow content", i.e. practically anything. Why is it not categorized as a BLOCK_TAG?

Does this HTML5 parser support xml namespace attributes?

Hi

I have the following custom HTML5 file that I want to load and parse in PHP.

<html xmlns:tpl="http://cphwebsolutions.dk/2013/tpl">
    <head>
        <title tpl:replace="page.title">Welcome</title>
        <link href="css/styles.css" type="text/css" rel="stylesheet"/>
    </head>
    <body>
        <h1 id="test" tpl:replace="page.headline">Welcome!</h1>
        <p tpl:replace="page.content"></p>
        <audio><source src="test.txt"/></audio>
    </body>
</html>

I want to be able to use XPath to search for different attributes in the custom HTML5 file e.g. $xpath->query("//*[@tpl:replace]");

But it will not find anything. When I used the loadHTML from the DOMDocument in PHP then it will find all three places where tpl:replace are present in the custom HTML5 document.

I can search for id attribute for example $xpath->query("//*[@id]"); and it works correctly. So my only problem is when I use the custom XML namespace in my HTML5 templates.

Br.
Rune Christensen

<?php
// Assuming you installed from Composer:
require "../vendor/autoload.php";


// An example HTML document:
$html = <<< 'HERE'
<!DOCTYPE html>
<html xmlns:tpl="http://cphwebsolutions.dk/2013/tpl">
    <head>
        <title tpl:replace="page.title">Welcome</title>
        <link href="css/styles.css" type="text/css" rel="stylesheet"/>
    </head>
    <body>
        <h1 id="test" tpl:replace="page.headline">Welcome!</h1>
        <p tpl:replace="page.content"></p>
        <audio><source src="test.txt"/></audio>
    </body>
</html>
HERE;

// Parse the document. $dom is a DOMDocument.
$dom = HTML5::loadHTML($html);

$xpath = new DOMXpath($dom);
$xpath->registerNamespace("tpl", "http://cphwebsolutions.dk/2013/tpl");

// example 1: for everything with an id
//$elements = $xpath->query("//*[@id]");

// example 2: for node data in a selected id
//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");

// example 3: same as above with wildcard
$elements = $xpath->query("//*[@tpl:replace]");

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "] ";
    echo "<br/> ". $element->attributes->length. "\n";

    for ($i=0; $i<$element->attributes->length;$i++) {
        echo " ".$element->attributes->item($i)->nodeName."<br/>\n";
    }

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeName. "\n";
    }
  }
}

echo "<br/><br/>\n";

// Render it as HTML5:
print HTML5::saveHTML($dom);

// Or save it to a file:
HTML5::save($dom, 'out.html');

Using an existing DOMDocument or subclass as target for parsed data

_PHPPowertools/DOM-Query_ is the first component of the _PHPPowertools_ framework that has been released to the public. It's purpose is similar to that of _technosophos/querypath_ but it's implementation is far more true to both jQuery's syntax and its semantics. For example, _PHPPowertools/DOM-Query_ lets you do stuff like this :

// Add a span tag with classes 'icon' and 'icon-printer' to all buttons
$H->select('body')->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the data-val attribute of all gallery images
$H->select('.gallery li img')->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

What's lacking so far, is proper support for HTML5. I've been considering using _Masterminds/html5-php_ to do the DOM parsing.

The most elegant way to implement the feature, would be by adding a target option to the supported options for \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct with support for following datatypes :

  • \DOMDocument or subclasses of \DomDocument
  • \DOMImplementation or subclasses of \DOMImplementation

I would like to use this feature as follows :

namespace PowerTools;

use \Symfony\Component\CssSelector\CssSelector as CssSelector;
use \Masterminds\HTML5 as HTML5;

class DOM_Document extends \DOMDocument {

    protected $_isHTML = false;

    public function __construct($data = false, $version = null, $encoding = null) {
        parent::__construct($version, $encoding);
        $data = trim($data);
        if ($data && $data != '') {
            if ($this->_isHTML) {
                $html5 = new HTML5();
                @$html5->loadHTML($data, array('target' => $this));
            } else {
                @$this->loadXML($data);
            }
        }
    }

   [ ... ]
}

I've tried adding a simple if(){}else{} statement to \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct to replace $this->doc with $options['target'] if a value for $options['target'] has been set, but that doesn't seem to do it.

As an alternative, I've also considered reïmplementing \PowerTools\DOM_Document as a subclass of \DOMImplementation, but this is a far less elegant approach that introduces too many new issues to go any further in that area.

Any feedback would be appreciated!

See also PHPPowertools/DOM-Query#1

No line break after <html>

Hi

I tried the example from the README file and the result was that the line break after the tag was remove:

Input:

  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>

Output:

<html><head>
    <title>TEST</title>
  </head>
  <body id="foo">
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>

It looks like it was the only line where input and output were different.

Br.
Rune Christensen

Expose debug mode and use PSR logger

In a few places there is a debug mode that prints to standard out. We can expose this in the Html5 and use the PSR logger interface (still printing to standard out by default).

Invalid Character Error

Hi,
when I'm trying to parse URL http://e107.funsite.cz/ I get DOMException("Invalid Character Error", 5) because of one unclosed tag in the markup. The snippet below causes the exception. It is caused by trying to set attribute with name <div in DOMTreeBuilder.php. As I understand from the doc all errors should be recorded in property $dom->errors. Can you fix this please?

<div class="wrapper"
                <div class="fleft">

Tokenizer quotedString function failure

Hi

I was trying to use the function quotedString from Tokenizer.php but it failed and I changed line 717 from:

if ($tok == '"' || "'") {

to:

if ($tok == '"' || $tok == "'") {

And now it works correctly in my PHP script.

Br.
Rune

ProcessorInstructions not working

Hi

Was trying to use the processor instruction functionality but got a fatal error

Fatal error: Call to a member function process() on a non-object in /home/www/cphwebsolutions.dk/cms/vendor/HTML5/Parser/DOMTreeBuilder.php on line 364

It looks like the error is placed in line 364:

$res = $processor->process($this->current, $name, $data);

I think that it should be changed to

$res = $this->processor->process($this->current, $name, $data);

Br.
Rune

DOMElement::setIdAttribute(): ID loading already defined

Hello,

Sometimes i get strange error:

PHP Warning:  DOMElement::setIdAttribute(): ID loading already defined in /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 392
PHP Stack trace:
PHP   1. {main}() /home/xxx/domains/xxx/public_html/index.php:0
PHP   2. Core\xxx->loadPage() /home/xxx/domains/xxx/public_html/index.php:147
PHP   3. Modules\xxx\Main->load() /home/xxx/domains/xxx/public_html/xxx/core/xxx.php:619
PHP   4. Modules\xxx\Main->_startEngine() /home/xxx/domains/xxx/public_html/xxx/modules/xxx/main.php:301
PHP   5. Modules\xxx\Main->_runCrawler() /home/xxx/domains/xxx/public_html/xxx/modules/xxx/main.php:121
PHP   6. Core\Classes\Search\Crawler->visitPage() /home/xxx/domains/xxx/public_html/xxx/modules/xxx/main.php:154
PHP   7. Masterminds\HTML5->loadHTML() /home/xxx/domains/xxx/public_html/xxx/core/classes/search/crawler.php:28
PHP   8. Masterminds\HTML5->parse() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5.php:94
PHP   9. Masterminds\HTML5\Parser\Tokenizer->parse() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5.php:165
PHP  10. Masterminds\HTML5\Parser\Tokenizer->consumeData() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:83
PHP  11. Masterminds\HTML5\Parser\Tokenizer->tagOpen() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:126
PHP  12. Masterminds\HTML5\Parser\Tokenizer->tagName() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:269
PHP  13. Masterminds\HTML5\Parser\DOMTreeBuilder->startTag() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:371
PHP  14. DOMElement->setIdAttribute() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/DOMTreeBuilder.php:392
PHP Warning:  DOMElement::setIdAttribute(): ID placeholder already defined in /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 392

I don't know what the input is, but i hope you can help me?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.