masterminds / html5-php Goto Github PK

View Code? Open in Web Editor NEW

1.5K 49.0 112.0 2.47 MB

An HTML5 parser and serializer for PHP.

Home Page: http://masterminds.github.io/html5-php/

License: Other

PHP 49.54% HTML 50.46%

xml-namespaces php html5-parser dom domdocument html5-php html5-document html5lib

html5-php's People

Contributors

Stargazers

Watchers

Forkers

paladin gobb kublaj stormwild cognifloyd jasonll shannah iraziud kitaitimakoto web5design kalyse yanguanglan itaylor yfix sasezaki cs1000 viniciusferreira matiasnamendola zhaofengli thomasweinert mikkorantalainen goraneza ddmitry ggeorgaras westie loryhuang dikshadeo webmechanicx tighten securecloud-biz eric-seekas alexpott oswaldderiemaecker onlyone0001 jemmy655 jslegers layershifter magicdice benayoun mundschenk-at sidkshatriya speksforks marczhermo downsider dochne siwinski voku shtse8 rodrigo-speller michaelroosz laravel24 soitun huskyalvarez apeschar iglue-repo jdubreville webconsol idhamperdameian y3z3ki3l warezaddict-com ndekere javiereguiluz tgalopin stof the-cc-dev dulanhewage genjiluo jaespinol amawada remicollet xabbuh signpostmarv pedrohrfaria open-source-contributions alecpl bigbangcom03 sshyran ben7magare highscoresl bytestream imsop lyrixx israel-nogueira faiz-9 erreur32 summercms somkhane derrabus a1812 standardgalactic ajunlonglive kaznovac hatsikidee tryweirdier ohader rosk0 sakarikl rakhithjk withinboredom mar4ehk0

html5-php's Issues

Some Tag attributes are case sensitive

Some tag attributes are case sensitive. This happens when something like svg is embedded. So, not all attribute names should be converted to lowercase.

For ref on SVG see http://www.w3.org/Graphics/SVG/WG/wiki/SVG_in_HTML5

Problem with loadHTMLFragment()

hello,

i just started using your html5-parser and i'm trying to load a fragment.

using the code shown in the wiki it's no problem:

 require "vendor/autoload.php";
use Masterminds\HTML5;
$html5 = new HTML5();

// An example HTML fragment:
$fragment = "<p>This is a test of the HTML5 parser.<p>";
$dom = $html5->loadHTMLFragment($fragment);

but when i try to parse other tags it's not working.
for example this code

 require "vendor/autoload.php";
use Masterminds\HTML5;
$html5 = new HTML5();

// An example HTML fragment:
$fragment = "<td>This is a test of the HTML5 parser.<td>";
$dom = $html5->loadHTMLFragment($fragment);

error shown:

Notice: Undefined property: DOMDocumentFragment::$tagName in C:\xampp\htdocs\html5\vendor\masterminds\html5\src\HTML5\Parser\TreeBuildingRules.php on line 138

stacktrace:

	Function	Location
1	{main}( )	..\index.php:0
2	Masterminds\HTML5->loadHTMLFragment( )	..\index.php:21
3	Masterminds\HTML5->parseFragment( )	..\HTML5.php:128
4	Masterminds\HTML5\Parser\Tokenizer->parse( )	..\HTML5.php:181
5	Masterminds\HTML5\Parser\Tokenizer->consumeData( )	..\Tokenizer.php:83
6	Masterminds\HTML5\Parser\Tokenizer->tagOpen( )	..\Tokenizer.php:126
7	Masterminds\HTML5\Parser\Tokenizer->tagName( )	..\Tokenizer.php:269
8	Masterminds\HTML5\Parser\DOMTreeBuilder->startTag( )	..\Tokenizer.php:371
9	Masterminds\HTML5\Parser\TreeBuildingRules->evaluate( )	..\DOMTreeBuilder.php:398
10	Masterminds\HTML5\Parser\TreeBuildingRules->closeIfCurrentMatches( )	..\TreeBuildingRules.php:90

Parse error

Hi, I encounter some problems when parsing html which contains certain elments:

HTML:

<!DOCTYPE html>
<html>
<body>
<div>
<table style="width: 520px; height: 361px;" border="1px solid">
<tbody>
        <tr>
                <td>a</td>
                <td>b</td>
                <td>c</td>
                <td>d</td>
                <td>d</td>
                <td>f</td>
        </tr>
</tbody>
</table>
</div>
</body>
</html>

PHP:

require_once(__DIR__ . "/vendor/autoload.php");

$html = file_get_contents("1.html");
$dom = HTML5::loadHTML($html); //DOMDocument
echo  HTML5::saveHTML($dom);

What I get is a wrong result:

<html><body>
<div>
<table style="width: 520px; height: 361px;" border="1px solid"></table>
<tbody></tbody>
        <tr></tr>
                <td></td>a
                <td></td>b
                <td></td>c
                <td></td>d
                <td></td>d
                <td></td>f



</div>
</body>

</html>

It works well using DOMDocument::loadHTML parsing the same test html file.

PHP Documentor

phpdocumentor/phpdocumentor in compser.json is really neded?

It requires a lot of dependencies to be downloaded in build phase...

PHP Strict error messages

PHP Strict Standards:  Non-static method DOMImplementation::createDocumentType() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 60

Strict Standards: Non-static method DOMImplementation::createDocumentType() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 60
PHP Strict Standards:  Non-static method DOMImplementation::createDocument() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 62

Strict Standards: Non-static method DOMImplementation::createDocument() should not be called statically, assuming $this from incompatible context in /Users/mfarina/Code/HTML5-PHP/src/HTML5/Parser/DOMTreeBuilder.php on line 62

I've been getting these errors when running example.php.

The input "0" is not handled correctly

$html5 = new HTML5();
$doc = $html5->loadHTML( "<!DOCTYPE html>\n<html><head><title></title></head><body><p>0</p><p>1</p></body></html>" );
echo $html5->saveHTML( $doc );

Result:

<!DOCTYPE html>
<html><head><title></title></head><body><p></p><p>01</p></body></html>

I'm using html5-php 2.0.0 and PHP 5.3.3-7+squeeze15.

PSR-1/PSR-2

What about change the coding standard? (With the incoming 2.0 release?)

Add support for parsing and serializing document fragments

The require-dev elements listed in the composer.json file are installed by default. That means every place someone installed this library they will also be installing phpunit and the symfony yaml parser in the vendor directory.

@technosophos should we pull or leave require-dev for phpunit?

CDATA sections parsed as comments, not CDATA.

The HTML5 spec supports CDATA sections, but the parser converts CDATA (incorrectly) into a comment section:

1) HTML5\Tests\SerializerTest::testCDATA
Failed asserting that '<!DOCTYPE html>
<html><head></head><body>a<!--[CDATA[ This <is--> a test. ]]&gt;b</body></html>
' matches PCRE pattern "|<![CDATA[ This <is> a test. ]]>|".

/Users/mattbutcher/Code/HP/HTML5-PHP/test/HTML5/SerializerTest.php:115

Error with parsing when HTML tags uppercased

Hi,
I discovered some weird behavior at this page http://rayer.g6.cz/. I also pasted source HTML here http://pastebin.com/FQjSEGCK .

Everything from the text in html > head > title is escaped (even </TITLE> tag). I find out that if I use function strtolower like this \HTML5::loadHTML(strtolower($html)) HTML is parsed correctly. Can you look at this please?

Thank you for your work - I can parse HTML also in PHP finally :)

Elements with dashes (Web Components support)

Over at https://www.drupal.org/node/1333730, we're working on pulling in masterminds/html5 via Composer into Drupal 8 core. But we're running into a problem: this library doesn't seem to support elements with dashes.

Elements with dashes are necessary for Web Components support (http://w3c.github.io/webcomponents/spec/custom/). However, technically, the Web Components spec is non-normative (http://www.w3.org/TR/html5/references.html#references), so it's not necessary — strictly speaking.

That being said, I think most people would argue Web Components are clearly going to become an important aspect of web development in the not-too-distant future, and hence we want to make sure Drupal 8 doesn't break them, and hence it'd necessary for this library not to break them, if Drupal 8 wants to use this library.

Would you be willing to add support for Web Components, and hence elements with dashes?

XML Namespaces in HTML5 documents not always working

Sporadically I have seen XML namespace errors in the parser, or cases where the parser ignores an XML namespace declaration.

Better error handling (HHVM compatibility)

Hi!
I suggest to change error handling in DOMTreeBuilder. (
https://github.com/Masterminds/html5-php/blob/master/src/HTML5/Parser/DOMTreeBuilder.php#L79)

Injecting a errors property to DOMDocument is not so clean and won't always work, especially on HHVM.

IE conditional tags being stripped

This appears to only be a issue before the tag, this:

<!--[if lte IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]> <!--><html class="no-js" lang="en"><!--<![endif]-->

is being turned into this:

<html class="no-js" lang="en"><!--<![endif]-->

Unescaped Output

The serializer currently is not encoding data properly for output. This enables certain documents to be crafted which can expose XSS vulnerabilities. For example, the cdata serializer just outputs the text directly. When crafted with a malicious payload, this results in an attack vector:

$html = "<!DOCTYPE html>
<html>
 <head>
  <title>TEST</title>
 </head>
 <body id='foo'>
  <h1>Hello World</h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>";

// Parse the document. $dom is a DOMDocument.
$dom = \HTML5::loadHTML($html);

$els = $dom->getElementsByTagName('h1');
$els->item(0)->appendChild(new DomCDataSection('this ]]><script>alert(hi!);</script><![CDATA[ is injected'));

var_dump(\HTML5::saveHTML($dom));

This will output:

<!DOCTYPE html>
<html><head>
  <title>TEST</title>
 </head>
 <body id="foo">
  <h1>Hello World<![CDATA[this ]]><script>alert(hi!);</script><![CDATA[ is     injected]]></h1>
  <p>This is a test of the HTML5 parser.</p>
 </body>
</html>

Which is obviously bad.

Comments and raw text fields both suffer a similar problem as well (from a quick glance).

About namespaces...

Reading http://www.w3.org/TR/html51/syntax.html#the-before-html-insertion-mode

Especially:
A start tag whose tag name is "html"
Create an element for the token in the HTML namespace, with the Document as the intended parent. Append it to the Document object. Put this element in the stack of open elements.

HTML namespace should be: http://www.w3.org/1999/xhtml

Does this implies that line https://github.com/Masterminds/html5-php/blob/master/src/HTML5/Parser/DOMTreeBuilder.php#L227 should became:

$ele = $this->doc->createElement($lname, $htmlNs);

Whitespace in the end tag of RCDATA tags

The parser is confused if you add whitespaces into </title>, like:

<title>Note the space after "title"</title >
<title>Another example<title
>

Both examples above are valid according to the W3 Validator.

This behaviour is caused by Tokenizer.php which assumes the end tag is always exactly </title>.

Test script

<?php
require_once __DIR__ . "/vendor/autoload.php";
$html = <<<EOF
<!doctype html>
<html>
<head>
    <title>This is valid, really.</title >
</head>
<body></body>
</html>
EOF;
$parser = new Masterminds\HTML5;
$dom = $parser->loadHTML( $html );
echo $parser->saveHTML( $dom );

Output

<!DOCTYPE html>
<html><head>
    <title>This is valid, really.&lt;/title &gt;
&lt;/head&gt;
&lt;body&gt;&lt;/body&gt;
&lt;/html&gt;</title></head></html>

Real-world examples

http://www.infoplease.com/ipa/A0855613.html

Doc returned by loadHTML doesn't work with DOMXPath

$html5 = new HTML5();
$htmlStr = <<<HERE
  <!DOCTYPE html>
  <html>
  <head>
    <title></title>
  </head>
  <body>
    <p>Testing</p>
  </body>
HERE;
$doc = $html5->loadHTML( $htmlStr );

$xPath = new DOMXPath( $doc );
echo $xPath->query( '//p' )->length; // "0" in 2.0.0; "1" in 1.0.3

PHP 5.3.3.

Wrong text mode for title element?

Hello,

Thanks for the nice library. I installed and tried this HTML5 lib.

And founded that handling of entity references in title element is wrong like this:

<?php
// entityref-in-title.php
require_once 'vendor/autoload.php';

$html = <<<EOH
<!doctype html>
<title>&#x27;</title>
<p>&#x27;</p>
EOH;

echo \HTML5::loadHTML($html)->saveHTML();

$ php -v
PHP 5.6.0beta1 (cli) (built: Apr 17 2014 15:46:38)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.6.0-dev, Copyright (c) 1998-2014 Zend Technologies
$ php ./entityref-in-title.php
<!DOCTYPE html>
<html><title>&amp;#x27;</title>
<p>'</p></html>

In example above, ' should be decoded as '(quotation) but actually doesn't.

If I set text mode for title element to 81, the entity ref is decoded properly:

<?php
require_once 'vendor/autoload.php';

$html = <<<EOH
<!doctype html>
<title>&#x27;</title>
<p>&#x27;</p>
EOH;

\HTML5\Elements::$html5['title'] = 81;
echo \HTML5::loadHTML($html)->saveHTML();

$ php ./entityref-in-title.php
<!DOCTYPE html>
<html><title>'</title>
<p>'</p></html>

I've intended to send a pull request but I couldn't because I didn't know why ¥HTML5¥Elements::$html5['title'] was set to 5.

Could you consider about this?

Non-inline elements being moved outside of inline containers automatically

In the current Drupal 8 test coverage, which still uses PHP's DomDocument (and hence makes assertions based on a XHTML POV), we have the following two assertions:

    $f = Html::normalize('<p>line1<br/><hr/>line2</p>');
    $this->assertEqual($f, '<p>line1<br></p><hr>line2', 'HTML corrector -- Move non-inline elements outside of inline containers.');

    $f = Html::normalize('<p>line1<div>line2</div></p>');
    $this->assertEqual($f, '<p>line1</p><div>line2</div>', 'HTML corrector -- Move non-inline elements outside of inline containers.');

The second still works with HTML5. The first doesn't.

Instead of moving the <hr> outside of the , it keeps it inside:

<p>line1<br><hr>line2</p>

Looking at \MasterMinds\HTML5\Elements, I see:

        "hr" => 73, // NORMAL | VOID_TAG | BLOCK_TAG

So it's definitely marked as a block-level element. Which makes me suspect that HTML5 simply doesn't do this kind of clean-up, and that it's merely by accident (as a side-effect of some other parsing aspect) that the second test case is handled correctly.

Which makes me wonder if this is behavior only required for XHTML parsers and not HTML5 parsers?

Test suite failing on PHP 5.4

The codebase is now running through Travis CI. And, it shows some tests are failing in PHP 5.4. See https://travis-ci.org/Masterminds/html5-php/jobs/7584913 for more details.

Switch to Masterminds vendor namespace

Switch all classes from HTML5 namespace to Masterminds\HTML5 namespace

Why not using \SplFileObject or \php_stream_filter

Is there a good reason why we doesn't use \SplFileObject or \php_stream_filter?
It could probably make the parser less complex; perhaps starting from the next major update.

What's your thoughts?

characters following an ampersand are removed if it is not encoded prior to parsing

noticed an issue on our site today.

we had the characters R&D in our html. Obviously this should be R &a mp; D to be accurate, however when html5-php parses this, it parses it to R&

I'm assuming this is because &D isn't an html entity so it defaults to the & I would have expected the output to be R &a mp; D though.

Very long-running

Source:

Output:

$html contains doctype HTML 4.01, total size 37 KB

loadHTML skips whitespace after <html> tag

$in = "<!DOCTYPE html>
<html>
  <head>
    <title>My Webpage</title>
  </head>
  <body>foo</body>
</html>";
$dom = \HTML5::loadHTML($in);
$out = \HTML5::saveHTML($dom);

($out == $in); // false < should be true

The value of $out is:

<!DOCTYPE html>
<html><head>
    <title>My Webpage</title>
  </head>
  <body>foo</body>
</html>

Spaces between html and head has been removed

MathML and SVG not tested.

Neither MathML nor SVG have been fully tested.

Convert to raw text without tags?

Years ago, I was using simple_html_dom to read in an HTML file and convert it to raw text for indexing in Apache Solr.

I'm in the process of converting those code to your library - is there a similar mechanism? If not, how do you recommend adding this functionality? Do I add a class that implements the RulesInterface?

Tag names with strange capitalisation

The parser fails when it encounters a tag name with strange capitalisation (e.g. <Title>, <titlE>, etc). For example, this script

<?php
require_once __DIR__ . "/vendor/autoload.php";
$html = <<< 'HERE'
<!doctype html>
<html>
<head>
        <Title>Hello, world!</Title>
</head>
<body></body>
</html>
HERE;
$parser = new Masterminds\HTML5;
$dom = $parser->loadHTML( $html );

echo "== HTML5 rendering ==\n";
echo $parser->saveHTML( $dom );

echo "== XPath queries ==\n";
$xpath = new DOMXPath( $dom );
$xpath->registerNamespace( "x", "http://www.w3.org/1999/xhtml" );
echo "=== Value of <title> ===\n";
echo $xpath->query( "//x:title" )->item( 0 )->nodeValue;

outputs:

== HTML5 rendering ==
<!DOCTYPE html>
<html><head>
    <title>Hello, world!&lt;/Title&gt;
&lt;/head&gt;
&lt;body&gt;&lt;/body&gt;
&lt;/html&gt;</title></head></html>
== XPath queries ==
=== Value of <title> ===
Hello, world!</Title>
</head>
<body></body>

The HTML supplied is valid.

Validate a HTML5 string

I'm trying to use your library to validate a HTML5 string. DOMDocument::validate() is the method I would be using.

$parser = new \HTML5;
$dom = $parser->loadHTML("<html><head><title>Herro</title></head><body></body></html>");
var_dump($dom->validate());

I get the following error:

Warning: DOMDocument::validate(): No declaration for element html

I presume this is something to do with requiring a dtd schema, although I presumed that (as your library is specific to HTML5), this would be handled. Can you tell me if it's possible to use your library to achieve what I require and if so, how? Thanks.

Traverser error when saving an empty string previously loaded

Hello,

After install, running an extreme test with an empty input string leads to the following error:

Notice: Trying to get property of non-object in [...]\vendor\masterminds\html5\src\HTML5\Serializer\Traverser.php on line 96
Call Stack:
    0.0020     127680   1. {main}() [...]\test.php:0
    0.0550     873336   2. Masterminds\HTML5->saveHTML() [...]\test.php:15
    0.0550     874376   3. Masterminds\HTML5->save() [...]\vendor\masterminds\html5\src\HTML5.php:238
    0.0620     965688   4. Masterminds\HTML5\Serializer\Traverser->walk() [...]\vendor\masterminds\html5\src\HTML5.php:215
    0.0620     965736   5. Masterminds\HTML5\Serializer\OutputRules->document() [...]\vendor\masterminds\html5\src\HTML5\Serializer\Traverser.ph
p:68
    0.0620     966000   6. Masterminds\HTML5\Serializer\Traverser->node() [...]\vendor\masterminds\html5\src\HTML5\Serializer\OutputRules.php:11
8
<!-- Skipped --><!DOCTYPE html>

Here's the script executed with a PHP 5.5 CLI interpreter:

<?php
// Assuming you installed from Composer:
require __DIR__ . "/vendor/autoload.php";
use Masterminds\HTML5;

// An example HTML document:
$html = '';

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
echo $html5->saveHTML($dom);

To me, the library shall throw an exception when it is not able to deal with the input data. What if I push an object, integer, resource in loadHTML? What will happen?
Thanks for your help!
Vincent

In-line style tag content is getting encoded

I have the following style tag in a page that I'm parsing w/ html5-php: http://www.diffchecker.com/k3opxnf5

As you can see, the CSS is getting broken by the parser because it's encoding characters in the CSS into HTML entities (">" for example).

Any idea how I can work around this?

Thanks.

Doctype is case sensitive and should not be

For reference: http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#the-doctype

The doctype declaration should be case insensitive. But, the parser is currently case sensitive to uppercase.

No inner DOMText if getElementById is used

Given:
Test EOD; $html5 = \HTML5::loadHTML($string); print \HTML5::saveHTML($html5->getElementById('test')); ?>

Then:
$html5->getElementById('test') is empty.

(Do I use it wrong?)

Couldn't fetch DOMText. Node no longer exists

$html5 = \HTML5::loadHTML($string);
$newelem = new \DOMText('Test2');
$oldnode = $html5->getElementById('test');
$newnode = $oldnode->cloneNode()->appendChild($newelem); // <<<<
$parent = $oldnode->parentNode;
$parent->replaceChild($newnode, $oldnode);
print \HTML5::saveHTML($newnode);

Error:

Warning: Couldn't fetch DOMText. Node no longer exists in C:\...\HTML5\Serializer\Traverser.php on line 93

Notice: Undefined property: DOMText::$nodeType in C:\...\HTML5\Serializer\Traverser.php on line 93

Exception is thrown for wrong tag names

Hi,
I noticed some issues with pages that contain wrong tag names. I really don't know how to deal with the issue so maybe you find the solution. Below is the list of the pages with names of tag that are invalid. Exception DOMException#5: Invalid Character Error is always thrown at DOMTreeBuilder.php:227 by method DOMDocument::createElement. Every solution, except throwing the exception, is fine for me :) I can make the PR if you tell me what is proper fix for you.

http://www.fatr.funsite.cz/ - tag a href="http: what is weird because it's valid form in HTML

http://greylink.4fan.cz/ - tag id="top_featured"
http://www.fcrozsicka.4fan.cz/ - tag color="white"
http://divejse.jecool.net/ - tag class='neaktivni_stranka'
http://black-horse.8u.cz/ - tag src=<a
http://bimbas.cz/ - tag bgcolor="white"
http://four-feathers.9e.cz/ - tag class="nom", here is also tag  that is valid but also invalid one <class="nom">

http://panskyklub.4fan.cz/ - tag br...<a
http://dfc.4fan.cz/news.php - tag span<
http://svadobneauta.sk/ - tag noscript<img
http://dsj4.g6.cz/ - tag br<br
http://pr-csf.cz/ - tag p<
http://www.vranovice500.tode.cz/ - tag wordpress<
http://www.xplay-games.eu/news.php - tag center<a

http://savcin.cz/ - tag li"
http://tjspacince.maweb.eu/ - tag p"
http://acid.funsite.cz/ - tag a�href="http:
http://aloo.cz/ - tag b 
http://www.wpmath.g6.cz/ - tag static*all
http://rchouby.cz/ - tag h*0720

Why is <li> not a BLOCK_TAG?

According to the spec, li can contain any "flow content", i.e. practically anything. Why is it not categorized as a BLOCK_TAG?

Does this HTML5 parser support xml namespace attributes?

I have the following custom HTML5 file that I want to load and parse in PHP.

<html xmlns:tpl="http://cphwebsolutions.dk/2013/tpl">
    <head>
        <title tpl:replace="page.title">Welcome</title>
        <link href="css/styles.css" type="text/css" rel="stylesheet"/>
    </head>
    <body>
        <h1 id="test" tpl:replace="page.headline">Welcome!</h1>
        <p tpl:replace="page.content"></p>
        <audio><source src="test.txt"/></audio>
    </body>
</html>

I want to be able to use XPath to search for different attributes in the custom HTML5 file e.g. $xpath->query("//*[@tpl:replace]");

But it will not find anything. When I used the loadHTML from the DOMDocument in PHP then it will find all three places where tpl:replace are present in the custom HTML5 document.

I can search for id attribute for example $xpath->query("//*[@id]"); and it works correctly. So my only problem is when I use the custom XML namespace in my HTML5 templates.

Br.
Rune Christensen

<?php
// Assuming you installed from Composer:
require "../vendor/autoload.php";


// An example HTML document:
$html = <<< 'HERE'
<!DOCTYPE html>
<html xmlns:tpl="http://cphwebsolutions.dk/2013/tpl">
    <head>
        <title tpl:replace="page.title">Welcome</title>
        <link href="css/styles.css" type="text/css" rel="stylesheet"/>
    </head>
    <body>
        <h1 id="test" tpl:replace="page.headline">Welcome!</h1>
        <p tpl:replace="page.content"></p>
        <audio><source src="test.txt"/></audio>
    </body>
</html>
HERE;

// Parse the document. $dom is a DOMDocument.
$dom = HTML5::loadHTML($html);

$xpath = new DOMXpath($dom);
$xpath->registerNamespace("tpl", "http://cphwebsolutions.dk/2013/tpl");

// example 1: for everything with an id
//$elements = $xpath->query("//*[@id]");

// example 2: for node data in a selected id
//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");

// example 3: same as above with wildcard
$elements = $xpath->query("//*[@tpl:replace]");

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "] ";
    echo "<br/> ". $element->attributes->length. "\n";

    for ($i=0; $i<$element->attributes->length;$i++) {
        echo " ".$element->attributes->item($i)->nodeName."<br/>\n";
    }

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeName. "\n";
    }
  }
}

echo "<br/><br/>\n";

// Render it as HTML5:
print HTML5::saveHTML($dom);

// Or save it to a file:
HTML5::save($dom, 'out.html');

Rename HTML to Html to follow PSR

Using an existing DOMDocument or subclass as target for parsed data

_PHPPowertools/DOM-Query_ is the first component of the _PHPPowertools_ framework that has been released to the public. It's purpose is similar to that of _technosophos/querypath_ but it's implementation is far more true to both jQuery's syntax and its semantics. For example, _PHPPowertools/DOM-Query_ lets you do stuff like this :

// Add a span tag with classes 'icon' and 'icon-printer' to all buttons
$H->select('body')->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the data-val attribute of all gallery images
$H->select('.gallery li img')->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

What's lacking so far, is proper support for HTML5. I've been considering using _Masterminds/html5-php_ to do the DOM parsing.

The most elegant way to implement the feature, would be by adding a target option to the supported options for \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct with support for following datatypes :

\DOMDocument or subclasses of \DomDocument
\DOMImplementation or subclasses of \DOMImplementation

I would like to use this feature as follows :

namespace PowerTools;

use \Symfony\Component\CssSelector\CssSelector as CssSelector;
use \Masterminds\HTML5 as HTML5;

class DOM_Document extends \DOMDocument {

    protected $_isHTML = false;

    public function __construct($data = false, $version = null, $encoding = null) {
        parent::__construct($version, $encoding);
        $data = trim($data);
        if ($data && $data != '') {
            if ($this->_isHTML) {
                $html5 = new HTML5();
                @$html5->loadHTML($data, array('target' => $this));
            } else {
                @$this->loadXML($data);
            }
        }
    }

   [ ... ]
}

I've tried adding a simple if(){}else{} statement to \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct to replace $this->doc with $options['target'] if a value for $options['target'] has been set, but that doesn't seem to do it.

As an alternative, I've also considered reïmplementing \PowerTools\DOM_Document as a subclass of \DOMImplementation, but this is a far less elegant approach that introduces too many new issues to go any further in that area.

Any feedback would be appreciated!

No line break after <html>

I tried the example from the README file and the result was that the line break after the tag was remove:

Input:

  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>

Output:

<html><head>
    <title>TEST</title>
  </head>
  <body id="foo">
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>

It looks like it was the only line where input and output were different.

Br.
Rune Christensen

Documentation

We need documentation.

TEXT_RCDATA Fields and Processing Instructions

In TEXT_RCDATA fields like <title> it is not possible to use processing instructions.

Could this be the sole exception for RCDATA fields or is this against the spec?

Expose debug mode and use PSR logger

In a few places there is a debug mode that prints to standard out. We can expose this in the Html5 and use the PSR logger interface (still printing to standard out by default).

Invalid Character Error

Hi,
when I'm trying to parse URL http://e107.funsite.cz/ I get DOMException("Invalid Character Error", 5) because of one unclosed tag in the markup. The snippet below causes the exception. It is caused by trying to set attribute with name <div in DOMTreeBuilder.php. As I understand from the doc all errors should be recorded in property $dom->errors. Can you fix this please?

<div class="wrapper"
                <div class="fleft">

Tokenizer quotedString function failure

I was trying to use the function quotedString from Tokenizer.php but it failed and I changed line 717 from:

if ($tok == '"' || "'") {

to:

if ($tok == '"' || $tok == "'") {

And now it works correctly in my PHP script.

Br.
Rune

ProcessorInstructions not working

Was trying to use the processor instruction functionality but got a fatal error

Fatal error: Call to a member function process() on a non-object in /home/www/cphwebsolutions.dk/cms/vendor/HTML5/Parser/DOMTreeBuilder.php on line 364

It looks like the error is placed in line 364:

$res = $processor->process($this->current, $name, $data);

I think that it should be changed to

$res = $this->processor->process($this->current, $name, $data);

Br.
Rune

DOMElement::setIdAttribute(): ID loading already defined

Hello,

Sometimes i get strange error:

PHP Warning:  DOMElement::setIdAttribute(): ID loading already defined in /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 392
PHP Stack trace:
PHP   1. {main}() /home/xxx/domains/xxx/public_html/index.php:0
PHP   2. Core\xxx->loadPage() /home/xxx/domains/xxx/public_html/index.php:147
PHP   3. Modules\xxx\Main->load() /home/xxx/domains/xxx/public_html/xxx/core/xxx.php:619
PHP   4. Modules\xxx\Main->_startEngine() /home/xxx/domains/xxx/public_html/xxx/modules/xxx/main.php:301
PHP   5. Modules\xxx\Main->_runCrawler() /home/xxx/domains/xxx/public_html/xxx/modules/xxx/main.php:121
PHP   6. Core\Classes\Search\Crawler->visitPage() /home/xxx/domains/xxx/public_html/xxx/modules/xxx/main.php:154
PHP   7. Masterminds\HTML5->loadHTML() /home/xxx/domains/xxx/public_html/xxx/core/classes/search/crawler.php:28
PHP   8. Masterminds\HTML5->parse() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5.php:94
PHP   9. Masterminds\HTML5\Parser\Tokenizer->parse() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5.php:165
PHP  10. Masterminds\HTML5\Parser\Tokenizer->consumeData() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:83
PHP  11. Masterminds\HTML5\Parser\Tokenizer->tagOpen() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:126
PHP  12. Masterminds\HTML5\Parser\Tokenizer->tagName() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:269
PHP  13. Masterminds\HTML5\Parser\DOMTreeBuilder->startTag() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/Tokenizer.php:371
PHP  14. DOMElement->setIdAttribute() /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/DOMTreeBuilder.php:392
PHP Warning:  DOMElement::setIdAttribute(): ID placeholder already defined in /home/xxx/domains/xxx/public_html/xxx/core/framework/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 392

I don't know what the input is, but i hope you can help me?

Create CREDITS file

Create a CREDITS file and add entry for #9 .

masterminds / html5-php Goto Github PK

html5-php's People

Contributors

Stargazers

Watchers

Forkers

html5-php's Issues

Test script

Output

Real-world examples

Recommend Projects

Recommend Topics

Recommend Org