paquettg / php-html-parser Goto Github PK
View Code? Open in Web Editor NEWAn HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.
License: MIT License
An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.
License: MIT License
I'm using this package to read some HTML and replace all anchor tags with some other href (to be more specific, I have coded a function getRedirectUrl that follows all redirects and gets the final URL after all redirections, but this is not relevant for the issue I'm running into).
This is the code I've came up with:
$dom = new Dom;
$dom->load($content);
$links = $dom->find('a');
foreach($links as $link)
{
$finalUrl = $this->getRedirectUrl($link->getAttribute('href'));
$tag = $link->getTag();
$tag->setAttribute('href', $finalUrl);
}
return trim(strip_tags($dom->root->innerHtml(), '<p><a><img>'));
$content contains very simple HTML, something like:
<div>
Actual content
<a href="http://bit.ly/1jIDoCy">This is a link</a>
<a href="http://bit.ly/1N0ZM5s">This is another one</a>
Parse this.
</div>
The final $dom->root->innerHtml() fails with:
local.ERROR: exception 'ErrorException' with message 'Illegal string offset 'value'' in /home/vagrant/Code/marketer/vendor/paquettg/php-html-parser/src/PHPHtmlParser/Dom/Tag.php:161
Am I doing something wrong? I'd appreciate any help, I've spent the past day trying to fix it to with no success.
is it possible to install this without composer? I am testing this on Wamp Server on Windows and planning to use in shared hosting so is there any possible installation without using composer?
my code
$dom = new Dom;
$dom->load('http://google.com');
got this
PHP Fatal error: Call to undefined function PHPHtmlParser\curl_init() in /home/howtomakeaturn/projects/veblen/vendor/paquettg/php-html-parser/src/PHPHtmlParser/Curl.php on line 17
do i miss anything?
or I should install something?
thanks!
Here's my code.
use PHPHtmlParser\Dom;
$dom = new Dom;
$dom->load('http://example.com');
$html = $dom->outerHtml;
echo $html;
When I view-source, the html isn't load completely. It just load like half of the html page only.
For some reason I can't expose the real URL here.
There is no way to add a new attribute to a tag. For example, I wanted to add ng-app='appName'
to an HTML tag
Using the setAttribute method results in,
[ErrorException]
Illegal string offset 'value'
Ideally, the setAttribute method should create a new attribute if it doesn't exist yet.
$this->dom = new Dom;
$this->dom->loadFromUrl($url);
$this->dom->find('body')->innerHtml;
But it just get a part of content in body tag. I don't understand the reason? May you help me. Thank you.
This is my URL: http://casiovietnam.net/dong-ho-dien-tu-casio-f91wg9sdf-dong-co-dien
When I updated to 1.6.9, it appears the way HTML entities are treated in parsed documents has changed, and it incorrectly handles some characters. The only entity I see so far that converts incorrectly is i
(lower case "i"). When requesting a node's text with this entity in it, it returns "\n5;" instead (newline, "5", semi-colon). As soon as I reverted back to 1.6.8, it fixed it.
Is it possible to disable all entity handling entirely, since I can easily do that myself with html_entity_decode() if I need to?
Hi,
If you have a space before the closing > in a tag, this library incorrectly assumes the following text is a set of attributes for the tag. Eg
<a href="http://www.example.com" >This is text</a>
The a node will then have the following attributes: href, This, is, text.
There is something wrong due to conversion made to the text. B
$dom = new Dom;
$dom->loadFromUrl($page_url, [ 'enforceEncoding'=> 'UTF-8']);
The text is sometimes decoded as it should be and sometimes I have a decoding that kills my UTF-8 characters (the source is UTF-8 and I don't do any change on results grabbed from ->text(TRUE)
function).
Not sure about a bug because I traced your code and it should not apply conversion due to the forced UTF8.
hi!
how i can select a element that has a specific text?
And a suggestion :
in css or jquery when i want select a element with two class i use this syntax .class1.class2
but in your script it doesn't work.i think you choose this syntax for this : .class1+.class2
please fix this.Thank You.
Just like the title what i said, i think it will be useful!
Using 1.7.0, selectors of the style "a.b.c" do not correctly find elements with multiple classes.
index.php:
<?php
require_once("vendor/autoload.php");
$dom = new \PHPHtmlParser\Dom();
$dom->loadFromFile("fake.html");
$A = $dom->find("a.b");
var_dump(count($A));
$B = $dom->find("a.b.c");
var_dump(count($B));
fake.html:
<a class="b">alpha</a>
<a class="b c">bravo</a>
output:
$ php -f index.php
int(2)
int(0)
$
I wouldn't necessarily consider this a bug, except the documentation says that "any CSS selector" can be used, which is not the case here. I also tested with 1.6.4 and 1.6.9, with the same results.
Thanks!
I have selected a table by finding it by class, I then need to loop over each row and examine the cells within them however returns as a TextNode so I am unable to do selects/foreach on the table cells .
Is there any way to make return as a HTMLNode?
Currently we are able to replace a tag node with another one(#52), but we can't replace a tag node with several nodes.
sunra/php-simple-html-dom-parser can do this.
$child->outertext = $child->innertext;
Here is my thought:
$child = $parent->find('child')[0];
$children = $child->getChildren();
$parent->replaceChild($child->id(), $children);
Thanks!
I'm currently working on a project and want to include php-html-parser as a dependency. I found the license in composer.json, but this should really be easier to find. Can this be added to the repo as a LICENSE-file?
Looks like html is losing line breaks when getting $dom->find( '.content', 0 )->innerHtml
, I'm still digging in to see why.
Setting the preserveLineBreaks option to TRUE doesn't seem to work.
I was trying to pars html content of cnn.com news pages, and when I get body tag, using both find()
and getElementByTag()
half the content was gone. I put parsed content into a file, and realized some tags like <article>
are out of <body>
or <html>
tag, something like this:
<html>
<head>...</head>
<body>...</body>
<article>...</article>
<div>...</div>
</html>
<div>...</div>
php code:
<?php
$dom = new PHPHtmlParser\Dom();
$url = 'http://edition.cnn.com/2015/11/19/tennis/world-tour-finals-federer-nishikori/index.html';
$dom->load($url);
file_put_contents('test.html', (string) $dom);
Hey there. First Of I want to thank you so much for your continuation of this amazing project. It is a God send for me and my projects. Keep it up. Second, I would like to request you add in itinance's getChildren() feature found here: Link . I think it would be a great feature to add in.
P.s. I can create a pull request if needed.
Thanks,
Mooror
Hello!
I'm trying to do some easy html manipulation and I can't seem to figure it out.
According to the docs for sunra/php-simple-html-dom-parser which I figure should work (correct me if i'm wrong) you should be able to do this.
// Remove a element, set it's outertext as an empty string
$e->outertext = '';
But in my testcase, that doesn't seem to work.
This is the relevant part of my code
$dom = new Dom;
$dom->loadStr( $pageMarkup, [] );
$menu = $dom->find( '.inPageMenu' );
$menu->outertext = '';
$html = $dom->outerHtml;
After that, the item with the class is still present in $html
.
Am I going at this backwards or am I missing something?
Thanks in advance!
We need to add more functions: innerHtml and outerHtml because parent class call them but this call don't define them.
Hi !
I don't know if I'm doing something wrong or if there is a bug in the code, but I would like to fetch items using a selector and php-html-parser returns only one result while there is several ones.
//$body is the content of this page : http://www.novaplanet.com/radionova/cetaitquoicetitre
//I can't use $dom->load('http://www.novaplanet.com/radionova/cetaitquoicetitre') here; I need to use a string.
$dom = new Dom;
$dom->loadStr($body, []);
$track_nodes = $dom->find('.cestquoicetitre_results .resultat');
I get only the first result.
Can someone help here ?
If I load the following,
<div class="content">
<div class="grid-container" ui-view>
<!-- the main content appears here -->
</div>
</div>
then when I render the HTML I get (note the extra >
in the inner div),
<div class="content">
<div class="grid-container" ui-view>> </div>
</div>
However, if I move the ui-view attribute before the class attribute or add a value to it, then it is rendered correctly.
I was trying to get all the elements,
here's what I tried:
$dom->load('<div class="all"><p>Hey bro, <a href="google.com">click here</a><br /> :)</p></div>');
$a = $dom->find('*');
exit(var_dump($a)));
It returns 1(int), it seems like php-html-parser doesn't support the wildcard symbol?
Is there any option to wrap existing nodes with new ones or some way to change node attributes?
Fatal error: Method PHPHtmlParser\Dom::__toString() must not throw an exception, caught Error: Cannot use object of type PHPHtmlParser\Dom\HtmlNode as array
Here is some example code to reproduce
$dom = new Dom;
$dom->load($content);
$images = $dom->find('img');
$newimages = [];
foreach ($images as $image) {
$tag = new Tag('amp-img');
$src = $image->getAttribute('src') ?: $image->getAttribute('data-src');
$tag->setAttribute('src', $src);
$html = new HtmlNode($tag);
$image->getParent()->replaceChild($image->id(), $html);
}
return (string)$dom;
On home page of this repo, I see an example of setting an attribute:
$tag->setAttribute('class', 'foo');
But above code does not work and through below error:
Illegal string offset 'value'
But If I use an array for second parameter then it works fine:
$tag->setAttribute('class', array('value'=>'foo', 'doubleQuote'=>true));
Fatal error: Class 'stringEncode\Encode' not found in /var/www/clients/client19/web83/web/libs/PHPHtmlParser/Dom.php on line 593
If i meet something like that
<p>.....</p>
<script>
some code ....
document.write("<script src='some script'><\/script>")
some code ....
</script>
<p>....</p>
cleaner remove many of html body
It can fix by chane code:
$str = preg_replace("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is", '', $str);
to
$str = preg_replace("'<\s*script[^>]*[^/]>(.*?)<[^\/]*/\s*script\s*>'is", '', $str);
Hi,
Do you have any idea how to fix the below error?
Fatal error: Uncaught exception 'PHPHtmlParser\Exceptions\CurlException' with message 'Error retrieving "http://google.com" (Resolving timed out after 5521 milliseconds)
Code is here:
$dom = new Dom;
$dom->loadFromUrl('http://google.com');
$html = $dom->outerHtml;
I got the same on dev-master and 1.6.4 versions.
Thanks.
public function __construct($text)
{
// remove double spaces
$text = preg_replace('/\s+/', ' ', $text);
$text = preg_replace('/\s+/', ' ', $text);
there are encoding bug. deleted just fine.
test case:
$dom = new \PHPHtmlParser\Dom;
$dom->load(
' <div class="content">'.
' 一哥们开车巨慢,早上上班十几公里路能开四十多分钟,每天起很早为了不迟到,经常见一骑三轮车的保洁老大爷晃晃悠悠超他的车对他说:小伙子又早起练车呀,快点开吧,交警一会就上班了。。。'.
' </div>'
);
$content = $dom->find('div.content');
var_dump($content[0]->innerHtml);exit;
maybe have a better solution。
Bug in your code or documentation. https://github.com/paquettg/php-html-parser/blob/master/src/PHPHtmlParser/Dom.php#L151
$dom->loadFromUrl('http://google.com', new Connector);
in documentation written implementation of CurlInterface an optional second parameter but in your code second parameter is options
Hi
I'm using this simple codes to catch a field from an external URL:
$dom = new Dom;
$dom->loadFromUrl($dom_address);
$time = $dom->getElementsByClass('exampleclass')->getAttribute('data-datetime');
The problem is, when it can't find the element containing that class for example the website is not working fine(goes offline or even when displays a 404 page) I get a php error(it's fine when it CAN find it). So how can I check to see if it can't find it, to simply set $time = 'N/A'; and prevent the page error.
Thanks
When you attempt to load an html page from a URL using loadFromUrl the encoding is incorrect.
I have not found this problem when attempting to open the same html page but as a file on the local server.
As the title states above, I am wondering if the find feature use regx to search for things and if so how.
Thanks,
Mooror
Hey mate, great module, just started using it, just wondering why i can't pull out any script tags, and i can see they're getting stripped in your clean() function.
This could be just what you want, but i'm actually wanting to parse these out! Ah well could just fork it ay.
Thanks again.
Cheers
Rob
When you attempt to load an html page from a URL using loadFromUrl the encoding is incorrect.
I have not found this problem when attempting to open the same html page but as a file on the local server.
Add ability to wrap existing nodes in a dom
i am trying to load a string
use PHPHtmlParser\Dom;
$dom = new Dom;
$dom->load('A HUGE HTML STRING ');
$a = $dom->find('a')[0];
echo $a->text; // "click here"
and i get the error
is_file(): File name is longer than the maximum allowed path length on this platform (4096):
i think that you should create a function or give access to loadStr widouth checking if it is a file or not
public function load($str, $options = [])
{
// check if it's a file
if (is_file($str))
{
return $this->loadFromFile($str, $options);
}
// check if it's a url
if (preg_match("/^https?:\/\//i",$str))
{
return $this->loadFromUrl($str, $options);
}
return $this->loadStr($str, $options);
}
Firstly, let me say thanks for maintaining such a great library, it has been exceptionally useful in my project. But on to the issue
Using the following code:
$dom->find('div > ul');
Results in an empty set, despite the html being valid. It seems the find() function does not support child selectors. I added in a unit test to SelectorTest.php to confirm the results.
The test code:
public function testFindClassWithChildSelector() {
$root = new HtmlNode(new Tag('root'));
$parent = new HtmlNode(new Tag('div'));
$child1 = new HtmlNode(new Tag('ul'));
$root->addChild($parent);
$parent->addChild($child1);
$selector = new Selector('div > ul');
$this->assertEquals(1, count($selector->find($root)));
}
I'm going to see if I can't add the functionality myself. However, given your familiarity with the library you may be able to make a quick change to fix this.
There are some cases in which it might be useful to use the DOM parser to check the presence of certain <style> or <script> tags. In those cases it should be possible to pass an option to skip some of the cleaning steps that remove those tags. This option could be either as broad as "skipHtmlCleanup" or specific such as "keepScriptsInDom" or "keepStylesInDom".
Given this HTML:
<a title="This is a "test" of double quotes" href="http://www.example.com">Hello</a>
When passed into Dom::load()
, the parser ends up correctly finding the element, but misparses the attributes and body text. The attributes (from var_dump($element->getAttributes())
) appear like so:
array(1) { ["title"]=> string(10) "This is a " }
and the body appears like so (from var_dump($element->text())
):
string(58) "est" of double quotes" href="http://www.example.com">Hello"
I realize that putting double quotes inside an attribute is noncomformant to HTML, but ideally PHPHtmlParser should be tolerant of such things and parse the element anyway, much in the way web browsers do. While it may be impossible to accurately determine what the intended title
attribute's correct value is, it should be possible to ensure that the element text does not include content from before the >
marker.
Hey folks, I'm trying to use parser in one of the websites (code below) but seems like parser cannot get whole website data. I tried with 'file_get_contents' and $dom->load still doesn't work at all.
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
$races = ['Karu', 'Elmo'];
$dom = new Dom;
$dom->loadFromUrl('www.nttgameonline.com/knight/en/ranking/clan/0/Karu');
$html = $dom->innerHtml;
$contents = $dom->find('#server');
echo count($contents); // It should print 4
foreach ($contents as $content)
{
echo ($content->plaintext);
}
This a part of the code where the problem occurred:
https://gist.github.com/rachid804/1c0c23fc7b2398660f4e
I can access div#listing-image-frame but when i want to get any of child tags i get this:
PHPHtmlParser\Dom\Collection Object ( [collection:protected] => Array ( ) )
It would be very useful if were possible to remove an attribute (or a list of attributes) from a tag. Ideally we can add two new methods both to the AbstractNode and the Tag classes. The first method, removeAttribute, just removes an attributes. The second method, removeAttributes, removes all attributes from a tag except the ones passed to the method.
This is the HTML:
<strong>hello</strong>
<code class="language-php">$foo = "bar";</code>
The parser only recognizes <strong>
according to the output:
there are 1 nodes
- strong
$dom = new Dom;
$dom->load('<strong>hello</strong><code class="language-php">$foo = "bar";</code>');
$nodes = $dom->find('*');
$total = count($nodes);
echo "there are {$total} nodes";
/** @var Dom\AbstractNode $node*/
foreach ($nodes as $node) {
$tag = $node->getTag();
echo "<br>- {$tag->name()}";
}
Any idea why <code>
is ignored? I have tested a lot of tags and only this one is not recognized. Thank you in advance.
i use linux mint, can i get video of it... it vl b very helpful to understand by watching video.
Currently we are able to remove specific child and add new child in the tail of the parent, but we can't replace specific tag node.
sunra/php-simple-html-dom-parser can do this.
$child->outertext = '<new />';
Here is my thought:
$child = $parent->find('child')[0];
$newChild = new Tag('new');
$parent->replaceChild($child->id(), $newChild);
Thanks!
This is the input I feed to php-html-parser:
matchstats2301484-b0f319a362db1f67e36f4702a9970e53.txt
$content = file_get_contents('file.html');
$dom = new Dom();
$container = $dom->loadStr($content, []);
echo $container->innerHtml;
gives
Any clue as to what causes this? I've tried all options, but nothing changes.
Hi,
It seems that index-related selectors does not work ?
I need to have a selector to get the second matched item, like
$node = $html->find('a:eq(1)')
I can't use
$nodes = $html->find('a');
$node = $nodes[1];
in my script, I need to get it using a selector.
Any way to achieve this ?
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.