jstedfast / htmlkit Goto Github PK
View Code? Open in Web Editor NEWA cross-platform .NET framework for parsing HTML
License: Other
A cross-platform .NET framework for parsing HTML
License: Other
This is basically question. I need to retrieve text from HTML document without building DOM to maximize performance. Can HtmlKit help me to achieve that ? Appreciate your advice.
Describe the bug
HtmlKit manipulates the attribute value when they contain HTML special entities!
We expect when an attribute value is returned it should literarily be equal to the input stream!
Platform (please complete the following information):
.Net 6.0
& .NET Framework 4.8
To Reproduce
Steps to reproduce the behavior:
Copy if newer
:
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
</head>
<body>
<a href="https://google.com/?q=val-«-val" name="val-«">The First Link</a>
<br />
<a href="https://google.com/?q=val-&-val" name="val-&">The Second Link</a>
</body>
</html>
Program.cs
file:
using HtmlKit;
namespace HtmlKitTestProject
{
internal class Program
{
static void Main(string[] args)
{
using var stream = new FileStream("index.html", FileMode.Open, FileAccess.Read);
using var reader = new StreamReader(stream);
var tokenizer = new HtmlTokenizer(reader);
HtmlToken token;
while (tokenizer.ReadNextToken(out token))
{
switch (token.Kind)
{
case HtmlTokenKind.Tag:
var tag = (HtmlTagToken)token;
if (tag.Id != HtmlTagId.A)
continue;
foreach (var attribute in tag.Attributes)
{
if (attribute.Value != null)
Console.WriteLine(" {0}={1}", attribute.Name, $"{attribute.Value}");
else
Console.WriteLine(" {0}", attribute.Name);
}
break;
}
}
Console.ReadLine();
}
}
}
Expected behavior
The HTML file contains attributes with some HTML special entities as their values:
When an attribute value is returned it should literarily be equal to the input stream but, as you see it's converted to their decoded version!
I think your original version is better, less error prone, meaningful.
just provide new Extension method
it will override the original one
so we can use it in .NET2
Hi,
I am evaluating whether a html parser could be a better fit for my mjml project: https://github.com/sebastianStehle/mjml-net
Your library seems to be one of the fastest solution. Nice project and very clean code :)
Your solutions is _HtmlReader
and _Html and _Html2 are HtmlAgilityPack and another solution. I am wondering if this project is under active development and if you consider performance improvements and/or PRs.
Your library is already pretty fast, but I wonder if it can be further improved by keeping the tokens as fields (and properties) so that you an initialize names with pointers to the CharBuffer and therefore avoid a lot of allocations.
from pull request#8
ok -> not event,
and what about reduce HtmlToken appreance in the api ?
(reduce using out parameter , not worry about return true or false)
is it better?
Hello
I don't intend to fragment you library.
I just want to use it with the HtmlRenderer (https://github.com/prepare/HTML-Renderer).
so I must change something to fit with the HtmlRenderer.
I'm a young developer, feel free to suggest/comment
Thank you for your work :)
Not really an issue, but...
Stumbled upon this brilliant library while considering writing a tokenizer myself - in order to even semi-reliably being able to locate the position of an element inside the source HTML, without resorting to e.g. regex. Need it for patching of HTML files (e.g. automatically inserting <script> or elements at given positions without touching the surrounding HTML).
Am I right in thinking that LineNumber/LinePosition will be off by 1, iff the last token was an HtmlDataToken (due to lookahead)? Or are there any edge cases I'm missing that you can think of? I can't quite wrap my head around the HTML tokenization spec at the moment.
just look only method name, and documentation comment (Html5)
this is my proposal
71cf4e7
the code review / study is more easy!
The HTML syntax (https://www.w3.org/TR/2011/WD-html5-20110405/syntax.html#elements-0) defines a number of void elements that don't have a close tag. In the XML serialization (<br />
) they are correctly tokenized, with the attribute IsEmptyElement
set to true but in the HTML serialization (<br>
) the same attribute is false.
Is this something that should be managed by the client application or it is a bug in HtmlKit?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.