jstedfast / htmlkit Goto Github PK

View Code? Open in Web Editor NEW

77.0 15.0 55.0 1.17 MB

A cross-platform .NET framework for parsing HTML

License: Other

C# 45.76% HTML 53.92% Batchfile 0.01% PowerShell 0.31%

c-sharp html-parser html html5 parser

htmlkit's People

Contributors

Stargazers

Watchers

Forkers

prepare nagyistge amesianx evenbing thecoder87 dotnetkits interestingitems wildgenie jither mobi-bytes mastermann rakhithjk bubdm manuelarriolag atikhan driekus77 ajunlonglive sebastianstehle awesomedotnetcore

htmlkit's Issues

Get text from HTML document

This is basically question. I need to retrieve text from HTML document without building DOM to maximize performance. Can HtmlKit help me to achieve that ? Appreciate your advice.

HtmlKit manipulates the attribute value when they contain HTML special entities

Describe the bug
HtmlKit manipulates the attribute value when they contain HTML special entities!
We expect when an attribute value is returned it should literarily be equal to the input stream!

Platform (please complete the following information):

OS: Windows & Linux
.NET Framework: .Net 6.0 & .NET Framework 4.8
HtmlKit Version: 1.1.0

To Reproduce
Steps to reproduce the behavior:

Create a new Console application

Add the following HTML file in the project and mark it as Copy if newer:

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title></title>
</head>
<body>
    <a href="https://google.com/?q=val-&laquo;-val" name="val-&laquo;">The First Link</a>
    <br />
    <a href="https://google.com/?q=val-&amp;-val" name="val-&amp;">The Second Link</a>
</body>
</html>

Copy-Paste the following code in Program.cs file:

using HtmlKit;

namespace HtmlKitTestProject
{
    internal class Program
    {
        static void Main(string[] args)
        {
            using var stream = new FileStream("index.html", FileMode.Open, FileAccess.Read);
            using var reader = new StreamReader(stream);

            var tokenizer = new HtmlTokenizer(reader);
            HtmlToken token;

            while (tokenizer.ReadNextToken(out token))
            {
                switch (token.Kind)
                {
                    case HtmlTokenKind.Tag:
                        var tag = (HtmlTagToken)token;

                        if (tag.Id != HtmlTagId.A)
                            continue;

                        foreach (var attribute in tag.Attributes)
                        {
                            if (attribute.Value != null)
                                Console.WriteLine(" {0}={1}", attribute.Name, $"{attribute.Value}");
                            else
                                Console.WriteLine(" {0}", attribute.Name);
                        }
                        break;
                }
            }

            Console.ReadLine();
        }
    }
}

Run the project and check the output:

Expected behavior
The HTML file contains attributes with some HTML special entities as their values:

When an attribute value is returned it should literarily be equal to the input stream but, as you see it's converted to their decoded version!

StringBuilder.Clear() extension method is OK

ea7b9d8

I think your original version is better, less error prone, meaningful.
just provide new Extension method
it will override the original one
so we can use it in .NET2

Active development?

Hi,

I am evaluating whether a html parser could be a better fit for my mjml project: https://github.com/sebastianStehle/mjml-net

Your library seems to be one of the fastest solution. Nice project and very clean code :)

Your solutions is _HtmlReader and _Html and _Html2 are HtmlAgilityPack and another solution. I am wondering if this project is under active development and if you consider performance improvements and/or PRs.

Your library is already pretty fast, but I wonder if it can be further improved by keeping the tokens as fields (and properties) so that you an initialize names with pointers to the CharBuffer and therefore avoid a lot of allocations.

ok not event , and what about reducing HtmlToken appearance ?

from pull request#8
ok -> not event,

and what about reduce HtmlToken appreance in the api ?
(reduce using out parameter , not worry about return true or false)
is it better?

Hello!

Hello

I don't intend to fragment you library.

I just want to use it with the HtmlRenderer (https://github.com/prepare/HTML-Renderer).
so I must change something to fit with the HtmlRenderer.

I'm a young developer, feel free to suggest/comment
Thank you for your work :)

Reliability of HtmlTokenizer.LineNumber/LinePosition

Not really an issue, but...

Stumbled upon this brilliant library while considering writing a tokenizer myself - in order to even semi-reliably being able to locate the position of an element inside the source HTML, without resorting to e.g. regex. Need it for patching of HTML files (e.g. automatically inserting <script> or elements at given positions without touching the surrounding HTML).

Am I right in thinking that LineNumber/LinePosition will be off by 1, iff the last token was an HtmlDataToken (due to lookahead)? Or are there any edge cases I'm missing that you can think of? I can't quite wrap my head around the HTML tokenization spec at the moment.

method rename proposal

just look only method name, and documentation comment (Html5)

this is my proposal
71cf4e7

the code review / study is more easy!

Void elements should be reported as empty

The HTML syntax (https://www.w3.org/TR/2011/WD-html5-20110405/syntax.html#elements-0) defines a number of void elements that don't have a close tag. In the XML serialization (<br />) they are correctly tokenized, with the attribute IsEmptyElement set to true but in the HTML serialization (<br>) the same attribute is false.

Is this something that should be managed by the client application or it is a bug in HtmlKit?

jstedfast / htmlkit Goto Github PK

htmlkit's People

Contributors

Stargazers

Watchers

Forkers

htmlkit's Issues

Get text from HTML document

HtmlKit manipulates the attribute value when they contain HTML special entities

StringBuilder.Clear() extension method is OK

Active development?

ok not event , and what about reducing HtmlToken appearance ?

Hello!

Reliability of HtmlTokenizer.LineNumber/LinePosition

method rename proposal

Void elements should be reported as empty

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent