Git Product home page Git Product logo

htmlkit's People

Contributors

dependabot[bot] avatar jstedfast avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htmlkit's Issues

Get text from HTML document

This is basically question. I need to retrieve text from HTML document without building DOM to maximize performance. Can HtmlKit help me to achieve that ? Appreciate your advice.

HtmlKit manipulates the attribute value when they contain HTML special entities

Describe the bug
HtmlKit manipulates the attribute value when they contain HTML special entities!
We expect when an attribute value is returned it should literarily be equal to the input stream!

Platform (please complete the following information):

  • OS: Windows & Linux
  • .NET Framework: .Net 6.0 & .NET Framework 4.8
  • HtmlKit Version: 1.1.0

To Reproduce
Steps to reproduce the behavior:

  1. Create a new Console application
  2. Add the following HTML file in the project and mark it as Copy if newer:
    <!DOCTYPE html>
    
    <html lang="en" xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta charset="utf-8" />
        <title></title>
    </head>
    <body>
        <a href="https://google.com/?q=val-&laquo;-val" name="val-&laquo;">The First Link</a>
        <br />
        <a href="https://google.com/?q=val-&amp;-val" name="val-&amp;">The Second Link</a>
    </body>
    </html>
    
  3. Copy-Paste the following code in Program.cs file:
    using HtmlKit;
    
    namespace HtmlKitTestProject
    {
        internal class Program
        {
            static void Main(string[] args)
            {
                using var stream = new FileStream("index.html", FileMode.Open, FileAccess.Read);
                using var reader = new StreamReader(stream);
    
                var tokenizer = new HtmlTokenizer(reader);
                HtmlToken token;
    
                while (tokenizer.ReadNextToken(out token))
                {
                    switch (token.Kind)
                    {
                        case HtmlTokenKind.Tag:
                            var tag = (HtmlTagToken)token;
    
                            if (tag.Id != HtmlTagId.A)
                                continue;
    
                            foreach (var attribute in tag.Attributes)
                            {
                                if (attribute.Value != null)
                                    Console.WriteLine(" {0}={1}", attribute.Name, $"{attribute.Value}");
                                else
                                    Console.WriteLine(" {0}", attribute.Name);
                            }
                            break;
                    }
                }
    
                Console.ReadLine();
            }
        }
    }
    
  4. Run the project and check the output:
    image

Expected behavior
The HTML file contains attributes with some HTML special entities as their values:
image
When an attribute value is returned it should literarily be equal to the input stream but, as you see it's converted to their decoded version!
image

Active development?

Hi,

I am evaluating whether a html parser could be a better fit for my mjml project: https://github.com/sebastianStehle/mjml-net

Your library seems to be one of the fastest solution. Nice project and very clean code :)

image

Your solutions is _HtmlReader and _Html and _Html2 are HtmlAgilityPack and another solution. I am wondering if this project is under active development and if you consider performance improvements and/or PRs.

Your library is already pretty fast, but I wonder if it can be further improved by keeping the tokens as fields (and properties) so that you an initialize names with pointers to the CharBuffer and therefore avoid a lot of allocations.

Hello!

Hello

I don't intend to fragment you library.

I just want to use it with the HtmlRenderer (https://github.com/prepare/HTML-Renderer).
so I must change something to fit with the HtmlRenderer.

I'm a young developer, feel free to suggest/comment
Thank you for your work :)

Reliability of HtmlTokenizer.LineNumber/LinePosition

Not really an issue, but...

Stumbled upon this brilliant library while considering writing a tokenizer myself - in order to even semi-reliably being able to locate the position of an element inside the source HTML, without resorting to e.g. regex. Need it for patching of HTML files (e.g. automatically inserting <script> or elements at given positions without touching the surrounding HTML).

Am I right in thinking that LineNumber/LinePosition will be off by 1, iff the last token was an HtmlDataToken (due to lookahead)? Or are there any edge cases I'm missing that you can think of? I can't quite wrap my head around the HTML tokenization spec at the moment.

method rename proposal

just look only method name, and documentation comment (Html5)

this is my proposal
71cf4e7

the code review / study is more easy!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.