Git Product home page Git Product logo

Comments (3)

dmitshur avatar dmitshur commented on June 10, 2024

No, it looks like wr.WriteString(tokenizer.Token().String()) is really not the right thing to do for html.TextTokens. Consider the following test case:

w.Write(blackfriday.MarkdownCommon([]byte(`Here are <script> some "quotes".`)))

MarkdownCommon has HTML_SANITIZE_OUTPUT turned on, and the HTML it produces is completely incorrect:

<p>Here are &lt;script&gt; some &amp;ldquo;quotes&amp;rdquo;.&lt;/p&gt;

When rendered, the HTML appears as:

Here are <script> some &ldquo;quotes&rdquo;.</p>

However, using wr.Write(tokenizer.Raw()) instead produces correct santizied HTML:

<p>Here are &lt;script&gt; some &ldquo;quotes&rdquo;.</p>

Which renders as:

Here are <script> some “quotes”.

That test case should really be added. It's so weird that currently all blackfriday tests pass despite the above. I just looked, and the there is test coverage in TestRawHtmlTag, but it has incorrect "expected" values.

from blackfriday.

dmitshur avatar dmitshur commented on June 10, 2024

FWIW, here's an MVP patch that you can use as a reference/starting point. It's probably not this simple, though...

diff --git a/sanitize.go b/sanitize.go
index 92a0cc3..7f06c34 100644
--- a/sanitize.go
+++ b/sanitize.go
@@ -72,9 +72,9 @@ func sanitizeHtmlSafe(input []byte) []byte {
    for t := tokenizer.Next(); t != html.ErrorToken; t = tokenizer.Next() {
        switch t {
        case html.TextToken:
-           // Text is written escaped.
-           wr.WriteString(tokenizer.Token().String())
-       case html.SelfClosingTagToken, html.StartTagToken:
+           // TODO: This needs to be verified and justified.
+           wr.Write(tokenizer.Raw())
+       case html.StartTagToken:
            // HTML tags are escaped unless whitelisted.
            tag, hasAttributes := tokenizer.TagName()
            tagName := string(tag)
@@ -107,6 +107,8 @@ func sanitizeHtmlSafe(input []byte) []byte {
            } else {
                wr.WriteString(html.EscapeString(string(tokenizer.Raw())))
            }
+       case html.SelfClosingTagToken:
+           fallthrough // Currently, it can be handled identically to EndTagToken.
        case html.EndTagToken:
            // Whitelisted tokens can be written in raw.
            tag, _ := tokenizer.TagName()

from blackfriday.

mprobst avatar mprobst commented on June 10, 2024

Hey,

Thanks for the test cases and debugging.

I was a bit surprised at first, too, but I think the behaviour of the sanitization code is actually correct for the quotes and the unicode. You should definitely set the charset for the generated HTML - within Go, it's Unicode text, so unless you encode it into some other charset (e.g. ISO8859-1 which browsers assume by default) you must make sure it's interpreted as UTF-8 as it should be.

As for the example with the <script> tag, that's a bit confusing indeed. The problem is that the HTML5 parser switches into raw text parsing mode as soon as it sees the opening <script>. Everything after that is then parsed in a special mode until you run into a </script> tag - that's why the </p> ends up being escaped. This is what a browser does, too. I don't entirely understand why the quotes end up as an escaped "&quot;" in the token's text (which leads to the double escaping), that might be a bug in the HTML5 parser.

In any case, I found a way to effectively disable the raw text parsing mode, which is fine as we escape those tags anyway. See pull request #75.

from blackfriday.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.