Comments (3)
No, it looks like wr.WriteString(tokenizer.Token().String())
is really not the right thing to do for html.TextToken
s. Consider the following test case:
w.Write(blackfriday.MarkdownCommon([]byte(`Here are <script> some "quotes".`)))
MarkdownCommon
has HTML_SANITIZE_OUTPUT turned on, and the HTML it produces is completely incorrect:
<p>Here are <script> some &ldquo;quotes&rdquo;.</p>
When rendered, the HTML appears as:
Here are <script> some “quotes”.</p>
However, using wr.Write(tokenizer.Raw())
instead produces correct santizied HTML:
<p>Here are <script> some “quotes”.</p>
Which renders as:
Here are <script> some “quotes”.
That test case should really be added. It's so weird that currently all blackfriday tests pass despite the above. I just looked, and the there is test coverage in TestRawHtmlTag
, but it has incorrect "expected" values.
from blackfriday.
FWIW, here's an MVP patch that you can use as a reference/starting point. It's probably not this simple, though...
diff --git a/sanitize.go b/sanitize.go
index 92a0cc3..7f06c34 100644
--- a/sanitize.go
+++ b/sanitize.go
@@ -72,9 +72,9 @@ func sanitizeHtmlSafe(input []byte) []byte {
for t := tokenizer.Next(); t != html.ErrorToken; t = tokenizer.Next() {
switch t {
case html.TextToken:
- // Text is written escaped.
- wr.WriteString(tokenizer.Token().String())
- case html.SelfClosingTagToken, html.StartTagToken:
+ // TODO: This needs to be verified and justified.
+ wr.Write(tokenizer.Raw())
+ case html.StartTagToken:
// HTML tags are escaped unless whitelisted.
tag, hasAttributes := tokenizer.TagName()
tagName := string(tag)
@@ -107,6 +107,8 @@ func sanitizeHtmlSafe(input []byte) []byte {
} else {
wr.WriteString(html.EscapeString(string(tokenizer.Raw())))
}
+ case html.SelfClosingTagToken:
+ fallthrough // Currently, it can be handled identically to EndTagToken.
case html.EndTagToken:
// Whitelisted tokens can be written in raw.
tag, _ := tokenizer.TagName()
from blackfriday.
Hey,
Thanks for the test cases and debugging.
I was a bit surprised at first, too, but I think the behaviour of the sanitization code is actually correct for the quotes and the unicode. You should definitely set the charset for the generated HTML - within Go, it's Unicode text, so unless you encode it into some other charset (e.g. ISO8859-1 which browsers assume by default) you must make sure it's interpreted as UTF-8 as it should be.
As for the example with the <script> tag, that's a bit confusing indeed. The problem is that the HTML5 parser switches into raw text parsing mode as soon as it sees the opening <script>. Everything after that is then parsed in a special mode until you run into a </script> tag - that's why the </p> ends up being escaped. This is what a browser does, too. I don't entirely understand why the quotes end up as an escaped """ in the token's text (which leads to the double escaping), that might be a bug in the HTML5 parser.
In any case, I found a way to effectively disable the raw text parsing mode, which is fine as we escape those tags anyway. See pull request #75.
from blackfriday.
Related Issues (20)
- Panic on listItem function HOT 1
- List after paragraph not rendenring HOT 1
- Is it possible to not wrap <div> inside <a> in <p>
- `AutoLink` becomes turned off when running with `HardLineBreak` ext
- How to support some UML plug-ins? For example: mermaid
- parse bitcoin uri's as links HOT 2
- List convert error HOT 1
- index out of range panic in v2.scanLinkRef (line 659 in markdown.go file)
- index out of range panic in v2.listItem (line 1369 in block.go file)
- code blocks break line has been deleted
- Buggy, fragile list behavior
- Column alignement in tables?
- What flags to (un)set if i want no header tags?
- Is v2 OK with Go 1.17?
- enclosed parentheses in markdown links not rendered properly in html
- Support single dash table define
- Is this repository dead? HOT 1
- Code blocks not parsed correctly
- Empty data while parsing markdown
- panic: block input is missing terminating newline
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blackfriday.