microcosm-cc / bluemonday Goto Github PK

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

Home Page: https://github.com/microcosm-cc/bluemonday

License: Other

Go 99.75% Makefile 0.25%

sanitization html security xss go owasp allowlist golang

bluemonday's Issues

Matching() does not do what's expected.

Hi @buro9,

I think there's a bug in bluemonday. According to your comment,

So class="foo bar bash" is fine and would be allowed, but class="javascript:alert('XSS')" would fail and be stripped, and class="><script src='http://hackers.org/XSS.js'></script>" would also fail and be stripped.

I've added tests for that and noticed they failed.

I tried using bluemonday directly, and saw behavior counter to what you suggested. All three class="foo bar bash", class="javascript:alert(123)", and class="><script src='http://hackers.org/XSS.js'></script>" were passed through, without failing and being stripped.

$ goe --quiet 'p:=bluemonday.UGCPolicy(); p.AllowAttrs("class").Matching(bluemonday.SpaceSeparatedTokens).OnElements("span"); println(p.Sanitize(`Hello <span class="foo bar bash">there</span> world.`))'
Hello <span class="foo bar bash">there</span> world.

$ goe --quiet 'p:=bluemonday.UGCPolicy(); p.AllowAttrs("class").Matching(bluemonday.SpaceSeparatedTokens).OnElements("span"); println(p.Sanitize(`Hello <span class="javascript:alert(123)">there</span> world.`))'
Hello <span class="javascript:alert(123)">there</span> world.

$ goe --quiet 'p:=bluemonday.UGCPolicy(); p.AllowAttrs("class").Matching(bluemonday.SpaceSeparatedTokens).OnElements("span"); println(p.Sanitize(`Hello <span class="><script src='"'http://hackers.org/XSS.js'"'></script>">there</span> world.`))'
Hello <span class="&gt;&lt;script src=&#39;http://hackers.org/XSS.js&#39;&gt;&lt;/script&gt;">there</span> world.

Looking at the code, the reason is clear.

SpaceSeparatedTokens = regexp.MustCompile(`[\s\p{L}\p{N}_-]+`)

if ap.regexp.MatchString(htmlAttr.Val) {
    cleanAttrs = append(cleanAttrs, htmlAttr)
}

The regex doesn't have ^ at the front nor $ at the end, so it matches any substring. That's why <span class="javascript:alert(123)">there</span> is not stripped, but <span class=":::::">there</span> is stripped as expected.

How to fix that... I leave to you (there's more than one way, and I'm not a fan of regexes).

AllowElements without attributes

AllowElement don't work with elements that don't have any attributes, for example, ,

. If these attributes are not listed in policy.go.

p = bluemonday.Policy{}
p.AllowElements("big")
output := p.Sanitize("test")

output is "test"

It is not said that bluemonday is only html5-sanitizer and it is good to allow users to customize their policies more.

Stripping non-tags

bluemonday.StrictPolicy().Sanitize("a<b")

returns "a".

Is there a reason its not looking for an actual tag, or is this a mistake?

apostrophes get turned into HTML entities - take 2

I realize this was asked before, but I would like to renew the discussion for my use case.

I am getting data that is being entered into an html form by a user, and that data will be saved in a database eventually and redisplayed later. Obviously such data needs to be sanitized, so I turned to bluemonday. If a user enters an apostrophe in the data, the apostrophe is coming through to my application in the form data as an apostrophe, so GO is not converting it. When I run it through bluemonday, it gets converted to an html entity.

bluemonday states that it is designed for sanitizing html that will be displayed as html, so I understand why this is correct behavior from bluemonday's perspective.

Soo..., I am asking for a new sanitizer policy that allows bluemonday to be used as a basic text sanitizer for my scenario, where I am not trying to have the end result be html, but still I want the goal that XSS attacks are cleaned out. I imagine it could be a simple process of running the result through UnescapeString, which I will do for now, but you guys are the experts and might have other thoughts as to why this may or may not be adequate.

Is Sanitize* safe to use by multiple goroutines

At a glance it looks like there is no reason to create new policy per each sanitized value (well, it's per goroutine, but in practice this often means - per each value), after initial policy setup it should be safe to use by multiple goroutines, but doc doesn't mention this (which apply default meaning: not safe). So, if it's safe, please document this.

Some AllowAttrs.Matching regexp are not anchored

It looks like this was already fixed in #3. Here are some examples, but my point is, all regexps should be reviewed and anchored unless there is a really good reason to not do this:

func (p *Policy) AllowStandardAttributes() {
	p.AllowAttrs(
		"lang",
	).Matching(regexp.MustCompile(`[a-zA-Z]{2,20}`)).Globally()
	p.AllowAttrs("id").Matching(
		regexp.MustCompile(`[a-zA-Z0-9\:\-_\.]+`),
	).Globally()

func (p *Policy) AllowTables() {
	p.AllowAttrs(
		"scope",
	).Matching(
		regexp.MustCompile(`(?i)(?:row|col)(?:group)?`),
	).OnElements("td", "th")
	p.AllowAttrs("nowrap").Matching(
		regexp.MustCompile(`(?i)|nowrap`),
	).OnElements("td", "th")

Also it's probably good idea to fix examples in README in same way.

Closing anchor and font tags mixed up

With the following sample program, in the output, the closing anchor tag is missing and in its place is an erroneous closing font tag:

package main

import (
    "fmt"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    in := `<font face="Arial">No link here. <a href="http://link.com">link here</a>.</font> Should not be linked here.`
    p := bluemonday.UGCPolicy()
    p.AllowAttrs("color").OnElements("font")
    fmt.Printf("'%s'\n", p.Sanitize(in))
}

Am I doing something wrong, or is this a bug?

Enable Sourcegraph

I want to use Sourcegraph for bluemonday code search, browsing, and usage examples. Can an admin enable Sourcegraph for this repository? Just go to https://sourcegraph.com/github.com/microcosm-cc/bluemonday. (It should only take 30 seconds.)

Thank you!

A tag unnormal output

	p := bluemonday.UGCPolicy()
	//p.AllowAttrs("src").Matching(regexp.MustCompile(`(?i)mailto|https?`)).OnElements("img")
	centents := []string{
		`<script>alert(/xss/);</script>;`,
		`<script src="http://xxx.xx/xx.js"></script>`,
		`<body onload=alert('test1')>`,
		`<b onmouseover=alert('Wufff!')>click me</b>;`,
		`<b onmouseover=alert('Wufff!')>click me!</b>`,
		`<a href="http://www.google.com/"><img src="https://ssl.gstatic.com/accounts/ui/logo_2x.png"/></a>`,
		`<a href="javascript:alert('XSS1')" onmouseover="alert('XSS2')">XSS<a>`,
		`<a href="http://www.google.com/" onmouseover="alert('XSS2')">XSS<a>`,
	}

	for _, v := range centents {
		t.Log(p.Sanitize(v))
	}

<a href="http://www.google.com/" onmouseover="alert('XSS2')">XSS<a>
ouput
<a href="http://www.google.com/" rel="nofollow">XSS

Force html attribute to specific values

It would be useful to be able to forcibly add an attribute to elements. You have similar functionality with the rel="nofollow" on links. Something like:

Policy.AllowAttrs("rel").Force("nofollow").OnElements("a")

or:

Policy.ForceAttrs("rel").Value("nofollow").OnElements("a")

skip nested tags by attrs bug

Example:

p := bluemonday.NewPolicy()
p.AllowElements("tag1", "tag2")
res := p.Sanitize("<tag1><tag2>abc</tag2></tag1><p>test</p>")
fmt.Println(res)

Output:
abc</tag1>test
Correct output:
abctest

Add OmitSkipElements method

Recently, I felt a need to create a policy in my project that should allow iframe, embed and script. But I could not get the desired results . I suspect the reason being this.

Can you suggest any other workaround for this Or we can have one more function like OmitSkipElements which omits the given elements(like 'iframe') if present from the setOfElementsToSkipContent map.

tag with semver for vgo

Hi,

Maybe it's time to tag it to make it friendly with vgo and make our go.mod more readable ?

Logo proposal

Hey @buro9 . I desinged a logo for you. I put the initials of the project together and I chose a blue color. WDYT? Please tell me your thoughts.

"javascript" in sanitize.go.

Hi I was using bluemonday lib and in the sanitize method I noticed you are expecting the tagname to be "javascript" and not "script". I was unable to understand why that was?

For my case I thought of changing it to "script" but after I do that I get some failing test cases.

--- FAIL: TestAntiSamy (0.00s)
    sanitize_test.go:769: test 59 failed;
        input   : <SCRIPT>document.write("<SCRI");</SCRIPT>PT SRC="http://ha.ckers.org/xss.js"></SCRIPT>
        output  : PT SRC="http://ha.ckers.org/xss.js">
        expected: PT SRC=&#34;http://ha.ckers.org/xss.js&#34;&gt;
=== RUN   TestXSS
--- FAIL: TestXSS (0.00s)
    sanitize_test.go:1138: test 2 failed;
        input   : <SCRIPT>document.write("<SCRI");</SCRIPT>PT SRC="http://ha.ckers.org/xss.js"></SCRIPT>
        output  : PT SRC="http://ha.ckers.org/xss.js">
        expected: PT SRC=&#34;http://ha.ckers.org/xss.js&#34;&gt;
    sanitize_test.go:1138: test 73 failed;
        input   : <IMG """><SCRIPT>alert("XSS")</SCRIPT>">
        output  : ">
        expected: &#34;&gt;

Which should not happen as far as I understand. Can you help me understand this behaviour? And if required investigate it further?

I apologise if it's something obvious that I have missed here.

apostrophes get turned into HTML entities - '

Is there any way to disable this behavior, or is this a bug?

Attribute Filter/Transform Callbacks

It would be helpful to have a hook to allow custom attribute filtering. I propose something much simpler than #24 that would integrate with the existing builder syntax:

// AttrTransform is a user provided function to manipulate the value
// of an attribute
type AttrTransform func(v string) string

func main() {
  tf := func(v string) string {
    return strings.ToUpper(v)
  }

  p := bluemonday.NewPolicy()
  p.TransformAttrs(tf, "style").OnElements("div")
}

This would provide a convenient way to extend bluemonday without requiring the user to parse the HTML before or after calling Sanitize()

My goal is to implement CSS style attribute filtering for my project without forking bluemonday.

AllowElements iframe doesn't work

Hey

I've tried to add iframe to the whitelist, but it still sanitize iframes.

Example code:

package main

import (
	"github.com/microcosm-cc/bluemonday"
	"fmt"
)

func main() {
	raw := "<iframe></iframe>"
	p := bluemonday.NewPolicy()
	p.AllowElements("iframe")
	 res := p.Sanitize(raw)
	 if res != raw {
	 	fmt.Printf("got: %s\n", res)
	 } else {
	 	fmt.Println("happy, happy, joy, joy!")
	}
}

Any help will be awesome.

Thanks,
Jonathan

Inline Images get stripped

Using this content:

<img src="data:image/gif;base64,R0lGODdhEAAQAMwAAPj7+FmhUYjNfGuxYY
    DJdYTIeanOpT+DOTuANXi/bGOrWj6CONzv2sPjv2CmV1unU4zPgISg6DJnJ3ImTh8Mtbs00aNP1CZSGy0YqLEn47RgXW8amasW
    7XWsmmvX2iuXiwAAAAAEAAQAAAFVyAgjmRpnihqGCkpDQPbGkNUOFk6DZqgHCNGg2T4QAQBoIiRSAwBE4VA4FACKgkB5NGReAS
    FZEmxsQ0whPDi9BiACYQAInXhwOUtgCUQoORFCGt/g4QAIQA7">

and this policy:

	// Define a policy, we are using the UGC policy as a base.
	p := bluemonday.UGCPolicy()

	// Allow images to be embedded via data-uri
	p.AllowDataURIImages()

i get an empty string back.... whats wrong with my policy?

cheers max

Recent change results in html attributes potentially being repeated twice.

See 63440de#commitcomment-7587904 for details.

`make test` fails on master

I just checked out master and ran make test and it fails.

$ make test
...
=== RUN   TestXSS
--- FAIL: TestXSS (0.00s)
	sanitize_test.go:1138: test 74 failed;
		input   : <IMG SRC=`javascript:alert("RSnake says, 'XSS'")`>
		output  : 
		expected: <img src="%60javascript:alert%28%22RSnake">
	sanitize_test.go:1138: test 61 failed;
		input   : <IMG SRC="jav&#x0D;ascript:alert('XSS');">
		output  : 
		expected: <img src="jav%0Dascript:alert%28%27XSS%27%29;">
...

make lint also fails:

$ make lint
example_test.go is in package bluemonday_test, not bluemonday

I'm running go1.8.3.

$ go version
go version go1.8.3 darwin/amd64

Was Sanitized Flag

Is there a way to check if the input was sanitized? Maybe something like the code below?

// . . .
sanitized := p.Sanitize(unSanitized)
if p.WasSanitized {
    fmt.Println("Needed to be Sanitized")
}
// . . .

If there isn't a way to do this, could someone add it? Or would you recommend comparing the unsanitized input to the sanitized input?

Thanks! :)

Feature Request: Ability for external packages to perform custom URL filtering.

I would like to consider/try out using data URI scheme for image urls for my own Markdown files (not user generated content). After the discussion in #5, I understand it's not appropriate to increase the API size of this package significantly trying to add support for data URIs, as it's not something people would want to use when sanitizing user generated content and it's not possible for bluemonday to do a sufficient job of sanitizing it anyway.

However, in the spirit of keeping this package highly configurable and usable for custom needs, I propose a very simple single API addition that would allow external packages to bring in their own validation logic for URLs into bluemonday.

// Given a correctly parsed URL, this func should return true if the URL is to be allowed,
// and return false if this URL should not be allowed.
type UrlFilter func (*url.URL) bool

func (p *Policy) AllowURLSchemeWithCustomFiltering(scheme string, urlFilter UrlFilter)  *Policy

(Rough draft of the API, naming could use improvement and I'm very open to other changes.)

It would imply/automatically set RequireParseableURLs to true.

Next, before being allowed, each URL would be passed through the filter func, which can return true if the URL is to be allowed, and false if it should be filtered out.

That will allow me to use my own logic to decide which URLs I want to keep (implemented outside of this package), and not bloat the API of bluemonday, allowing it to remain secure and high quality.

p.AllowDocType() opens a vector for inserting unsanitized HTML

Reported via email:

package main

import (
	"fmt"

	"github.com/microcosm-cc/bluemonday"
)

func TestDoctype() {
	p := bluemonday.UGCPolicy()
	p.AllowDocType(true)

	original := `<!DOCTYPE html="&#34;&#62;<img src=x onerror=&#34;alert(/XSS/)">`
	html := p.Sanitize(original)

	// Output:
	// Original: <!DOCTYPE html="&#34;&#62;<img src=x onerror=&#34;alert(/XSS/)">
	// Sanitized: <!DOCTYPE html=""><img src=x onerror="alert(/XSS/)">
	fmt.Printf("Original: %s\nSanitized: %s\n",
		original, html)
}

func main() {
	TestDoctype()
}

Custom handlers

What about adding custom handlers for html elements?
Without it user is forced to parse html second time by himself. Also custom handler allows to use more complicated content-dependent sanitizing.

For example, emails. Emails stored as original, but displayed sanitized and to display embed images it is needed to replace content-id ("cid:url") with real url.

p.SetCustomElementHandler(
    func(token html.Token) bluemonday.HandlerResult {
        for i := range token.Attr {
            // possible image locations
            if token.Attr[i].Key == "src" || token.Attr[i].Key == "background" {
                cid := token.Attr[i].Val // get content-id
                url := GetUrlFromCid(cid)
                token.Attr[i].Val = url
            }
        }

        return bluemonday.HandlerResult{
            Token:         token,
            SkipContent:   false,
            SkipTag:       false,
            DoNotSanitize: false,
        }
    },
)

Or blog-posts, editor can form html with special values in custom attributes (e.g. x-my-attribute) and while saving to database, backend can process these attributes while sanitizing without needs to parse html again -> speedup.

I suggest this syntax for custom handlers, it receives html.Token and returns struct with modified token and flags:

SkipContent - skip content of current html-element only
SkipTag - skip closing tag for current html-element only
DoNotSanitize - do not apply sanitizing rules for this html-element only

These flags allows to sanitize with more complex rules that Regexp cannot handle. For example, golang regexp cannot use forward lookup.

Prevent escaping special characters

Hello,

We have this snippet:

package main

import (
        "fmt"

        "github.com/microcosm-cc/bluemonday"
)

func main() {
        p := bluemonday.UGCPolicy()
        html := p.Sanitize(
                `"Hello world!"  <script>alert(document.cookie)</script>`,
        )

        // Output:
        // &#34;Hello world!&#34;
        fmt.Println(html)
}

Which produces the following output:

&#34;Hello world!&#34;

We'd like to prevent this script from escaping the special characters like ", is there any way we can tell bluemonday to not escape special chars by default?

URLs with multiple query parameters escape the `&` delimiter incorrectly

Example the following test case exists which tests 1 query parameter

{
    in:       `<a href="?q=1">`,
    expected: `<a href="?q=1" rel="nofollow">`,
},

Add the following test case to produce a failure with more than one query parameter.

{
    in:       `<a href="?q=1&r=2">`,
    expected: `<a href="?q=1&r=2" rel="nofollow">`,
},

the result would be <a href="?q=1&r=2" rel="nofollow"> which breaks the queries

Require referrerpolicy="no-referrer" on images?

I'd like bluemonday to add referrerpolicy="no-referrer" to all img tags, but I don't see a way to do that with the current API. Maybe we could get an API like RequireReferrerPolicyOnImages(policy ReferrerPolicy)?

The referrerpolicy attribute is experimental but has decent support.

When a self-closing iframe is present, content afterward does not get sanitized.

package main

import (
	"fmt"
	"html"

	"github.com/microcosm-cc/bluemonday"
)

func main() {
	input := `<iframe /><script type="text/javascript">doBadStuff();</script>`

	fmt.Println(html.UnescapeString(bluemonday.UGCPolicy().Sanitize(input)))
	// output:
	// <script type="text/javascript">doBadStuff();</script>
}

Allow all body/head/title and only do xss removal

Whats the easiest way to allow all default/basic html tags
especially body/head/title is always removed, even if i allow them.

    p.AllowElements("html", "head", "title")

looking for a quick&dirty xss remover

Feature Request: Ability to filter URLs on a finer grained level.

Suppose I would like to allow using data URI scheme for image urls. For example:

<img src="data:image/png;base64,iVBORw0KGgoAAAANS...K5CYII=">

Currently, I can achieve that by doing:

p := bluemonday.UGCPolicy()
p.AllowURLSchemes("data")

However, that will allow all kinds of things, including "data:text/javascript;charset=utf-8,alert('hi');" or other unexpected values.

What I'd like to do is be able to filter on a finer level, similarly to what's possible with attributes and elements.

For example, I would imagine an API something like this:

p.RequireBase64().AllowMimeTypes("image/png", "image/jpeg").OnURLSchemes("data")

And it would make sure to filter out anything that's not one of those two mime types, not base64, or contains charset, and is valid base64 encoding (i.e., doesn't contain other characters, no query and no fragment).

What are your thoughts on this proposal?

Allow `RequireParsableURL` method to be applied selectively on tags.

Hi, currently it looks like RequireParsableURL cannot be applied to tags selectively.

i.e. something like -> policy.AllowParsabelURL(true).OnElements(<tag>).

I want to allow any url in src attr on img tag and nowhere else. Please update if you think this is a feature worth having.

Thanks.

css sanitization in style attributes

Per the bluemonday docs:

We are not yet including any tools to help whitelist and sanitize CSS. Which means that unless you wish to do the heavy lifting in a single regular expression (inadvisable), you should not allow the "style" attribute anywhere.

We use bluemonday and would like allow a limited subset of CSS (essentially limit to a handful of property names) to be allowed within the style attribute.

I can dedicate a few cycles to enhancing bluemonday with this functionality, but per the contributing guidelines, it asks that I create an issue first.

Since we have the issue, I figure it's worth proposing an API and an approach.

api

For the API, I'd propose something similar to allowAttrs(), for example allowStyles('font-size', 'text-align'). We have no need for tag-specific styles, so I'd propose foregoing the complexity of the builder initially and just allowing the styles globally. This would cause an API change down the line though, should such functionality be necessary.

approach

gorilla's css tokenizer is a useful starting point - https://github.com/gorilla/css and it is a reputable and community accepted source

unfortunately, you really need a parser built on top of the tokenizer to do this work. building a simple one is fairly trivial, but only for this use case (declarations in style attributes). If you want to sanitize inline css in <style> tags, or external css, then the task becomes much more difficult.

https://github.com/aymerick/douceur has a parser built on top of gorilla's tokenizer that purports to accomplish this, but it's unclear how well supported it is or how high the quality is… one issue points out an infinite loop

fortunately, it has an acceptable implementation of declaration parsing for our use case: https://github.com/aymerick/douceur/blob/master/parser/parser.go#L163-L201

given that, it seems like it's worth using for now. if the scope of css sanitization increases, then it may be worth re-evaluating options on whether starting fresh, forking, or contributing back upstream are better alternatives

README says supported versions 1.1–1.9, but .travis.yml tests 1.1–1.11

Compare:

bluemonday/README.md

Lines 59 to 63 in 506f3da

 ### Supported Go Versions 

 bluemonday is tested against Go 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, and tip. 

 We do not support Go 1.0 as we depend on `golang.org/x/net/html` which includes a reference to `io.ErrNoProgress` which did not exist in Go 1.0.

bluemonday/.travis.yml

Lines 3 to 13 in 506f3da

 - 1.1.x 

 - 1.2.x 

 - 1.3.x 

 - 1.4.x 

 - 1.5.x 

 - 1.6.x 

 - 1.7.x 

 - 1.8.x 

 - 1.9.x 

 - 1.10.x 

 - 1.11.x

Also, Go 1.12 is out now. Is it supported?

skip elements content nested bug

Hi,

p := bluemonday.NewPolicy()
p.SkipElementsContent("tag1", "tag2")
res := p.Sanitize(`<tag1>cut<tag2></tag2>harm</tag1><tag1>123</tag1><tag2>234</tag2>`)
fmt.Println(res)

Output is: harm
But result must be an empty string.
I will submit pull request for this issue soon.

My question is:

p := bluemonday.NewPolicy()
p.SkipElementsContent("tag")
p.AllowElements("p")
res := p.Sanitize(`<tag>234<p>asd</p></tag>`)
fmt.Println(res)

Output is: <p></p>.
Is that correct to skip nested tag's content but to keep tag itself? Should it be empty string result or <p>asd</p>?

Links are stripped when they shouldn't

package main

import (
	"fmt"
	"github.com/microcosm-cc/bluemonday"
)

func main() {
	policy := bluemonday.NewPolicy()
	policy.AllowElements("div", "a")
	fmt.Println(policy.Sanitize(`<div><a href="/">link</a></div>`))
}

Output: <div>link</div>

I expect it to output <div><a>link</a></div>

Request: DisallowElements function

I'd love to be able to do something like:

p := bluemonday.UGCPolicy().DisallowElements("h1", "h2")

Would this be a plausible PR, and if so would you be interested in it? I completely get it if you don't want to support it so asking here first.

Feature request: blacklist elements from policy

Hi there,

Forgive me if this feels like an inappropriate way to cast a vote, and feel free to close if so!

I noticed in the TODO that the Blacklist concept is mentioned, and that you were interested in investigating whether this would be useful to devs. So I wanted to cast a vote in favor of this request. I'd love to be able to use the UGCPolicy and then simply take out the parts that aren't applicable to my use case, such as UGCPolicy without Blockquotes.

Thanks for your work!

Url with ascii char encoded to hex

Hi,

thank you for this awesome package. It helped me a lot on my project.
There is just one issue that I try to solve. When sanitizing the following string:
http://my-server.com/index.php?name=<script>window.onload = function() {var link=document.getElementsByTagName("a");link[0].href="http://not-real-xssattackexamples.com/";}</script>

it result in:
http://my-server.com/index.php?name=

which is find for me.

But, when the evil part of the URL is encoded in hex like:
http://your-server/index.php?name=%3c%73%63%72%69%70%74%3e%77%69%6e%64%6f%77%2e%6f%6e%6c%6f%61%64%20%3d%20%66%75%6e%63%74%69%6f%6e%28%29%20%7b%76%61%72%20%6c%69%6e%6b%3d%64%6f%63%75%6d%65%6e%74%2e%67%65%74%45%6c%65%6d%65%6e%74%73%42%79%54%61%67%4e%61%6d%65%28%22%61%22%29%3b%6c%69%6e%6b%5b%30%5d%2e%68%72%65%66%3d%22%68%74%74%70%3a%2f%2f%61%74%74%61%63%6b%65%72%2d%73%69%74%65%2e%63%6f%6d%2f%22%3b%7d%3c%2f%73%63%72%69%70%74%3e

the sanitizer didn't work.

Do I miss something on the way to use the sanitizer? Is there a way to detect such situation and sanitize the string?

Thank you.

In the UGCPolicy() the cite attribute on q should be a URL not a paragraph.

Interestingly this is the only example of a href link that does not permit a rel="nofollow" attribute.

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/q

Add support for dataset attributes

Hey :)

In my use of the library I need to support arbitrary data attributes. However, I do not know which data attributes will be sent so I cannot whitelist them.

Can you please add support for arbitrary dataset attributes which will all start with "data-"
For reference of data attributes you can refer to MDN
They have no meaning for the browser only to pass data to certain elements.

Love to hear your thoughts

Thanks,
Jonathan

Suggestion: Insert white space when stripping tags

I'm using the StrictPolicy() to strip tags from text in order to feed mongoDB full text search. The text content of adjacent elements may be visually separated by html rendering even though there is no whitespace in the text. Stripping the tags therefore merges words potentially altering search results. Here's an example:

package main

import (
    "fmt"
    "github.com/microcosm-cc/bluemonday"
)

func main() {
    userInput := "<p>Why oh why</p><p>she swallowed a fly</p>"
    searchableText := bluemonday.StrictPolicy().Sanitize(userInput)

    fmt.Println(searchableText) // Why oh whyshe swallowed a fly
}

I can easily solve this in my own code, e.g. by inserting a space before or after every block-level html element before stripping the tags.

I wondered whether this would be a generally useful feature. A general case might need configuration given that even adjacent inline elements can be visually separated through CSS.

Use of strings.ToLower() incorrectly escapes chars and allows for insertion of scripts

Reported by email:

package main

import (
	"fmt"

	"github.com/microcosm-cc/bluemonday"
)

func TestEncoding() {
	p := bluemonday.NewPolicy()

	original := "<scr\u0130pt>&lt;script>alert(/XSS/)&lt;/script>"
	html := p.Sanitize(original)

	// Output:
	// Original: <scrİpt>&lt;script>alert(/XSS/)&lt;/script>
	// Sanitized: <script>alert(/XSS/)</script>
	fmt.Printf("Original: %s\nSanitized: %s\n",
		original, html)
}

func main() {
	TestEncoding()
}

Note that this is more severe than even the original reporter realised as this works on the NewPolicy which is a blank policy.

An explanation was provided:

This one occurs because you doesn't escape script/style tag contents:
https://github.com/microcosm-cc/bluemonday/blob/master/sanitize.go#L219-L228
The trick is using symbol \u0130 (İ) that converts to i by
strings.ToLower, how to find it: https://play.golang.org/p/jDMRCSNigR7

Investigation reveals that strings.ToLower() was not even required, and could be omitted which results in the expected (safe) behaviour.

A change is coming in a moment.

Credit to Yandex and @buglloc for reporting this.

golang.org/x/net/html is obsoleted.

Your package import golang.org/x/net/html. This url is obsoleted, which leads to

unrecognized import path "golang.org/x/net/html"

Maybe you need change it to https://github.com/golang/net

Turn disallowed tags into html entities

It would be useful to have the option to turn stripped tags into entities instead of completely removing them. So

<script></script>

would be turned into

&lt;script&gt;&lt;/script&gt;

rel="noopener" should be added if target="_blank" is on a link

Background:
https://dev.to/ben/the-targetblank-vulnerability-by-example
https://lists.w3.org/Archives/Public/public-whatwg-archive/2015Jan/0002.html

If rel="noopener" is not present on a link that has target="_blank" then we should add it.

We already have an option to force adding target="_blank" via AddTargetBlankToFullyQualifiedLinks and should extend the functionality of that option to add the rel="noopener" to mitigate the risk of this attack.

Add functionality to set rel="noreferrer" on a,area,link

Right now we have the capability to set rel="nofollow", and we add rel="noopener" if target="_blank". Another potential useful piece of functionality would be to add rel="noreferrer" if the user makes that part of the policy.

I would be willing to write the functionality. I was thinking of the API consisting of two functions:
(p *Policy) RequireNoReferrerOnLinks() *policy
(p *Policy) RequireNoReferrerOnFullyQualifiedLinks() *policy

How to allow emojis?

Using sanitization rule below converts emojies like 😀 to ? :

func StripHtml(s string) string {
	p := bluemonday.UGCPolicy()
	p.AllowAttrs("class").OnElements("img")
	return p.Sanitize(s)
}

I'm wondering how can I allow emojis to be saved without lowering the sanitization bar too much?

fatal error: concurrent map writes

We use bluemonday in our golang service and under high load and traffic we are noticing the error below
fatal error: concurrent map writes

fatal error: concurrent map writes

goroutine 239204 [running]:
runtime.throw(0xbea50a, 0x15)
#011/usr/lib/go-1.9/src/runtime/panic.go:605 +0x95 fp=0xc42157a508 sp=0xc42157a4e8 pc=0x42bd85
runtime.mapassign_faststr(0xaf4580, 0xc4201c3380, 0xbdc0da, 0x5, 0xc42025e168)
#011/usr/lib/go-1.9/src/runtime/hashmap_fast.go:861 +0x4da fp=0xc42157a588 sp=0xc42157a508 pc=0x40d41a
git.corp.adobe.com/service/vendor/github.com/microcosm-cc/bluemonday.(*attrPolicyBuilder).OnElements(0xc420321e60, 0xc42157a650, 0x1, 0x1, 0xc420321e60)
#011/go/src/git.corp.adobe.com/service/vendor/github.com/microcosm-cc/bluemonday/policy.go:176 +0x139 fp=0xc42157a620 sp=0xc42157a588 pc=0x9be669
git.corp.adobe.com/service/jobs.sanitizeComment(0xc4203edec0, 0xc, 0x32, 0xc421041e80)

An undocumented difference between the three Sanitize* funcs.

There are now 3 funcs to perform sanitization.

Their signatures and documentation suggest they differ only in the types for input/output.

However, func (p *Policy) Sanitize(s string) string differs from the other two in behavior:

$ goe 'bluemonday.UGCPolicy().Sanitize("Hi.\n")'
(string)("Hi.")
$ goe 'string(bluemonday.UGCPolicy().SanitizeBytes([]byte("Hi.\n")))'
(string)("Hi.\n")
$ goe 'bluemonday.UGCPolicy().SanitizeReader(strings.NewReader("Hi.\n")).String()'
(string)("Hi.\n")

This is because Sanitize performs s = strings.TrimSpace(s), while the other two don't do the equivalent action.

Is this difference intended? I am guessing it is (and I like the current behavior, it seems appropriate). But then the documentation should reflect it.

	### Supported Go Versions

	bluemonday is tested against Go 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, and tip.

	We do not support Go 1.0 as we depend on `golang.org/x/net/html` which includes a reference to `io.ErrNoProgress` which did not exist in Go 1.0.

	- 1.1.x
	- 1.2.x
	- 1.3.x
	- 1.4.x
	- 1.5.x
	- 1.6.x
	- 1.7.x
	- 1.8.x
	- 1.9.x
	- 1.10.x
	- 1.11.x

microcosm-cc / bluemonday Goto Github PK

bluemonday's Issues

Recommend Projects

Recommend Topics

Recommend Org