Git Product home page Git Product logo

Comments (8)

dmitshur avatar dmitshur commented on July 17, 2024

Note that url.Parse correctly handles data URI schemes:

url.URL{
    Scheme:   (string)("data"),
    Opaque:   (string)("image/png;base64,iVBORw0KGgoAAAANS...K5CYII="),
    User:     (*url.Userinfo)(nil),
    Host:     (string)(""),
    Path:     (string)(""),
    RawQuery: (string)(""),
    Fragment: (string)(""),
}

You just need to check the values of Scheme, Opaque, RawQuery, Fragment.

from bluemonday.

buro9 avatar buro9 commented on July 17, 2024

I hadn't added anything to support the data URI due to the way that it can be abused.

The problem is that the data URI isn't restricted to having just images and one can easily create a data URI that contains CSS, JavaScript, HTML or other things that could permit the loading of a remote resource. Filtering on the claimed mimetype isn't adequate, you should assume an attacker isn't telling the truth.

This is actually listed on the Wikipedia data URI page under disadvantages http://en.wikipedia.org/wiki/Data_URI_scheme#Disadvantages and cites point 6 of the RFC itself http://tools.ietf.org/html/rfc2397 , notably that the effects of unknown length data is not known, and that it compromises the ability for a broad range of other security software to scan the embedded data.

Beyond the security considerations, I also found that there were practical issues in that as a web site owner I want to be able to cache images using a CDN and keep my page size low, or to ensure that when I add the content to a search index that I'm not bloating it with large amounts of gibberish (base64 encoded data).

With all that in mind, I chose not to add features that would encourage the use of something that could make the content insecure and may have other side effects.

But... if you have a strong use-case in which you're fine accepting these risks and do wish to have this feature, I'll look into how best to add it.

from bluemonday.

dmitshur avatar dmitshur commented on July 17, 2024

Hi, thanks for your response. I am learning about the advantages and disadvantages of data URIs, which your reply helped with, but also raised some questions.

The problem is that the data URI isn't restricted to having just images and one can easily create a data URI that contains CSS, JavaScript, HTML or other things that could permit the loading of a remote resource.

Yes, exactly, that's why I wanted to filter by mime-type.

Filtering on the claimed mimetype isn't adequate, you should assume an attacker isn't telling the truth.

Can you please elaborate? How is it different from an attacker doing:

<img src="http://www.example.org/attacker_image.png">

And having some javascript instead of a png that url? At the end of the day, that image.png is a bunch of bytes, whether they come from an external resource, or base64 encoded data.

Most of the disadvantages listed at http://en.wikipedia.org/wiki/Data_URI_scheme#Disadvantages are practical reasons why someone might not want to use them, but not security reasons why they shouldn't. Yes, they don't have a filename, but that's expected. Yes, they make the document larger, also expected. If you include the same image twice, it takes up twice the space, etc.

The effect of using long "data" URLs in applications is currently
unknown; some software packages may exhibit unreasonable behavior
when confronted with data that exceeds its allocated buffer size.

Again, how is this different from http://www.example.org/attacker_image.png being a 1.5 gigabyte file?

With all that in mind, I chose not to add features that would encourage the use of something that could make the content insecure and may have other side effects.

Understandable, and I support that.

However, I have not found/understood any concrete reasons why data URIs present a risk unlike any other external resource. It seems like they're generally not supported because of "they seem scary", not because there are clearly demonstrable attacks that regular images loaded via URLs are not susceptible to.

Again, I'm not an expert on this topic, I'm only looking to get a better understanding. Perhaps if you could elaborate on your points, that'd be helpful.

My use case is not very strong, mostly just that I thought it'd be neat to be able to use images embedded into a Markdown file occasionally, and it seems to me so far that it can be done in a clean/secure way (no less secure a link to an image that is external).

from bluemonday.

buro9 avatar buro9 commented on July 17, 2024

I guess the question is what is bluemonday for?

In my mind the answer is that it's a security package designed to eradicate the risks associated with user generated content that will be displayed on a web page. The biggest source of risk are XSS attacks and so we whitelist the allowed elements and test their values for safety.

The data URI makes that difficult, consider this:
<img src="data:text/html;charset=utf-8,..." />

That would display an image not found, collapsing the <img /> into a single-pixel, but it may have loaded a web page and then the browser was given the task of determining the content was valid for this context and that a security policy should prevent it (modern browsers should have prevented this).

But... if the browser had allowed it, who is to say that the text (HTML) in the data URI didn't include JavaScript that grabbed the cookie information of the end user and dispatched it to some third party.

Now consider the example you gave: <img src="...K5CYII=">

If we trust this, then we trust that the claimed mimetype image/png speaks the truth and that the base64 encoded data is actually a PNG image.

One of the first rules of security is to not trust user supplied input, and we shouldn't trust that the mimetype is correct. What if the base64 encoded data was really a HTML page, and it merely claimed to be a PNG.

Traditionally browsers used to use mimetypes and file extensions as hints as to what the content is, and would conveniently try and load the files as those hints instructed... a PNG in your case... and when this failed, the browser would step back and ponder how it might best deal with this, perhaps sniff the first few bytes to find out what it is... aha! HTML... great, pass it to the HTML handler and render.

So we go back to that question, what is bluemonday for?

By not allowing the data URI bluemonday eradicates the risks and bluemonday is a security package. But if we allow the data URI, without somehow proving that the base64 encoded data is safe (by ignoring whatever mimetype it claims to be, verifying it and then matching the verification to the mimetype), then we're allowing one of these risks to remain. Most modern browsers are not so dumb as to fall for this, but some older browsers might fall for it. We could've made sure the risk didn't exist, but instead we passed the risk downstream to the browsers and all their versions historically. If we do that, then bluemonday is a cleaning/tidying utility, but shouldn't be considered a security package.

We just don't trust user generated content, even when the content looks lovely and asks kindly to trust it.

from bluemonday.

dmitshur avatar dmitshur commented on July 17, 2024

Now consider the example you gave: <img src="...K5CYII=">

If we trust this, then we trust that the claimed mimetype image/png speaks the truth and that the base64 encoded data is actually a PNG image.

But... you have no issues with the following?

<img src="http://www.example.org/dangerous_javascript.js.or_maybe.jpg.actually.gif">

Why is one okay (i.e. you don't try to download the image src, verify it's actually an image, etc.) but the other not okay?

So we go back to that question, what is bluemonday for?

My understanding is that it's a configurable HTML sanitization library. But it's up to its user of bluemonday to decide what they want to do with it.

It offers some safe, well known and commonly used defaults, like p := bluemonday.UGCPolicy(). But if the user wants to do p.AllowAttrs("value").OnElements("li"), then they can. If the user wants to do p.AllowAttrs("class").Matching(bluemonday.SpaceSeparatedTokens).OnElements("div", "span"), then they're specifying they want those specific things to not be filtered out.

Similarly, if the user says p.AllowElements("script"), I would expect it to do exactly what the user requested and allow "script" elements.

As such, I don't expect people allowing data URI images on their sites very often, just like I don't expect them to allow "script" element. But they won't allow data URI images for practical reasons rather than because it's unsafe.

from bluemonday.

buro9 avatar buro9 commented on July 17, 2024

Why is one OK and the other not?

Because the browser security around HTTP requests and their mimetype is mature, and attempting to serve JavaScript through a HTTP request via an IMG tag won't work at all, and it's bluemonday's job to make sure that trying to stuff that in without the HTTP request doesn't work either (the browser deals with security of HTTP requests, context and cross-origins, etc... bluemonday is dealing with attacks based on inline code).

The browser security model around items within the page that do no create a HTTP request is not as strong... the present model considers anything within the page to be trusted to some degree, and so we're relying on immature security models from the browsers to take care of inline data URI content... it is effectively trusted, just like inline JavaScript is, and inline CSS is.

Our only hope is that the implementation within the browser is mature enough that mime type and content is checked and verified and only handled when it's correct.

Chrome is fine as far as I can tell, IE had major issues and only has partial support even now (images and CSS only, below some size, but then... if you can control CSS you can control anything: https://www.youtube.com/watch?v=eb3suf4REyI ).

Things like Content Security Policy can be safely applied to everything that is a HTTP request, but isn't applied on inline data URI resources. That's the important taking, the data URI is inline, and is trusted by the page... so bluemonday really needs to help protect the web page.

My understanding is that it's a configurable HTML sanitization library. But it's up to its user of bluemonday to decide what they want to do with it.

That's true, and you can certainly use the package in a way which is insecure if you choose. As I said on my first comment, I don't mind implementing this I just don't want to encourage data URIs as there clearly is a risk which you need to know you're accepting. I can add mimetype checking, and verification that something that claims it is base64 is actually base64... but risk will still remain.

from bluemonday.

dmitshur avatar dmitshur commented on July 17, 2024

Thank you for finally answering my question of what the difference between the two ways of inserting user generated data (images). That is much clearer and I understand your point now.

Basically, in _theory_ there shouldn't be a difference, but in practice one path is well tested and vetted, while the other is less so, so there's a higher chance of bugs or older browsers not doing a good job of interpreting the data, etc.

From that perspective, allowing images to be embedded from user supplied content to be hosted on your site with lots of visitors is certainly not a great idea, and I now see why you wouldn't want to support that.

There are two major use cases for this library, and my understanding is that it's supposed to support both:

  1. Untrusted user generated content.
  2. Trusted admin/site owner generated content.

So far you've been using the perspective of "user supplied content that gets inserted into your web page, hence the need to be strict about it".

The other use case is... "site admin-generated content" where attacking your own website doesn't make sense, since you are the admin. In such a scenario, HTML sanitization can be skipped altogether, but I prefer to use it anyway for a few reasons.

Consider people using Markdown to write books. Or publish articles on their own site. Or use Markdown to generate some debug information as part of a development tool.

I still want to (continue to) use bluemonday for those cases, for a few reasons:

  • I don't expect to be writing <script> tags in my Markdown, so I don't want it to work by accident.
  • It feels good knowing you can say "I only allow things X, Y and Z to work, filter anything else".

I am planning to make a PR to blackfriday to refactor it to use bluemonday for HTML sanitization instead of the internal code it uses now, simply to reduce code duplication and allow each package to focus on one specific task. However, blackfriday also has two types of users: rendering safe sanitized markdown, meant for user supplied comments on sites, but also admin-generated content.

In order for that to continue to be possible, bluemonday has to stay true to it's description as _highly configurable_.


That said, I understand your hesitation to add an API for data URIs when you really don't want to support them, because you cannot do a good enough job of ensuring user generated content is securely sanitized.

I have a better idea, which I hope is more amenable for everyone and allows both use cases to be supported. I will close this issue and create another proposal.

Thank you.

from bluemonday.

dmitshur avatar dmitshur commented on July 17, 2024

By the way, an interesting observation.

It is possible (not that anyone would want to do this, but just for argument's sake) to get around the potential browser bugs due to a different code-path for data URI images by having a Go web server that, when serving HTML with data URI images, would rewrite the data URI to become a normal link, and serve the image via normal means on a separate endpoint.

from bluemonday.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.