rgrove / sanitize Goto Github PK

Ruby HTML and CSS sanitizer.

License: MIT License

Ruby 1.90% HTML 98.10%

sanitize's Introduction

Sanitize

Sanitize is an allowlist-based HTML and CSS sanitizer. It removes all HTML and/or CSS from a string except the elements, attributes, and properties you choose to allow.

Using a simple configuration syntax, you can tell Sanitize to allow certain HTML elements, certain attributes within those elements, and even certain URL protocols within attributes that contain URLs. You can also allow specific CSS properties, @ rules, and URL protocols in elements or attributes containing CSS. Any HTML or CSS that you don't explicitly allow will be removed.

Sanitize is based on the Nokogiri HTML5 parser, which parses HTML the same way modern browsers do, and Crass, which parses CSS the same way modern browsers do. As long as your allowlist config only allows safe markup and CSS, even the most malformed or malicious input will be transformed into safe output.

Installation

gem install sanitize

Quick Start

require 'sanitize'

# Clean up an HTML fragment using Sanitize's permissive but safe Relaxed config.
# This also sanitizes any CSS in `<style>` elements or `style` attributes.
Sanitize.fragment(html, Sanitize::Config::RELAXED)

# Clean up an HTML document using the Relaxed config.
Sanitize.document(html, Sanitize::Config::RELAXED)

# Clean up a standalone CSS stylesheet using the Relaxed config.
Sanitize::CSS.stylesheet(css, Sanitize::Config::RELAXED)

# Clean up some CSS properties using the Relaxed config.
Sanitize::CSS.properties(css, Sanitize::Config::RELAXED)

Usage

Sanitize can sanitize the following types of input:

HTML fragments
HTML documents
CSS stylesheets inside HTML <style> elements
CSS properties inside HTML style attributes
Standalone CSS stylesheets
Standalone CSS properties

Warning

Sanitize cannot fully sanitize the contents of <math> or <svg> elements. MathML and SVG elements are foreign elements that don't follow normal HTML parsing rules.

By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you may create a security vulnerability in your application.

HTML Fragments

A fragment is a snippet of HTML that doesn't contain a root-level <html> element.

If you don't specify any configuration options, Sanitize will use its strictest settings by default, which means it will strip all HTML and leave only safe text behind.

html = '<b><a href="http://foo.com/">foo</a></b><img src="bar.jpg">'
Sanitize.fragment(html)
# => 'foo'

To keep certain elements, add them to the element allowlist.

Sanitize.fragment(html, :elements => ['b'])
# => '<b>foo</b>'

HTML Documents

When sanitizing a document, the <html> element must be allowlisted. You can also set :allow_doctype to true to allow well-formed document type definitions.

html = %[
  <!DOCTYPE html>
  <html>
    <b><a href="http://foo.com/">foo</a></b><img src="bar.jpg">
  </html>
]

Sanitize.document(html,
  :allow_doctype => true,
  :elements      => ['html']
)
# => %[
#      <!DOCTYPE html><html>foo
#
#      </html>
#    ]

CSS in HTML

To sanitize CSS in an HTML fragment or document, first allowlist the <style> element and/or the style attribute. Then allowlist the CSS properties, @ rules, and URL protocols you wish to allow. You can also choose whether to allow CSS comments or browser compatibility hacks.

html = %[
  <style>
    div { color: green; width: 1024px; }
  </style>

  <div style="height: 100px; width: 100px;"></div>
  <p>hello!</p>
]

Sanitize.fragment(html,
  :elements   => ['div', 'style'],
  :attributes => {'div' => ['style']},

  :css => {
    :properties => ['width']
  }
)
#=> %[
#   <style>
#     div {  width: 1024px; }
#   </style>
#
#   <div style=" width: 100px;"></div>
#   hello!
# ]

Standalone CSS

Sanitize will happily clean up a standalone CSS stylesheet or property string without needing to invoke the HTML parser.

css = %[
  @import url(evil.css);

  a { text-decoration: none; }

  a:hover {
    left: expression(alert('xss!'));
    text-decoration: underline;
  }
]

Sanitize::CSS.stylesheet(css, Sanitize::Config::RELAXED)
# => %[
#
#
#
#   a { text-decoration: none; }
#
#   a:hover {
#
#     text-decoration: underline;
#   }
# ]

Sanitize::CSS.properties(%[
  left: expression(alert('xss!'));
  text-decoration: underline;
], Sanitize::Config::RELAXED)
# => %[
#
#   text-decoration: underline;
# ]

Configuration

In addition to the ultra-safe default settings, Sanitize comes with three other built-in configurations that you can use out of the box or adapt to meet your needs.

Sanitize::Config::RESTRICTED

Allows only very simple inline markup. No links, images, or block elements.

Sanitize.fragment(html, Sanitize::Config::RESTRICTED)
# => '<b>foo</b>'

Sanitize::Config::BASIC

Allows a variety of markup including formatting elements, links, and lists.

Images and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto protocols, and a rel="nofollow" attribute is added to all links to mitigate SEO spam.

Sanitize.fragment(html, Sanitize::Config::BASIC)
# => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'

Sanitize::Config::RELAXED

Allows an even wider variety of markup, including images and tables, as well as safe CSS. Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images are limited to HTTP and HTTPS. In this mode, rel="nofollow" is not added to links.

Sanitize.fragment(html, Sanitize::Config::RELAXED)
# => '<b><a href="http://foo.com/">foo</a></b><img src="bar.jpg">'

Custom Configuration

If the built-in modes don't meet your needs, you can easily specify a custom configuration:

Sanitize.fragment(html,
  :elements => ['a', 'span'],

  :attributes => {
    'a'    => ['href', 'title'],
    'span' => ['class']
  },

  :protocols => {
    'a' => {'href' => ['http', 'https', 'mailto']}
  }
)

You can also start with one of Sanitize's built-in configurations and then customize it to meet your needs.

The built-in configs are deeply frozen to prevent people from modifying them (either accidentally or maliciously). To customize a built-in config, create a new copy using Sanitize::Config.merge(), like so:

# Create a customized copy of the Basic config, adding <div> and <table> to the
# existing allowlisted elements.
Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
  :elements        => Sanitize::Config::BASIC[:elements] + ['div', 'table'],
  :remove_contents => true
))

The example above adds the <div> and <table> elements to a copy of the existing list of elements in Sanitize::Config::BASIC. If you instead want to completely overwrite the elements array with your own, you can omit the + operation:

# Overwrite :elements instead of creating a copy with new entries.
Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
  :elements        => ['div', 'table'],
  :remove_contents => true
))

Config Settings

:add_attributes (Hash)

Attributes to add to specific elements. If the attribute already exists, it will be replaced with the value specified here. Specify all element names and attributes in lowercase.

:add_attributes => {
  'a' => {'rel' => 'nofollow'}
}

:allow_comments (boolean)

Whether or not to allow HTML comments. Allowing comments is strongly discouraged, since IE allows script execution within conditional comments. The default value is false.

:allow_doctype (boolean)

Whether or not to allow well-formed HTML doctype declarations such as "" when sanitizing a document. This setting is ignored when sanitizing fragments. The default value is false.

:attributes (Hash)

Attributes to allow on specific elements. Specify all element names and attributes in lowercase.

:attributes => {
  'a'          => ['href', 'title'],
  'blockquote' => ['cite'],
  'img'        => ['alt', 'src', 'title']
}

If you'd like to allow certain attributes on all elements, use the symbol :all instead of an element name.

# Allow the class attribute on all elements.
:attributes => {
  :all => ['class'],
  'a'  => ['href', 'title']
}

To allow arbitrary HTML5 data-* attributes, use the symbol :data in place of an attribute name.

# Allow arbitrary HTML5 data-* attributes on <div> elements.
:attributes => {
  'div' => [:data]
}

:css (Hash)

Hash of the following CSS config settings to be used when sanitizing CSS (either standalone or embedded in HTML).

:css => :allow_comments (boolean)

Whether or not to allow CSS comments. The default value is false.

:css => :allow_hacks (boolean)

Whether or not to allow browser compatibility hacks such as the IE * and _ hacks. These are generally harmless, but technically result in invalid CSS. The default is false.

:css => :at_rules (Array or Set)

Names of CSS at-rules to allow that may not have associated blocks, such as import or charset. Names should be specified in lowercase.

:css => :at_rules_with_properties (Array or Set)

Names of CSS at-rules to allow that may have associated blocks containing CSS properties. At-rules like font-face and page fall into this category. Names should be specified in lowercase.

:css => :at_rules_with_styles (Array or Set)

Names of CSS at-rules to allow that may have associated blocks containing style rules. At-rules like media and keyframes fall into this category. Names should be specified in lowercase.

:css => :import_url_validator

This is a Proc (or other callable object) that will be called and passed the URL specified for any @import at-rules.

You can use this to limit what can be imported, for example something like the following to limit @import to Google Fonts URLs:

Proc.new { |url| url.start_with?("https://fonts.googleapis.com") }

:css => :properties (Array or Set)

List of CSS property names to allow. Names should be specified in lowercase.

:css => :protocols (Array or Set)

URL protocols to allow in CSS URLs. Should be specified in lowercase.

If you'd like to allow the use of relative URLs which don't have a protocol, include the symbol :relative in the protocol array.

:elements (Array or Set)

Array of HTML element names to allow. Specify all names in lowercase. Any elements not in this array will be removed.

:elements => %w[
  a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
  q s samp small strike strong sub sup time u ul var
]

Warning

Sanitize cannot fully sanitize the contents of <math> or <svg> elements. MathML and SVG elements are foreign elements that don't follow normal HTML parsing rules.

By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you must assume that any content inside them will be allowed, even if that content would otherwise be removed or escaped by Sanitize. This may create a security vulnerability in your application.

Note

Sanitize always removes <noscript> elements and their contents, even if noscript is in the allowlist.

This is because a <noscript> element's content is parsed differently in browsers depending on whether or not scripting is enabled. Since Nokogiri doesn't support scripting, it always parses <noscript> elements as if scripting is disabled. This results in edge cases where it's not possible to reliably sanitize the contents of a <noscript> element because Nokogiri can't fully replicate the parsing behavior of a scripting-enabled browser.

:parser_options (Hash)

Parsing options to be supplied to nokogumbo.

:parser_options => {
  max_errors: -1,
  max_tree_depth: -1
}

:protocols (Hash)

URL protocols to allow in specific attributes. If an attribute is listed here and contains a protocol other than those specified (or if it contains no protocol at all), it will be removed.

:protocols => {
  'a'   => {'href' => ['ftp', 'http', 'https', 'mailto']},
  'img' => {'src'  => ['http', 'https']}
}

If you'd like to allow the use of relative URLs which don't have a protocol, include the symbol :relative in the protocol array:

:protocols => {
  'a' => {'href' => ['http', 'https', :relative]}
}

:remove_contents (boolean or Array or Set)

If this is true, Sanitize will remove the contents of any non-allowlisted elements in addition to the elements themselves. By default, Sanitize leaves the safe parts of an element's contents behind when the element is removed.

If this is an Array or Set of element names, then only the contents of the specified elements (when filtered) will be removed, and the contents of all other filtered elements will be left behind.

The default value is %w[iframe math noembed noframes noscript plaintext script style svg xmp].

:transformers (Array or callable)

Custom HTML transformer or array of custom transformers. See the Transformers section below for details.

:whitespace_elements (Hash)

Hash of element names which, when removed, should have their contents surrounded by whitespace to preserve readability.

Each element name is a key pointing to another Hash, which provides the specific whitespace that should be inserted :before and :after the removed element's position. The :after value will only be inserted if the removed element has children, in which case it will be inserted after those children.

:whitespace_elements => {
  'br'  => { :before => "\n", :after => "" },
  'div' => { :before => "\n", :after => "\n" },
  'p'   => { :before => "\n", :after => "\n" }
}

The default elements with whitespace added before and after are:

address article aside blockquote br dd div dl dt
footer h1 h2 h3 h4 h5 h6 header hgroup hr li nav
ol p pre section ul

Transformers

Transformers allow you to filter and modify HTML nodes using your own custom logic, on top of (or instead of) Sanitize's core filter. A transformer is any object that responds to call() (such as a lambda or proc).

To use one or more transformers, pass them to the :transformers config setting. You may pass a single transformer or an array of transformers.

Sanitize.fragment(html, :transformers => [
  transformer_one,
  transformer_two
])

Input

Each transformer's call() method will be called once for each node in the HTML (including elements, text nodes, comments, etc.), and will receive as an argument a Hash that contains the following items:

:config - The current Sanitize configuration Hash.
:is_allowlisted - true if the current node has been allowlisted by a previous transformer, false otherwise. It's generally bad form to remove a node that a previous transformer has allowlisted.
:node - A Nokogiri::XML::Node object representing an HTML node. The node may be an element, a text node, a comment, a CDATA node, or a document fragment. Use Nokogiri's inspection methods (element?, text?, etc.) to selectively ignore node types you aren't interested in.
:node_allowlist - Set of Nokogiri::XML::Node objects in the current document that have been allowlisted by previous transformers, if any. It's generally bad form to remove a node that a previous transformer has allowlisted.
:node_name - The name of the current HTML node, always lowercase (e.g. "div" or "span"). For non-element nodes, the name will be something like "text", "comment", "#cdata-section", "#document-fragment", etc.

Output

A transformer doesn't have to return anything, but may optionally return a Hash, which may contain the following items:

:node_allowlist - Array or Set of specific Nokogiri::XML::Node objects to add to the document's allowlist, bypassing the current Sanitize config. These specific nodes and all their attributes will be allowlisted, but their children will not be.

If a transformer returns anything other than a Hash, the return value will be ignored.

Processing

Each transformer has full access to the Nokogiri::XML::Node that's passed into it and to the rest of the document via the node's document() method. Any changes made to the current node or to the document will be reflected instantly in the document and passed on to subsequently called transformers and to Sanitize itself. A transformer may even call Sanitize internally to perform custom sanitization if needed.

Nodes are passed into transformers in the order in which they're traversed. Sanitize performs top-down traversal, meaning that nodes are traversed in the same order you'd read them in the HTML, starting at the top node, then its first child, and so on.

html = %[
  <header>
    <span>
      <strong>foo</strong>
    </span>
    <p>bar</p>
  </header>

  <footer></footer>
]

transformer = lambda do |env|
  puts env[:node_name] if env[:node].element?
end

# Prints "header", "span", "strong", "p", "footer".
Sanitize.fragment(html, :transformers => transformer)

Transformers have a tremendous amount of power, including the power to completely bypass Sanitize's built-in filtering. Be careful! Your safety is in your own hands.

Example: Transformer to allow image URLs by domain

The following example demonstrates how to remove image elements unless they use a relative URL or are hosted on a specific domain. It assumes that the <img> element and its src attribute are already allowlisted.

require 'uri'

image_allowlist_transformer = lambda do |env|
  # Ignore everything except <img> elements.
  return unless env[:node_name] == 'img'

  node      = env[:node]
  image_uri = URI.parse(node['src'])

  # Only allow relative URLs or URLs with the example.com domain. The
  # image_uri.host.nil? check ensures that protocol-relative URLs like
  # "//evil.com/foo.jpg".
  unless image_uri.host == 'example.com' || (image_uri.host.nil? && image_uri.relative?)
    node.unlink # `Nokogiri::XML::Node#unlink` removes a node from the document
  end
end

Example: Transformer to allow YouTube video embeds

The following example demonstrates how to create a transformer that will safely allow valid YouTube video embeds without having to allow other kinds of embedded content, which would be the case if you tried to do this by just allowing all <iframe> elements:

youtube_transformer = lambda do |env|
  node      = env[:node]
  node_name = env[:node_name]

  # Don't continue if this node is already allowlisted or is not an element.
  return if env[:is_allowlisted] || !node.element?

  # Don't continue unless the node is an iframe.
  return unless node_name == 'iframe'

  # Verify that the video URL is actually a valid YouTube video URL.
  return unless node['src'] =~ %r|\A(?:https?:)?//(?:www\.)?youtube(?:-nocookie)?\.com/|

  # We're now certain that this is a YouTube embed, but we still need to run
  # it through a special Sanitize step to ensure that no unwanted elements or
  # attributes that don't belong in a YouTube embed can sneak in.
  Sanitize.node!(node, {
    :elements => %w[iframe],

    :attributes => {
      'iframe'  => %w[allowfullscreen frameborder height src width]
    }
  })

  # Now that we're sure that this is a valid YouTube embed and that there are
  # no unwanted elements or attributes hidden inside it, we can tell Sanitize
  # to allowlist the current node.
  {:node_allowlist => [node]}
end

html = %[
<iframe width="420" height="315" src="//www.youtube.com/embed/dQw4w9WgXcQ"
    frameborder="0" allowfullscreen></iframe>
]

Sanitize.fragment(html, :transformers => youtube_transformer)
# => '<iframe width="420" height="315" src="//www.youtube.com/embed/dQw4w9WgXcQ" frameborder="0" allowfullscreen=""></iframe>'

sanitize's People

Contributors

Stargazers

Watchers

Forkers

adamhooper mutle peterc codemonkeykevin actsasflinn pablobm whatcould zmack artforge dramsay wilson shripadk glebm rehanift owlmanatt haifeng rykov lravik rkabir eadz austinrfnd iangreenleaf epictetus earnold the-guitarman joelneubert fabrikagency ymerejredienhcs redfunnel boltmade aji9861 theoldreader luxyer stanhsieh kensodev janrenz drakmail penultimatix heathd alphagov libertycoverage peterwillcn isendir gjtorikian seoyoochan damien priestd09 apfeltee mmpat mattisbusycom adamakhtar jamiecobbett ohms4tehpuur andrewroth ehudc jhloa2 chondm hamzakc benubois scarow mtco crowdtilt genius plehoux featured-gems jhubert netinmax rubys dyg2104md machida junaruga yeluojun sb8244 nikz arjunmenon louim skipants martinhladil angelocordon ejtttje rajeshguna zhidkovdenis abak-press renny-ren jasonzyj andrewcroome icelab missive juysmc flavorjones janklimo futurelearn johnpaulh elastion asadakbar stanhu powerticmkt mifrill dometto stevecheckoway

sanitize's Issues

Enable returning Nokogiri object

I'd really like to be able to tell Sanitize to return a Nokogiri object instead of a string. For my particular application, I need to paginate the output, and that requires a parsed DOM. There are probably other scenarios where a Nokogiri object would be needed. It seems silly to use CPU cycles writing to HTML then parsing again.

If you want, I could try to write a patch for this. I'm probably going to monkey-patch my own copy for now.

By the way, this is a great gem.

Thanks,
Jarrett

Bad encoding on non-ASCII replacements in transformer

Hello

I have strange behaviour of transformer. When I trying to replace node content with non-ascii strings, sanitizer returns double-encoded to utf-8 string:

This is my transformer:

     lambda do |env|
        node      = env[:node]
         node_name = env[:node_name]
         return if node_name == 'pre'
         return if node_name == 'code'
         if !node.text?
           #text = node.text.to_s
           text = "тест"
           #text.gsub!(/\n[\n]+/u, "</p><p>")
           #text.gsub!(/\n/u, "<br />")
           node.replace(text)
         end
         {:node_whitelist => [node]}
       end

IO Code:

intext = STDIN.read()
outtext = Sanitize.clean(intext, HTMLSanitize::Config::Custom)
outlen = outtext.size()
STDOUT.write(outtext)

Output of sanitizer:

$ ./priv/html_sanitizer.rb
<p>тест
тест
</p>
ÑÐµÑÑ

But I expect тест output
How I can solve this problem with encodings?

Thank you

escape > inside text

My users are frustrated that if they accidentally have a bracket tag, it removes the html.

Sanitize.clean!("test < the more love") -> "test "

I've tried adding a filter but it was already gone by the time it saw it.

    node = env[:node]
    node.content = node.content.gsub(/</,'&lt;').gsub(/>/,'&gt;') if node.text?

NameError: uninitialized constant Sanitize::Error in transformer with syntax error

When i use plain Sanitize and provide a bad transformer, i get the above exception. Maybe you could raise the standard ArgumentError or the like.

Reproduce:
Sanitize.clean("<script>fail();</script>", :transformers => lambda {|env| :badtransformer })

Can I whitelist \n \r \t

My html is coming from a Word document, which is converted to html, rather poorly, and uses a ton of newline characters.

Is this a job for transformers? Or would a simple update allow for whitelisting these types of characters?

Thanks

Add option to escape non-whitelisted elements instead of stripping them

Feature request from Ævar Arnfjörð Bjarmason:

My use case is that I'm calling Sanitize like this:
def clean(html)
  return Sanitize.clean(html, :elements => ['a'],
                   :attributes => {'a' => ['href']},
                   :protocols => {'a' => {'href' => ['http', 'https']}})
end
And this is the output I really want:
clean("<b>") ==> &lt;b&gt;
clean("<b><a href=\"http://example.com\">example</a>") ==> &lt;b&gt;<a href="http://example.com">example</a>
Instead Sanitize will comptetely strip the unacceptable HTML tag.

We should implement a setting to make the stripping optional.

Weird Case found

I was testing out some xss test cases from a cheat sheet and had one example get through on a relaxed setting. https://gist.github.com/984318 is this gist of the command

html entities should not be removed

>> Sanitize::VERSION
=> "1.1.0"
>> Sanitize.clean("&nbsp;")
=> " "

I expected: " "

Requirements - Hpricot instead of Nokogiri

Nokogiri is listed as a requirement on the readme, but it actually requires hpricot.

carriage returns are html encoded by clean() method

I don't recall this happending prior to nokogiri / libxml2.7+ but it appears
that carriage returns are now being html encoded by the clean() method:

http://gist.github.com/270293

There are comments on this here:

http://github.com/rgrove/sanitize/issues/closed/#issue/10

but no mention on whether this is by design, should be considered a bug or if
there is a workaround?

Benchmarks and HTMLFilter

Would be interesting to compare to HTMLFilter and it would be nice if the brenchmarks were published somewhere in the project.

Sanitize broken in JRuby

Case #1 - MRI (1.9.2-p290, 1.9.3-p194)

Nokogiri::VERSION   # => "1.5.0"
Sanitize.clean("a<a> b </a>") # => "a b "

Case #2 (JRuby-1.6.7.2)

Nokogiri::VERSION   # => "1.5.3"
Sanitize.clean("a<a> b </a>") # => "a"

trouble sanitizing <script> tag contents

I am unable to strip the contents of the <script> tags from various parts of an HTML document. Try these commands out in your console:

url = "https://github.com/rgrove/sanitize"
raw = RestClient.get( url )
a = Sanitize.clean( raw )
a = Sanitize.clean( raw, :elements => ['p'] )
a = Sanitize.clean( raw, :elements => [] )
a = Sanitize.clean( raw, Sanitize::Config::RESTRICTED )

And witness that a always contains a lot of leftover script-ese.

(note: using the rest-client gem)

No matter which method of sanitizing I use, I am left with the contents of the <script> tags which is quite annoying, I have tried using the :remove_contents => ['script'] parameter as well, to no success.

If this is pilot error on my part, please let me know. I would love to solve this.

Add support for sanitizing full documents (as opposed to fragments)

HTML and BODY tags are always removed from string:
Sanitize.clean("

foo bar

", :elements => ['html', 'body', 'p'], :remove_contents => true)
=> "

foo bar

Sanitize.clean("<p>foo bar</p><a>some text</a>", :elements => ['html', 'body', 'p'], :remove_contents => true)
=> "<p>foo bar</p>" 

Sanitize.clean("<html><body><p>foo bar</p><a>some text</a></body></html>", :elements => ['html', 'body', 'p'], :remove_contents => true)
=> "<p>foo bar</p>"

How to whitelist &

If you submit a field with "hello & world" sanitize is saving that in the DB as:

hello &amp; world

How can you whitelist the & . We want sanitize to remove all possible malicious html and JS/script tags. but we're ok allowing the ampersand.

I tried but that had no effect:

Sanitize.clean(self[column.name], :elements => %w[&])

Thanks

Nokogiri 1.4.0

Nokogiri 1.4.0 is out, and it fixes a segfault that I encountered when running Sanitize on a large number of documents. Could you please test / configure the gem to work with 1.4.0. I currently get RubyGem version error: nokogiri(1.4.0 not ~> 1.3.3).

And for the record, I confirmed that Nokogiri 1.4.0 fixed my segfault issue on a pre-release Nokogiri gem that was still in the 1.3.x version series.

URL fragment identifiers containing colons are stripped even when relative URLs are allowed

Using the Sanitize gem, I'm cleaning some HTML. In the href attribute of my anchor tags, I wish to parse the following:

<a href="#fn:1">1</a>

This is required for implementing footnotes using the Kramdown gem.

However, Sanitize doesn't appear to like the colon inside the href attribute. It simply outputs <a>1</a> instead, skipping the href attribute altogether.

My sanitize code looks like this:

# Setup whitelist of html elements, attributes, and protocols that are allowed.
allowed_elements = ['h2', 'a', 'img', 'p', 'ul', 'ol', 'li', 'strong', 'em', 'cite', 
  'blockquote', 'code', 'pre', 'dl', 'dt', 'dd', 'br', 'hr', 'sup', 'div']
allowed_attributes = {'a' => ['href', 'rel', 'rev'], 'img' => ['src', 'alt'], 
  'sup' => ['id'], 'div' => ['class'], 'li' => ['id']}
allowed_protocols = {'a' => {'href' => ['http', 'https', 'mailto', :relative]}}

# Clean text of any unwanted html tags.
html = Sanitize.clean(html, :elements => allowed_elements, :attributes => allowed_attributes, 
  :protocols => allowed_protocols)

Is there a way to get Sanitize to accept a colon in the href attribute?

This issue is a duplicate of this Stack Overflow question.

Segfault/exception when parsing http://www.fcc.gov/ftp/Bureaus/MB/Databases/cdbs/_readme.html

I'm getting a segmentation fault when running the following script. It appears to be something related to non-breaking-space parsing(???) because when I uncomment the line that gsub replaces these with regular spaces, the problem goes away.

#!/usr/bin/env ruby
require 'net/http'
require 'rubygems'
require 'sanitize'

headers,data = Net::HTTP.new("www.fcc.gov",80).get('/ftp/Bureaus/MB/Databases/cdbs/_readme.html')

# Uncomment this to fix segmentation fault
#data = data.gsub(/&nbsp;/, ' ')

puts Sanitize.clean(data)

using sanitize and mechanize together

I tried to use sanitize and mechanize together in the same app, but I received the following error:

"Gem::LoadError (can't activate nokogiri (~> 1.3.3, runtime) for ["sanitize-1.1.0"], already activated nokogiri-1.4.0 for ["mechanize-0.9.3"]):"

I checked with "gem list --local" and I've got the both versions of nokogiri (1.4.0 and 1.3.3)
*** LOCAL GEMS ***
...
mechanize (0.9.3)
nokogiri (1.4.0, 1.3.3)
sanitize (1.1.0)
...

The error message only appears at the first time when I tried to run methods from the module using sanitize, it's a very simple function only executes 1 line of sanitize(Sanitize.clean(html)). After the first time the error message became this:

"MissingSourceFile (no such file to load -- sanitize):"

I'm kind off lost with this problem, it seems to me like a multiple inheritance problem, do you have any idea how to solve this?

extra characters added

When I run the following text:

asdfa sdfasdf sdf

through this function:
Sanitize.clean(text,
# Style element has no effect since the JS sterilizes it
:elements => Sanitize::Config::RELAXED[:elements].push('span', 'div', 'style'),
:attributes => Sanitize::Config::RELAXED[:attributes].merge(:all => ['style']),
:protocols => Sanitize::Config::RELAXED[:protocols]
)

I get

asdfa sdfasdf sdf

If I do it again, I get:

 <div>asdfa sdfasdf sdf</div>&#13;
 <style>&amp;amp;amp;amp;#13;&#13;
 div {&amp;amp;amp;amp;#13;&#13;
 background-color: red;&amp;amp;amp;amp;#13;&#13;
 }&amp;amp;amp;amp;#13;&#13;
 </style>

great tool tho. everything else works super. :)

"img" tag is not closed

>> Sanitize::VERSION
=> "1.1.0"
>> Sanitize.clean('', :elements => "img", :attributes => {:all => ["src"] })
=> "<img src="http://example.org/image.png">"

I expected: <img src="http://example.org/image.png" />

& spam.

This should explain it all for you: http://pastie.org/803874 Let me know if I can be of any further help.

Won't keep BR tag

Hi and thanks for a great library!

I am experiencing what might be a bug, here is a failing test as a gist: http://gist.github.com/245526
I ran it with the specs, for me "keep br and a tags, v1" will fail because Sanitize strips the
tag.

Would be happy if you could have a look at it and maybe help me. As of now I'm running:
sanitize 1.2.0.dev.20091104
nokogiri 1.4.0
Ubuntu Karmic Koala (ie, the libxml/xslt libs and whatnot thats in Karmic)

I had the same error with Sanitize 1.1.0.

Should exclude some empty tags

There are some tags that doesn't work if empty or if not all parameters have been filled, for example, I have:

<a></a>
<img> or <img />

and so on...
I want to be able to just wipe those out, as they are "non-functional" tags if they don't have all the attributes correctly filled.
The came from a "MS Word" copy and pasted text, after some sanitization.

Namespace support

Hi,
could you tell me please - how to add to allowed tags list tags with prefixes, for example

 <prefix:tag attr = "something" />

I have tried, but its not working :(

youtube iframe transformer

I couldn't get the wiki to let me edit it, so here it is:

T_YOUTUBE_IFRAME = lambda do |env|
node = env[:node]
return nil unless env[:node_name] == 'iframe'

if node['src'] =~ /^http:\/\/www.youtube.com\/embed\//
  node['src'] += "?webkitfix"  # needed to get around webkit's XSS protection

  return {:node_whitelist => [node]}
end

end

ol inside ul tags

<ul><li>one</li><ul><li>two</li></ul></ul><ol><li>one</li><ol><ol><ul><li>ok<br></li></ul></ol></ol></ol>

becomes

This might be a bug with Nokogiri though

Sanitize.clean doesn't work

There must be something that I am missing becase I have tried this bit of malicious HTML on your site: http://sanitize.pieisgood.org/ and the Sanitize.clean seems to work but if I do it in a ruby script it does not. Can you please see what could I be doing wrong? I am trying to sanitize this: http://gist.github.com/206679

Cheers.

andHapp

html stripped after comment tag

If there is a comment html is stripped when it should be kept.

Here is an example of the issue in an irb session with a git clone from 11 Jan, 2011:

irb --> text = "<!-- comment --><b>Hello</b>"
    ==> "<!-- comment --><b>Hello</b>"

irb --> Sanitize.clean(text)
    ==> "Hello"

irb --> Sanitize.clean(text, Sanitize::Config::RESTRICTED)
    ==> "Hello"

irb --> Sanitize.clean(text, Sanitize::Config::BASIC)
    ==> "Hello"

irb --> Sanitize.clean(text, Sanitize::Config::RELAXED)
    ==> "Hello"

Add this to the 'strings' cases in the test file to add tests:
:raw_comment_with_html => {
:html => 'Hello',
:default => 'Hello',
:restricted => 'Hello',
:basic => 'Hello',
:relaxed => 'Hello'
}

should strip HTML comments

I'm at a local Ruby group meeting, and the presenter is saying that he had to strip out HTML comments using a regular expression before passing the result to sanitize. That shouldn't be necessary.

Closing slashes in single tags (img, br)

There's an example use case here: http://wonko.com/post/sanitize
It's actually very close to the Readme example, but differs a bit.

html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html, Sanitize::Config::RELAXED)
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

Please note the closing slash on the image tag - that's what I would actually expect as the result.

In reality the same example cuts out that slash:

html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html, Sanitize::Config::RELAXED)
#=> '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'

Same happens to slashes in br tags for example.
Is there a way to keep these slashes?

Thank you.

Transcoder for youtube or vimeo embeds through iframes

YouTube and Vimeo embeds through iframe:

@@transformer = lambda do |env|
    node      = env[:node]
    node_name = env[:node_name]

    # Don't continue if this node is already whitelisted or is not an element.
    return if env[:is_whitelisted] || !node.element?

    # We look for <iframe> nodes
    return unless node_name == 'iframe'

    url = node['src']

    # Verify that the video URL is actually a valid YouTube or Vimeo video URL.
    return unless (url =~ /\Ahttp:\/\/(?:www\.)?youtube\.com\/embed\// ||
                   url =~ /\Ahttp:\/\/(?:(www|player)\.)?vimeo\.com\/video\//)

    Sanitize.clean_node!(node, {
      :elements => %w[iframe],

      :attributes => {
        'iframe' => ['src', 'width', 'height', 'allowfullscreen']
      },

      :add_attributes => {
        'iframe' => {'frameborder' => '0'}
      }
    })

    # Now that we're sure that this is a valid YouTube or Vimeo embed
    {:node_whitelist => [node]}
  end

Fails to load when using Active Support

from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/activesupport-3.0.0/lib/active_support/dependencies.rb:239:in `require'
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/activesupport-3.0.0/lib/active_support/dependencies.rb:239:in `block in require'
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/activesupport-3.0.0/lib/active_support/dependencies.rb:225:in `block in load_dependency'
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/activesupport-3.0.0/lib/active_support/dependencies.rb:591:in `new_constants_in'
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/activesupport-3.0.0/lib/active_support/dependencies.rb:225:in `load_dependency'
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/activesupport-3.0.0/lib/active_support/dependencies.rb:239:in `require'
from (irb):8
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/railties-3.0.0/lib/rails/commands/console.rb:44:in `start'
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/railties-3.0.0/lib/rails/commands/console.rb:8:in `start'
from /Users/admin/.rvm/gems/ruby-1.9.2-p0@readfeed/gems/railties-3.0.0/lib/rails/commands.rb:23:in `<top (required)>'
from script/rails:6:in `require'
from script/rails:6:in `<main>'

Option to return the Nokogiri object rather than a string

Would you be open to a patch that would allow Sanitize.clean! to return the current Nokogiri document? In my current use case I end up re-parsing the string back into Nokogiri unnecessarily.

Just thought I'd check to see if you had interest. If so I'd be happy to put something together.

Cheers, and thanks for a great tool.

Alex

Spaces after legitimate tags are being cleaned.

I have the following HTML

<a href="/pants" target="foo"><b>ipsum</b></a><br /><em><strong>dolor</strong></em> <i>amet</i> <img src="http://foo" height="20" width="20" align="center" alt="foo" /> <small>H<sub>2</sub>0</small> 2<sup>nd</sup> <u>dolor</u>

When I run it through Sanitize, I expect the following output:

<a href="/pants" target="foo"><b>ipsum</b></a><br><em><strong>dolor</strong></em> <i>amet</i> <img src="http://foo" height="20" width="20" align="center" alt="foo"> <small>H<sub>2</sub>0</small> 2<sup>nd</sup> <u>dolor</u>

But I get this:

<a href="/pants" target="foo"><b>ipsum</b></a><br><em><strong>dolor</strong></em> <i>amet</i> <img src="http://foo" height="20" width="20" align="center" alt="foo"><small>H<sub>2</sub>0</small> 2<sup>nd</sup><u>dolor</u>

The difference is that the space after the <img> tag as well as the space after the </sup> tag have been removed. It doesn't seem to matter whether I specify :html or :xhtml as the output. Here is the config that I am using:

# Use a custom set of options for the Sanitize gem
SANITIZE_CONFIG = {
  # HTML elements to allow. By default, no elements are allowed (which means
  # that all HTML will be stripped).
  :elements => %w[a b br em i img small strong sub sup u],

  # HTML attributes to allow in specific elements. By default, no attributes
  # are allowed.
  :attributes => {
    :all => [],
    'a' => ['href', 'target', 'rel'],
    'img' => ['align', 'alt', 'height', 'src', 'width']
  },

  # URL handling protocols to allow in specific attributes. By default, no
  # protocols are allowed. Use :relative in place of a protocol if you want
  # to allow relative URLs sans protocol.
  :protocols => {
    'a' => {'href' => ['ftp', 'http', 'https', :relative]},
    'img' => {'src' => ['http', 'https', :relative]}
  },

  # Whether or not to allow HTML comments. Allowing comments is strongly
  # discouraged, since IE allows script execution within conditional
  # comments.
  :allow_comments => false,

  # HTML attributes to add to specific elements. By default, no attributes
  # are added.
  :add_attributes => {},

  # Output format. Supported formats are :html and :xhtml. Default is :html.
  :output => :html,

  # Character encoding to use for HTML output. Default is 'utf-8'.
  :output_encoding => 'utf-8',

  # If this is true, Sanitize will remove the contents of any filtered
  # elements in addition to the elements themselves. By default, Sanitize
  # leaves the safe parts of an element's contents behind when the element
  # is removed.
  #
  # If this is an Array of element names, then only the contents of the
  # specified elements (when filtered) will be removed, and the contents of
  # all other filtered elements will be left behind.
  :remove_contents => false,

  # Transformers allow you to filter or alter nodes using custom logic. See
  # README.rdoc for details and examples.
  :transformers => [],

  # By default, transformers perform depth-first traversal (deepest node
  # upward). This setting allows you to specify transformers that should
  # perform breadth-first traversal (top node downward).
  :transformers_breadth => [],

  # Elements which, when removed, should have their contents surrounded by
  # space characters to preserve readability. For example,
  # `foo<div>bar</div>baz` will become 'foo bar baz' when the <div> is
  # removed.
  :whitespace_elements => %w[
  address article aside blockquote br dd div dl dt footer h1 h2 h3 h4 h5
  h6 header hgroup hr li nav ol p pre section ul
  ]
}

Support for second level style attributes

Hey

I just ran into an issue where I'd like to allow users to pass the style attribute to a div or span

<span style="font-weight:bold"></span>
<div style="text-align:center"></div>

However, I would like to whitelist the style attributes that they can pass.

This doesn't seem possible with the current setup. Might you be willing to add another setting for allowed style attributes similar to what you are doing with the protocols setting?

HTML comments in <style> block

Hello Ryan,
Seems to be related to Nokogiri behaviour, I can't remove comments inside <style> nodes (HTML code produced by Microsoft Word) :

>> Sanitize.clean("<p><!-- test --></p>")
=> ""
>> Sanitize.clean("<style><!-- test --></style>")
=> "&lt;!-- test --&gt;"

Comments in <style> nodes aren't removed. Nokogiri treats it as CDATA node, instead of Comment (that's probably regular) :

<Nokogiri::XML::CDATA:0x83194bb8 "<!-- test -->">

I could remove this <style> node with a Transformer. Any better idea ?

Emilien

Does not fully pass the XSS cheat sheet

http://ha.ckers.org/xss.html

Tried a random one on:

http://sanitize.pieisgood.org/

With relaxed settings:

<A HREF="http://www.gohttp://www.google.com/ogle.com/">XSS</A>
<A HREF="javascript:document.location='http://www.google.com/'">XSS</A>
<A HREF="//google">XSS</A>
<SCRIPT a=">'>" SRC="http://ha.ckers.org/xss.js"></SCRIPT>
<HEAD><META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=UTF-7"> </HEAD>+ADw-SCRIPT+AD4-alert('XSS');+ADw-/SCRIPT+AD4-

Relax Dependency on nokogiri

Been trying to use sanitize with nokogiri 1.5.0.beta.4, though bundler fails because it can't find a compatible nokogiri version once 1.5 is enabled. So, might it be possible to relax the dependency on nokogiri from ~> 1.4.4 to ~> 1.4 and thus including all 1.x point releases?

can't dup NilClass

This happens when running Sanitize.clean(string, Sanitize::Config::BASIC)

Has anyone ever had this problem? Any idea how to fix this?

Thanks

Option to strip whitespace?

It would be nice to see Sanitize support cleansing whitespace in addition to html.

html = "some\t\ttext\t\nwith&nbsp;tabs    and\n\n newlines"
expected_result = "some text with tabs and newlines"  # nbsp replaced w/ space
actual_result = "some\t\ttext\t\nwith tabs    and\n\n newlines"

transformers won't work

even if I use your example it didn't work:

>> html        = '<div><span>foo</span></div>'
=> "<div><span>foo</span></div>"
>>   transformer = lambda{|env| puts env[:node].name }
=> #<Proc:0x0000000101271968@(irb):3>
>> 
?>   # Prints "span", then "div". 
?>   Sanitize.clean(html, :transformers => transformer)
=> "foo"

I'm using the Ruby 1.8.7 default installation on OS X 10.6 and the newest Version of sanitize and its dependencies

don't get it to work: transformers

Hello guys,
sanitize is really amazing, but i don't get to work to display vimeo/youtube video.
Other elements work fine.

tried it on this way:
http://pastie.org/3239118

Should convert BR and P to whitespace

We're trying to convert our home-grown sanitizer to use Sanitize instead and seeing a bunch of failing edge cases.

html = "Hello<br>Fred.<br /><p class=\"test\" >How's it\n\n going?</p>Fine. Thank you."
expected_result = "Hello Fred. How's it going? Fine. Thank you."
actual_result = "HelloFred.How's it\n\n going?Fine. Thank you."

Error: premature end of regular expression

Today I ran into an unusual error; I'm using sanitize on a substring in my rails project, like this:

Sanitize.clean project.description[0,150], Sanitize::Config::BASIC

On one particular project, the 150th character was a tick mark. This generated an error:

"premature end of regular expression"

I resolved this by deleting a character in the description, so that the 150th character was a letter.

I am not able to test on 1.2.0 and I couldn't find your change log to see if the problem has already been resolved. (I'm new to github).

nokogiri (1.3.3)
sanitize (1.1.0)

How to remove the head of an HTML document?

I have a textarea where people sometimes paste HTML documents. How can I remove the head section and just leave the body section? Here's a snippet from my config:

:transformers => [
  lambda { |env|
    env[:node].remove if env[:node].name == 'head'
    {}
  },
]

But when I try this out I get these results:

>> html = '<html><head>assdfd</head><body><p>hi</p></body></html>'
>> Sanitize.clean html, Sanitize::Config::MyConfig
=> "<p>assdfd</p><p>hi</p>"

But I would like to get <p>hi</p>. Any suggestions?

Thanks and regards,
Andy

Move the character-to-entity conversion map to a config setting

I don't want sanitize to replace apostrophes and ampersands with their HTML equivalents. Is there any way to disable that?

arithmetics operators

Why whith text = "soit 52 kcal, protéines < 0,1g, glucides : "

Sanitize.clean(text) returns "soit 52 kcal, protéines" ?

Incredible slow work

Hi, just tried to use this gem to sanitize web page html source before indexing it, but this makes indexing process very and very slow:

Without sanitize:

Benchmark.measure { SiteSource.reindex }
 =>   6.090000   0.220000   6.310000 (  8.211175)

With Sanitize (Sanitize.clean(page_source, :remove_contents => %w[script style])):

Benchmark.measure { SiteSource.reindex }
 => 120.590000   0.230000 120.820000 (122.315429)

It's also noticeable slow even when I am trying to sanitize single html string from rails console:

1.9.3p0 :009 > Benchmark.measure { Sanitize.clean(SiteSource.last.data["source"]) }
 =>   2.030000   0.000000   2.030000 (  2.023783)

1.9.3p0 :010 > Benchmark.measure { page_source }
 =>   2.030000   0.000000   2.030000 (  2.038710)

1.9.3p0 :011 > Benchmark.measure { page_source }
 =>   0.270000   0.000000   0.270000 (  0.271550)

1.9.3p0 :012 > Benchmark.measure { page_source }
 =>   2.060000   0.060000   2.120000 (  2.122617)

1.9.3p0 :013 > Benchmark.measure { page_source }
 =>   0.230000   0.000000   0.230000 (  0.237719)

Weird that time is so different.

sanitize replaces & with & on whitelisted elements

I'm using a modified version of the youtube filter to allow grooveshark embeds. These have a url like this in them, which sanitize is messing up despite my best efforts.
hostname=cowbell.grooveshark.com&widgetID=25096656&style=metal&p=0 - the &'s are being replaced with & which breaks the embed

full details:
https://gist.github.com/998274

Update:
Looks like the &s weren't my only problem, as safari/chrome get upset about a url being posted in an embed and then shown on the next page, and won't show it for security reasons. But refreshing it will show it, so not sure what to do about it.

rgrove / sanitize Goto Github PK

sanitize's Introduction

Sanitize

Links

Installation

Quick Start

Usage

HTML Fragments

HTML Documents

CSS in HTML

Standalone CSS

Configuration

Sanitize::Config::RESTRICTED

Sanitize::Config::BASIC

Sanitize::Config::RELAXED

Custom Configuration

Config Settings

:add_attributes (Hash)

:allow_comments (boolean)

:allow_doctype (boolean)

:attributes (Hash)

:css (Hash)

:css => :allow_comments (boolean)

:css => :allow_hacks (boolean)

:css => :at_rules (Array or Set)

:css => :at_rules_with_properties (Array or Set)

:css => :at_rules_with_styles (Array or Set)

:css => :import_url_validator

:css => :properties (Array or Set)

:css => :protocols (Array or Set)

:elements (Array or Set)

:parser_options (Hash)

:protocols (Hash)

:remove_contents (boolean or Array or Set)

:transformers (Array or callable)

:whitespace_elements (Hash)

Transformers

Input

Output

Processing

Example: Transformer to allow image URLs by domain

Example: Transformer to allow YouTube video embeds

sanitize's People

Contributors

Stargazers

Watchers

Forkers

sanitize's Issues

Case #1 - MRI (1.9.2-p290, 1.9.3-p194)

Case #2 (JRuby-1.6.7.2)

Recommend Projects

Recommend Topics

Recommend Org