flavorjones / loofah Goto Github PK

View Code? Open in Web Editor NEW

927.0 16.0 138.0 1.18 MB

Ruby library for HTML/XML transformation and sanitization

License: MIT License

Ruby 63.14% HTML 36.76% JavaScript 0.10%

loofah's Introduction

Loofah

https://github.com/flavorjones/loofah
Docs: http://rubydoc.info/github/flavorjones/loofah/main/frames
Mailing list: [email protected]

Status

Description

Loofah is a general library for manipulating and transforming HTML/XML documents and fragments, built on top of Nokogiri.

Loofah also includes some HTML sanitizers based on html5lib's safelist, which are a specific application of the general transformation functionality.

Active Record extensions for HTML sanitization are available in the loofah-activerecord gem.

Features

Easily write custom transformations for HTML and XML
Common HTML sanitizing transformations are built-in:
- Strip unsafe tags, leaving behind only the inner text.
- Prune unsafe tags and their subtrees, removing all traces that they ever existed.
- Escape unsafe tags and their subtrees, leaving behind lots of < and > entities.
- Whitewash the markup, removing all attributes and namespaced nodes.
Other common HTML transformations are built-in:
- Add the nofollow attribute to all hyperlinks.
- Add the target=_blank attribute to all hyperlinks.
- Remove unprintable characters from text nodes.
Format markup as plain text, with (or without) sensible whitespace handling around block elements.
Replace Rails's strip_tags and sanitize view helper methods.

Compare and Contrast

Loofah is both:

a general framework for transforming XML, XHTML, and HTML documents
a specific toolkit for HTML sanitization

General document transformation

Loofah tries to make it easy to write your own custom scrubbers for whatever document transformation you need. You don't like the built-in scrubbers? Build your own, like a boss.

HTML sanitization

Another Ruby library that provides HTML sanitization is rgrove/sanitize, another library built on top of Nokogiri, which provides a bit more flexibility on the tags and attributes being scrubbed.

You may also want to look at rails/rails-html-sanitizer which is built on top of Loofah and provides some useful extensions and additional flexibility in the HTML sanitization.

The Basics

Loofah wraps Nokogiri in a loving embrace. Nokogiri is a stable, well-maintained parser for XML, HTML4, and HTML5.

Loofah implements the following classes:

Loofah::HTML5::Document
Loofah::HTML5::DocumentFragment
Loofah::HTML4::Document (aliased as Loofah::HTML::Document for now)
Loofah::HTML4::DocumentFragment (aliased as Loofah::HTML::DocumentFragment for now)
Loofah::XML::Document
Loofah::XML::DocumentFragment

These document and fragment classes are subclasses of the similarly-named Nokogiri classes Nokogiri::HTML5::Document et al.

Loofah also implements Loofah::Scrubber, which represents the document transformation, either by wrapping a block,

span2div = Loofah::Scrubber.new do |node|
  node.name = "div" if node.name == "span"
end

or by implementing a method.

Side Note: Fragments vs Documents

Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don't have a document, you have a fragment. For HTML, another rule of thumb is that documents have html and body tags, and fragments usually do not.

HTML fragments should be parsed with Loofah.html5_fragment or Loofah.html4_fragment. The result won't be wrapped in html or body tags, won't have a DOCTYPE declaration, head elements will be silently ignored, and multiple root nodes are allowed.

HTML documents should be parsed with Loofah.html5_document or Loofah.html4_document. The result will have a DOCTYPE declaration, along with html, head and body tags.

XML fragments should be parsed with Loofah.xml_fragment. The result won't have a DOCTYPE declaration, and multiple root nodes are allowed.

XML documents should be parsed with Loofah.xml_document. The result will have a DOCTYPE declaration and a single root node.

Side Note: HTML4 vs HTML5

⚠ HTML5 functionality is not available on JRuby, or with versions of Nokogiri < 1.14.0.

Currently, Loofah's methods Loofah.document and Loofah.fragment are aliases to .html4_document and .html4_fragment, which use Nokogiri's HTML4 parser. (Similarly, Loofah::HTML::Document and Loofah::HTML::DocumentFragment are aliased to Loofah::HTML4::Document and Loofah::HTML4::DocumentFragment.)

Please note that in a future version of Loofah, these methods and classes may switch to using Nokogiri's HTML5 parser and classes on platforms that support it [1].

We strongly recommend that you explicitly use .html5_document or .html5_fragment unless you know of a compelling reason not to. If you are sure that you need to use the HTML4 parser, you should explicitly call .html4_document or .html4_fragment to avoid breakage in a future version.

[1]: [feature request] HTML5 parser for JRuby implementation · Issue #2227 · sparklemotion/nokogiri

`Loofah::HTML5::Document` and `Loofah::HTML5::DocumentFragment`

These classes are subclasses of Nokogiri::HTML5::Document and Nokogiri::HTML5::DocumentFragment.

The module methods Loofah.html5_document and Loofah.html5_fragment will parse either an HTML document and an HTML fragment, respectively.

Loofah.html5_document(unsafe_html).is_a?(Nokogiri::HTML5::Document)         # => true
Loofah.html5_fragment(unsafe_html).is_a?(Nokogiri::HTML5::DocumentFragment) # => true

Loofah injects a scrub! method, which takes either a symbol (for built-in scrubbers) or a Loofah::Scrubber object (for custom scrubbers), and modifies the document in-place.

Loofah overrides to_s to return HTML:

unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"

doc = Loofah.html5_fragment(unsafe_html).scrub!(:prune)
doc.to_s    # => "ohai! <div>div is safe</div> "

and text to return plain text:

doc.text    # => "ohai! div is safe "

Also, to_text is available, which does the right thing with whitespace around block-level and line break elements.

doc = Loofah.html5_fragment("<h1>Title</h1><div>Content<br>Next line</div>")
doc.text    # => "TitleContentNext line"            # probably not what you want
doc.to_text # => "\nTitle\n\nContent\nNext line\n"  # better

`Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`

These classes are subclasses of Nokogiri::HTML4::Document and Nokogiri::HTML4::DocumentFragment.

The module methods Loofah.html4_document and Loofah.html4_fragment will parse either an HTML document and an HTML fragment, respectively.

Loofah.html4_document(unsafe_html).is_a?(Nokogiri::HTML4::Document)         # => true
Loofah.html4_fragment(unsafe_html).is_a?(Nokogiri::HTML4::DocumentFragment) # => true

`Loofah::XML::Document` and `Loofah::XML::DocumentFragment`

These classes are subclasses of Nokogiri::XML::Document and Nokogiri::XML::DocumentFragment.

The module methods Loofah.xml_document and Loofah.xml_fragment will parse an XML document and an XML fragment, respectively.

Loofah.xml_document(bad_xml).is_a?(Nokogiri::XML::Document)         # => true
Loofah.xml_fragment(bad_xml).is_a?(Nokogiri::XML::DocumentFragment) # => true

Nodes and Node Sets

Nokogiri's Node and NodeSet classes also get a scrub! method, which makes it easy to scrub subtrees.

The following code will apply the employee_scrubber only to the employee nodes (and their subtrees) in the document:

Loofah.xml_document(bad_xml).xpath("//employee").scrub!(employee_scrubber)

And this code will only scrub the first employee node and its subtree:

Loofah.xml_document(bad_xml).at_xpath("//employee").scrub!(employee_scrubber)

`Loofah::Scrubber`

A Scrubber wraps up a block (or method) that is run on a document node:

# change all <span> tags to <div> tags
span2div = Loofah::Scrubber.new do |node|
  node.name = "div" if node.name == "span"
end

This can then be run on a document:

Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
# => "<div>foo</div><p>bar</p>"

Scrubbers can be run on a document in either a top-down traversal (the default) or bottom-up. Top-down scrubbers can optionally return Scrubber::STOP to terminate the traversal of a subtree. Read below and in the Loofah::Scrubber class for more detailed usage.

Here's an XML example:

# remove all <employee> tags that have a "deceased" attribute set to true
bring_out_your_dead = Loofah::Scrubber.new do |node|
  if node.name == "employee" and node["deceased"] == "true"
    node.remove
    Loofah::Scrubber::STOP # don't bother with the rest of the subtree
  end
end
Loofah.xml_document(File.read('plague.xml')).scrub!(bring_out_your_dead)

Built-In HTML Scrubbers

Loofah comes with a set of sanitizing scrubbers that use html5lib's safelist algorithm:

doc = Loofah.html5_document(input)
doc.scrub!(:strip)       # replaces unknown/unsafe tags with their inner text
doc.scrub!(:prune)       #  removes unknown/unsafe tags and their children
doc.scrub!(:escape)      #  escapes unknown/unsafe tags, like this: &lt;script&gt;
doc.scrub!(:whitewash)   #  removes unknown/unsafe/namespaced tags and their children,
                         #          and strips all node attributes

Loofah also comes with some common transformation tasks:

doc.scrub!(:nofollow)    #  adds rel="nofollow" attribute to links
doc.scrub!(:noopener)    #  adds rel="noopener" attribute to links
doc.scrub!(:noreferrer)  #  adds rel="noreferrer" attribute to links
doc.scrub!(:unprintable) #  removes unprintable characters from text nodes
doc.scrub!(:targetblank) #     adds target="_blank" attribute to links

See Loofah::Scrubbers for more details and example usage.

Chaining Scrubbers

You can chain scrubbers:

Loofah.html5_fragment("<span>hello</span> <script>alert('OHAI')</script>") \
      .scrub!(:prune) \
      .scrub!(span2div).to_s
# => "<div>hello</div> "

Shorthand

The class methods Loofah.scrub_html5_fragment and Loofah.scrub_html5_document (and the corresponding HTML4 methods) are shorthand.

These methods:

Loofah.scrub_html5_fragment(unsafe_html, :prune)
Loofah.scrub_html5_document(unsafe_html, :prune)
Loofah.scrub_html4_fragment(unsafe_html, :prune)
Loofah.scrub_html4_document(unsafe_html, :prune)
Loofah.scrub_xml_fragment(bad_xml, custom_scrubber)
Loofah.scrub_xml_document(bad_xml, custom_scrubber)

do the same thing as (and arguably semantically clearer than):

Loofah.html5_fragment(unsafe_html).scrub!(:prune)
Loofah.html5_document(unsafe_html).scrub!(:prune)
Loofah.html4_fragment(unsafe_html).scrub!(:prune)
Loofah.html4_document(unsafe_html).scrub!(:prune)
Loofah.xml_fragment(bad_xml).scrub!(custom_scrubber)
Loofah.xml_document(bad_xml).scrub!(custom_scrubber)

View Helpers

Loofah has two "view helpers": Loofah::Helpers.sanitize and Loofah::Helpers.strip_tags, both of which are drop-in replacements for the Rails Action View helpers of the same name.

These are not required automatically. You must require loofah/helpers to use them.

Requirements

Nokogiri >= 1.5.9

Installation

Unsurprisingly:

gem install loofah

Requirements:

Ruby >= 2.5

Support

The bug tracker is available here:

https://github.com/flavorjones/loofah/issues

And the mailing list is on Google Groups:

Mail: [email protected]
Archive: https://groups.google.com/forum/#!forum/loofah-talk

Consider subscribing to Tidelift which provides license assurances and timely security notifications for your open source dependencies, including Loofah. Tidelift subscriptions also help the Loofah maintainers fund our automated testing which in turn allows us to ship releases, bugfixes, and security updates more often.

Security

See SECURITY.md for vulnerability reporting details.

Authors

Mike Dalessio (@flavorjones)
Bryan Helmkamp

Featuring code contributed by:

And a big shout-out to Corey Innis for the name, and feedback on the API.

Thank You

The following people have generously funded Loofah with financial sponsorship:

Bill Harding
Sentry @getsentry

Historical Note

This library was once named "Dryopteris", which was a very bad name that nobody could spell properly.

License

Distributed under the MIT License. See MIT-LICENSE.txt for details.

loofah's People

Contributors

Stargazers

Watchers

Forkers

technicalpickles mikeauclair weplay alistairholt dacort mrkurt jpsilvashy mrskin frenski phaza shripadk pyrat nandayadav toy kieranmasterton jhe nko tenderlove dabble bborn reqshark crimefighter santosh-1987 hooopo bf4 11factory sesharatnam-nyros forschnix toufique wrightling vipulnsward ihid patientfrenzy rafaelfranca kaspth batter williamcodes openbl wilg jordoh chengguangnan glittershark jstorimer mangoappsinc kosmas jasnow flyeven jbr viewbook qqwy olivierlacan indiegogo anukat2015 bricesanchez raghav92 mikewoj hansondr olleolleolle junaruga farrspace mrpasquini danielma drosboro y-yagi aried3r myxoh baopham mothonmars novotarq biow0lf rmacklin pjg nicolasleger vadimeremeev gpreis nextideallc danfstucky jhubert teleaziz mschnitzer laurenhassonatpnm jaredbeck juanitofatas jm-maniego kares bchaney martincizek ihorpohasii rafbm morristech pipefy juffel ahorek dyet92k djrolls b7kich hoangtuyb96 ajaxvm miguelperez troter

loofah's Issues

Scrub not fully applied on HTML::Document

I noticed that some HTML comment tags are not removed.

Here is an example, my_scrub should remove all the comments.

Loofah.document("<!DOCTYPE html><!--[if IE 7]><!-- --><html><body><script></script></body></html><!--ww -->").scrub!(my_scrub).to_xml
=> "<!DOCTYPE html>\n<!--[if IE 7]><!-- --><html></html>\n"

I check the code and think the problem is here:

https://github.com/flavorjones/loofah/blob/master/lib/loofah/instance_methods.rb#L41

        case self
        when Nokogiri::XML::Document
          scrubber.traverse(root) if root
        when Nokogiri::XML::DocumentFragment
          children.scrub! scrubber
        else
          scrubber.traverse(self)
        end

So even a HTML::Document would went through scrubber.traverse(root) if root. So things outside of HTML will not went through this scrubber.

0.4.7 not compatible with Nokogiri 1.3.3

There is a call to at_xpath which means that scrub_fragment does not work. 0.4.3 works.

irb(main):002:0> Loofah.scrub_fragment("Bold", :prune).to_s
NoMethodError: undefined method `at_xpath' for #
        from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:31:in `serialize_root'
        from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:26:in `to_s'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `inspect'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_str'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_s'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `write'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `print'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:259:in `signal_status'
        from /opt/local/lib/ruby/1.8/irb.rb:147:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:146:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:70:in `start'
        from /opt/local/lib/ruby/1.8/irb.rb:69:in `catch'
        from /opt/local/lib/ruby/1.8/irb.rb:69:in `start'
        from /opt/local/bin/irb:13

whitewash test error with libxml 2.9.1

I'm seeing a test failure with the newer LibXML 2.9.0 in Fedora 19 and above:

=> testing with Nokogiri {"warnings"=>["Nokogiri was built against LibXML version 2.9.0, but has dynamically loaded 2.9.1"], "nokogiri"=>"1.5.9", "ruby"=>{"version"=>"2.0.0", "platform"=>"i386-linux", "description"=>"ruby 2.0.0p353 (2013-11-22 revision 43784) [i386-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "compiled"=>"2.9.0", "loaded"=>"2.9.1"}}

The same test succeeds on CentOS 6, with Ruby 1.9.3 and LibXML 2.8.0:

=> testing with Nokogiri {"warnings"=>[], "nokogiri"=>"1.6.1", "ruby"=>{"version"=>"1.9.3", "platform"=>"x86_64-linux", "description"=>"ruby 1.9.3p327 (2012-11-10 revision 37606) [x86_64-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "source"=>"packaged", "libxml2_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxml2/2.8.0", "libxslt_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxslt/1.1.26", "compiled"=>"2.8.0", "loaded"=>"2.8.0"}}

The failing test is this one:

  1) Failure:
IntegrationTestAdHoc#test_fragment_whitewash_on_microsofty_markup [/builddir/bui
ld/BUILD/loofah-1.2.1/usr/share/gems/gems/loofah-1.2.1/test/integration/test_ad_
hoc.rb:146]:
--- expected
+++ actual
@@ -1 +1,4 @@
-"<p>Foo <b>BOLD</b></p>"
+"
+
+<p>Foo <b>BOLD</b></p>
+"

For some reason, nokogiri must be inserting some blank lines there.

I looked at the Travis builds, and apparently Travis is using the older LibXML 2.8.0, which might explain why we haven't seen this fail in Travis yet.

Not all CSS properties are whitelisted

One example is word-spacing, which is listed at http://www.w3.org/TR/CSS2/propidx.html but isn't included in Loofah's whitelist (or the whitelist in html5lib-ruby which is obviously not being maintained anymore).

There's a greated problem here of keeping whitelists up to date with specs. Perhaps we can solve both.

to_text and Loofah::Helpers

Hi,
is is possible that the removal to automatically load Loofah::Helpers
breaks to_text?
irb
require 'loofah'
Loofah.fragment("abc").to_text
NameError: uninitialized constant Loofah::Helpers

OTOH Loofah.fragment("abc").to_s works.

Or is to_text considered part of the ActionView helpers?
Feels a bit strange. And: the to_text method itself is there.

best
Morus

Not support -Ku option for ruby

Rake fails with error if ruby runs with RUBYOPT=-Ku

rake db:create
/home/nleo/.rvm/gems/ruby-1.9.3-p327/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/

improve rails integration points

I found that just config.gem'ing loofah didn't quite work, because it hadn't extended ActiveRecord::Base with html_fragment and friends.

require 'loofah/active_record' was enough to fix. I noticed an init.rb too, but it was referring to classes that didn't exist.

I took a page from factory_girl's initialization, and added some logic to lib/loofa.rb to require loofa/active_record if Rails is in existence. I also updated init.rb to just require loofa.

Updated in my fork/branch: http://github.com/technicalpickles/loofah/tree/better-rails-int

Loofah broken by colons

I'm using JRuby 1.6.5, Loofah 1.2.1, Nokogiri 1.5.5-java. Loofah is broken when it comes to colons inside the body of any tags. Here's a succinct example of what's wrong, from a console session:

>> Loofah.fragment("4:30am").to_s
=> "4:30am"
>> Loofah.fragment("<span>4:30am</span>").to_s
=> "<span>30am</span>"

>> Loofah.fragment("4:30am")
=> #<Loofah::HTML::DocumentFragment:0x1086 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1084 "4:30am">]>
>> Loofah.fragment("<span>4:30am</span>")
=> #<Loofah::HTML::DocumentFragment:0x108c name="#document-fragment" children=[#<Nokogiri::XML::Element:0x108a name="span" children=[#<Nokogiri::XML::Text:0x1088 "30am">]>]>

Notice how the "4:" has been mysteriously chomped. It doesn't matter if it's span, b, i or any other tag -- all remove the "4:". It seems that Nokogiri is able to parse the HTML initially just fine:

>> Nokogiri::HTML("4:30am")
=> #<Nokogiri::HTML::Document:0x1074 name="document" children=[#<Nokogiri::XML::DTD:0x106a name="html">, #<Nokogiri::XML::Element:0x1072 name="html" children=[#<Nokogiri::XML::Element:0x106c name="head">, #<Nokogiri::XML::Element:0x1070 name="body" children=[#<Nokogiri::XML::Text:0x106e "4:30am">]>]>]>

>> Nokogiri::HTML("<b>4:30am</b>")
=> #<Nokogiri::HTML::Document:0x1082 name="document" children=[#<Nokogiri::XML::DTD:0x1076 name="html">, #<Nokogiri::XML::Element:0x1080 name="html" children=[#<Nokogiri::XML::Element:0x1078 name="head">, #<Nokogiri::XML::Element:0x107e name="body" children=[#<Nokogiri::XML::Element:0x107c name="b" children=[#<Nokogiri::XML::Text:0x107a "4:30am">]>]>]>]>

So it looks like it must be Loofah that's removing it.

feature: xss_foliate should be able to accept custom scrubbers

currently only accepts built-in scrubbers

feature: text() should optionally do something intelligent with whitespace

For example,

  Loofah.fragment("<div>foo</div><div>bar</div>").text
  => "foobar"

For block HTML elements, we should insert whitespace (maybe even newline), so that the desired output would be:

  foo bar

  foo
  bar

Can't use negative value in html style attribute

Very cool sanitizer, but when I'm trying to use letter-spacing with negative value in one of the properties, it just returns 'style' without any properties

multibyte regex error

Reported on ruby-talk by Une Bévue [email protected]

because i'm using daily nokogiri i wanted to test loofah with a small
script (coming from http://loofah.rubyforge.org/loofah/) :

#! /opt/local/bin/ruby1.9
# encoding: utf-8

require 'rubygems'
require 'nokogiri'
require 'loofah'

unsafe_html="ohai! <div>div is safe</div> <script>but script is
not</script>"

doc=Loofah.fragment(unsafe_html).scrub!(:strip)
puts doc.to_s

however i got :

SyntaxError:
/opt/local/lib/ruby1.9/gems/1.9.1/gems/loofah-1.0.0/lib/loofah/html5/scr
ub.rb:20: too short escaped multibyte character:
/`|[\000-\040\177\s]+|\302[\200-\240]/
method require in untitled document at line 29
method require in untitled document at line 29
method <top (required)> in loofah.rb at line 9
method require in untitled document at line 33
method rescue in require in untitled document at line 33
method require in untitled document at line 29
method <main> in loofah_first_test.rb at line 22

ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10]
over Mac OS X SL

Confirm that all HTML5 tags are whitelisted

Canonical list here:

http://www.w3.org/TR/html-markup/elements.html

undefined method `lstrip' for :Loofah::HTML::DocumentFragment

undefined method `lstrip' for :Loofah::HTML::DocumentFragment
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:21:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `parse'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:179:in `fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:184:in `scrub_fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/active_record.rb:27:in `html_fragment'

The code that I was using:

html_fragment :content, :scrub => :whitewash

Duplicated range warning

I keep getting this warning, but I'm not too familiar with the regex contents to fix it.

lib/loofah/html5/scrub.rb:12: warning: character class has duplicated range: /[`\x00-\x20\x7F\s�-ā]/

//@rafaelfranca

encoding with ruby 1.9

I've checked to see if this issue is fixed.
https://github.com/flavorjones/loofah/issues/issue/26


irb(main):008:0> utf8_string = "あ<b>い</b>う<script>え</script>お"
=> "あ<b>い</b>う<script>え</script>お"
irb(main):009:0> Loofah.scrub_fragment(utf8_string, :strip).to_s
=> "あ<b>い</b>うã\u0081\u0088お"

I think above code should return あ<b>い</b>うえお.
(:escape works correctly.)

LoadError: no such file to load -- loofah/xml/document

In 0.4.0 - lib/loofah.rb:

require 'loofah/xml/document'

This file is not in the gem. I'm puzzled why this isn't reported - doesn't this prevent everybody from using the gem??

Syntax error when runnig test in Textmate

loofah-1.2.0/lib/loofah/html5/scrub.rb:10: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/

When I remove the failing line 10, It works!

Ruby 1.9.1 and loofah 0.2.2 Encoding error. Ruby 1.8.7 is OK.

Just got this under Ruby 1.9.1 while parsing http://www.fd.nl/nieuws/laatstenieuws/?view=RSS

/home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:42:in characters': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError) from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:innative_parse_memory'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:in parse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/sax/parser.rb:83:inparse'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:11:in initialize' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:innew'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:in parse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:179:infragment'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:184:in scrub_fragment' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1362:instrip_html'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:442:in get_feed_summary' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:888:inupdate_rss_feed'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1130:in `update_feed'

The strip_html() method looks like this...

class String
def strip_html
html = Nokogiri::HTML.fragment(self.dup)
(html/:br).each {|_br| _br.swap(' ') }
(html/:p).each {|_p| _p.swap(_p.content + ' ') }
Loofah.scrub_fragment(html.content, :strip).text
end
end

The text it's working on is:

fd.nl - De nieuwsbron van ondernemend Nederland. Financieel-economisch nieuws, achtergronden en analyses. Artikelen over beleggen, carrière en ondernemen. Uitgebreide koersen, rentestanden en bedrijfsinformatie. Met veel aandacht voor aandelenmarkten en beleggingsfondsen.

Please integrate latest version of HTML5 sanitizer to allow support for HTML5 "data-" attributes

The source for whitelist.rb has been updated, and already includes this support, but needs to be integrated into loofah:
https://github.com/html5lib/html5lib-ruby/blob/master/lib/html5/sanitizer.rb

Loofah escaping carriage returns

We have carriage returns in our text fields, and loofah is turning them into entities.

This is undesirable. Is there a way to stop it?

Removes valid element <figure>

Loofah removes

element.

https://developer.mozilla.org/en-US/docs/HTML/Element/figure

Loofah.scrub_fragment("<span>hello</span> <figure>asd</figure>", :prune).to_s

Errors out in a Rails project that doesn't use ActiveRecord (like MongoMapper)

I'm using MongoMapper on a project and loading Feedzirra, which uses loofah. phew

I remove ActiveRecord from my configuration and get this error message:
/Library/Ruby/Gems/1.8/gems/activesupport-2.3.5/lib/active_support/dependencies.rb:443:in `load_missing_constant':NameError: uninitialized constant ActiveRecord

Trying to use rake also results in an error:
uninitialized constant ActiveRecord

Using this to remove AR:
config.frameworks -= [:active_record]

Strange behaviour with HTML comments

I cannot make sense of this.

Loofah.fragment('link').to_html

=> "link"

Loofah.fragment(' link').to_html

=> "link"

where is the link tag? But ok if comment is not at the start:

Loofah.fragment('
link').to_html

=> "
link"

feature: allow Loofah-ization of an existing Nokogiri document or fragment

Currently, the following code will add scrubbing behavior:

doc = Nokogiri::HTML HTML
doc.extend Loofah::ScrubBehavior::Node

but that doesn't include #text or the node and node set decorators.

So, I'd suggest that we allow Loofah initialize methods to receive a Nokogiri document / fragment and do the extension there. Then we should check that we're not relying on the typeiness of the Loofah document / fragment.

(And, actually, can we then eliminate the classes entirely, and just use extended Nokogiri docs everywhere?)

feature: Allow developers to override the acceptable elements whitelist

Some projects call for a more limited whitelist of acceptable elements. It would be easy enough to monkey patch, but you may consider this request common enough to support it through a method call instead.

The ActiveRecord extension almost works for MongoMapper

The methods defined in the ActiveRecord extension class actually work as-is for MongoMapper. The line that extends ActiveRecord::Base, though, blows up when there's no ActiveRecord present and I include that class manually. I "fixed" this in my own branch(mrkurt/loofah@a336a9c) by moving the actual extension call into that Rails/ActiveRecord initialization bits. Would you mind tweaking the real Gem so I can use the ActiveRecordExtension like this?

encoding with ruby 1.9

Is this a bug?


irb(main):012:0> utf8_string = "日本語"
=> "日本語"
irb(main):012:0> utf8_string.encoding
=> #<Encoding:UTF-8>
irb(main):013:0> escaped = Loofah.scrub_fragment(utf8_string, :escape).to_s
=> "&#26085;&#26412;&#35486;"
irb(main):015:0> escaped.encoding
=> #<Encoding:US-ASCII>
irb(main):016:0> escaped.encode('UTF-8')
=> "&#26085;&#26412;&#35486;"
irb(main):019:0> escaped.force_encoding('UTF-8')
=> "&#26085;&#26412;&#35486;"

Software versions:

Ubuntu 10.04
rvm 1.0.20
ruby 1.9.2p0
nokogiri (1.4.4)
loofah (1.0.0.beta.1.20101025234603)

Which code printed this error to the console "element _: validity error : ID _ already defined"

I try to find a way to silent those messages but could not locate the code in loofah or nokogiri.

Any idea where does it come from? libxml?

> Loofah.document("<i id=a></i><i id=a></i>").to_text
element i: validity error : ID a already defined

Consider removing unprintable characters?

We've encountered some issues with HTML containing unprintable characters (namely \u2028\u2029) over on the Stringer project (stringer-rss/stringer#295). The issue manifests itself further down the chain once we try to load the HTML with unprintable characters as JSON.

We've resolved our issues by adding a gsub(/[^[:print:]]/, '') call after running the contents through loofah. I think this makes sense to add this as part of loofah's sanitizing process - any thoughts before I take a stab at a PR?

chore: refactor tests into proper unit test and integration tests

Currently the grab bag that's in test_ad_hoc.rb is a hot mess.

OMG JRuby support

Travis is totally blowing up on the JRubies. Figure it out.

Scrubber inserts newlines between spans if there is no whitespace between them?

Bit of a weird one this. I've been trying to investigate this issue with clojure syntax highlighting on exercism.io, and it seems to be being caused by some weirdness with how Loofah sanitization deals with span elements without whitespace between them. If there are no spaces between the spans then it adds a newline between each one. If there is a space between at least one set of spans, it leaves the whitespace intact.

Here's the example from the exercism.io issue. In each case the input is being run through Loofah.xml_fragment(html).scrub!(:escape).to_s.

# input with no spaces between spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>

# output has loads of newlines added
<pre class=\"highlight\">\n  <span class=\"p\">(</span>\n  <span class=\"k\">defn</span>\n  <span class=\"w\"/>\n  <span class=\"n\">to-rna</span>\n  <span class=\"p\">)</span>\n</pre>\n

# input with only one char difference, a single space between two of spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>

# output retains its newlines correctly
<pre class=\"highlight\"><span class=\"p\">(</span><span class=\"k\">defn</span> <span class=\"w\"> </span> <span class=\"n\">to-rna</span><span class=\"p\">)</span></pre>\n

At this point I'm a bit lost - any ideas?

#to_text should preserve whitespace between inline elements

@flavorjones I love the that #to_text preserves whitespace (as you added here #12), but it doesn't appear to preserve between inline elements as you say in this post (sparklemotion/nokogiri#636), only block elements. Would it be reasonable for it to preserve whitespace between inline elements? Am I missing an option to do this? If not, would you be open to a pull request to make this happen?

Loofah fails on the empty string

On my setup, Loofah gives a runtime error if you ask it to prune the empty string (but is fine with pruning a string with a single space):

>> Loofah.scrub_fragment(" ", :prune)
=> #<Loofah::HTML::DocumentFragment:0x1066 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1064 " ">]>

>> Loofah.scrub_fragment("", :prune)
NoMethodError: undefined method `scrub!' for []:Nokogiri::XML::NodeSet
    from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah/instance_methods.rb:44:in `scrub!'
    from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah.rb:49:in `scrub_fragment'
    from (irb):4:in `evaluate'
    from org/jruby/RubyKernel.java:1088:in `eval'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:158:in `eval_input'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:271:in `signal_status'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:155:in `eval_input'
    from org/jruby/RubyKernel.java:1420:in `loop'
    from org/jruby/RubyKernel.java:1192:in `catch'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:154:in `eval_input'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:71:in `start'
    from org/jruby/RubyKernel.java:1192:in `catch'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:70:in `start'
    from /home/neil/src/jruby-1.6.5/bin/jirb:13:in `(root)'

Rails 3 Support

Rails 3 Support / Bundler support with railties.

should handle <javascript>alert('evil')</javascript>

  obj = Object.new(:text => "&lt;javascript&gt;alert('evil')&lt;/javascript&gt;")
  obj.valid?
  obj.text.should == "alert('evil')"

 expected: "alert('evil')",
 got: "<javascript>alert('evil')</javascript>" (using ==)

Loofah seems to always :prune, no matter what scrubber is defined

No matter what scrubber I use, Loofah seems to always prune the contents. Here is a little test inside of Rails Console:

>> Loofah.scrub_fragment('I <3 You', :escape).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :prune).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :whitewash).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :strip).to_s
=> "I "
>> Loofah.scrub_fragment('I <3> You', :strip).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :whitewash).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :escape).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :prune).to_s
=> "I  You"

infos:
Ruby 1.9.3p327
Rails 3.2.11
Using gem 'loofah-activerecord'

from Gemfile.lock file

loofah (1.2.1)
  nokogiri (>= 1.4.4)
loofah-activerecord (1.1.0)
  loofah (>= 1.0.0)

Negative values inside of css properties.

I'm trying to sanitize property with negative value, but it removes the property,
i've attached the failing test

class IntegrationTestCssNegative < Loofah::TestCase
    def test_css_negative_value_sanitization
        html = "<span style=\"word-spacing: -1px;\"></span>"
        sane = Nokogiri::HTML(Loofah.scrub_fragment(html, :escape).to_xml)
        assert_match %r/-1px/,    sane.inner_html
    end
end

Escape removes numbers using greater than / less than.

The text '1<2 and 2>1' is incorrectly converted to '11'.

require 'Loofah'
=> []

str = "1<2 and 2>1"
=> "1<2 and 2>1"

result = Loofah.scrub_fragment(str,:escape)
=> #<Loofah::HTML::DocumentFragment:0x262233c name="#document-fragment" children
=[#<Nokogiri::XML::Text:0x2621cf2 "11">]>

result.to_s
=> "11"

exit

loofah not compatible with rails3

This blog post has details: http://www.mythoughtpot.com/2010/02/10/feedzirra-on-rails3/

$ rails server
/Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails.rb:44:in `configuration': undefined method `config' for nil:NilClass (NoMethodError)
    from /Library/Ruby/Gems/1.8/gems/loofah-0.4.7/lib/loofah.rb:89
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `each'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `each'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler.rb:112:in `require'
    from /Users/dphillips/cnxweb/config/application.rb:7
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28:in `require'
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27:in `tap'
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27
    from script/rails:6:in `require'
    from script/rails:6

error:`<module:XML>': uninitialized constant Nokogiri::XML::Document (NameError)

While running rspec tests for my application which is using nokogiri, feedzirra and loofah i get error:

/home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:8:in <module:XML>': uninitialized constant Nokogiri::XML::Document (NameError) from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:2:inmodule:Loofah'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:1:in <top (required)>' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah.rb:15:in<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:in require' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:inblock (2 levels) in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:inblock in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:inrequire'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler.rb:114:in require' from /home/mydir/project_path/config/application.rb:13:in<top (required)>'
from /home/mydir/project_path/config/environment.rb:2:in require' from /home/mydir/project_path/config/environment.rb:2:in<top (required)>'
from /home/mydir/project_path/spec/spec_helper.rb:3:in require' from /home/mydir/project_path/spec/spec_helper.rb:3:in<top (required)>'
from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in require' from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in load' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:inblock in load_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in map' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:inload_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/command_line.rb:18:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:55:inrun_in_process'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:46:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:10:inblock in autorun'

I'm in the same situation when i'm using ruby 1.8.7 and rspec 2.5.0. I'm running rspec test on Ubuntu 10.10 64 bit and my RSS parsing gems are defined like:
gem 'fii', '1.0.5'
gem "nokogiri"
gem "loofah", '1.0.0'
gem "feedzirra", '~> 0.0.24'

Please advise.

feature: allow developers to implement their own scrubbing methods

Implement a proper visitor pattern.
Make internal loofah methods available.

Documents with extended UTF-8 URIs causes regexp error

Any idea how to work around this issue?

ruby-1.9.2-p136 :002 > Loofah.document("<a href=\"\u5927\">").scrub!(:strip)
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
    from /Users/dphillips/.rvm/gems/ruby-1.9.2-p136/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20:in `gsub'

allow custom scrubbers to leverage the HTML5lib scrubbing already written

A couple of commonly requested features:

add or remove attributes from the whitelists
turn off CSS scrubbing