Git Product home page Git Product logo

loofah's Introduction

Loofah

Status

ci Tidelift dependencies

Description

Loofah is a general library for manipulating and transforming HTML/XML documents and fragments, built on top of Nokogiri.

Loofah also includes some HTML sanitizers based on html5lib's safelist, which are a specific application of the general transformation functionality.

Active Record extensions for HTML sanitization are available in the loofah-activerecord gem.

Features

  • Easily write custom transformations for HTML and XML
  • Common HTML sanitizing transformations are built-in:
    • Strip unsafe tags, leaving behind only the inner text.
    • Prune unsafe tags and their subtrees, removing all traces that they ever existed.
    • Escape unsafe tags and their subtrees, leaving behind lots of < and > entities.
    • Whitewash the markup, removing all attributes and namespaced nodes.
  • Other common HTML transformations are built-in:
    • Add the nofollow attribute to all hyperlinks.
    • Add the target=_blank attribute to all hyperlinks.
    • Remove unprintable characters from text nodes.
  • Format markup as plain text, with (or without) sensible whitespace handling around block elements.
  • Replace Rails's strip_tags and sanitize view helper methods.

Compare and Contrast

Loofah is both:

  • a general framework for transforming XML, XHTML, and HTML documents
  • a specific toolkit for HTML sanitization

General document transformation

Loofah tries to make it easy to write your own custom scrubbers for whatever document transformation you need. You don't like the built-in scrubbers? Build your own, like a boss.

HTML sanitization

Another Ruby library that provides HTML sanitization is rgrove/sanitize, another library built on top of Nokogiri, which provides a bit more flexibility on the tags and attributes being scrubbed.

You may also want to look at rails/rails-html-sanitizer which is built on top of Loofah and provides some useful extensions and additional flexibility in the HTML sanitization.

The Basics

Loofah wraps Nokogiri in a loving embrace. Nokogiri is a stable, well-maintained parser for XML, HTML4, and HTML5.

Loofah implements the following classes:

  • Loofah::HTML5::Document
  • Loofah::HTML5::DocumentFragment
  • Loofah::HTML4::Document (aliased as Loofah::HTML::Document for now)
  • Loofah::HTML4::DocumentFragment (aliased as Loofah::HTML::DocumentFragment for now)
  • Loofah::XML::Document
  • Loofah::XML::DocumentFragment

These document and fragment classes are subclasses of the similarly-named Nokogiri classes Nokogiri::HTML5::Document et al.

Loofah also implements Loofah::Scrubber, which represents the document transformation, either by wrapping a block,

span2div = Loofah::Scrubber.new do |node|
  node.name = "div" if node.name == "span"
end

or by implementing a method.

Side Note: Fragments vs Documents

Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don't have a document, you have a fragment. For HTML, another rule of thumb is that documents have html and body tags, and fragments usually do not.

HTML fragments should be parsed with Loofah.html5_fragment or Loofah.html4_fragment. The result won't be wrapped in html or body tags, won't have a DOCTYPE declaration, head elements will be silently ignored, and multiple root nodes are allowed.

HTML documents should be parsed with Loofah.html5_document or Loofah.html4_document. The result will have a DOCTYPE declaration, along with html, head and body tags.

XML fragments should be parsed with Loofah.xml_fragment. The result won't have a DOCTYPE declaration, and multiple root nodes are allowed.

XML documents should be parsed with Loofah.xml_document. The result will have a DOCTYPE declaration and a single root node.

Side Note: HTML4 vs HTML5

HTML5 functionality is not available on JRuby, or with versions of Nokogiri < 1.14.0.

Currently, Loofah's methods Loofah.document and Loofah.fragment are aliases to .html4_document and .html4_fragment, which use Nokogiri's HTML4 parser. (Similarly, Loofah::HTML::Document and Loofah::HTML::DocumentFragment are aliased to Loofah::HTML4::Document and Loofah::HTML4::DocumentFragment.)

Please note that in a future version of Loofah, these methods and classes may switch to using Nokogiri's HTML5 parser and classes on platforms that support it [1].

We strongly recommend that you explicitly use .html5_document or .html5_fragment unless you know of a compelling reason not to. If you are sure that you need to use the HTML4 parser, you should explicitly call .html4_document or .html4_fragment to avoid breakage in a future version.

[1]: [feature request] HTML5 parser for JRuby implementation · Issue #2227 · sparklemotion/nokogiri

Loofah::HTML5::Document and Loofah::HTML5::DocumentFragment

These classes are subclasses of Nokogiri::HTML5::Document and Nokogiri::HTML5::DocumentFragment.

The module methods Loofah.html5_document and Loofah.html5_fragment will parse either an HTML document and an HTML fragment, respectively.

Loofah.html5_document(unsafe_html).is_a?(Nokogiri::HTML5::Document)         # => true
Loofah.html5_fragment(unsafe_html).is_a?(Nokogiri::HTML5::DocumentFragment) # => true

Loofah injects a scrub! method, which takes either a symbol (for built-in scrubbers) or a Loofah::Scrubber object (for custom scrubbers), and modifies the document in-place.

Loofah overrides to_s to return HTML:

unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"

doc = Loofah.html5_fragment(unsafe_html).scrub!(:prune)
doc.to_s    # => "ohai! <div>div is safe</div> "

and text to return plain text:

doc.text    # => "ohai! div is safe "

Also, to_text is available, which does the right thing with whitespace around block-level and line break elements.

doc = Loofah.html5_fragment("<h1>Title</h1><div>Content<br>Next line</div>")
doc.text    # => "TitleContentNext line"            # probably not what you want
doc.to_text # => "\nTitle\n\nContent\nNext line\n"  # better

Loofah::HTML4::Document and Loofah::HTML4::DocumentFragment

These classes are subclasses of Nokogiri::HTML4::Document and Nokogiri::HTML4::DocumentFragment.

The module methods Loofah.html4_document and Loofah.html4_fragment will parse either an HTML document and an HTML fragment, respectively.

Loofah.html4_document(unsafe_html).is_a?(Nokogiri::HTML4::Document)         # => true
Loofah.html4_fragment(unsafe_html).is_a?(Nokogiri::HTML4::DocumentFragment) # => true

Loofah::XML::Document and Loofah::XML::DocumentFragment

These classes are subclasses of Nokogiri::XML::Document and Nokogiri::XML::DocumentFragment.

The module methods Loofah.xml_document and Loofah.xml_fragment will parse an XML document and an XML fragment, respectively.

Loofah.xml_document(bad_xml).is_a?(Nokogiri::XML::Document)         # => true
Loofah.xml_fragment(bad_xml).is_a?(Nokogiri::XML::DocumentFragment) # => true

Nodes and Node Sets

Nokogiri's Node and NodeSet classes also get a scrub! method, which makes it easy to scrub subtrees.

The following code will apply the employee_scrubber only to the employee nodes (and their subtrees) in the document:

Loofah.xml_document(bad_xml).xpath("//employee").scrub!(employee_scrubber)

And this code will only scrub the first employee node and its subtree:

Loofah.xml_document(bad_xml).at_xpath("//employee").scrub!(employee_scrubber)

Loofah::Scrubber

A Scrubber wraps up a block (or method) that is run on a document node:

# change all <span> tags to <div> tags
span2div = Loofah::Scrubber.new do |node|
  node.name = "div" if node.name == "span"
end

This can then be run on a document:

Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
# => "<div>foo</div><p>bar</p>"

Scrubbers can be run on a document in either a top-down traversal (the default) or bottom-up. Top-down scrubbers can optionally return Scrubber::STOP to terminate the traversal of a subtree. Read below and in the Loofah::Scrubber class for more detailed usage.

Here's an XML example:

# remove all <employee> tags that have a "deceased" attribute set to true
bring_out_your_dead = Loofah::Scrubber.new do |node|
  if node.name == "employee" and node["deceased"] == "true"
    node.remove
    Loofah::Scrubber::STOP # don't bother with the rest of the subtree
  end
end
Loofah.xml_document(File.read('plague.xml')).scrub!(bring_out_your_dead)

Built-In HTML Scrubbers

Loofah comes with a set of sanitizing scrubbers that use html5lib's safelist algorithm:

doc = Loofah.html5_document(input)
doc.scrub!(:strip)       # replaces unknown/unsafe tags with their inner text
doc.scrub!(:prune)       #  removes unknown/unsafe tags and their children
doc.scrub!(:escape)      #  escapes unknown/unsafe tags, like this: &lt;script&gt;
doc.scrub!(:whitewash)   #  removes unknown/unsafe/namespaced tags and their children,
                         #          and strips all node attributes

Loofah also comes with some common transformation tasks:

doc.scrub!(:nofollow)    #  adds rel="nofollow" attribute to links
doc.scrub!(:noopener)    #  adds rel="noopener" attribute to links
doc.scrub!(:noreferrer)  #  adds rel="noreferrer" attribute to links
doc.scrub!(:unprintable) #  removes unprintable characters from text nodes
doc.scrub!(:targetblank) #     adds target="_blank" attribute to links

See Loofah::Scrubbers for more details and example usage.

Chaining Scrubbers

You can chain scrubbers:

Loofah.html5_fragment("<span>hello</span> <script>alert('OHAI')</script>") \
      .scrub!(:prune) \
      .scrub!(span2div).to_s
# => "<div>hello</div> "

Shorthand

The class methods Loofah.scrub_html5_fragment and Loofah.scrub_html5_document (and the corresponding HTML4 methods) are shorthand.

These methods:

Loofah.scrub_html5_fragment(unsafe_html, :prune)
Loofah.scrub_html5_document(unsafe_html, :prune)
Loofah.scrub_html4_fragment(unsafe_html, :prune)
Loofah.scrub_html4_document(unsafe_html, :prune)
Loofah.scrub_xml_fragment(bad_xml, custom_scrubber)
Loofah.scrub_xml_document(bad_xml, custom_scrubber)

do the same thing as (and arguably semantically clearer than):

Loofah.html5_fragment(unsafe_html).scrub!(:prune)
Loofah.html5_document(unsafe_html).scrub!(:prune)
Loofah.html4_fragment(unsafe_html).scrub!(:prune)
Loofah.html4_document(unsafe_html).scrub!(:prune)
Loofah.xml_fragment(bad_xml).scrub!(custom_scrubber)
Loofah.xml_document(bad_xml).scrub!(custom_scrubber)

View Helpers

Loofah has two "view helpers": Loofah::Helpers.sanitize and Loofah::Helpers.strip_tags, both of which are drop-in replacements for the Rails Action View helpers of the same name.

These are not required automatically. You must require loofah/helpers to use them.

Requirements

  • Nokogiri >= 1.5.9

Installation

Unsurprisingly:

gem install loofah

Requirements:

  • Ruby >= 2.5

Support

The bug tracker is available here:

And the mailing list is on Google Groups:

Consider subscribing to Tidelift which provides license assurances and timely security notifications for your open source dependencies, including Loofah. Tidelift subscriptions also help the Loofah maintainers fund our automated testing which in turn allows us to ship releases, bugfixes, and security updates more often.

Security

See SECURITY.md for vulnerability reporting details.

Related Links

Authors

Featuring code contributed by:

And a big shout-out to Corey Innis for the name, and feedback on the API.

Thank You

The following people have generously funded Loofah with financial sponsorship:

  • Bill Harding
  • Sentry @getsentry

Historical Note

This library was once named "Dryopteris", which was a very bad name that nobody could spell properly.

License

Distributed under the MIT License. See MIT-LICENSE.txt for details.

loofah's People

Contributors

ahorek avatar asok avatar brynary avatar dependabot-preview[bot] avatar flavorjones avatar joncalhoun avatar juanitofatas avatar junaruga avatar kaspth avatar kristianfreeman avatar ktdreyer avatar kyoshidajp avatar louim avatar m-nakamura145 avatar mothonmars avatar mrpasquini avatar nick-desteffen avatar nikoroberts avatar olivierlacan avatar olleolleolle avatar orien avatar pauldix avatar queso avatar sampokuokkanen avatar stefannibrasil avatar tastycode avatar technicalpickles avatar tenderlove avatar trans avatar vipulnsward avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

loofah's Issues

Scrub not fully applied on HTML::Document

I noticed that some HTML comment tags are not removed.

Here is an example, my_scrub should remove all the comments.

Loofah.document("<!DOCTYPE html><!--[if IE 7]><!-- --><html><body><script></script></body></html><!--ww -->").scrub!(my_scrub).to_xml
=> "<!DOCTYPE html>\n<!--[if IE 7]><!-- --><html></html>\n"

I check the code and think the problem is here:

https://github.com/flavorjones/loofah/blob/master/lib/loofah/instance_methods.rb#L41

        case self
        when Nokogiri::XML::Document
          scrubber.traverse(root) if root
        when Nokogiri::XML::DocumentFragment
          children.scrub! scrubber
        else
          scrubber.traverse(self)
        end

So even a HTML::Document would went through scrubber.traverse(root) if root. So things outside of HTML will not went through this scrubber.

0.4.7 not compatible with Nokogiri 1.3.3

There is a call to at_xpath which means that scrub_fragment does not work. 0.4.3 works.

irb(main):002:0> Loofah.scrub_fragment("Bold", :prune).to_s
NoMethodError: undefined method `at_xpath' for #
        from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:31:in `serialize_root'
        from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:26:in `to_s'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `inspect'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_str'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_s'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `write'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `print'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:259:in `signal_status'
        from /opt/local/lib/ruby/1.8/irb.rb:147:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:146:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:70:in `start'
        from /opt/local/lib/ruby/1.8/irb.rb:69:in `catch'
        from /opt/local/lib/ruby/1.8/irb.rb:69:in `start'
        from /opt/local/bin/irb:13

whitewash test error with libxml 2.9.1

I'm seeing a test failure with the newer LibXML 2.9.0 in Fedora 19 and above:

=> testing with Nokogiri {"warnings"=>["Nokogiri was built against LibXML version 2.9.0, but has dynamically loaded 2.9.1"], "nokogiri"=>"1.5.9", "ruby"=>{"version"=>"2.0.0", "platform"=>"i386-linux", "description"=>"ruby 2.0.0p353 (2013-11-22 revision 43784) [i386-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "compiled"=>"2.9.0", "loaded"=>"2.9.1"}}

The same test succeeds on CentOS 6, with Ruby 1.9.3 and LibXML 2.8.0:

=> testing with Nokogiri {"warnings"=>[], "nokogiri"=>"1.6.1", "ruby"=>{"version"=>"1.9.3", "platform"=>"x86_64-linux", "description"=>"ruby 1.9.3p327 (2012-11-10 revision 37606) [x86_64-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "source"=>"packaged", "libxml2_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxml2/2.8.0", "libxslt_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxslt/1.1.26", "compiled"=>"2.8.0", "loaded"=>"2.8.0"}}

The failing test is this one:

  1) Failure:
IntegrationTestAdHoc#test_fragment_whitewash_on_microsofty_markup [/builddir/bui
ld/BUILD/loofah-1.2.1/usr/share/gems/gems/loofah-1.2.1/test/integration/test_ad_
hoc.rb:146]:
--- expected
+++ actual
@@ -1 +1,4 @@
-"<p>Foo <b>BOLD</b></p>"
+"
+
+<p>Foo <b>BOLD</b></p>
+"

For some reason, nokogiri must be inserting some blank lines there.

I looked at the Travis builds, and apparently Travis is using the older LibXML 2.8.0, which might explain why we haven't seen this fail in Travis yet.

to_text and Loofah::Helpers

Hi,
is is possible that the removal to automatically load Loofah::Helpers
breaks to_text?
irb
require 'loofah'
Loofah.fragment("abc").to_text
NameError: uninitialized constant Loofah::Helpers

OTOH Loofah.fragment("abc").to_s works.

Or is to_text considered part of the ActionView helpers?
Feels a bit strange. And: the to_text method itself is there.

best
Morus

Not support -Ku option for ruby

Rake fails with error if ruby runs with RUBYOPT=-Ku

rake db:create
/home/nleo/.rvm/gems/ruby-1.9.3-p327/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/

improve rails integration points

I found that just config.gem'ing loofah didn't quite work, because it hadn't extended ActiveRecord::Base with html_fragment and friends.

require 'loofah/active_record' was enough to fix. I noticed an init.rb too, but it was referring to classes that didn't exist.

I took a page from factory_girl's initialization, and added some logic to lib/loofa.rb to require loofa/active_record if Rails is in existence. I also updated init.rb to just require loofa.

Updated in my fork/branch: http://github.com/technicalpickles/loofah/tree/better-rails-int

Loofah broken by colons

I'm using JRuby 1.6.5, Loofah 1.2.1, Nokogiri 1.5.5-java. Loofah is broken when it comes to colons inside the body of any tags. Here's a succinct example of what's wrong, from a console session:

>> Loofah.fragment("4:30am").to_s
=> "4:30am"
>> Loofah.fragment("<span>4:30am</span>").to_s
=> "<span>30am</span>"

>> Loofah.fragment("4:30am")
=> #<Loofah::HTML::DocumentFragment:0x1086 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1084 "4:30am">]>
>> Loofah.fragment("<span>4:30am</span>")
=> #<Loofah::HTML::DocumentFragment:0x108c name="#document-fragment" children=[#<Nokogiri::XML::Element:0x108a name="span" children=[#<Nokogiri::XML::Text:0x1088 "30am">]>]>

Notice how the "4:" has been mysteriously chomped. It doesn't matter if it's span, b, i or any other tag -- all remove the "4:". It seems that Nokogiri is able to parse the HTML initially just fine:

>> Nokogiri::HTML("4:30am")
=> #<Nokogiri::HTML::Document:0x1074 name="document" children=[#<Nokogiri::XML::DTD:0x106a name="html">, #<Nokogiri::XML::Element:0x1072 name="html" children=[#<Nokogiri::XML::Element:0x106c name="head">, #<Nokogiri::XML::Element:0x1070 name="body" children=[#<Nokogiri::XML::Text:0x106e "4:30am">]>]>]>

>> Nokogiri::HTML("<b>4:30am</b>")
=> #<Nokogiri::HTML::Document:0x1082 name="document" children=[#<Nokogiri::XML::DTD:0x1076 name="html">, #<Nokogiri::XML::Element:0x1080 name="html" children=[#<Nokogiri::XML::Element:0x1078 name="head">, #<Nokogiri::XML::Element:0x107e name="body" children=[#<Nokogiri::XML::Element:0x107c name="b" children=[#<Nokogiri::XML::Text:0x107a "4:30am">]>]>]>]>

So it looks like it must be Loofah that's removing it.

multibyte regex error

Reported on ruby-talk by Une Bévue [email protected]


because i'm using daily nokogiri i wanted to test loofah with a small
script (coming from http://loofah.rubyforge.org/loofah/) :

#! /opt/local/bin/ruby1.9
# encoding: utf-8

require 'rubygems'
require 'nokogiri'
require 'loofah'

unsafe_html="ohai! <div>div is safe</div> <script>but script is
not</script>"

doc=Loofah.fragment(unsafe_html).scrub!(:strip)
puts doc.to_s

however i got :

SyntaxError:
/opt/local/lib/ruby1.9/gems/1.9.1/gems/loofah-1.0.0/lib/loofah/html5/scr
ub.rb:20: too short escaped multibyte character:
/`|[\000-\040\177\s]+|\302[\200-\240]/
method require in untitled document at line 29
method require in untitled document at line 29
method <top (required)> in loofah.rb at line 9
method require in untitled document at line 33
method rescue in require in untitled document at line 33
method require in untitled document at line 29
method <main> in loofah_first_test.rb at line 22

ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10]
over Mac OS X SL

undefined method `lstrip' for :Loofah::HTML::DocumentFragment

undefined method `lstrip' for :Loofah::HTML::DocumentFragment
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:21:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `parse'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:179:in `fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:184:in `scrub_fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/active_record.rb:27:in `html_fragment'

The code that I was using:

html_fragment :content, :scrub => :whitewash

Duplicated range warning

I keep getting this warning, but I'm not too familiar with the regex contents to fix it.

  • lib/loofah/html5/scrub.rb:12: warning: character class has duplicated range: /[`\x00-\x20\x7F\s�-ā]/

//@rafaelfranca

encoding with ruby 1.9

I've checked to see if this issue is fixed.
https://github.com/flavorjones/loofah/issues/issue/26


irb(main):008:0> utf8_string = "あ<b>い</b>う<script>え</script>お"
=> "あ<b>い</b>う<script>え</script>お"
irb(main):009:0> Loofah.scrub_fragment(utf8_string, :strip).to_s
=> "あ<b>い</b>うã\u0081\u0088お"

I think above code should return あ<b>い</b>うえお.
(:escape works correctly.)

Syntax error when runnig test in Textmate

loofah-1.2.0/lib/loofah/html5/scrub.rb:10: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/

When I remove the failing line 10, It works!

Ruby 1.9.1 and loofah 0.2.2 Encoding error. Ruby 1.8.7 is OK.

Just got this under Ruby 1.9.1 while parsing http://www.fd.nl/nieuws/laatstenieuws/?view=RSS

/home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:42:in characters': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError) from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:innative_parse_memory'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:in parse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/sax/parser.rb:83:inparse'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:11:in initialize' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:innew'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:in parse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:179:infragment'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:184:in scrub_fragment' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1362:instrip_html'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:442:in get_feed_summary' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:888:inupdate_rss_feed'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1130:in `update_feed'

The strip_html() method looks like this...

class String
def strip_html
html = Nokogiri::HTML.fragment(self.dup)
(html/:br).each {|_br| _br.swap(' ') }
(html/:p).each {|_p| _p.swap(_p.content + ' ') }
Loofah.scrub_fragment(html.content, :strip).text
end
end

The text it's working on is:

fd.nl - De nieuwsbron van ondernemend Nederland. Financieel-economisch nieuws, achtergronden en analyses. Artikelen over beleggen, carrière en ondernemen. Uitgebreide koersen, rentestanden en bedrijfsinformatie. Met veel aandacht voor aandelenmarkten en beleggingsfondsen.

Loofah escaping carriage returns

We have carriage returns in our text fields, and loofah is turning them into entities.

This is undesirable. Is there a way to stop it?

Errors out in a Rails project that doesn't use ActiveRecord (like MongoMapper)

I'm using MongoMapper on a project and loading Feedzirra, which uses loofah. phew

I remove ActiveRecord from my configuration and get this error message:
/Library/Ruby/Gems/1.8/gems/activesupport-2.3.5/lib/active_support/dependencies.rb:443:in `load_missing_constant':NameError: uninitialized constant ActiveRecord

Trying to use rake also results in an error:
uninitialized constant ActiveRecord

Using this to remove AR:
config.frameworks -= [:active_record]

feature: allow Loofah-ization of an existing Nokogiri document or fragment

Currently, the following code will add scrubbing behavior:

doc = Nokogiri::HTML HTML
doc.extend Loofah::ScrubBehavior::Node

but that doesn't include #text or the node and node set decorators.

So, I'd suggest that we allow Loofah initialize methods to receive a Nokogiri document / fragment and do the extension there. Then we should check that we're not relying on the typeiness of the Loofah document / fragment.

(And, actually, can we then eliminate the classes entirely, and just use extended Nokogiri docs everywhere?)

The ActiveRecord extension *almost* works for MongoMapper

The methods defined in the ActiveRecord extension class actually work as-is for MongoMapper. The line that extends ActiveRecord::Base, though, blows up when there's no ActiveRecord present and I include that class manually. I "fixed" this in my own branch(mrkurt/loofah@a336a9c) by moving the actual extension call into that Rails/ActiveRecord initialization bits. Would you mind tweaking the real Gem so I can use the ActiveRecordExtension like this?

encoding with ruby 1.9

Is this a bug?


irb(main):012:0> utf8_string = "日本語"
=> "日本語"
irb(main):012:0> utf8_string.encoding
=> #<Encoding:UTF-8>
irb(main):013:0> escaped = Loofah.scrub_fragment(utf8_string, :escape).to_s
=> "&#26085;&#26412;&#35486;"
irb(main):015:0> escaped.encoding
=> #<Encoding:US-ASCII>
irb(main):016:0> escaped.encode('UTF-8')
=> "&#26085;&#26412;&#35486;"
irb(main):019:0> escaped.force_encoding('UTF-8')
=> "&#26085;&#26412;&#35486;"

Software versions:

  • Ubuntu 10.04
  • rvm 1.0.20
  • ruby 1.9.2p0
  • nokogiri (1.4.4)
  • loofah (1.0.0.beta.1.20101025234603)

Consider removing unprintable characters?

We've encountered some issues with HTML containing unprintable characters (namely \u2028\u2029) over on the Stringer project (stringer-rss/stringer#295). The issue manifests itself further down the chain once we try to load the HTML with unprintable characters as JSON.

We've resolved our issues by adding a gsub(/[^[:print:]]/, '') call after running the contents through loofah. I think this makes sense to add this as part of loofah's sanitizing process - any thoughts before I take a stab at a PR?

Scrubber inserts newlines between spans if there is no whitespace between them?

Bit of a weird one this. I've been trying to investigate this issue with clojure syntax highlighting on exercism.io, and it seems to be being caused by some weirdness with how Loofah sanitization deals with span elements without whitespace between them. If there are no spaces between the spans then it adds a newline between each one. If there is a space between at least one set of spans, it leaves the whitespace intact.

Here's the example from the exercism.io issue. In each case the input is being run through Loofah.xml_fragment(html).scrub!(:escape).to_s.

# input with no spaces between spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>

# output has loads of newlines added
<pre class=\"highlight\">\n  <span class=\"p\">(</span>\n  <span class=\"k\">defn</span>\n  <span class=\"w\"/>\n  <span class=\"n\">to-rna</span>\n  <span class=\"p\">)</span>\n</pre>\n

# input with only one char difference, a single space between two of spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>

# output retains its newlines correctly
<pre class=\"highlight\"><span class=\"p\">(</span><span class=\"k\">defn</span> <span class=\"w\"> </span> <span class=\"n\">to-rna</span><span class=\"p\">)</span></pre>\n

At this point I'm a bit lost - any ideas?

Loofah fails on the empty string

On my setup, Loofah gives a runtime error if you ask it to prune the empty string (but is fine with pruning a string with a single space):

>> Loofah.scrub_fragment(" ", :prune)
=> #<Loofah::HTML::DocumentFragment:0x1066 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1064 " ">]>

>> Loofah.scrub_fragment("", :prune)
NoMethodError: undefined method `scrub!' for []:Nokogiri::XML::NodeSet
    from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah/instance_methods.rb:44:in `scrub!'
    from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah.rb:49:in `scrub_fragment'
    from (irb):4:in `evaluate'
    from org/jruby/RubyKernel.java:1088:in `eval'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:158:in `eval_input'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:271:in `signal_status'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:155:in `eval_input'
    from org/jruby/RubyKernel.java:1420:in `loop'
    from org/jruby/RubyKernel.java:1192:in `catch'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:154:in `eval_input'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:71:in `start'
    from org/jruby/RubyKernel.java:1192:in `catch'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:70:in `start'
    from /home/neil/src/jruby-1.6.5/bin/jirb:13:in `(root)'

Loofah seems to always :prune, no matter what scrubber is defined

No matter what scrubber I use, Loofah seems to always prune the contents. Here is a little test inside of Rails Console:

>> Loofah.scrub_fragment('I <3 You', :escape).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :prune).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :whitewash).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :strip).to_s
=> "I "
>> Loofah.scrub_fragment('I <3> You', :strip).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :whitewash).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :escape).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :prune).to_s
=> "I  You"

infos:
Ruby 1.9.3p327
Rails 3.2.11
Using gem 'loofah-activerecord'

from Gemfile.lock file

loofah (1.2.1)
  nokogiri (>= 1.4.4)
loofah-activerecord (1.1.0)
  loofah (>= 1.0.0)

Negative values inside of css properties.

I'm trying to sanitize property with negative value, but it removes the property,
i've attached the failing test

class IntegrationTestCssNegative < Loofah::TestCase
    def test_css_negative_value_sanitization
        html = "<span style=\"word-spacing: -1px;\"></span>"
        sane = Nokogiri::HTML(Loofah.scrub_fragment(html, :escape).to_xml)
        assert_match %r/-1px/,    sane.inner_html
    end
end

Escape removes numbers using greater than / less than.

The text '1<2 and 2>1' is incorrectly converted to '11'.

require 'Loofah'
=> []

str = "1<2 and 2>1"
=> "1<2 and 2>1"

result = Loofah.scrub_fragment(str,:escape)
=> #<Loofah::HTML::DocumentFragment:0x262233c name="#document-fragment" children
=[#<Nokogiri::XML::Text:0x2621cf2 "11">]>

result.to_s
=> "11"

exit

loofah not compatible with rails3

This blog post has details: http://www.mythoughtpot.com/2010/02/10/feedzirra-on-rails3/

$ rails server
/Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails.rb:44:in `configuration': undefined method `config' for nil:NilClass (NoMethodError)
    from /Library/Ruby/Gems/1.8/gems/loofah-0.4.7/lib/loofah.rb:89
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `each'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `each'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler.rb:112:in `require'
    from /Users/dphillips/cnxweb/config/application.rb:7
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28:in `require'
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27:in `tap'
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27
    from script/rails:6:in `require'
    from script/rails:6

error:`<module:XML>': uninitialized constant Nokogiri::XML::Document (NameError)

While running rspec tests for my application which is using nokogiri, feedzirra and loofah i get error:

/home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:8:in <module:XML>': uninitialized constant Nokogiri::XML::Document (NameError) from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:2:inmodule:Loofah'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:1:in <top (required)>' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah.rb:15:in<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:in require' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:inblock (2 levels) in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:inblock in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:inrequire'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler.rb:114:in require' from /home/mydir/project_path/config/application.rb:13:in<top (required)>'
from /home/mydir/project_path/config/environment.rb:2:in require' from /home/mydir/project_path/config/environment.rb:2:in<top (required)>'
from /home/mydir/project_path/spec/spec_helper.rb:3:in require' from /home/mydir/project_path/spec/spec_helper.rb:3:in<top (required)>'
from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in require' from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in load' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:inblock in load_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in map' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:inload_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/command_line.rb:18:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:55:inrun_in_process'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:46:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:10:inblock in autorun'

I'm in the same situation when i'm using ruby 1.8.7 and rspec 2.5.0. I'm running rspec test on Ubuntu 10.10 64 bit and my RSS parsing gems are defined like:
gem 'fii', '1.0.5'
gem "nokogiri"
gem "loofah", '1.0.0'
gem "feedzirra", '~> 0.0.24'

Please advise.

Documents with extended UTF-8 URIs causes regexp error

Any idea how to work around this issue?

ruby-1.9.2-p136 :002 > Loofah.document("<a href=\"\u5927\">").scrub!(:strip)
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
    from /Users/dphillips/.rvm/gems/ruby-1.9.2-p136/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20:in `gsub'

push tags to github

When you tag releases, would you mind pushing the tags to GitHub? For example the latest version on Rubygems.org is 2.0.0, but the v2.0.0 tag is missing from the repository.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.