error:`<module:XML>': uninitialized constant Nokogiri::XML::Document (NameError)

While running rspec tests for my application which is using nokogiri, feedzirra and loofah i get error:

/home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:8:in <module:XML>': uninitialized constant Nokogiri::XML::Document (NameError) from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:2:inmodule:Loofah'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:1:in <top (required)>' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah.rb:15:in<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:in require' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:inblock (2 levels) in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:inblock in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:inrequire'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler.rb:114:in require' from /home/mydir/project_path/config/application.rb:13:in<top (required)>'
from /home/mydir/project_path/config/environment.rb:2:in require' from /home/mydir/project_path/config/environment.rb:2:in<top (required)>'
from /home/mydir/project_path/spec/spec_helper.rb:3:in require' from /home/mydir/project_path/spec/spec_helper.rb:3:in<top (required)>'
from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in require' from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in load' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:inblock in load_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in map' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:inload_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/command_line.rb:18:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:55:inrun_in_process'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:46:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:10:inblock in autorun'

I'm in the same situation when i'm using ruby 1.8.7 and rspec 2.5.0. I'm running rspec test on Ubuntu 10.10 64 bit and my RSS parsing gems are defined like:
gem 'fii', '1.0.5'
gem "nokogiri"
gem "loofah", '1.0.0'
gem "feedzirra", '~> 0.0.24'

Please advise.

push tags to github

When you tag releases, would you mind pushing the tags to GitHub? For example the latest version on is 2.0.0, but the v2.0.0 tag is missing from the repository.

Duplicated range warning

I keep getting this warning, but I'm not too familiar with the regex contents to fix it.

  • lib/loofah/html5/scrub.rb:12: warning: character class has duplicated range: /[`\x00-\x20\x7F\s�-ā]/


feature: allow Loofah-ization of an existing Nokogiri document or fragment

Currently, the following code will add scrubbing behavior:

doc = Nokogiri::HTML HTML
doc.extend Loofah::ScrubBehavior::Node

but that doesn't include #text or the node and node set decorators.

So, I'd suggest that we allow Loofah initialize methods to receive a Nokogiri document / fragment and do the extension there. Then we should check that we're not relying on the typeiness of the Loofah document / fragment.

(And, actually, can we then eliminate the classes entirely, and just use extended Nokogiri docs everywhere?)

Loofah escaping carriage returns

We have carriage returns in our text fields, and loofah is turning them into entities.

This is undesirable. Is there a way to stop it?

Loofah fails on the empty string

On my setup, Loofah gives a runtime error if you ask it to prune the empty string (but is fine with pruning a string with a single space):

>> Loofah.scrub_fragment(" ", :prune)
=> #<Loofah::HTML::DocumentFragment:0x1066 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1064 " ">]>

>> Loofah.scrub_fragment("", :prune)
NoMethodError: undefined method `scrub!' for []:Nokogiri::XML::NodeSet
    from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah/instance_methods.rb:44:in `scrub!'
    from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah.rb:49:in `scrub_fragment'
    from (irb):4:in `evaluate'
    from org/jruby/ `eval'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:158:in `eval_input'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:271:in `signal_status'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:155:in `eval_input'
    from org/jruby/ `loop'
    from org/jruby/ `catch'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:154:in `eval_input'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:71:in `start'
    from org/jruby/ `catch'
    from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:70:in `start'
    from /home/neil/src/jruby-1.6.5/bin/jirb:13:in `(root)'

encoding with ruby 1.9

Is this a bug?

irb(main):012:0> utf8_string = "日本語"
=> "日本語"
irb(main):012:0> utf8_string.encoding
=> #<Encoding:UTF-8>
irb(main):013:0> escaped = Loofah.scrub_fragment(utf8_string, :escape).to_s
=> "&#26085;&#26412;&#35486;"
irb(main):015:0> escaped.encoding
=> #<Encoding:US-ASCII>
irb(main):016:0> escaped.encode('UTF-8')
=> "&#26085;&#26412;&#35486;"
irb(main):019:0> escaped.force_encoding('UTF-8')
=> "&#26085;&#26412;&#35486;"

Software versions:

  • Ubuntu 10.04
  • rvm 1.0.20
  • ruby 1.9.2p0
  • nokogiri (1.4.4)
  • loofah (1.0.0.beta.1.20101025234603)

Ruby 1.9.1 and loofah 0.2.2 Encoding error. Ruby 1.8.7 is OK.

Just got this under Ruby 1.9.1 while parsing

/home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:42:in characters': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError) from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:innative_parse_memory'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:in parse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/sax/parser.rb:83:inparse'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:11:in initialize' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:innew'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:in parse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:179:infragment'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:184:in scrub_fragment' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1362:instrip_html'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:442:in get_feed_summary' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:888:inupdate_rss_feed'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1130:in `update_feed'

The strip_html() method looks like this...

class String
def strip_html
html = Nokogiri::HTML.fragment(self.dup)
(html/:br).each {|_br| _br.swap(' ') }
(html/:p).each {|_p| _p.swap(_p.content + ' ') }
Loofah.scrub_fragment(html.content, :strip).text

The text it's working on is: - De nieuwsbron van ondernemend Nederland. Financieel-economisch nieuws, achtergronden en analyses. Artikelen over beleggen, carrière en ondernemen. Uitgebreide koersen, rentestanden en bedrijfsinformatie. Met veel aandacht voor aandelenmarkten en beleggingsfondsen.

loofah not compatible with rails3

This blog post has details:

$ rails server
/Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails.rb:44:in `configuration': undefined method `config' for nil:NilClass (NoMethodError)
    from /Library/Ruby/Gems/1.8/gems/loofah-0.4.7/lib/loofah.rb:89
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `each'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `each'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `require'
    from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler.rb:112:in `require'
    from /Users/dphillips/cnxweb/config/application.rb:7
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28:in `require'
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27:in `tap'
    from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27
    from script/rails:6:in `require'
    from script/rails:6

whitewash test error with libxml 2.9.1

I'm seeing a test failure with the newer LibXML 2.9.0 in Fedora 19 and above:

=> testing with Nokogiri {"warnings"=>["Nokogiri was built against LibXML version 2.9.0, but has dynamically loaded 2.9.1"], "nokogiri"=>"1.5.9", "ruby"=>{"version"=>"2.0.0", "platform"=>"i386-linux", "description"=>"ruby 2.0.0p353 (2013-11-22 revision 43784) [i386-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "compiled"=>"2.9.0", "loaded"=>"2.9.1"}}

The same test succeeds on CentOS 6, with Ruby 1.9.3 and LibXML 2.8.0:

=> testing with Nokogiri {"warnings"=>[], "nokogiri"=>"1.6.1", "ruby"=>{"version"=>"1.9.3", "platform"=>"x86_64-linux", "description"=>"ruby 1.9.3p327 (2012-11-10 revision 37606) [x86_64-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "source"=>"packaged", "libxml2_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxml2/2.8.0", "libxslt_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxslt/1.1.26", "compiled"=>"2.8.0", "loaded"=>"2.8.0"}}

The failing test is this one:

  1) Failure:
IntegrationTestAdHoc#test_fragment_whitewash_on_microsofty_markup [/builddir/bui
--- expected
+++ actual
@@ -1 +1,4 @@
-"<p>Foo <b>BOLD</b></p>"
+<p>Foo <b>BOLD</b></p>

For some reason, nokogiri must be inserting some blank lines there.

I looked at the Travis builds, and apparently Travis is using the older LibXML 2.8.0, which might explain why we haven't seen this fail in Travis yet.

Documents with extended UTF-8 URIs causes regexp error

Any idea how to work around this issue?

ruby-1.9.2-p136 :002 > Loofah.document("<a href=\"\u5927\">").scrub!(:strip)
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
    from /Users/dphillips/.rvm/gems/ruby-1.9.2-p136/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20:in `gsub'

to_text and Loofah::Helpers

is is possible that the removal to automatically load Loofah::Helpers
breaks to_text?
require 'loofah'
NameError: uninitialized constant Loofah::Helpers

OTOH Loofah.fragment("abc").to_s works.

Or is to_text considered part of the ActionView helpers?
Feels a bit strange. And: the to_text method itself is there.


Not support -Ku option for ruby

Rake fails with error if ruby runs with RUBYOPT=-Ku

rake db:create
/home/nleo/.rvm/gems/ruby-1.9.3-p327/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/

0.4.7 not compatible with Nokogiri 1.3.3

There is a call to at_xpath which means that scrub_fragment does not work. 0.4.3 works.

irb(main):002:0> Loofah.scrub_fragment("Bold", :prune).to_s
NoMethodError: undefined method `at_xpath' for #
        from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:31:in `serialize_root'
        from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:26:in `to_s'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `inspect'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_str'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_s'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `write'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `print'
        from /opt/local/lib/ruby/1.8/irb.rb:154:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:259:in `signal_status'
        from /opt/local/lib/ruby/1.8/irb.rb:147:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:146:in `eval_input'
        from /opt/local/lib/ruby/1.8/irb.rb:70:in `start'
        from /opt/local/lib/ruby/1.8/irb.rb:69:in `catch'
        from /opt/local/lib/ruby/1.8/irb.rb:69:in `start'
        from /opt/local/bin/irb:13

Escape removes numbers using greater than / less than.

The text '1<2 and 2>1' is incorrectly converted to '11'.

require 'Loofah'
=> []

str = "1<2 and 2>1"
=> "1<2 and 2>1"

result = Loofah.scrub_fragment(str,:escape)
=> #<Loofah::HTML::DocumentFragment:0x262233c name="#document-fragment" children
=[#<Nokogiri::XML::Text:0x2621cf2 "11">]>

=> "11"


undefined method `lstrip' for :Loofah::HTML::DocumentFragment

undefined method `lstrip' for :Loofah::HTML::DocumentFragment
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:21:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `parse'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:179:in `fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:184:in `scrub_fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/active_record.rb:27:in `html_fragment'

The code that I was using:

html_fragment :content, :scrub => :whitewash

Negative values inside of css properties.

I'm trying to sanitize property with negative value, but it removes the property,
i've attached the failing test

class IntegrationTestCssNegative < Loofah::TestCase
    def test_css_negative_value_sanitization
        html = "<span style=\"word-spacing: -1px;\"></span>"
        sane = Nokogiri::HTML(Loofah.scrub_fragment(html, :escape).to_xml)
        assert_match %r/-1px/,    sane.inner_html

multibyte regex error

Reported on ruby-talk by Une Bévue [email protected]

because i'm using daily nokogiri i wanted to test loofah with a small
script (coming from :

#! /opt/local/bin/ruby1.9
# encoding: utf-8

require 'rubygems'
require 'nokogiri'
require 'loofah'

unsafe_html="ohai! <div>div is safe</div> <script>but script is

puts doc.to_s

however i got :

ub.rb:20: too short escaped multibyte character:
method require in untitled document at line 29
method require in untitled document at line 29
method <top (required)> in loofah.rb at line 9
method require in untitled document at line 33
method rescue in require in untitled document at line 33
method require in untitled document at line 29
method <main> in loofah_first_test.rb at line 22

ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10]
over Mac OS X SL

Scrub not fully applied on HTML::Document

I noticed that some HTML comment tags are not removed.

Here is an example, my_scrub should remove all the comments.

Loofah.document("<!DOCTYPE html><!--[if IE 7]><!-- --><html><body><script></script></body></html><!--ww -->").scrub!(my_scrub).to_xml
=> "<!DOCTYPE html>\n<!--[if IE 7]><!-- --><html></html>\n"

I check the code and think the problem is here:

        case self
        when Nokogiri::XML::Document
          scrubber.traverse(root) if root
        when Nokogiri::XML::DocumentFragment
          children.scrub! scrubber

So even a HTML::Document would went through scrubber.traverse(root) if root. So things outside of HTML will not went through this scrubber.

Loofah seems to always :prune, no matter what scrubber is defined

No matter what scrubber I use, Loofah seems to always prune the contents. Here is a little test inside of Rails Console:

>> Loofah.scrub_fragment('I <3 You', :escape).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :prune).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :whitewash).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :strip).to_s
=> "I "
>> Loofah.scrub_fragment('I <3> You', :strip).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :whitewash).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :escape).to_s
=> "I  You"
>> Loofah.scrub_fragment('I <3> You', :prune).to_s
=> "I  You"

Ruby 1.9.3p327
Rails 3.2.11
Using gem 'loofah-activerecord'

from Gemfile.lock file

loofah (1.2.1)
  nokogiri (>= 1.4.4)
loofah-activerecord (1.1.0)
  loofah (>= 1.0.0)

Syntax error when runnig test in Textmate

loofah-1.2.0/lib/loofah/html5/scrub.rb:10: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/

When I remove the failing line 10, It works!

Errors out in a Rails project that doesn't use ActiveRecord (like MongoMapper)

I'm using MongoMapper on a project and loading Feedzirra, which uses loofah. phew

I remove ActiveRecord from my configuration and get this error message:
/Library/Ruby/Gems/1.8/gems/activesupport-2.3.5/lib/active_support/dependencies.rb:443:in `load_missing_constant':NameError: uninitialized constant ActiveRecord

Trying to use rake also results in an error:
uninitialized constant ActiveRecord

Using this to remove AR:
config.frameworks -= [:active_record]

Scrubber inserts newlines between spans if there is no whitespace between them?

Bit of a weird one this. I've been trying to investigate this issue with clojure syntax highlighting on, and it seems to be being caused by some weirdness with how Loofah sanitization deals with span elements without whitespace between them. If there are no spaces between the spans then it adds a newline between each one. If there is a space between at least one set of spans, it leaves the whitespace intact.

Here's the example from the issue. In each case the input is being run through Loofah.xml_fragment(html).scrub!(:escape).to_s.

# input with no spaces between spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>

# output has loads of newlines added
<pre class=\"highlight\">\n  <span class=\"p\">(</span>\n  <span class=\"k\">defn</span>\n  <span class=\"w\"/>\n  <span class=\"n\">to-rna</span>\n  <span class=\"p\">)</span>\n</pre>\n

# input with only one char difference, a single space between two of spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>

# output retains its newlines correctly
<pre class=\"highlight\"><span class=\"p\">(</span><span class=\"k\">defn</span> <span class=\"w\"> </span> <span class=\"n\">to-rna</span><span class=\"p\">)</span></pre>\n

At this point I'm a bit lost - any ideas?

Loofah broken by colons

I'm using JRuby 1.6.5, Loofah 1.2.1, Nokogiri 1.5.5-java. Loofah is broken when it comes to colons inside the body of any tags. Here's a succinct example of what's wrong, from a console session:

>> Loofah.fragment("4:30am").to_s
=> "4:30am"
>> Loofah.fragment("<span>4:30am</span>").to_s
=> "<span>30am</span>"

>> Loofah.fragment("4:30am")
=> #<Loofah::HTML::DocumentFragment:0x1086 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1084 "4:30am">]>
>> Loofah.fragment("<span>4:30am</span>")
=> #<Loofah::HTML::DocumentFragment:0x108c name="#document-fragment" children=[#<Nokogiri::XML::Element:0x108a name="span" children=[#<Nokogiri::XML::Text:0x1088 "30am">]>]>

Notice how the "4:" has been mysteriously chomped. It doesn't matter if it's span, b, i or any other tag -- all remove the "4:". It seems that Nokogiri is able to parse the HTML initially just fine:

>> Nokogiri::HTML("4:30am")
=> #<Nokogiri::HTML::Document:0x1074 name="document" children=[#<Nokogiri::XML::DTD:0x106a name="html">, #<Nokogiri::XML::Element:0x1072 name="html" children=[#<Nokogiri::XML::Element:0x106c name="head">, #<Nokogiri::XML::Element:0x1070 name="body" children=[#<Nokogiri::XML::Text:0x106e "4:30am">]>]>]>

>> Nokogiri::HTML("<b>4:30am</b>")
=> #<Nokogiri::HTML::Document:0x1082 name="document" children=[#<Nokogiri::XML::DTD:0x1076 name="html">, #<Nokogiri::XML::Element:0x1080 name="html" children=[#<Nokogiri::XML::Element:0x1078 name="head">, #<Nokogiri::XML::Element:0x107e name="body" children=[#<Nokogiri::XML::Element:0x107c name="b" children=[#<Nokogiri::XML::Text:0x107a "4:30am">]>]>]>]>

So it looks like it must be Loofah that's removing it.

improve rails integration points

I found that just config.gem'ing loofah didn't quite work, because it hadn't extended ActiveRecord::Base with html_fragment and friends.

require 'loofah/active_record' was enough to fix. I noticed an init.rb too, but it was referring to classes that didn't exist.

I took a page from factory_girl's initialization, and added some logic to lib/loofa.rb to require loofa/active_record if Rails is in existence. I also updated init.rb to just require loofa.

Updated in my fork/branch:

encoding with ruby 1.9

I've checked to see if this issue is fixed.

irb(main):008:0> utf8_string = "あ<b>い</b>う<script>え</script>お"
=> "あ<b>い</b>う<script>え</script>お"
irb(main):009:0> Loofah.scrub_fragment(utf8_string, :strip).to_s
=> "あ<b>い</b>うã\u0081\u0088お"

I think above code should return あ<b>い</b>うえお.
(:escape works correctly.)

The ActiveRecord extension *almost* works for MongoMapper

The methods defined in the ActiveRecord extension class actually work as-is for MongoMapper. The line that extends ActiveRecord::Base, though, blows up when there's no ActiveRecord present and I include that class manually. I "fixed" this in my own branch(mrkurt/loofah@a336a9c) by moving the actual extension call into that Rails/ActiveRecord initialization bits. Would you mind tweaking the real Gem so I can use the ActiveRecordExtension like this?

Consider removing unprintable characters?

We've encountered some issues with HTML containing unprintable characters (namely \u2028\u2029) over on the Stringer project (stringer-rss/stringer#295). The issue manifests itself further down the chain once we try to load the HTML with unprintable characters as JSON.

We've resolved our issues by adding a gsub(/[^[:print:]]/, '') call after running the contents through loofah. I think this makes sense to add this as part of loofah's sanitizing process - any thoughts before I take a stab at a PR?

