flavorjones / loofah Goto Github PK
View Code? Open in Web Editor NEWRuby library for HTML/XML transformation and sanitization
License: MIT License
Ruby library for HTML/XML transformation and sanitization
License: MIT License
While running rspec tests for my application which is using nokogiri, feedzirra and loofah i get error:
/home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:8:in <module:XML>': uninitialized constant Nokogiri::XML::Document (NameError) from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:2:in
module:Loofah'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah/xml/document.rb:1:in <top (required)>' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/loofah-1.0.0/lib/loofah.rb:15:in
<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:in require' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:68:in
block (2 levels) in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:66:in
block in require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:in each' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler/runtime.rb:55:in
require'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/bundler-1.0.9/lib/bundler.rb:114:in require' from /home/mydir/project_path/config/application.rb:13:in
<top (required)>'
from /home/mydir/project_path/config/environment.rb:2:in require' from /home/mydir/project_path/config/environment.rb:2:in
<top (required)>'
from /home/mydir/project_path/spec/spec_helper.rb:3:in require' from /home/mydir/project_path/spec/spec_helper.rb:3:in
<top (required)>'
from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in require' from /home/mydir/project_path/spec/controllers/mongo_mapper_controller_spec.rb:1:in
<top (required)>'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in load' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in
block in load_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in map' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/configuration.rb:388:in
load_spec_files'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/command_line.rb:18:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:55:in
run_in_process'
from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:46:in run' from /home/mydir/.rvm/gems/ruby-1.9.2-p0/gems/rspec-core-2.3.0/lib/rspec/core/runner.rb:10:in
block in autorun'
I'm in the same situation when i'm using ruby 1.8.7 and rspec 2.5.0. I'm running rspec test on Ubuntu 10.10 64 bit and my RSS parsing gems are defined like:
gem 'fii', '1.0.5'
gem "nokogiri"
gem "loofah", '1.0.0'
gem "feedzirra", '~> 0.0.24'
Please advise.
yes, I said to_markdown.
When you tag releases, would you mind pushing the tags to GitHub? For example the latest version on Rubygems.org is 2.0.0, but the v2.0.0
tag is missing from the repository.
Some projects call for a more limited whitelist of acceptable elements. It would be easy enough to monkey patch, but you may consider this request common enough to support it through a method call instead.
I keep getting this warning, but I'm not too familiar with the regex contents to fix it.
Currently, the following code will add scrubbing behavior:
doc = Nokogiri::HTML HTML
doc.extend Loofah::ScrubBehavior::Node
but that doesn't include #text or the node and node set decorators.
So, I'd suggest that we allow Loofah initialize methods to receive a Nokogiri document / fragment and do the extension there. Then we should check that we're not relying on the typeiness of the Loofah document / fragment.
(And, actually, can we then eliminate the classes entirely, and just use extended Nokogiri docs everywhere?)
currently only accepts built-in scrubbers
We have carriage returns in our text fields, and loofah is turning them into entities.
This is undesirable. Is there a way to stop it?
For example,
Loofah.fragment("<div>foo</div><div>bar</div>").text
=> "foobar"
For block HTML elements, we should insert whitespace (maybe even newline), so that the desired output would be:
foo bar
or
foo
bar
On my setup, Loofah gives a runtime error if you ask it to prune the empty string (but is fine with pruning a string with a single space):
>> Loofah.scrub_fragment(" ", :prune)
=> #<Loofah::HTML::DocumentFragment:0x1066 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1064 " ">]>
>> Loofah.scrub_fragment("", :prune)
NoMethodError: undefined method `scrub!' for []:Nokogiri::XML::NodeSet
from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah/instance_methods.rb:44:in `scrub!'
from /home/neil/src/jruby-1.6.5/lib/ruby/gems/1.8/gems/loofah-1.2.1/lib/loofah.rb:49:in `scrub_fragment'
from (irb):4:in `evaluate'
from org/jruby/RubyKernel.java:1088:in `eval'
from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:158:in `eval_input'
from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:271:in `signal_status'
from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:155:in `eval_input'
from org/jruby/RubyKernel.java:1420:in `loop'
from org/jruby/RubyKernel.java:1192:in `catch'
from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:154:in `eval_input'
from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:71:in `start'
from org/jruby/RubyKernel.java:1192:in `catch'
from /home/neil/src/jruby-1.6.5/lib/ruby/1.8/irb.rb:70:in `start'
from /home/neil/src/jruby-1.6.5/bin/jirb:13:in `(root)'
Is this a bug?
irb(main):012:0> utf8_string = "日本語"
=> "日本語"
irb(main):012:0> utf8_string.encoding
=> #<Encoding:UTF-8>
irb(main):013:0> escaped = Loofah.scrub_fragment(utf8_string, :escape).to_s
=> "日本語"
irb(main):015:0> escaped.encoding
=> #<Encoding:US-ASCII>
irb(main):016:0> escaped.encode('UTF-8')
=> "日本語"
irb(main):019:0> escaped.force_encoding('UTF-8')
=> "日本語"
Just got this under Ruby 1.9.1 while parsing http://www.fd.nl/nieuws/laatstenieuws/?view=RSS
/home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:42:in characters': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError) from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:in
native_parse_memory'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/html/sax/parser.rb:34:in parse_memory' from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/sax/parser.rb:83:in
parse'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:11:in initialize' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:in
new'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah/html/document_fragment.rb:18:in parse' from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:179:in
fragment'
from /home/greg/.rvm/gems/ruby/1.9.1/gems/loofah-0.2.2/lib/loofah.rb:184:in scrub_fragment' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1362:in
strip_html'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:442:in get_feed_summary' from /home/greg/code/development/agents/rss/rewrite/feeds.rb:888:in
update_rss_feed'
from /home/greg/code/development/agents/rss/rewrite/feeds.rb:1130:in `update_feed'
The strip_html() method looks like this...
class String
def strip_html
html = Nokogiri::HTML.fragment(self.dup)
(html/:br).each {|_br| _br.swap('
') }
(html/:p).each {|_p| _p.swap(_p.content + '
') }
Loofah.scrub_fragment(html.content, :strip).text
end
end
The text it's working on is:
fd.nl - De nieuwsbron van ondernemend Nederland. Financieel-economisch nieuws, achtergronden en analyses. Artikelen over beleggen, carrière en ondernemen. Uitgebreide koersen, rentestanden en bedrijfsinformatie. Met veel aandacht voor aandelenmarkten en beleggingsfondsen.
obj = Object.new(:text => "<javascript>alert('evil')</javascript>")
obj.valid?
obj.text.should == "alert('evil')"
expected: "alert('evil')",
got: "<javascript>alert('evil')</javascript>" (using ==)
This blog post has details: http://www.mythoughtpot.com/2010/02/10/feedzirra-on-rails3/
$ rails server
/Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails.rb:44:in `configuration': undefined method `config' for nil:NilClass (NoMethodError)
from /Library/Ruby/Gems/1.8/gems/loofah-0.4.7/lib/loofah.rb:89
from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:64:in `require'
from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `each'
from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:62:in `require'
from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `each'
from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler/runtime.rb:51:in `require'
from /Library/Ruby/Gems/1.8/gems/bundler-1.0.3/lib/bundler.rb:112:in `require'
from /Users/dphillips/cnxweb/config/application.rb:7
from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28:in `require'
from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:28
from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27:in `tap'
from /Library/Ruby/Gems/1.8/gems/railties-3.0.1/lib/rails/commands.rb:27
from script/rails:6:in `require'
from script/rails:6
I'm seeing a test failure with the newer LibXML 2.9.0 in Fedora 19 and above:
=> testing with Nokogiri {"warnings"=>["Nokogiri was built against LibXML version 2.9.0, but has dynamically loaded 2.9.1"], "nokogiri"=>"1.5.9", "ruby"=>{"version"=>"2.0.0", "platform"=>"i386-linux", "description"=>"ruby 2.0.0p353 (2013-11-22 revision 43784) [i386-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "compiled"=>"2.9.0", "loaded"=>"2.9.1"}}
The same test succeeds on CentOS 6, with Ruby 1.9.3 and LibXML 2.8.0:
=> testing with Nokogiri {"warnings"=>[], "nokogiri"=>"1.6.1", "ruby"=>{"version"=>"1.9.3", "platform"=>"x86_64-linux", "description"=>"ruby 1.9.3p327 (2012-11-10 revision 37606) [x86_64-linux]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "source"=>"packaged", "libxml2_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxml2/2.8.0", "libxslt_path"=>"/home/gitoriousci193/.gems/gems/nokogiri-1.6.1/ports/x86_64-redhat-linux/libxslt/1.1.26", "compiled"=>"2.8.0", "loaded"=>"2.8.0"}}
The failing test is this one:
1) Failure:
IntegrationTestAdHoc#test_fragment_whitewash_on_microsofty_markup [/builddir/bui
ld/BUILD/loofah-1.2.1/usr/share/gems/gems/loofah-1.2.1/test/integration/test_ad_
hoc.rb:146]:
--- expected
+++ actual
@@ -1 +1,4 @@
-"<p>Foo <b>BOLD</b></p>"
+"
+
+<p>Foo <b>BOLD</b></p>
+"
For some reason, nokogiri must be inserting some blank lines there.
I looked at the Travis builds, and apparently Travis is using the older LibXML 2.8.0, which might explain why we haven't seen this fail in Travis yet.
Rails 3 Support / Bundler support with railties.
Any idea how to work around this issue?
ruby-1.9.2-p136 :002 > Loofah.document("<a href=\"\u5927\">").scrub!(:strip)
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
from /Users/dphillips/.rvm/gems/ruby-1.9.2-p136/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20:in `gsub'
Hi,
is is possible that the removal to automatically load Loofah::Helpers
breaks to_text?
irb
require 'loofah'
Loofah.fragment("abc").to_text
NameError: uninitialized constant Loofah::Helpers
OTOH Loofah.fragment("abc").to_s works.
Or is to_text considered part of the ActionView helpers?
Feels a bit strange. And: the to_text method itself is there.
best
Morus
Rake fails with error if ruby runs with RUBYOPT=-Ku
rake db:create
/home/nleo/.rvm/gems/ruby-1.9.3-p327/gems/loofah-1.0.0/lib/loofah/html5/scrub.rb:20: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/
There is a call to at_xpath which means that scrub_fragment does not work. 0.4.3 works.
irb(main):002:0> Loofah.scrub_fragment("Bold", :prune).to_s NoMethodError: undefined method `at_xpath' for # from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:31:in `serialize_root' from /opt/local/lib/ruby/gems/1.8/gems/loofah-0.4.7/lib/loofah/html/document_fragment.rb:26:in `to_s' from /opt/local/lib/ruby/1.8/irb.rb:154:in `inspect' from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_str' from /opt/local/lib/ruby/1.8/irb.rb:154:in `to_s' from /opt/local/lib/ruby/1.8/irb.rb:154:in `write' from /opt/local/lib/ruby/1.8/irb.rb:154:in `print' from /opt/local/lib/ruby/1.8/irb.rb:154:in `eval_input' from /opt/local/lib/ruby/1.8/irb.rb:259:in `signal_status' from /opt/local/lib/ruby/1.8/irb.rb:147:in `eval_input' from /opt/local/lib/ruby/1.8/irb.rb:146:in `eval_input' from /opt/local/lib/ruby/1.8/irb.rb:70:in `start' from /opt/local/lib/ruby/1.8/irb.rb:69:in `catch' from /opt/local/lib/ruby/1.8/irb.rb:69:in `start' from /opt/local/bin/irb:13
The text '1<2 and 2>1' is incorrectly converted to '11'.
require 'Loofah'
=> []
str = "1<2 and 2>1"
=> "1<2 and 2>1"
result = Loofah.scrub_fragment(str,:escape)
=> #<Loofah::HTML::DocumentFragment:0x262233c name="#document-fragment" children
=[#<Nokogiri::XML::Text:0x2621cf2 "11">]>
result.to_s
=> "11"
exit
Also, quantify the performance penalty with some benchmarks.
In 0.4.0 - lib/loofah.rb:
require 'loofah/xml/document'
This file is not in the gem. I'm puzzled why this isn't reported - doesn't this prevent everybody from using the gem??
undefined method `lstrip' for :Loofah::HTML::DocumentFragment
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/fragment_handler.rb:21:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/nokogiri-1.3.3/lib/nokogiri/xml/document_fragment.rb:7:in `initialize'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `new'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/html/document_fragment.rb:18:in `parse'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:179:in `fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah.rb:184:in `scrub_fragment'
/gentoo/usr/lib/ruby/gems/1.8/gems/loofah-0.2.0/lib/loofah/active_record.rb:27:in `html_fragment'
The code that I was using:
html_fragment :content, :scrub => :whitewash
Currently the grab bag that's in test_ad_hoc.rb is a hot mess.
I'm trying to sanitize property with negative value, but it removes the property,
i've attached the failing test
class IntegrationTestCssNegative < Loofah::TestCase
def test_css_negative_value_sanitization
html = "<span style=\"word-spacing: -1px;\"></span>"
sane = Nokogiri::HTML(Loofah.scrub_fragment(html, :escape).to_xml)
assert_match %r/-1px/, sane.inner_html
end
end
Canonical list here:
Reported on ruby-talk by Une Bévue [email protected]
because i'm using daily nokogiri i wanted to test loofah with a small
script (coming from http://loofah.rubyforge.org/loofah/) :
#! /opt/local/bin/ruby1.9
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'loofah'
unsafe_html="ohai! <div>div is safe</div> <script>but script is
not</script>"
doc=Loofah.fragment(unsafe_html).scrub!(:strip)
puts doc.to_s
however i got :
SyntaxError:
/opt/local/lib/ruby1.9/gems/1.9.1/gems/loofah-1.0.0/lib/loofah/html5/scr
ub.rb:20: too short escaped multibyte character:
/`|[\000-\040\177\s]+|\302[\200-\240]/
method require in untitled document at line 29
method require in untitled document at line 29
method <top (required)> in loofah.rb at line 9
method require in untitled document at line 33
method rescue in require in untitled document at line 33
method require in untitled document at line 29
method <main> in loofah_first_test.rb at line 22
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10]
over Mac OS X SL
I noticed that some HTML comment tags are not removed.
Here is an example, my_scrub
should remove all the comments.
Loofah.document("<!DOCTYPE html><!--[if IE 7]><!-- --><html><body><script></script></body></html><!--ww -->").scrub!(my_scrub).to_xml
=> "<!DOCTYPE html>\n<!--[if IE 7]><!-- --><html></html>\n"
I check the code and think the problem is here:
https://github.com/flavorjones/loofah/blob/master/lib/loofah/instance_methods.rb#L41
case self
when Nokogiri::XML::Document
scrubber.traverse(root) if root
when Nokogiri::XML::DocumentFragment
children.scrub! scrubber
else
scrubber.traverse(self)
end
So even a HTML::Document
would went through scrubber.traverse(root) if root
. So things outside of HTML will not went through this scrubber.
No matter what scrubber I use, Loofah seems to always prune the contents. Here is a little test inside of Rails Console:
>> Loofah.scrub_fragment('I <3 You', :escape).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :prune).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :whitewash).to_s
=> "I "
>> Loofah.scrub_fragment('I <3 You', :strip).to_s
=> "I "
>> Loofah.scrub_fragment('I <3> You', :strip).to_s
=> "I You"
>> Loofah.scrub_fragment('I <3> You', :whitewash).to_s
=> "I You"
>> Loofah.scrub_fragment('I <3> You', :escape).to_s
=> "I You"
>> Loofah.scrub_fragment('I <3> You', :prune).to_s
=> "I You"
infos:
Ruby 1.9.3p327
Rails 3.2.11
Using gem 'loofah-activerecord'
from Gemfile.lock file
loofah (1.2.1)
nokogiri (>= 1.4.4)
loofah-activerecord (1.1.0)
loofah (>= 1.0.0)
loofah-1.2.0/lib/loofah/html5/scrub.rb:10: too short escaped multibyte character: /`|[\000-\040\177\s]+|\302[\200-\240]/
When I remove the failing line 10, It works!
I'm using MongoMapper on a project and loading Feedzirra, which uses loofah. phew
I remove ActiveRecord from my configuration and get this error message:
/Library/Ruby/Gems/1.8/gems/activesupport-2.3.5/lib/active_support/dependencies.rb:443:in `load_missing_constant':NameError: uninitialized constant ActiveRecord
Trying to use rake also results in an error:
uninitialized constant ActiveRecord
Using this to remove AR:
config.frameworks -= [:active_record]
Travis is totally blowing up on the JRubies. Figure it out.
Bit of a weird one this. I've been trying to investigate this issue with clojure syntax highlighting on exercism.io, and it seems to be being caused by some weirdness with how Loofah sanitization deals with span
elements without whitespace between them. If there are no spaces between the span
s then it adds a newline between each one. If there is a space between at least one set of span
s, it leaves the whitespace intact.
Here's the example from the exercism.io issue. In each case the input is being run through Loofah.xml_fragment(html).scrub!(:escape).to_s
.
# input with no spaces between spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>
# output has loads of newlines added
<pre class=\"highlight\">\n <span class=\"p\">(</span>\n <span class=\"k\">defn</span>\n <span class=\"w\"/>\n <span class=\"n\">to-rna</span>\n <span class=\"p\">)</span>\n</pre>\n
# input with only one char difference, a single space between two of spans
<pre class="highlight"><span class="p">(</span><span class="k">defn</span><span class="w"></span><span class="n">to-rna</span><span class="p">)</span></pre>
# output retains its newlines correctly
<pre class=\"highlight\"><span class=\"p\">(</span><span class=\"k\">defn</span> <span class=\"w\"> </span> <span class=\"n\">to-rna</span><span class=\"p\">)</span></pre>\n
At this point I'm a bit lost - any ideas?
I'm using JRuby 1.6.5, Loofah 1.2.1, Nokogiri 1.5.5-java. Loofah is broken when it comes to colons inside the body of any tags. Here's a succinct example of what's wrong, from a console session:
>> Loofah.fragment("4:30am").to_s
=> "4:30am"
>> Loofah.fragment("<span>4:30am</span>").to_s
=> "<span>30am</span>"
>> Loofah.fragment("4:30am")
=> #<Loofah::HTML::DocumentFragment:0x1086 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x1084 "4:30am">]>
>> Loofah.fragment("<span>4:30am</span>")
=> #<Loofah::HTML::DocumentFragment:0x108c name="#document-fragment" children=[#<Nokogiri::XML::Element:0x108a name="span" children=[#<Nokogiri::XML::Text:0x1088 "30am">]>]>
Notice how the "4:" has been mysteriously chomped. It doesn't matter if it's span, b, i or any other tag -- all remove the "4:". It seems that Nokogiri is able to parse the HTML initially just fine:
>> Nokogiri::HTML("4:30am")
=> #<Nokogiri::HTML::Document:0x1074 name="document" children=[#<Nokogiri::XML::DTD:0x106a name="html">, #<Nokogiri::XML::Element:0x1072 name="html" children=[#<Nokogiri::XML::Element:0x106c name="head">, #<Nokogiri::XML::Element:0x1070 name="body" children=[#<Nokogiri::XML::Text:0x106e "4:30am">]>]>]>
>> Nokogiri::HTML("<b>4:30am</b>")
=> #<Nokogiri::HTML::Document:0x1082 name="document" children=[#<Nokogiri::XML::DTD:0x1076 name="html">, #<Nokogiri::XML::Element:0x1080 name="html" children=[#<Nokogiri::XML::Element:0x1078 name="head">, #<Nokogiri::XML::Element:0x107e name="body" children=[#<Nokogiri::XML::Element:0x107c name="b" children=[#<Nokogiri::XML::Text:0x107a "4:30am">]>]>]>]>
So it looks like it must be Loofah that's removing it.
A couple of commonly requested features:
One example is word-spacing
, which is listed at http://www.w3.org/TR/CSS2/propidx.html but isn't included in Loofah's whitelist (or the whitelist in html5lib-ruby which is obviously not being maintained anymore).
There's a greated problem here of keeping whitelists up to date with specs. Perhaps we can solve both.
The source for whitelist.rb has been updated, and already includes this support, but needs to be integrated into loofah:
https://github.com/html5lib/html5lib-ruby/blob/master/lib/html5/sanitizer.rb
@flavorjones I love the that #to_text preserves whitespace (as you added here #12), but it doesn't appear to preserve between inline elements as you say in this post (sparklemotion/nokogiri#636), only block elements. Would it be reasonable for it to preserve whitespace between inline elements? Am I missing an option to do this? If not, would you be open to a pull request to make this happen?
I found that just config.gem'ing loofah didn't quite work, because it hadn't extended ActiveRecord::Base with html_fragment and friends.
require 'loofah/active_record' was enough to fix. I noticed an init.rb too, but it was referring to classes that didn't exist.
I took a page from factory_girl's initialization, and added some logic to lib/loofa.rb to require loofa/active_record if Rails is in existence. I also updated init.rb to just require loofa.
Updated in my fork/branch: http://github.com/technicalpickles/loofah/tree/better-rails-int
I've checked to see if this issue is fixed.
https://github.com/flavorjones/loofah/issues/issue/26
irb(main):008:0> utf8_string = "あ<b>い</b>う<script>え</script>お"
=> "あ<b>い</b>う<script>え</script>お"
irb(main):009:0> Loofah.scrub_fragment(utf8_string, :strip).to_s
=> "あ<b>い</b>うã\u0081\u0088お"
I think above code should return あ<b>い</b>うえお
.
(:escape works correctly.)
Loofah removes
element.https://developer.mozilla.org/en-US/docs/HTML/Element/figure
Loofah.scrub_fragment("<span>hello</span> <figure>asd</figure>", :prune).to_s
The methods defined in the ActiveRecord extension class actually work as-is for MongoMapper. The line that extends ActiveRecord::Base, though, blows up when there's no ActiveRecord present and I include that class manually. I "fixed" this in my own branch(mrkurt/loofah@a336a9c) by moving the actual extension call into that Rails/ActiveRecord initialization bits. Would you mind tweaking the real Gem so I can use the ActiveRecordExtension like this?
I try to find a way to silent those messages but could not locate the code in loofah or nokogiri.
Any idea where does it come from? libxml?
> Loofah.document("<i id=a></i><i id=a></i>").to_text
element i: validity error : ID a already defined
We've encountered some issues with HTML containing unprintable characters (namely \u2028\u2029) over on the Stringer project (stringer-rss/stringer#295). The issue manifests itself further down the chain once we try to load the HTML with unprintable characters as JSON.
We've resolved our issues by adding a gsub(/[^[:print:]]/, '')
call after running the contents through loofah
. I think this makes sense to add this as part of loofah
's sanitizing process - any thoughts before I take a stab at a PR?
Very cool sanitizer, but when I'm trying to use letter-spacing with negative value in one of the properties, it just returns 'style' without any properties
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.