Comments (12)
Hi! Thanks for asking this question. Let me explain what's going on ...
doc = Nokogiri::HTML4::DocumentFragment.parse("This is <sometext>")
doc.to_html
# => "This is <sometext></sometext>"
Nokogiri's HTML parser sees this tag and auto-corrects it with a closing tag. This is how the parser is configured for HTML, and it mimics what many browsers do when they encounter invalid or ill-formed HTML.
Loofah then sits on top of that Nokogiri object. When it strips invalid tags, that tag gets stripped.
Keep in mind that bare <
and >
are invalid outside of CDATA contexts like <script>
or <style>
.
You haven't explained your use case, but here are a few options for you to consider:
- if you have control of this content, make sure you use HTML entities instead of bare
<
and>
:
Loofah.scrub_fragment("This is <sometext>", :strip).to_html
# => "This is <sometext>"
- If you don't have control, but you want tags to appear in the output, use
:escape
instead of:strip
:
Loofah.scrub_fragment("This is <sometext>", :escape).to_html
# => "This is <sometext></sometext>"
I'm going to close this issue, but if you have more questions I'm happy to answer them.
from loofah.
Sorry, one latest comment, and I'm sorry to keep writing but your question and response are making me nervous because I think you're trying to do something that's unsafe.
In your original description you say the desired output includes the bytes this is <sometext>
without the closing tag. If you manage to do this, I think you'll see that the browser will follow the html5 spec and auto close it, and make it a sometext
element in the DOM, and the user will just see this is
. Does that make sense? If not, please go through the exercise of putting those bytes in a static HTML page and render it in your browser.
I don't think this is what you want. I think what you want is to emit this is <sometext>
so that the browser renders something that the user sees as this is <sometext>
.
Please don't conflate the byte stream with what you want the user to see in a browser-rendered DOM. This is why it is so dangerous to use "encode_special_chars".
from loofah.
Hi @flavorjones, thanks for the reply. Sorry to have missed out on the exact use case. Let me elaborate on the issue in detail:
Providing some sample input string below to mimic user-inputted content:
"<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this Q"
where xxx
is of CDATA type content
None of :strip
or :escape
scrubbers were helping here solely due to the self-closing tags being added from Nokogiri's HTML parsing
:strip
is generating output as below:
Loofah.scrub_fragment("<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this Q", :strip).text(encode_special_chars: false)
# => "unsafe script inserted Maintaining some metric under"
:escape
as:
Loofah.scrub_fragment("<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this Q", :escape).text(encode_special_chars: false)
# => "<script>unsafe script inserted</script> Maintaining some metric under <xxxx dollars for this q></xxxx>"
- For the above use case, we are looking to generate sanitized content without the additional self-closing tags coming from HTML parsing
"unsafe script inserted Maintaining some metric under <xxxx dollars for this Q"
Wondering if we can patch on top of Nokogiri's DocumentFragment.parse
to skip closing/formatting unclosed HTML tags (which are harmless anyway) and treat them as regular text content.
from loofah.
Again, the behavior you're seeing is from libxml2's HTML4 parser, so there's very little we can do in Nokogiri or Loofah to change that behavior if we wanted to.
The key point is that you're passing ill-formed HTML to Loofah and Nokogiri. We could debate a bit about the "right" way to autocorrect ill-formed markup, since the HTML4 spec doesn't specify how it should be done, but ...
Notably, though, Nokogiri's HTML5 parser (libgumbo) which follows the HTML5 spec behaves similarly. The HTML5 spec is very specific about how it thinks ill-formed markup should be recovered, and we wouldn't want to change that behavior.
My point, I guess, is Loofah is supposed to make the content safe for rendering in a browser. It's doing that in this case. It's not doing in the way you wish it did, but then again I'm not sure how what you're proposing could be made safe.
from loofah.
The other option you should consider, if what you're saying is that the string should be considered as "text" and not HTML, is to just call CGI.escapeHTML on the string.
from loofah.
Also please don't use the encode_special_chars option on untrusted input as then you're no longer sanitizing the output.
from loofah.
Thanks for the correction, right, the expectation is to emit this is <sometext>
and render the user-inputted content in the text editor as is (which I guess should be fine as they are unsafe) instead of auto-closing and treating it as HTML. This issue is becoming more apparent as we are treating the user-inputted text (text essentially containing <some CDATA>
type strings) which is being treated as unclosed/ill-formatted HTML content and parsing it with libgumbo/libxml2 (where the autocorrelation logic is falling in place). Doing this as per html5 spec is totally justified but from a UX standpoint, it is likely to be safe content from a user's perspective. This was the exact pain point for us. So trying to find a solution to achieve this where the unclosed tags can be skipped and safely treated as plain text and the rest of the logic can be followed as is (scrubbing off unsafe / rightly-formatted HTML tags).
Also, regarding the usage of "encode_special_chars" we are doing, it looks like this is into action upon the scrubbed content where essentially all of the unsafe content is stripped off and this should be unsafe, cmiiw; so trying to understand how doing this would be "dangerous" as you have previously mentioned.
from loofah.
If the user content is supposed to be text, then I would treat it as text and use GCI.escapeHTML
instead of an HTML sanitizer. An HTML sanitizer is only really needed if the untrusted content is actually intended to be interpreted as a DOM structure by the browser. Does that make sense?
With respect to encode_special_chars: false
, here's the scenario:
- Untrusted user input is the string
"hello <script>alert(1);</script> there"
- You scrub it:
Loofah.fragment("hello <script>alert(1);</script> there").scrub!(:strip)
# =>
# #(DocumentFragment:0xdf84 {
# name = "#document-fragment",
# children = [ #(Text "hello <script>alert(1);</script> there")]
# })
- You then serialize it, choosing not to encode special character entities:
Loofah.fragment("hello <script>alert(1);</script> there").scrub!(:strip).text(encode_special_chars: fal
se)
# => "hello <script>alert(1);</script> there"
- when you stick that string into your webpage, the javascript runs in the browser
Instead, you should just serialize it with the defaults:
Loofah.fragment("hello <script>alert(1);</script> there").scrub!(:strip).text
# => "hello <script>alert(1);</script> there"
which, when it arrives in the browser, will be rendered to the user's eyeballs as "hello <script>alert(1);</script> there"
and not executed as javascript.
Make sense?
from loofah.
Thanks. Got it, understood. But, the use-case we discussed is to remove unsafe content but render the required user-inputted string as is. Let's take a sample string something like below to understand more:
"hello there, <script>alert(1); some unsafe script injected</script> logging revenue: <million dollars for this Q."
Now, the requirement here is to strip/prune the unsafe script tag and render the rest of everything as is.
Expected: "hello there, logging revenue: <million dollars for this Q."
(with :prune
)
- Treating the content as text and
CGI.escapeHTML
on this would give:
"hello there, <script>alert(1); unsafe script injected</script> logging revenue: <million dollars for this Q."
Doing the above would render the content as is (same as input) to the user but the script tag isn't pruned off as the input is escaped for HTML content.
- Whereas, treating input as html content and pruning off with HTML sanitizer:
Loofah.fragment("hello there, <script>alert(1); some unsafe script injected</script> logging revenue: <million dollars for this Q.").scrub!(:prune).text
this gives, "hello there, logging revenue: "
but the user-inputted substring "<million dollars for this Q."
is pruned off too! (which looks like data loss from a user perspective)
So looking for a way to achieve the expected output as, using html parser solely for this case is auto-closing this ...<million...
input too which is eventually getting pruned off along with unsafescript
tags, whereas using CGI.escapeHTML is retaining the required substring ...<million...
as expected but is also showing unsafe content, the script tag.
Hoping the issue is clear now?
from loofah.
So looking for a way to achieve the expected output as ...
Let me be crystal clear: you're asking for something that no HTML parser does. Nokogiri and Loofah will not do this for you.
Sanitizers' job is to make unsafe content safe, that's it. You're asking for something else, something aesthetic, which is to treat <
as a start-of-tag character if it's starting a well-formed tag, but to treat it as a plaintext character if it's not. No HTML parser I know of will do this for you, and I've been trying to communicate to you the why behind that. I'm sorry if I've not done that well.
You need to choose which behavior is an acceptable compromise. And I'm begging you not to choose the option that makes your site insecure.
from loofah.
And I'll say it one more time just for the record: please do not use encode_special_chars: false
if you're rendering for a web browser. If you do that, you will be vulnerable to XSS attacks. This option is intended for rendering in textual contexts like SMS messages, not HTML contexts like web browsers.
from loofah.
Makes sense now, understood. Thanks a ton for bearing with this!
from loofah.
Related Issues (20)
- A whitespace handling change in v2.9.0 is breaking a test in our code HOT 1
- `#text` should only render HTML elements HOT 1
- explore testing with the portswigger xss cheat sheet exploits
- `#to_text` doesn't handle `<br>` elements well. HOT 4
- Adding sms to ACCEPTABLE_PROTOCOLS HOT 3
- tests fail with latest versions of dependencies HOT 1
- Loofah removes HOT 3
- HTML5 empty attributes are being scrubbed HOT 5
- CSS Scrubber is removing the builtin extended CSS color properties in `>= v2.9.0` HOT 5
- RFC: should Loofah sanitize `<style>` tag contents HOT 2
- Preserving emails that look like tags HOT 2
- loofah issue with recent CVE release HOT 2
- Getting errors using Nokogiri < 1.12 HOT 11
- pass encode_special_chars to to_s HOT 1
- Whitespace Added around "/" in CSS HOT 3
- Add scrub to append `target=_blank` to all links HOT 3
- Built-in scrubbers don't escape unsafe HTML with Nokogiri > 1.15 HOT 2
- feat: encapsulate some whitespace-handling into a scrubber (or scrubbers) HOT 3
- placeholder: when Nokogiri 1.17 is released, use the `parse_noscript_content_as_text` option by default
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from loofah.