Comments (35)
i can understand that you do not have time but that should not be an argument to close this issue. you can give it a low priority maybe some one else can take this one.
It's not just that I don't have a lot of time, but also that I've seen nothing that leads me to believe that it serves a purpose ; I don't just prioritize it low, I can't distinguish its priority from Priority:Worthless.
from java-html-sanitizer.
I'm not going to expand the space of configurations to specify different policies on entity encoding.
Expanding the configuration space reduces the mean test coverage per configuration and the minimum test coverage per configuration.
Noone has yet outlined a use-case that requires this extra configuration.
I have too many other demands on time to write and maintain lots of extra tests for configurations that don't meet a required use-case.
I can't recommend this library for use in production if the minimum test coverage per configuration is low.
I want to be able to recommend this library for use in production.
CJK languages compress better when arbitrary code-points are not encoded as character reference.
Therefore, until I see a clear and well-argued use-case,
I will not complicate the code that decides whether to encode a code-point as a character reference.
from java-html-sanitizer.
This is a bug for sure. The sanitize should not replace entities as they are
the generalized way of expressing extended characters regardless of encoding.
They are perfectly valid and safe in HTML.
Original comment by [email protected]
on 24 Jun 2014 at 4:32
from java-html-sanitizer.
Are you not seeing codepoint U+A0 in the output?
Original comment by [email protected]
on 24 Jun 2014 at 9:10
from java-html-sanitizer.
I'm seeing "?" in the output. We use CKEditor which has a feature to insert
extended characters which are inserted as HTML entities.
You can see the ckeditor demo here:
http://ckeditor.com/demo
If the user does this, the result after sanitizing replaces these entities with
"?". This occurs whether I use the entity name or number. E.g.
<p>blah blah blah ♦</p>
<p>blah blah blah ♦!</p>
Original comment by [email protected]
on 25 Jun 2014 at 10:09
from java-html-sanitizer.
I don't know what CKEditor has to do with this bug, and you talk about '?' in
the output but to understand the output, I'd have to actually see the headers
and meta-content of the response you're serving.
I added a test (see patch below) and entities seems to work fine. The HTML
sanitizer decodes entities just fine and normalizes them in the output.
As long as your content-type header's charset matches the charset you used to
encode your string, then the content should reach the browser just fine.
Index: src/tests/org/owasp/html/HtmlPolicyBuilderTest.java
===================================================================
--- src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (revision 235)
+++ src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (working copy)
@@ -282,6 +282,17 @@
"<select>\n <option>1</option>\n <option>2</option>\n</select>"));
}
+ @Test
+ public static final void testEntities() throws Exception {
+ assertEquals(
+ "(Foo)\u00a0(Bar)\u2666\u2666\u2666\u2666(Baz)"
+ + "(Boo)",
+ apply(
+ new HtmlPolicyBuilder(),
+ "(Foo) (Bar)♦♦♦♦(Baz)"
+ + "\ud812\udc34(Boo)"));
+ }
+
private static String apply(HtmlPolicyBuilder b) throws Exception {
return apply(b, EXAMPLE);
}
Original comment by [email protected]
on 25 Jun 2014 at 1:09
from java-html-sanitizer.
I added a test to HtmlSanitizerTest
@Test
public static final void testDiamond() throws Exception {
assertEquals("<p>blah blah blah ♦</p>", sanitize("<p>blah blah blah ♦</p>"));
}
I get:
junit.framework.ComparisonFailure: null
Expected :<p>blah blah blah ♦</p>
Actual :<p>blah blah blah ♦</p>
Original comment by [email protected]
on 25 Jun 2014 at 3:10
from java-html-sanitizer.
And what I don't understand is
> Expected :<p>blah blah blah ♦</p>
> Actual :<p>blah blah blah ♦</p>
Why is the actual result a problem?
Original comment by [email protected]
on 5 Jul 2014 at 8:28
from java-html-sanitizer.
The issue is text portability. Named entities preserve the text as simple ascii
which is universal. In our situation we send various html content to 3rd party
systems so we can't guarantee a common unicode for special characters.
Currently we added jericho as a dependency to reparse and convert these back to
entities.
Unless you think there's some security issue with entities I think they should
be preserved. Best would probably be to have a setting for this in the
HtmlPolicyBuilder.
Thanks
Original comment by [email protected]
on 6 Jul 2014 at 10:53
from java-html-sanitizer.
Having same issue ' is replaced with it's ' code @ with @
Version: r239
What shoud I do to avoid such behaviour?
Original comment by [email protected]
on 11 Oct 2014 at 5:42
from java-html-sanitizer.
The adventures of ' from the input:
Version: r239
org.owasp.html.Encoding
line 55-60:
it calls method HtmlEntities.decodeEntityAt(s, amp, n);
it calculates codepoint 39 for ' so it puts ' symbol in StringBuilder instead
of '...
After it in encodeHtmlOnto method line:168-176
it rereplaces ' with it's code from String[] REPLACEMENTS (it's for sure ').
but yet I don't understand why, or how to easily add ' to allowed symbols?
Original comment by [email protected]
on 11 Oct 2014 at 6:29
from java-html-sanitizer.
Why are these replacements problems?
Original comment by [email protected]
on 13 Oct 2014 at 10:23
from java-html-sanitizer.
Numerical entities would be interpreted differently depending on unicode. Named
entities should really be preserved for text portability.
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Original comment by [email protected]
on 13 Oct 2014 at 11:10
from java-html-sanitizer.
Agree with [email protected].
Character entity references in HTML are already a symbol references, widely
used.
It will be great if they will be just ignored by, maybe some optional setting
implemented to HtmlEntities.
Text pocessing could be bore predictable and reversable in this case.
Original comment by [email protected]
on 14 Oct 2014 at 8:31
from java-html-sanitizer.
Replacing with a unicode no-break-space messes me up because I use UTF-8
encoding, and when I UTF-8 encode a unicode no-break-space it results in Â.
If I pass in " " then I want to get " " back. Why bother replacing it?
Original comment by [email protected]
on 28 Jan 2015 at 4:41
from java-html-sanitizer.
We have an another issue example, if you enter ’
sanitizer covert this to html character (´) which is not recognized by rendered engine (UTF-8).
Actually we are not understanding why santizer required to change html entities to characters. This happend to only some entites not for all (example <
is not converted)
Kr,
Urvish
from java-html-sanitizer.
Hi,
In fact if use different encoding rather than UTF-8 there is problem.
If use ISO-8859-1 it does not convert HTMLentities well to character.
But we are not understanding why it is required to convert HTML entities to character.
Example (default file encoding ISO-8599-1),
Before sanitize: ’
After sanitize: ? (if we use encoding UTF-8 it gives ´(SINGLE RIGHT QUOTATION MARK))
According to us this is incorrect?
Kr,
U
from java-html-sanitizer.
i can understand that you do not have time but that should not be an argument to close this issue. you can give it a low priority maybe some one else can take this one.
you need a clear and well argued use case. ok here is mine.
i have a front-end where, via a rich text editor, a user can add content. which will be stored in the db and shown to users on some other page.
i want to make this secure so the editor can't add tags which are not allowed but he should still be able to add all the text he wants.
i also want to give clear error messages.
so ideally i would receive the added text. i would run it through the sanitizer to remove tags i don't want. then i would compare the added text with the cleaned text. if this is different i return an error telling the editor he added wrong input.
second spaces should still be kept as   because multiple   is different presented in html than multiple spaces (they are compressed to one space).
the strange thin is i have all the options to configure which tags i want with even which arguments but i have no option at all to toggle the encoding.
both scenario's are not possible because the text is encoded when i sanitize. encoding and sanitation are 2 different steps which should be two different steps in the code. this will not add you more unit tests it just would make the lib more modular.
i won't mind helping out 😄
from java-html-sanitizer.
from #43 to put all the examples in 1 issue
test case:
@Test
public void testSpace() {
String text = "L&nbsp;&nbsp;&nbsp;L";
assertEquals(text, Sanitizers.FORMATTING.sanitize(text));
}
why:
1)
a is something different than a space. when i get a from my richttext editor i want to preserve it. in the above example when i would add the sanitized text into an html page it would look like L L instead of L L.
2)
i want to know if the user added something wrong and present an error:
if (!StringUtils.equals(text, clean)) {
addFieldError("wrong input! please check cleaned text");
}
i don't want this to happen after a spcae replacement.
todo:
remove the to space part or make it optional.
work around:
replace the by spaces. do the sanitizing and checking and re add the
from java-html-sanitizer.
is something different than a space
Agreed. That's why I don't normalize it as space (U+20) and instead re-encode it as non-breaking space (U+A0).
from java-html-sanitizer.
ok but for me this tool is worthless now because i need to sanitize and not decode/encode.
somehow it is not important for you to preserve added text. how can i sanitize and keep the text equal. how can i tell my users that they added wrong tags?
from java-html-sanitizer.
Testing string equality is extremely brittle -- there are many HTML strings with the same semantics.
What does HtmlChangeListener
not provide that you need to do your check?
from java-html-sanitizer.
yes i was looking into that one interesting that i get an call for every change well every sanitize change. the decode encode changes are not send to a listener.
but its not a problem that html strings have the same semantics when it stays the same i do not have to report any thing to the user.
but i agree with the listener i think i can report back to the user.
so how can i keep the text equal?
from java-html-sanitizer.
i ran the test again with some extra chars and they all get changed.:
@Test
public void testSpace() {
String text = " ' = _+ ;'\" ";
String result = Sanitizers.FORMATTING.sanitize(text);
assertEquals(text, result);
}
so the nbsp's get decoded and the others get encoded. what i find hard to understand is that you seem a very smart coder but don't mind a method does not do what it is called. when you create a method sanitize and what it does is sanitize and encode/decode i find it really confusing. apart that i cant' use the tool now. i either need to fork this or need to get the text out first and re-add it later.
from java-html-sanitizer.
If you have problems with HtmlChangeListener, open an issue about that.
I define "sanitize" in a specific way that allows semantics preserving changes to documents in addition to elision.
If you choose to fork or not use the tool then good luck to you.
from java-html-sanitizer.
i don't have any problem with the HtmlChangeListhren but it is interesting that you find some changes changes reportable and others are quietly done.
i think we agree on all but what you find semantics preserving changes for me (and the others above) are real changes which are not acceptable for me nor for my client. i will go on and fork it than.
i think your tool would be much better when you split the method sanitize into sanitize which will clean the text and a method dencode which will change chars for some reason.
from java-html-sanitizer.
i think we agree on all but what you find semantics preserving changes for me
This is defined by the HTML5 specification. It's not just a matter of opinion.
i think your tool would be much better when you split the method sanitize into sanitize which will clean the text and a method dencode which will change chars for some reason.
This is not just a whim. Sanitizing requires taking a large and inconsistently interpreted set of strings and producing an output that is in a safe and consistently interpreted set. You seem to think that defining that subset without encoding/decoding is easy. My experience writing sanitizers has taught me otherwise.
from java-html-sanitizer.
just for future references I ran into another issue. with encoding text which is that we use place holders which get replaced by a process later in the chain. these text can be anything.
so if the place holder text would be te'st than it will be encoded to te'st and lose its place holder capability.
one more thing when reading through the documentation i could not find anything about encoding. can you point me to these documents?
from java-html-sanitizer.
Encoding and decoding of HTML text nodes are specified in http://www.w3.org/html/wg/drafts/html/master/single-page.html#character-references
from java-html-sanitizer.
thanks. but w3 does not say what this html sanitizer does.
i would expect somewhere a remark or documentation about what this tool does besides sanitizing. which chars are decoded (at least the  ) and which are encoded.
from java-html-sanitizer.
I'm intentionally not specifying that. Clients should not rely on the exact character output because I may change that in response to new threats.
from java-html-sanitizer.
ah ok so basically user who use this tool should be warned that by upgrading the encoding and decoding could changed without notice.
from java-html-sanitizer.
Those changes would be discussed in release notes.
Again, if you are upset at Mr. Samuels choices, I encourage you to fork
the project.
Please note that Michael is a volunteer and this is an open source
project that he has donated to the OWASP foundation. He is also kind
enough to continue maintaining the project to the best of his ability.
Please be sure to maintain respectful, helpful discourse.
Aloha,
Jim
On 10/6/15 1:44 PM, tibistibi wrote:
ah ok so basically user who use this tool should be warned that by
upgrading the encoding and decoding could changed without notice.—
Reply to this email directly or view it on GitHub
#30 (comment).
from java-html-sanitizer.
jim first I’m not addressing anyone in particular, anyone with knowledge of the project can respond.
second I spend my time on this forum/issue because I think it will improve the product so it could be used in more real life situations. I see some other posters who have problems with the encoding and decoding like myself.
third there was a question to provide clear and well argued cases, so that's what I’m doing.
off course I can fork the project to change it to whatever I want but that will not improve the community it will just clutter it and probably my implementation will not be as good because off lake of experience and overview being new to the code.
and finally if you think I’m in any way not respectful just point me to what sentence let you think that so I can update it. I’m not a native English speaker so it could be that I made some mistakes.
from java-html-sanitizer.
Fair comments and thank you for your support.
Jim Manico
@manicode
(808) 652-3805
On Oct 6, 2015, at 2:04 PM, tibistibi [email protected] wrote:
jim first I’m not addressing anyone in particular, anyone with knowledge of the project can respond.
second I spend my time on this forum/issue because I think it will improve the product so it could be used in more real life situations. I see some other posters who have problems with the encoding and decoding like myself.
third there was a question to provide clear and well argued cases, so that's what I’m doing.
off course I can fork the project to change it to whatever I want but that will not improve the community it will just clutter it and probably my implementation will not be as good because off lake of experience and overview being new to the code.
and finally if you think I’m in any way not respectful just point me to what sentence let you think that so I can update it. I’m not a native English speaker so it could be that I made some mistakes.
—
Reply to this email directly or view it on GitHub.
from java-html-sanitizer.
Related Issues (20)
- xxx-large font-size is discarded when allowStyling() is used HOT 6
- Issue while disallowing attributes matching pattern
- Remove malicious code from svg content HOT 1
- Encoding malicious code instead of removing it HOT 4
- Index out of bound when empty list is passed to `allowAttributes(...).globally()`
- Guava removal breaks compatibility (with JDK9) HOT 13
- Html sanitizer repeatedly adds rel="noopener noreferrer" even if it's pre-exist HOT 1
- SECURITY.MD currently does not contain sensible information
- Sanitizing CSS HOT 3
- ClassNotFoundException: org.owasp.shim.Java8Shim after update to 20240325.1 HOT 5
- Release 20240325 cannot be transpiled HOT 1
- Issue in 2024x version with styles
- Question: What means Recognize foreign content syntactic context: mathml / svg?
- Issues encountered while processing <a> tags
- rel attributes are reordered in 20220608.1
- Possible to enforce having mutliple attributes on tag?
- On Java8Shim class, better to catch Throwable instead of Error
- text-align literals are outdated
- Please build the Java8/10 shim classes into the sanitizer JAR
- Issue with HTML Sanitization: Improper Handling of <div> Tag Inside <table>
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from java-html-sanitizer.