Comments (10)
Marking private until triage and fix.
Original comment by [email protected]
on 27 Feb 2014 at 7:40
- Added labels: Private
from java-html-sanitizer.
> for the sanitizer to be effective for others who want to use it, they should
receive thoroughly tested configurations along with it.
agreed.
> Two sequential character classes are starred and not disjoint. Specifically:
[\\p{Zs}]*(\\s)*
Would making the first greedy address the problem?
Original comment by [email protected]
on 27 Feb 2014 at 7:41
from java-html-sanitizer.
Removing private flag since I triaged as non-critical.
Original comment by [email protected]
on 27 Feb 2014 at 8:27
- Changed state: Accepted
- Removed labels: Private
from java-html-sanitizer.
https://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=217
addresses this
Original comment by [email protected]
on 27 Feb 2014 at 8:30
from java-html-sanitizer.
[deleted comment]
from java-html-sanitizer.
Making either star lazy/greedy does not change the O(n^2) problem, it merely
rearranges the loop order that the regex engine will follow through those
O(n^2) steps.
You should always make adjacent starred (or plused) character classes disjoint
(they don't share any characters that match in both). In this case, since your
first character class already matches any \p{Zs} character, the remaining
disjoint set of characters in the second class is practically empty. The way
you remove this adjacent overlap of the character classes depends on whether
you need to capture the trailing whitespace at the end (e.g. for preserving it
in the output). My sense is that you don't, in which case, you should simply
remove the \\s* (whitespace characters), since it is already largely matched by
\p{Zs} (a whitespace character that is invisible, but does take up space). If
you do, you should require a non-whitespace character before you start matching
"trailing" whitespace.
Original comment by [email protected]
on 27 Feb 2014 at 9:12
from java-html-sanitizer.
> Making either star lazy/greedy does not change the O(n^2) problem, it merely
rearranges the loop order that the regex engine will follow through those
O(n^2) steps.
Sorry, I misspoke. I changed it be "possessive", not "greedy".
"Possessive" is defined thus at
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html :
"""* Possessive quantifiers, which greedily match as much as they can and do
not back off, even when doing so would allow the overall match to succeed."""
so my understanding is that the extra '+' in (x++) changes the backtracking
mode from Prolog-style backtracking to PEG-style backtracking.
----
That said, I think you're right about
> whether you need to capture the trailing whitespace at the end (e.g. for
preserving it in the output). My sense is that you don't,
so there is no loss of function. I will eliminate that and leave a comment as
to why it's unnecessary.
Original comment by [email protected]
on 27 Feb 2014 at 9:38
from java-html-sanitizer.
Yes, I think any of those would be appropriate fixes, including the possessive
modifier you mentioned.
Original comment by [email protected]
on 27 Feb 2014 at 9:44
from java-html-sanitizer.
I can only eliminate the trailing \s* if \s is a subset of \p{Zs} but it is not
because \s includes code-points in the control character (Cc) category which is
disjoint with the Z category.
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt says of TAB and SPACE
0009;<control>;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;
...
0020;SPACE;Zs;0;WS;;;;;N;;;;;
This affects HTML5 because HTML5 defines the space that can be ignored at the
beginning and end of certain attribute values at
http://www.w3.org/TR/html5/infrastructure.html#space-character thus:
"""
The space characters, for the purposes of this specification, are U+0020 SPACE,
"tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR" (U+000D).
"""
----
Empirically, in my version of JDK6, the differences between \p{Zs} and \s for
code-points < U+E000 are:
9 inP=false, inQ=true
a inP=false, inQ=true
b inP=false, inQ=true
c inP=false, inQ=true
d inP=false, inQ=true
a0 inP=true, inQ=false
1680 inP=true, inQ=false
180e inP=true, inQ=false
2000 inP=true, inQ=false
2001 inP=true, inQ=false
2002 inP=true, inQ=false
2003 inP=true, inQ=false
2004 inP=true, inQ=false
2005 inP=true, inQ=false
2006 inP=true, inQ=false
2007 inP=true, inQ=false
2008 inP=true, inQ=false
2009 inP=true, inQ=false
200a inP=true, inQ=false
200b inP=true, inQ=false
202f inP=true, inQ=false
205f inP=true, inQ=false
3000 inP=true, inQ=false
as derived by
import java.util.regex.*;
public class Foo {
public static void main(String[] argv) {
Pattern p = Pattern.compile("[\\p{Zs}]");
Pattern q = Pattern.compile("\\s");
for (char c = 0; c < 0xE000; ++c) {
String s = new StringBuilder(1).append(c).toString();
boolean inP = p.matcher(s).matches();
boolean inQ = q.matcher(s).matches();
if (inP != inQ) {
System.out.println(
Integer.toString(c, 16) + " inP=" + inP + ", inQ=" + inQ);
}
if (c == ' ' && !inP) { // Sanity check
System.err.println("Something widgy");
}
}
}
}
Original comment by [email protected]
on 27 Feb 2014 at 10:55
from java-html-sanitizer.
The release with the fix is r223 which is currently staging to Maven central
and should be available at http://search.maven.org/#browse%7C84770979 shortly.
Original comment by [email protected]
on 28 Feb 2014 at 9:57
- Changed state: Fixed
from java-html-sanitizer.
Related Issues (20)
- "<" symbol with text inputs getting cleared, after applied the sanitize HOT 3
- independent attribute auto add value
- Behaviour with malformed HTML Input
- How to customize the policy after defining the policy.
- noopener noreferrer getting added every time even if "noopener noreferrer" already exist HOT 4
- org.springframework.web.multipart.support.MissingServletRequestPartException: Required request part 'issueModel' is not present HOT 1
- <span> elements get removed even when allowed by the policy HOT 2
- bug: closing tag for </html> misplaced HOT 1
- Vulnerable dependency guava:30.1.jre HOT 3
- Licensing issue: BSD-3-Clause or BSD-2-Clause? HOT 1
- Sanitizer converting font names in 'style' attribute value to lower case
- CSS property `overflow-wrap` not included in CssSchema definition list
- xxx-large font-size is discarded when allowStyling() is used HOT 6
- Issue while disallowing attributes matching pattern
- Remove malicious code from svg content HOT 1
- Encoding malicious code instead of removing it HOT 4
- Index out of bound when empty list is passed to `allowAttributes(...).globally()`
- Guava removal breaks compatibility (with JDK9) HOT 13
- Html sanitizer repeatedly adds rel="noopener noreferrer" even if it's pre-exist HOT 1
- SECURITY.MD currently does not contain sensible information
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from java-html-sanitizer.