Git Product home page Git Product logo

Comments (10)

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
Marking private until triage and fix.

Original comment by [email protected] on 27 Feb 2014 at 7:40

  • Added labels: Private

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
> for the sanitizer to be effective for others who want to use it, they should 
receive thoroughly tested configurations along with it.

agreed.

> Two sequential character classes are starred and not disjoint. Specifically: 
[\\p{Zs}]*(\\s)*

Would making the first greedy address the problem?

Original comment by [email protected] on 27 Feb 2014 at 7:41

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
Removing private flag since I triaged as non-critical.

Original comment by [email protected] on 27 Feb 2014 at 8:27

  • Changed state: Accepted
  • Removed labels: Private

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
https://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=217 
addresses this

Original comment by [email protected] on 27 Feb 2014 at 8:30

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
[deleted comment]

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
Making either star lazy/greedy does not change the O(n^2) problem, it merely 
rearranges the loop order that the regex engine will follow through those 
O(n^2) steps.

You should always make adjacent starred (or plused) character classes disjoint 
(they don't share any characters that match in both). In this case, since your 
first character class already matches any \p{Zs} character, the remaining 
disjoint set of characters in the second class is practically empty. The way 
you remove this adjacent overlap of the character classes depends on whether 
you need to capture the trailing whitespace at the end (e.g. for preserving it 
in the output). My sense is that you don't, in which case, you should simply 
remove the \\s* (whitespace characters), since it is already largely matched by 
\p{Zs} (a whitespace character that is invisible, but does take up space). If 
you do, you should require a non-whitespace character before you start matching 
"trailing" whitespace.

Original comment by [email protected] on 27 Feb 2014 at 9:12

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
> Making either star lazy/greedy does not change the O(n^2) problem, it merely 
rearranges the loop order that the regex engine will follow through those 
O(n^2) steps.

Sorry, I misspoke.  I changed it be "possessive", not "greedy".

"Possessive" is defined thus at 
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html :

"""* Possessive quantifiers, which greedily match as much as they can and do 
not back off, even when doing so would allow the overall match to succeed."""

so my understanding is that the extra '+' in (x++) changes the backtracking 
mode from Prolog-style backtracking to PEG-style backtracking.

----

That said, I think you're right about

> whether you need to capture the trailing whitespace at the end (e.g. for 
preserving it in the output). My sense is that you don't, 

so there is no loss of function.  I will eliminate that and leave a comment as 
to why it's unnecessary.

Original comment by [email protected] on 27 Feb 2014 at 9:38

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
Yes, I think any of those would be appropriate fixes, including the possessive 
modifier you mentioned.

Original comment by [email protected] on 27 Feb 2014 at 9:44

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
I can only eliminate the trailing \s* if \s is a subset of \p{Zs} but it is not 
because \s includes code-points in the control character (Cc) category which is 
disjoint with the Z category.

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt says of TAB and SPACE
0009;<control>;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;
...
0020;SPACE;Zs;0;WS;;;;;N;;;;;

This affects HTML5 because HTML5 defines the space that can be ignored at the 
beginning and end of certain attribute values at 
http://www.w3.org/TR/html5/infrastructure.html#space-character thus:
"""
The space characters, for the purposes of this specification, are U+0020 SPACE, 
"tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR" (U+000D).
"""

----
Empirically, in my version of JDK6, the differences between \p{Zs} and \s for 
code-points < U+E000 are:

9 inP=false, inQ=true
a inP=false, inQ=true
b inP=false, inQ=true
c inP=false, inQ=true
d inP=false, inQ=true
a0 inP=true, inQ=false
1680 inP=true, inQ=false
180e inP=true, inQ=false
2000 inP=true, inQ=false
2001 inP=true, inQ=false
2002 inP=true, inQ=false
2003 inP=true, inQ=false
2004 inP=true, inQ=false
2005 inP=true, inQ=false
2006 inP=true, inQ=false
2007 inP=true, inQ=false
2008 inP=true, inQ=false
2009 inP=true, inQ=false
200a inP=true, inQ=false
200b inP=true, inQ=false
202f inP=true, inQ=false
205f inP=true, inQ=false
3000 inP=true, inQ=false

as derived by

import java.util.regex.*;

public class Foo {
  public static void main(String[] argv) {
    Pattern p = Pattern.compile("[\\p{Zs}]");
    Pattern q = Pattern.compile("\\s");
    for (char c = 0; c < 0xE000; ++c) {
      String s = new StringBuilder(1).append(c).toString();
      boolean inP = p.matcher(s).matches();
      boolean inQ = q.matcher(s).matches();
      if (inP != inQ) {
        System.out.println(
            Integer.toString(c, 16) + " inP=" + inP + ", inQ=" + inQ);
      }
      if (c == ' ' && !inP) {  // Sanity check
        System.err.println("Something widgy");
      }
    }
  }
}

Original comment by [email protected] on 27 Feb 2014 at 10:55

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
The release with the fix is r223 which is currently staging to Maven central 
and should be available at http://search.maven.org/#browse%7C84770979 shortly.

Original comment by [email protected] on 28 Feb 2014 at 9:57

  • Changed state: Fixed

from java-html-sanitizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.