Git Product home page Git Product logo

Comments (7)

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
Are there any other constraints on output that your XML parser requires?

Does it recognize HTML specific entities like '?
Does it disallow codepoints not in 
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets , e.g. control characters 
besides \t \r \n and orphaned surrogates whether escaped or not?


Original comment by [email protected] on 18 Sep 2012 at 9:01

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
Hello,
thanks for your response.

I use the SAXParser from JDK (OpenJDK implementation) to further process the 
output (to convert e-mail addresses to bitmaps and to remove duplicate element 
attributes and unnecessary whitespaces).
It does recognize '.
It seems to not allow code points outside of the ranges defined in 
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets (i tried some other 
control characters and some of the code points reserved for surrogates). 
However, I did not know what does "escaped" and "not escaped mean", i tried 
only this form: &#x???; Will this be a problem in some cases, or it is the 
desired and correct behavior in this case of html processing?

Original comment by [email protected] on 19 Sep 2012 at 6:40

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
> However, I did not know what does "escaped" and "not escaped mean", i tried 
only this form: &#x???; Will this be a problem in some cases, or it is the 
desired and correct behavior in this case of html processing.

By escaped I mean the sequence of chars seen by the XML parser contains
  '&', '#', '8', ';'
which represents control character 8 in HTML,
but by "not escaped", I mean the sequence of chars seen by the XML parser 
contains control character 8.

> Will this be a problem...

It will not be a problem.  I ask because if I am going to try and ensure that 
the output of the HTML sanitizer is parsable by XML parsers, then I would 
rather solve the problem in one release instead of giving you a release and 
have you file another bug because the parser now just fails a little later on 
the same input.

Original comment by [email protected] on 19 Sep 2012 at 6:52

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
To summarize all cases (for the Java SAXParser):

escaped control character (0x7) - Java string "�"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 23; Character 
reference "&#

escaped orphaned surrogate (0xD800) - Java string "�"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 27; Character 
reference "&#

unescaped control character (0x7) - Java string "\u0007"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 19; An invalid XML 
character (Unicode: 0x7) was found in the element content of the document.

unescaped orphaned surrogate (0xD800) - Java string "\uD800"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 20; An invalid XML 
character (Unicode: 0xd800) was found in the element content of the document.



escaped or unescaped 0x9 and 0xd7ff (from the ranges in 
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets) are working correctly

Original comment by [email protected] on 20 Sep 2012 at 8:36

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
I believe 
http://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=114 
addresses this issue.

It does three things.
(1) Makes sure that characters not in the XML Character set do not make it to 
the policy as inputs.  All invalid code-units are elided.
(2) Makes sure that similar characters that are emitted by a policy are elided 
on rendering so will not appear in the HTML output.
(3) Adds the self-closing tag marker to all HTML5 void elements ( 
http://www.w3.org/TR/html-markup/syntax.html#void-element ), so instead of 
seeing "<br>" in the output, you will see "<br />".

r114 is not yet package into a release.  Let me know if that works for you and 
I will put out a release.

Original comment by [email protected] on 21 Sep 2012 at 10:25

  • Changed state: Started

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
It works. I tried:
- if is solved the original problem with <br> <hr> etc.
- if it removes the characters (escaped or not escaped) which are not parseable 
by the XML parser (even when they are in tag names, attribute names or 
attribute values)
- if policy allow/disallow rules work when there are such characters in the tag 
or attribute names (but I am not sure if I tried all the possible cases)

thanks

Original comment by [email protected] on 22 Sep 2012 at 12:41

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 16, 2024
Release 117 includes the XML compatibility changes and is now available via the 
Downloads tab and via maven.  I'm marking this issue closed.  Please reopen if 
you run into related problems with the new release.

Change log : 
http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/CHANGE_LOG.html

Original comment by [email protected] on 22 Sep 2012 at 11:07

  • Changed state: Fixed

from java-html-sanitizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.