Comments (7)
Are there any other constraints on output that your XML parser requires?
Does it recognize HTML specific entities like '?
Does it disallow codepoints not in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets , e.g. control characters
besides \t \r \n and orphaned surrogates whether escaped or not?
Original comment by [email protected]
on 18 Sep 2012 at 9:01
from java-html-sanitizer.
Hello,
thanks for your response.
I use the SAXParser from JDK (OpenJDK implementation) to further process the
output (to convert e-mail addresses to bitmaps and to remove duplicate element
attributes and unnecessary whitespaces).
It does recognize '.
It seems to not allow code points outside of the ranges defined in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets (i tried some other
control characters and some of the code points reserved for surrogates).
However, I did not know what does "escaped" and "not escaped mean", i tried
only this form: &#x???; Will this be a problem in some cases, or it is the
desired and correct behavior in this case of html processing?
Original comment by [email protected]
on 19 Sep 2012 at 6:40
from java-html-sanitizer.
> However, I did not know what does "escaped" and "not escaped mean", i tried
only this form: &#x???; Will this be a problem in some cases, or it is the
desired and correct behavior in this case of html processing.
By escaped I mean the sequence of chars seen by the XML parser contains
'&', '#', '8', ';'
which represents control character 8 in HTML,
but by "not escaped", I mean the sequence of chars seen by the XML parser
contains control character 8.
> Will this be a problem...
It will not be a problem. I ask because if I am going to try and ensure that
the output of the HTML sanitizer is parsable by XML parsers, then I would
rather solve the problem in one release instead of giving you a release and
have you file another bug because the parser now just fails a little later on
the same input.
Original comment by [email protected]
on 19 Sep 2012 at 6:52
from java-html-sanitizer.
To summarize all cases (for the Java SAXParser):
escaped control character (0x7) - Java string "�"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 23; Character
reference "&#
escaped orphaned surrogate (0xD800) - Java string "�"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 27; Character
reference "&#
unescaped control character (0x7) - Java string "\u0007"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 19; An invalid XML
character (Unicode: 0x7) was found in the element content of the document.
unescaped orphaned surrogate (0xD800) - Java string "\uD800"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 20; An invalid XML
character (Unicode: 0xd800) was found in the element content of the document.
escaped or unescaped 0x9 and 0xd7ff (from the ranges in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets) are working correctly
Original comment by [email protected]
on 20 Sep 2012 at 8:36
from java-html-sanitizer.
I believe
http://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=114
addresses this issue.
It does three things.
(1) Makes sure that characters not in the XML Character set do not make it to
the policy as inputs. All invalid code-units are elided.
(2) Makes sure that similar characters that are emitted by a policy are elided
on rendering so will not appear in the HTML output.
(3) Adds the self-closing tag marker to all HTML5 void elements (
http://www.w3.org/TR/html-markup/syntax.html#void-element ), so instead of
seeing "<br>" in the output, you will see "<br />".
r114 is not yet package into a release. Let me know if that works for you and
I will put out a release.
Original comment by [email protected]
on 21 Sep 2012 at 10:25
- Changed state: Started
from java-html-sanitizer.
It works. I tried:
- if is solved the original problem with <br> <hr> etc.
- if it removes the characters (escaped or not escaped) which are not parseable
by the XML parser (even when they are in tag names, attribute names or
attribute values)
- if policy allow/disallow rules work when there are such characters in the tag
or attribute names (but I am not sure if I tried all the possible cases)
thanks
Original comment by [email protected]
on 22 Sep 2012 at 12:41
from java-html-sanitizer.
Release 117 includes the XML compatibility changes and is now available via the
Downloads tab and via maven. I'm marking this issue closed. Please reopen if
you run into related problems with the new release.
Change log :
http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/CHANGE_LOG.html
Original comment by [email protected]
on 22 Sep 2012 at 11:07
- Changed state: Fixed
from java-html-sanitizer.
Related Issues (20)
- Question: How to not escape characters in plain text
- Incorrect escaping for inline svg+xml data image
- Clarify which BSD license applies
- Encoding.encodeHtmlAttribOnto visibility
- Allow disabling the default attribute guards from HtmlPolicyBuilder
- "<" symbol with text inputs getting cleared, after applied the sanitize HOT 3
- independent attribute auto add value
- Behaviour with malformed HTML Input
- How to customize the policy after defining the policy.
- noopener noreferrer getting added every time even if "noopener noreferrer" already exist HOT 4
- org.springframework.web.multipart.support.MissingServletRequestPartException: Required request part 'issueModel' is not present HOT 1
- <span> elements get removed even when allowed by the policy HOT 2
- bug: closing tag for </html> misplaced HOT 1
- Vulnerable dependency guava:30.1.jre HOT 3
- Licensing issue: BSD-3-Clause or BSD-2-Clause? HOT 1
- Sanitizer converting font names in 'style' attribute value to lower case
- CSS property `overflow-wrap` not included in CssSchema definition list
- xxx-large font-size is discarded when allowStyling() is used HOT 6
- Issue while disallowing attributes matching pattern
- Remove malicious code from svg content HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from java-html-sanitizer.