Git Product home page Git Product logo

java-html-sanitizer's Introduction

OWASP Java HTML Sanitizer

Java CI with Maven Coverage Status CII Best Practices Maven Central

A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS.

The existing dependency is on JSR 305. The other jars are only needed by the test suite. The JSR 305 dependency is a compile-only dependency, only needed for annotations.

This code was written with security best practices in mind, has an extensive test suite, and has undergone adversarial security review.

Table Of Contents

Getting Started

Getting Started includes instructions on how to get started with or without Maven.

Prepackaged Policies

You can use prepackaged policies:

PolicyFactory policy = Sanitizers.FORMATTING.and(Sanitizers.LINKS);
String safeHTML = policy.sanitize(untrustedHTML);

Crafting a policy

The tests show how to configure your own policy:

PolicyFactory policy = new HtmlPolicyBuilder()
    .allowElements("a")
    .allowUrlProtocols("https")
    .allowAttributes("href").onElements("a")
    .requireRelNofollowOnLinks()
    .toFactory();
String safeHTML = policy.sanitize(untrustedHTML);

Custom Policies

You can write custom policies to do things like changing h1s to divs with a certain class:

PolicyFactory policy = new HtmlPolicyBuilder()
    .allowElements("p")
    .allowElements(
        (String elementName, List<String> attrs) -> {
          // Add a class attribute.
          attrs.add("class");
          attrs.add("header-" + elementName);
          // Return elementName to include, null to drop.
          return "div";
        }, "h1", "h2", "h3", "h4", "h5", "h6")
    .toFactory();
String safeHTML = policy.sanitize(untrustedHTML);

Please note that the elements "a", "font", "img", "input" and "span" need to be explicitly whitelisted using the allowWithoutAttributes() method if you want them to be allowed through the filter when these elements do not include any attributes.

Attribute policies allow running custom code too. Adding an attribute policy will not water down any default policy like style or URL attribute checks.

new HtmlPolicyBuilder = new HtmlPolicyBuilder()
    .allowElement("div", "span")
    .allowAttributes("data-foo")
        .matching(
            (String elementName, String attributeName, String value) -> {
              // Return value for the attribute or null to drop.
            })
        .onElements("div", "span")
    .build()

Preprocessors

Preprocessors allow inserting text and large scale structural changes.

new HtmlPolicyBuilder = new HtmlPolicyBuilder()
    // Use a preprocessor to be backwards compatible with the
    // <plaintext> element which 
    .withPreprocessor(
        (HtmlStreamEventReceiver r) -> {
          // Provide user with info about links before they click.
          // Before:                       <a href="https://example.com/...">
          // After:  (https://example.com) <a href="https://example.com/...">
          return new HtmlStreamEventReceiverWrapper(r) {
            @Override public void openTag(String elementName, List<String> attrs) {
              if ("a".equals(elementName)) {
                for (int i = 0, n = attrs.size(); i < n; i += 2) {
                  if ("href".equals(attrs.get(i)) {
                    String url = attrs.get(i + 1);
                    String origin;
                    try {
                      URI uri = new URI(url);
                      String scheme = uri.getScheme();
                      String authority = uri.getRawAuthority();
                      if (scheme == null && authority == null) {
                        origin = null;
                      } else {
                        origin = (scheme != null ? scheme + ":" : "")
                               + (authority != null ? "//" + authority : "");
                      }
                    } catch (URISyntaxException ex) {
                      origin = "about:invalid";
                    }
                    if (origin != null) {
                      text(" (" + origin + ") ");
                    }
                  }
                }
              }
              super.openTag(elementName, attrs);
            }
          };
        }
    .allowElement("a")
    ...
    .build()

Preprocessing happens before a policy is applied, so cannot affect the security of the output.

Telemetry

When a policy rejects an element or attribute it notifies an HtmlChangeListener.

You can use this to keep track of policy violation trends and find out when someone is making an effort to breach your security.

PolicyFactory myPolicyFactory = ...;
// If you need to associate reports with some context, you can do so.
MyContextClass myContext = ...;

String sanitizedHtml = myPolicyFactory.sanitize(
    unsanitizedHtml,
    new HtmlChangeListener<MyContextClass>() {
      @Override
      public void discardedTag(MyContextClass context, String elementName) {
        // ...
      }
      @Override
      public void discardedAttributes(
          MyContextClass context, String elementName, String... attributeNames) {
        // ...
      }
    },
    myContext);

Note: If a string sanitizes with no change notifications, it is not the case that the input string is necessarily safe to use. Only use the output of the sanitizer.

The sanitizer ensures that the output is in a sub-set of HTML that commonly used HTML parsers will agree on the meaning of, but the absence of notifications does not mean that the input is in such a sub-set, only that it does not contain elements or attributes that were removed.

See "Why sanitize when you can validate" for more on this topic.

Questions?

If you wish to report a vulnerability, please see AttackReviewGroundRules.

Subscribe to the mailing list to be notified of known Vulnerabilities and important updates.

Contributing

If you would like to contribute, please ping @mvsamuel or @manicode.

We welcome issue reports and PRs. PRs that change behavior or that add functionality should include both positive and negative tests.

Please be aware that contributions fall under the Apache 2.0 License.

Credits

Thanks to everyone who has helped with criticism and code

java-html-sanitizer's People

Contributors

0xflotus avatar aakritisi avatar benapple avatar chuckdumont avatar claudioweiler avatar csware avatar cure53 avatar dependabot[bot] avatar edbaker83 avatar jamesdaily avatar jed204 avatar jmanico avatar jshields-squarespace avatar lillesand avatar mikesamuel avatar mymhealthltd-joshengland avatar nuke100pr avatar pukomuko avatar rnnds avatar ronabop avatar sbearcsiro avatar subbudvk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

java-html-sanitizer's Issues

Latest version JRE 1.7 requirement

The version released yesterday 20151202.1 has a new class AutoCloseableHtmlStreamRenderer that implements java.lang.AutoCloseable interface define on org/owasp/html/HtmlStreamRenderer.java. This change causes that the library depends on runtime on JRE 1.7 or greater throwing a ClassDefNotFound exception is used with lower versions.

Is this a mistake or form this version the library will require JRE 1.7?

Sanitizer html-encodes characters in attribute url

Santizing:

<a href="ftp://site.com:user@host/file.txt">click here</a>

the '@' is replaced by '&#64;'. However, the href and src attribute values are 
URLs, not HTML text, so I believe the '@' should be left unencoded, or if 
anything be URL-encoded.

Another context where I run into this is sanitizing email html content. It 
sometimes points to attached images using a cid: (rfc2392) URL, eg:

<img src="[email protected]">


What version of the product are you using? On what operating system?
r173, linux, java 6.

Thanks!

Fred

Original issue reported on code.google.com by [email protected] on 8 Jun 2013 at 8:26

Is there support for HTML5 data attributes?

I would like to allow users to provide HTML code such as:

<a data-target="..." data-action="..." data-profile="...">...</a>

Basically, any number of attributes on a tag (a in this example) that begin with data-. Is it possible do this with HtmlPolicyBuilder (or otherwise)?

&nbsp; should not be changed to a space

test case:

        @Test
        public void testSpace() {
            String text = "L&amp;nbsp;&amp;nbsp;&amp;nbsp;L";
            assertEquals(text, Sanitizers.FORMATTING.sanitize(text));
         }

why:
1)
a &nbsp; is something different than a space. when i get a &nbsp; from my richttext editor i want to preserve it. in the above example when i would add the sanitized text into an html page it would look like L L instead of L&nbsp;&nbsp;&nbsp;L.
2)
i want to know if the user added something wrong and present an error:

        if (!StringUtils.equals(text, clean)) {
            addFieldError("wrong input! please check cleaned text");
        }

i don't want this to happen after a spcae replacement.

todo:
remove the &nbsp; to space part or make it optional.

work around:
replace the &nbsp; by spaces. do the sanitizing and checking and re add the &nbsp;

</li> become </li&gt;

What steps will reproduce the problem?

  1. Use the sanitizer to sanitize:
<p><span class="application-font-size-14"><span style="color: rgb(40, 40, 40);" 
class="application-font-name-arial">Lorem ipsum dolor sit amet, adipiscing 
elit. In scelerisque condimentum. </span>Phasellus molestie hendrerit 
augue.</span></p><p><span class="application-bold application-font-size-14">In 
eget arcu at fermentum tortor:</span></p><ul 
class="application-ul-disc"><li><span style="color: rgb(40, 40, 40);" 
class="application-font-name-arial application-font-size-14">Sapien sed 
fermentum </span></li><li><span style="color: rgb(40, 40, 40);" 
class="application-font-name-arial application-font-size-14">Tellus consectetur 
sit amet</span></li><li><span style="color: rgb(40, 40, 40);" 
class="application-font-name-arial application-font-size-14">Sed interdum 
ligula nec </span></li></ul><ul class="application-ul-disc"><br></ul><p><span 
style="color: rgb(40, 40, 40);" class="application-font-name-arial 
application-font-size-14">Vestibulum ultricies, arcu neque euismod ipsum, id 
tempor sem ante quis sem.</span></p><p><span>Donec mi ipsum, pretium sit amet 
interdum quis, egestas et justo.</span></p> 

This becomes:

<p><span class="application-font-size-14"><span style="color:rgb( 40 , 40 , 40 
)" class="application-font-name-arial">Lorem ipsum dolor sit amet, adipiscing 
elit. In scelerisque condimentum. </span>Phasellus molestie hendrerit 
augue.</span></p><p><span class="application-bold application-font-size-14">In 
eget arcu at fermentum tortor:</span></p><ul 
class="application-ul-disc"><li><span style="color:rgb( 40 , 40 , 40 )" 
class="application-font-name-arial application-font-size-14">Sapien sed 
fermentum </span></li&gt;<li><span style="color:rgb( 40 , 40 , 40 )" 
class="application-font-name-arial application-font-size-14">Tellus consectetur 
sit amet</span></li><li><span style="color:rgb( 40 , 40 , 40 )" 
class="application-font-name-arial application-font-size-14">Sed interdum 
ligula nec </span></li></ul><ul class="application-ul-disc"><li><br 
/></li></ul><p><span style="color:rgb( 40 , 40 , 40 )" 
class="application-font-name-arial application-font-size-14">Vestibulum 
ultricies, arcu neque euismod ipsum, id tempor sem ante quis 
sem.</span></p><p><span>Donec mi ipsum, pretium sit amet interdum quis, egestas 
et justo.</span></p>

Which has a closing li that become </li&gt; in the sanitized version.

What is the expected output? What do you see instead?

The expended output would be </li>

What version of the product are you using? On what operating system?

'com.googlecode.owasp-java-html-sanitizer:owasp-java-html-sanitizer:r239'
I'm on Ubuntu 14.04, but this is also being run on Windows Server 2008

Please provide any additional information below.

So the way I'm using the sanitizer, is to run the sanitzer do some
manipulations to the before and after running the content through a policy. as
necessary and compare to see if anything has been removed. The reason I'm
doing this is that my requirement is that I block xss, not and not store the
sanitized content. If there was a way to have to check that something was
removed, or not to clean up the html, that would be helpful, in addition to not
messing with the </li> tag.

Here is the policy that I'm using currently:

package com.affinnova.platform.util;

// Copyright (c) 2011, Mike Samuel
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
// are met:
//
// Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
// Neither the name of the OWASP nor the names of its contributors may
// be used to endorse or promote products derived from this software
// without specific prior written permission.
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
// FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
// COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
// INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
// BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
// LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
// CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
// ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.

import com.google.common.base.Predicate;
import org.owasp.html.HtmlPolicyBuilder;
import org.owasp.html.PolicyFactory;

import java.util.regex.Pattern;

/**
 * Based on the
 * <a href="http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#Stage_2_-_Choosing_a_base_policy_file">AntiSamy EBay example</a>.
 * <blockquote>
 * eBay (http://www.ebay.com/) is the most popular online auction site in the
 * universe, as far as I can tell. It is a public site so anyone is allowed to
 * post listings with rich HTML content. It's not surprising that given the
 * attractiveness of eBay as a target that it has been subject to a few complex
 * XSS attacks. Listings are allowed to contain much more rich content than,
 * say, Slashdot- so it's attack surface is considerably larger. The following
 * tags appear to be accepted by eBay (they don't publish rules):
 * {@code <a>},...
 * </blockquote>
 */
public class JavaHtmlSanitizerPolicy {

  // Some common regular expression definitions.

  // The 16 colors defined by the HTML Spec (also used by the CSS Spec)
  private static final Pattern COLOR_NAME = Pattern.compile(
      "(?:aqua|black|blue|fuchsia|gray|grey|green|lime|maroon|navy|olive|purple"
      + "|red|silver|teal|white|yellow)");

  // HTML/CSS Spec allows 3 or 6 digit hex to specify color
  private static final Pattern COLOR_CODE = Pattern.compile(
      "(?:#(?:[0-9a-fA-F]{3}(?:[0-9a-fA-F]{3})?))");

  private static final Pattern NUMBER_OR_PERCENT = Pattern.compile(
      "[0-9]+%?");
  private static final Pattern PARAGRAPH = Pattern.compile(
      "(?:[\\p{L}\\p{N},'\\.\\s\\-_\\(\\)]|&[0-9]{2};)*");
  private static final Pattern HTML_ID = Pattern.compile(
      "[a-zA-Z0-9\\:\\-_\\.]+");
  // force non-empty with a '+' at the end instead of '*'
  private static final Pattern HTML_TITLE = Pattern.compile(
      "[\\p{L}\\p{N}\\s\\-_',:\\[\\]!\\./\\\\\\(\\)&]*");
  private static final Pattern HTML_CLASS = Pattern.compile(
      "[a-zA-Z0-9\\s,\\-_]+");

  private static final Pattern ONSITE_URL = Pattern.compile(
      "(?:[\\p{L}\\p{N}\\\\\\.\\#@\\$%\\+&;\\-_~,\\?=/!]+|\\#(\\w)+)");
  private static final Pattern OFFSITE_URL = Pattern.compile(
      "\\s*(?:(?:ht|f)tps?://|mailto:)[\\p{L}\\p{N}]"
      + "[\\p{L}\\p{N}\\p{Zs}\\.\\#@\\$%\\+&;:\\-_~,\\?=/!\\(\\)]*+\\s*");

  private static final Pattern NUMBER = Pattern.compile(
      "[+-]?(?:(?:[0-9]+(?:\\.[0-9]*)?)|\\.[0-9]+)");

  private static final Pattern NAME = Pattern.compile("[a-zA-Z0-9\\-_\\$]+");

  private static final Pattern ALIGN = Pattern.compile(
      "(?i)center|left|right|justify|char");

  private static final Pattern VALIGN = Pattern.compile(
      "(?i)baseline|bottom|middle|top");

  private static final Predicate<String> COLOR_NAME_OR_COLOR_CODE
      = new Predicate<String>() {
        public boolean apply(String s) {
          return COLOR_NAME.matcher(s).matches()
              || COLOR_CODE.matcher(s).matches();
        }
      };

  private static final Predicate<String> ONSITE_OR_OFFSITE_URL
      = new Predicate<String>() {
        public boolean apply(String s) {
          return ONSITE_URL.matcher(s).matches()
              || OFFSITE_URL.matcher(s).matches();
        }
      };

  private static final Pattern HISTORY_BACK = Pattern.compile(
      "(?:javascript:)?\\Qhistory.go(-1)\\E");

  private static final Pattern ONE_CHAR = Pattern.compile(
      ".?", Pattern.DOTALL);



  public static final PolicyFactory POLICY_DEFINITION = new HtmlPolicyBuilder()
          .allowAttributes("id").matching(HTML_ID).globally()
          .allowAttributes("class").matching(HTML_CLASS).globally()
          .allowAttributes("lang").matching(Pattern.compile("[a-zA-Z]{2,20}"))
              .globally()
          .allowAttributes("title").matching(HTML_TITLE).globally()
          .allowStyling()
          .allowAttributes("align").matching(ALIGN).onElements("p")
          .allowAttributes("for").matching(HTML_ID).onElements("label")
          .allowAttributes("color").matching(COLOR_NAME_OR_COLOR_CODE)
              .onElements("font")
          .allowAttributes("face")
              .matching(Pattern.compile("[\\w;, \\-]+"))
              .onElements("font")
          .allowAttributes("size").matching(NUMBER).onElements("font")
          .allowAttributes("href").matching(ONSITE_OR_OFFSITE_URL)
              .onElements("a")
          .allowStandardUrlProtocols()
          .allowAttributes("nohref").onElements("a")
          .allowAttributes("name").matching(NAME).onElements("a")
          .allowAttributes(
                  "onfocus", "onblur", "onclick", "onmousedown", "onmouseup")
              .matching(HISTORY_BACK).onElements("a")
          .requireRelNofollowOnLinks()
          .allowAttributes("src").matching(ONSITE_OR_OFFSITE_URL)
              .onElements("img")
          .allowAttributes("name").matching(NAME)
              .onElements("img")
          .allowAttributes("alt").matching(PARAGRAPH)
              .onElements("img")
          .allowAttributes("border", "hspace", "vspace").matching(NUMBER)
              .onElements("img")
          .allowAttributes("border", "cellpadding", "cellspacing")
              .matching(NUMBER).onElements("table")
          .allowAttributes("bgcolor").matching(COLOR_NAME_OR_COLOR_CODE)
              .onElements("table")
          .allowAttributes("background").matching(ONSITE_URL)
              .onElements("table")
          .allowAttributes("align").matching(ALIGN)
              .onElements("table")
          .allowAttributes("noresize").matching(Pattern.compile("(?i)noresize"))
              .onElements("table")
          .allowAttributes("background").matching(ONSITE_URL)
              .onElements("td", "th", "tr")
          .allowAttributes("bgcolor").matching(COLOR_NAME_OR_COLOR_CODE)
              .onElements("td", "th")
          .allowAttributes("abbr").matching(PARAGRAPH)
              .onElements("td", "th")
          .allowAttributes("axis", "headers").matching(NAME)
              .onElements("td", "th")
          .allowAttributes("scope")
              .matching(Pattern.compile("(?i)(?:row|col)(?:group)?"))
              .onElements("td", "th")
          .allowAttributes("nowrap")
              .onElements("td", "th")
          .allowAttributes("height", "width").matching(NUMBER_OR_PERCENT)
              .onElements("table", "td", "th", "tr", "img")
          .allowAttributes("align").matching(ALIGN)
              .onElements("thead", "tbody", "tfoot", "img",
                      "td", "th", "tr", "colgroup", "col")
          .allowAttributes("valign").matching(VALIGN)
              .onElements("thead", "tbody", "tfoot",
                      "td", "th", "tr", "colgroup", "col")
          .allowAttributes("charoff").matching(NUMBER_OR_PERCENT)
              .onElements("td", "th", "tr", "colgroup", "col",
                      "thead", "tbody", "tfoot")
          .allowAttributes("char").matching(ONE_CHAR)
              .onElements("td", "th", "tr", "colgroup", "col",
                      "thead", "tbody", "tfoot")
          .allowAttributes("colspan", "rowspan").matching(NUMBER)
              .onElements("td", "th")
          .allowAttributes("span", "width").matching(NUMBER_OR_PERCENT)
          .onElements("colgroup", "col")
          .allowElements(
                  "a", "label", "noscript", "h1", "h2", "h3", "h4", "h5", "h6", "p", "i", "b", "u", "strong", "em", "small", "big", "pre", "code",
                  "cite", "samp", "sub", "sup", "strike", "center", "blockquote", "hr", "br", "col", "font", "map", "span", "div", "img",
                  "ul", "ol", "li", "dd", "dt", "dl", "tbody", "thead", "tfoot", "table", "td", "th", "tr", "colgroup", "fieldset", "legend", "abbr",
                  "acronym", "address", "article", "aside", "basefont", "bdi", "bdo", "big", "caption", "colgroup", "del", "dfn", "dir", "font", "figcaption",
                  "figure", "footer", "header", "hgroup", "ins", "mark", "menu", "nav", "q", "s", "section", "style",
                  "tt", "var", "wbr")
          .allowWithoutAttributes(
                  "a", "label", "noscript", "h1", "h2", "h3", "h4", "h5", "h6", "p", "i", "b", "u", "strong", "em", "small", "big", "pre", "code",
                  "cite", "samp", "sub", "sup", "strike", "center", "blockquote", "hr", "br", "col", "font", "map", "span", "div", "img",
                  "ul", "ol", "li", "dd", "dt", "dl", "tbody", "thead", "tfoot", "table", "td", "th", "tr", "colgroup", "fieldset", "legend", "abbr",
                  "acronym", "address", "article", "aside", "basefont", "bdi", "bdo", "big", "caption", "colgroup", "del", "dfn", "dir", "font", "figcaption",
                  "figure", "footer", "header", "hgroup", "ins", "mark", "menu", "nav", "q", "s", "section", "style",
                  "tt", "var", "wbr")
          .toFactory();
}

Original issue reported on code.google.com by [email protected] on 6 Feb 2015 at 9:41

Misnested list-item and list elements break lists

Per 
https://groups.google.com/d/topic/owasp-java-html-sanitizer-support/LJFuNLa4T_8/
discussion

<ul>
  <li>asdf</li>
  <ul>
    <li>adfasdf</li>
  </ul>
</ul>

is getting sanitized into:

<ul>
  <li>asdf</li>
</ul>
<ul>
  <li>adfasdf</li>
</ul>

instead of what Jon Steven's expects:

<ul>
  <li>asdf</li>
  <li>
     <ul>
        <li>adfasdf</li>
     </ul>
  </li>
</ul>

Jim points out that the input is misnested and

Line 5, Column 6: document type does not allow element "UL" here; assuming 
missing "LI" start-tag

The tag balancer does not insert the missing LI start-tag.


Original issue reported on code.google.com by [email protected] on 23 Oct 2012 at 3:34

Quoted css font names with hyphens are filtered

What steps will reproduce the problem?
String css = "font-family:'Arial','sans-serif'";
StylingPolicy stylingPolicy = new StylingPolicy(CssSchema.DEFAULT);
stylingPolicy.sanitizeCssProperties(css);


What is the expected output? What do you see instead?
Expected: font-family:'arial' , 'sans-serif'
Actual: font-family:'arial' ,


What version of the product are you using? On what operating system?
svn trunk (r227) on Centos 6.5


Please provide any additional information below.
sanitizeCssProperties() works properly when 'sans-serif' is unquoted. It looks 
like quotedString in StylingPolicy.java doesn't allow for '-' in the font name. 
Attached is a patch with a potential fix.

Original issue reported on code.google.com by [email protected] on 31 Mar 2014 at 3:43

  • Merged into: #10

Attachments:

URL protocol sanitization should be case insensitive

What steps will reproduce the problem?
1. HTML source contains the following link: <a 
href="HTTP://some.site.org/">Link</a>
2. the input is filtered by applying the following simplified policy:
.allowAttributes("href").onElements("a").allowStandardUrlProtocols().allowElemen
ts("a").toFactory();

What is the expected output?
The link remains intact.

What do you see instead?
The link is removed, even though "http" is an allowed protocol as per policy.

What version of the product are you using?
r215

Additional information:

Internet Standard STD 66 [1] states:
   [...] An implementation should accept uppercase letters as equivalent
   to lowercase in scheme names (e.g., allow "HTTP" as well as "http" [...]

Changing the compare functionality in 
src/org/owasp/html/FilterUrlByProtocolAttributePolicy.java will result in the 
desired outcome:

@@ -77,7 +77,7 @@
           }
           break protocol_loop;
         case ':':
-          if (!protocols.contains(s.substring(0, i))) { return null; }
+          if (!protocols.contains(s.substring(0, i).toLowerCase())) { return 
null; }
           break protocol_loop;
       }
     }

[1]: http://tools.ietf.org/html/std66#section-3.1

Original issue reported on code.google.com by [email protected] on 12 Feb 2014 at 11:06

Move project to Github

Please, move project to Github. Project Hosting on Google Code will close on 
January 25th, 2016.

Original issue reported on code.google.com by [email protected] on 12 Apr 2015 at 12:31

r223 is not available for download from Google Code

It looks like latest official release is r223 [1], but Google Code provides 
r226 (and r223 is completely missing). It is kinda confusing. Also would it be 
possible to tag official releases in svn?

Thanks


[1]: 
http://repo1.maven.org/maven2/com/googlecode/owasp-java-html-sanitizer/owasp-jav
a-html-sanitizer/

Original issue reported on code.google.com by [email protected] on 4 Mar 2014 at 8:10

Project Version in POM Seems to Be Incorrect

The project version in the pom is 1.1-SNAPSHOT, but a release of 1.1 exists on Maven Central. If the release of 1.1 is valid on Maven Central, then the version in the pom should be 1.2-SNAPSHOT.

Runtime error loading org/owasp/html/Sanitizers

What steps will reproduce the problem?
1. Install from Maven

Source code:
PolicyFactory policy = Sanitizers.FORMATTING.and(Sanitizers.BLOCKS); // error 
happens here
String safeHTML = policy.sanitize("<table>asdf</table>"); // never gets to this 
line

What is the expected output? What do you see instead?
Sanitized output. Getting the following error instead:

Aug 22, 2014 9:28:49 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [Jersey Web Application] in context with 
path [/asdf] threw exception [org.glassfish.jersey.server.ContainerException: 
java.lang.NoClassDefFoundError: org/owasp/html/Sanitizers] with root cause
java.lang.NoClassDefFoundError: org/owasp/html/Sanitizers
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:151)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:171)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:195)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:104)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:387)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:331)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:103)
    at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:297)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:254)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1028)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:372)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:381)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:344)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.apache.catalina.filters.CorsFilter.handleNonCORS(CorsFilter.java:439)
    at org.apache.catalina.filters.CorsFilter.doFilter(CorsFilter.java:178)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:745)

What version of the product are you using? On what operating system?
OSX 10.9.4 using version r239. Tomcat v7.0.55. Java SE 1.7.0_67.


Original issue reported on code.google.com by [email protected] on 22 Aug 2014 at 1:38

<a> elements containing <div> elements are not working

When i sanitize links containing sub elements these elements will be moved outside the a element.

<a href=\"https://www.xyz.com\"><div>Button text</div></a> 
will result in:
<a href="https://www.xyz.com" rel="nofollow"></a><div>Button text</div>

Can you please help me with this issue?

Example code:

        String html = "<a href=\"https://www.xyz.com\"><div>Button text</div></a>";
        PolicyFactory policy = new HtmlPolicyBuilder()
                .allowElements("a", "div")
                .allowUrlProtocols("https")
                .allowAttributes("href").onElements("a")
                .requireRelNofollowOnLinks()
                .toFactory();
        String safeHTML = policy.sanitize(html);

Single and double quotes are being transformed

I had hijacked another issue and was asked to create a new one :) After writing 
several tests, it's simpler than I though

What steps will reproduce the problem?
1. Pass an input string with a ' or " in it
2. Comes back escaped as &#39; or &#34;

What is the expected output? What do you see instead?
I expect my input to come back with the ' or " in it.

What version of the product are you using? On what operating system?
Using version r164 on Mac mountain lion

Please provide any additional information below.
The code is quite basic:

HtmlPolicyBuilder builder = new HtmlPolicyBuilder();
PolicyFactory factory = builder.toFactory();
String sanitized = factory.sanitize(input);
return sanitized;




Original issue reported on code.google.com by [email protected] on 24 Jun 2013 at 4:55

Sanitizers.STYLES not working as advertised


> I'm trying to sanitze the html generated by a WYSWYG editor
> (http://hackerwins.github.io/summernote/), but the sanitize() is cleaning
> all the html tags. I'm doing this:
>
> PolicyFactory sanitizer =
> 
Sanitizers.FORMATTING.and(Sanitizers.BLOCKS.and(Sanitizers.STYLES.and(Sanitizers
.LINKS)))
> sanitizer.sanitize(unsafeHtml)
>
> Source string:
> "<span style="font-weight: bold; text-decoration: underline;
> background-color: yellow;">aaaaaaaaaaaaaaaaaaaaaaa</span>"
>
> Result:
> aaaaaaaaaaaaaaaaaaaaaaa
>
> I'm doing something wrong? For what i've read, the standard sanitizers
> should be enough in this case


This looks like a bug.  Sanitizers.STYLES doesn't work as advertised,
so the style="..." attribute is rejected out of hand, and <span> is
one of the elements that is, by default, stripped when it has no
attributes.

I'm looking into a fix and will respond to this thread when I know more.

I repeated the problem using:

    PolicyFactory sanitizer = Sanitizers.FORMATTING
        .and(Sanitizers.BLOCKS)
        .and(Sanitizers.STYLES)
        .and(Sanitizers.LINKS);
    String input = "<span style=\"font-weight: bold;"
        + " text-decoration: underline; background-color: yellow;\""
        + ">aaaaaaaaaaaaaaaaaaaaaaa</span>";
    String got = sanitizer.sanitize(input);
    String want = input;
    assertEquals(want, got);

Original issue reported on code.google.com by [email protected] on 30 Apr 2014 at 7:04

HTML em tag not accepted as an inline formatting element

What steps will reproduce the problem?
1. Initialize a sanitizer as Sanitizers.BLOCKS.and(Sanitizers.FORMATTING).
2. Attempt to sanitize the string "<em>Emphasized</em>".  This trips off the 
<em> tags.

What is the expected output? What do you see instead?
I would expect to see <em>Emphasized</em>.  I see Emphasized instead.

What version of the product are you using? On what operating system?
r239.  Any OS

Please provide any additional information below.
Some HTML programmers consider em and strong to be legacy and obsolete.  
However, the HTML standard still supports them.  Additionally, The OWASP 
Sanitizer supports the strong tag but not the em tag.  If strong is supported, 
so should be em.

Original issue reported on code.google.com by [email protected] on 9 Jul 2014 at 5:06

Configured regex has unbounded O(n^2) execution time

Exploit: "Cause an exception, crash, or inf. loop in the sanitizer that causes 
it to fail to provide service or consume inordinate resources for an input of 
that size.

What steps will reproduce the problem?
1.Enter <a href="http://x            '">t</a> with a large number of spaces 
between the x and the '
2. 15kb of spaces -> 2s execution time. 60kb of spaces -> 171s execution time

What is the expected output? What do you see instead?
Expected: Worst case execution time is at most O(n) for an input of size n (or 
execution time limited to mitigate an attack on server resources)
Observed: Worst case execution time is O(n^2) for an input of size n

What version of the product are you using? On what operating system?
As currently on http://canyouxssthis.com/HTMLSanitizer/reflect (no version 
specified)

Please provide any additional information below.

Not sure if this is "in bounds" since it is a problem with the configuration of 
the sanitizer rather than the sanitizer per se. However, for the sanitizer to 
be effective for others who want to use it, they should receive thoroughly 
tested configurations along with it.

The cause for this defect is the regex defined in 
Pattern OFFSITE_URL = 
Pattern.compile("(\\s)*((ht|f)tp(s?)://|mailto:)[\\p{L}\\p{N}][\\p{L}\\p{N}\\p{Z
s}\\.\#@\$%\\+&;:\\-_~,\\?=/!\\(\\)]*(\\s)*");
Two sequential character classes are starred and not disjoint. Specifically: 
[\\p{Zs}]*(\\s)*

Original issue reported on code.google.com by [email protected] on 27 Feb 2014 at 7:39

Deeply nested elements crash FF 8, Chrome 11

vytah said 

"""
OK, I didn't circumvent the protection, but I managed to crash Firefox 8 and 
make it unusable until I restarted it in safe mode.
My input was about 20000×<div> (opening only, no closing)
"""

Original issue reported on code.google.com by [email protected] on 10 Oct 2011 at 9:20

Can't seem to whitelist <span> without attributes

The following code:

StringBuilder retVal = new StringBuilder();

PolicyFactory policyFactory = new HtmlPolicyBuilder().allowElements("b", "i", "br", "p").allowWithoutAttributes("span").toFactory();

HtmlStreamRenderer renderer = HtmlStreamRenderer.create(retVal, 
                                                        new Handler<String>() {
                                                            public void handle(String x) {
                                                                throw new AssertionError(x);
                                                            }
                                                        });

HtmlSanitizer.sanitize("<span>foo</span>", policyFactory.apply(renderer));

Returns "foo", not "<span>foo</span>"

Text alternative in video-element

Johannes Lichtenberger writes

I have the following policy:

    /**
     * Allow media elements/attributes.
     */
    public static final PolicyFactory MEDIA = new HtmlPolicyBuilder().allowElements("video", "audio", "source")
            .allowAttributes("controls", "width", "height").onElements("video").allowAttributes("controls")
            .onElements("audio").allowAttributes("src", "type").onElements("source").allowTextIn("video", "audio")
            .toFactory();

and the HTML content I want to sanitize (all whitelisted content) is:

<p><video controls="controls" width="300" height="150">
<source src="media/video/small.webm" type="video/webm" />
<source src="media/video/small.mp4" type="video/mp4" />
<source src="media/video/small.ogv" type="video/ogg" />
<source src="media/video/small.3gp" type="video/3gp" />
Your browser does not support the video tag.</video></p>

But it seems character content within the video-element is never permitted (contents-member field is 0, probably it should be != 0?). Should be valid to have an alternative text I guess.

CENTER in H1 terminates header

[email protected] says

We have a similar behavior in this case:

assertEquals("<h1>TEXT</h1>", 
Sanitizers.BLOCKS.sanitize("<H1><center>TEXT</H1>"));

For this one the result is:

<h1></h1>TEXT

instead of:

<h1>TEXT</h1>

But test case:

assertEquals("<h1>TEXT</h1>", 
Sanitizers.BLOCKS.sanitize("<H1></center>TEXT</H1>"));

works as expected:

<h1>TEXT</h1>

What's wrong with the first one?

Original issue reported on code.google.com by [email protected] on 1 Oct 2014 at 12:46

Recognize URLs in <img srcset>

http://www.w3.org/html/wg/drafts/srcset/w3c-srcset/ describes an extension 
attribute to HTML <img> elements that allows multiple annotated URLs.

Make sure the URl protocol policy applies to all of them.

Original issue reported on code.google.com by [email protected] on 21 Jan 2014 at 3:59

Simplify policies that require constraints on a URL based on its protocol

Once a: protocol is allowed, policy authors often want to place additional 
constraints: e.g. a data protocol with an image/... mime-type for use with <img 
src>, or a tel: protocol that contains a valid telephone number.

Right now, policy authors are tempted to do

allowUrlProtocols("data", "https", "http", "mailto")

allowAttributes("src").matching(Pattern.compile("^(data:image/(gif|png|jpeg)[,;]
|http|https|mailto|//)", Pattern.CASE_INSENSITIVE)

which requires duplicative effort.

We should provide good alternatives to writing regular expressions to match 
URLs as it is error prone.

Perhaps a URL policy that recognizes structure in URLs.

Original issue reported on code.google.com by [email protected] on 21 Jan 2014 at 4:09

documentation: maven.md has old version number

i tried installing the java-html-sanitizer with the example from the maven.md file.
maven gave download problems for some old versions.
changing the version to
<version>[r239,)</version>
solved the problem

only minor issue but should help out outers.

empty-element tag transformed to start tag only

What steps will reproduce the problem?
1. new HtmlPolicyBuilder
2. .allowElements("hr")
3. HtmlSanitizer.sanitize("<hr />", policy);

What is the expected output? What do you see instead?
expected - <hr />
instead - <hr>

What version of the product are you using? On what operating system?
r99

Please provide any additional information below.
For browsers the output <hr> is correct. However, it is not usable if we need 
some additional XML processing of the output.

Original issue reported on code.google.com by [email protected] on 18 Sep 2012 at 6:59

html injection/XSS

What steps will reproduce the problem?
1. <a href="http://demo.testfire.net">CLICK HERE</a>
2. click on CLICK HERE
3.

What is the expected output? What do you see instead?
it should filter out html tags. In this context,it accepts <a> tag and href 
attribute which is used to specify a link address. So, by giving the above 
input and on clicking CLICK HERE, it goes to malicious link specified in href 
attribute hence leading to html injection/XSS attacks

What version of the product are you using? On what operating system?
OS-Windows XP
Version-1.5.2

Please provide any additional information below.
vulnerable to html injection attacks

Original issue reported on code.google.com by [email protected] on 11 Jan 2014 at 5:21

Single and double quotes encoded in text nodes

What steps will reproduce the problem?
1. Pass an input string with ' or " - for example: <div>And he said, 
"Hello."</div>
2. ' or " characters come back encoded

What is the expected output? What do you see instead?
I would expect that quotes within text nodes don't get encoded.

What version of the product are you using? On what operating system?
r209, Linux

Please provide any additional information below.

I already saw issue 15: 
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=15

To answer the question that wasn't answered in that issue - "How is this 
causing problems though?" - it causes a problem in rich text editors.

We expect that the user can enter text in a rich text editor; this includes 
quotes.  When that data gets stored and returned again in another/the same 
page, they should see the ' or " they entered, not the encoded version of that 
string.

Original issue reported on code.google.com by [email protected] on 9 Sep 2013 at 3:34

Stackoverflow sanitizing HTML with large inline background-image

What steps will reproduce the problem?
Sanitize with ExampleTest.java the string below causes StackOverflow (with 
r173-EbayPolicy-based code), likely due to very deep regular expression tree. 
Ran into this sanitizing a large email collection. Verified with clean r176 by 
adding a test to ExamplesTest.java. If I sufficiently shorten the image data, I 
no longer get stack overflow.

1) testDataImage(org.owasp.html.ExamplesTest)java.lang.StackOverflowError
    at java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
[...repeats nothing more useful at bottom of list...]

The test is (String all on one line):

  public final void testDataImage() {
        String input="<a class=\"atc_s addthis_button_compact\" style=\"background-image:url();\"></a>";
    String sanitized = EbayPolicyExample.POLICY_DEFINITION.sanitize(input);
    System.out.println(sanitized);
  }


What is the expected output? What do you see instead?
Sanitized HTML with this style monstrosity removed or passed.

What version of the product are you using? On what operating system?
r173-based code, repeated with test case in clean r176.

$ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) Server VM (build 20.45-b01, mixed mode)

Linux 3.2.0-49-generic-pae #75-Ubuntu SMP

Attached gz version of smallest HTML fragment that causes error (image data can 
be shortened further, but attached is with original image data).

Thanks!

Fred

PS:

FWIW -
in StylingPolicy.html, the '.' in the below regexp should probably be escaped. 
However, doing so does not help with above problem (also doesn't cause new 
failures):

  private static final Pattern NON_NEGATIVE_LENGTH = Pattern.compile(
      "(?:0|[1-9][0-9]*)([.][0-9]+)?(ex|[ecm]m|v[hw]|p[xct]|in|%)?");

I also though in CssGrammar.java, the URL_CHARS regexp should require at least 
one character (didn't check with actual CssGrammar, though) by makeing '*' into 
'+', but that also did not abolish the problem (also doesn't cause new 
failures):

    // url chars               ({url_special_chars}|{nonascii}|{escape})+
    String URL_CHARS = "(?:"
        + url_special_chars + "|" + nonascii + "|" + escape + ")+";



Original issue reported on code.google.com by [email protected] on 16 Jul 2013 at 2:51

Attachments:

missing guava.jar causes hang with no exception

What steps will reproduce the problem?
1. include owasp-java-html-sanitizer.jar in the classpath but not guava.jar
2. run the code:

        PolicyFactory policy = Sanitizers.FORMATTING;
        logger.debug("Policy is " + policy);

What is the expected output? What do you see instead?

Expect either a thrown exception or the debug line to be printed

However, the debug line is never printed, the code just seems to hang

What version of the product are you using? On what operating system?

r198, Ubuntu Linux LTS 12

Please provide any additional information below.

This isn't a big problem, the setup instructions do after all say that 
guava.jar is necessary but for troubleshooting purposes, shouldn't there be 
some descriptive way of reporting the missing dependency?

Original issue reported on code.google.com by [email protected] on 25 Jul 2013 at 10:53

<span> without attributes dropped when skipIfEmpty(false) used and policies unioned

MGupta provided the below on the group list

"""
I'm trying to use the default policy and have observed following two issues.

1. <span> is not allowed
2. <br> is returned as <br > (note a space before the end tag)

For #1, I tried using a custom policy with allowElements("span") and it still 
didn't work.

I then tried allowAttributes("id").globally().
This allowed me to use something like this ...  <span id="abc">some text</span>

But I want to use <span> with NO attributes.

I even tried .allowWithoutAttributes("span"), but it did not work.

-----


    public static final PolicyFactory POLICY_DEFINITION = new HtmlPolicyBuilder()
        .allowAttributes("id", "class").globally()
        .allowAttributes("href", "target").onElements("a")
        .allowWithoutAttributes("span", "div")
        .allowElements("a", "span", "div","input", "textarea")
        .toFactory();

    public static String sanitizeWithDefaultPolicy(String htmlString){
        return Sanitizers.FORMATTING
                .and(Sanitizers.BLOCKS)
                .and(Sanitizers.IMAGES)
                .and(Sanitizers.STYLES)
                .and(POLICY_DEFINITION)
                .sanitize(htmlString);
    }
"""

Original issue reported on code.google.com by [email protected] on 10 Feb 2014 at 11:36

Latest Maven Version?

Hi, great work but I've noticed something strange in mvn repostiory and hope you can help and possibly update your release version numbers.

If you look in mvn repository you can see a release this year:
http://mvnrepository.com/artifact/com.googlecode.owasp-java-html-sanitizer/owasp-java-html-sanitizer/20150501.1

But the page is recommending that a newer release is available from last year !?
http://mvnrepository.com/artifact/com.googlecode.owasp-java-html-sanitizer/owasp-java-html-sanitizer/r239

And the OWASP page for the project suggests last year's version is the latest:
https://www.owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Project

Please advise whether the 2015 one is a release (or just a beta/alpha unstable thing).

If it is a release, then please change your versioning of the pom to ensure that the latest version is recognised as the latest version by maven tooling. This might be due to ascii string comparison of the version number: ie
'r' > '2'
"r239" > "20150501.1"

I'm not sure what impact this would have for example if auto versioning picks latest or maven enforcer warns about latest: because it seems like latest they might suggest the 2014 one and not the 2015 one by maven version conventions: (this might need testing). Therefore users attempting to use latest may get trapped on an old version.

Moving style attributes to font tag changes rendering

What steps will reproduce the problem?
1. Consider this HTML:
<table style="color: rgb(0, 0, 0); font-family: Arial, Geneva, sans-serif;">
<tbody>
<tr>
<th>Column One</th><th>Column Two</th>
</tr>
<tr>
<td align="center" style="background-color: rgb(255, 255, 254);"><font 
size="2">Size 2</font></td>
<td align="center" style="background-color: rgb(255, 255, 254);"><font 
size="7">Size 7</font></td>
</tr>
</tbody>
</table>

If you display this in a browser, all the text inside the table renders in a 
sans-serif font.

2. Sanitize that HTML with allowStyling(). Some of the style attributes are 
moved to a font tag. This is the output:
<table>
<font face="Arial, Geneva, sans-serif" style="color:#000">
<tbody>
<tr>
<th>Column One</th>
<th>Column Two</th>
</tr>
<tr>
<td align="center"><font style="background-color:#fffffe"><font size="2">Size 
2</font></font></td>
<td align="center"><font style="background-color:#fffffe"><font size="7">Size 
7</font></font></td>
</tr>
</tbody>
</font>
</table>

If you view this in a browser, the table text is now rendered in serif instead 
of sans-serif.

What is the expected output? What do you see instead?
I think this is the expected output, given the design of the library. However, 
I question whether transforming style attributes by adding a font tag is really 
the "right" thing to do. Besides changing how the HTML is rendered, the font 
tag is deprecated in HTML 4.0 and is not supported in HTML 5. If the code is 
able to generate sanitized style attributes for use in the font tag, why not 
put those same style attributes in the original style attribute (in this case, 
on the table element)?

What version of the product are you using? On what operating system?
r135 on Windows 7, with Java 6.

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 1 Feb 2013 at 4:47

Quote-dependent loss of fonts from font-family style

Thank you for a very useful tool!

I'm trying to deal with the result of content pasted from Excel spreadsheets 
into an HTML text area. I'm interested in preserving as much style info as 
possible. When I sanitize this, the font-family part of the style attribute 
misses fonts, depending on the way the font-family values are quoted.

Specifically, with "font-family: Arial,serif", the 'serif' is missing iff it 
was quoted in the style attribute. It looks like this does not happen to the 
first font listed, and happens to the second font listed if it is "sans-serif" 
but not of it is "Verdana" (std-font vs other handling?).

It does not seem to matter which quotes are used, ie style="font-family: 
'a','b'" yields the same results as style='font-family: "a","b"'.


With an allowStyling() eBay policy (allows span element), Sanitize (these are 
all font-family with 2 fonts and an irrelevant tag. What is different in each 
set of 4 is which font(s) is/are quoted. Set 1, 3, 4 use single quote outer, 
double inner. This is reversed in set 2 without effect. Set 1 and 2 use Arial, 
sans-serif; set 3 Arial, Verdana, set 4 serif, sans-serif.

<span style='font-family:Arial,sans-serif;mso-fareast-language:EN-GB'>..</span>
<span 
style='font-family:"Arial",sans-serif;mso-fareast-language:EN-GB'>..</span>
<span 
style='font-family:Arial,"sans-serif";mso-fareast-language:EN-GB'>..</span>
<span 
style='font-family:"Arial","sans-serif";mso-fareast-language:EN-GB'>..</span>

<span style="font-family:Arial,sans-serif;mso-fareast-language:EN-GB">..</span>
<span 
style="font-family:'Arial',sans-serif;mso-fareast-language:EN-GB">..</span>
<span 
style="font-family:Arial,'sans-serif';mso-fareast-language:EN-GB">..</span>
<span 
style="font-family:'Arial','sans-serif';mso-fareast-language:EN-GB">..</span>

<span style='font-family:Arial,Verdana;mso-fareast-language:EN-GB'>..</span>
<span style='font-family:"Arial",Verdana;mso-fareast-language:EN-GB'>..</span>
<span style='font-family:Arial,"Verdana";mso-fareast-language:EN-GB'>..</span>
<span style='font-family:"Arial","Verdana";mso-fareast-language:EN-GB'>..</span>

<span style='font-family:serif,sans-serif;mso-fareast-language:EN-GB'>..</span>
<span 
style='font-family:"serif",sans-serif;mso-fareast-language:EN-GB'>..</span>
<span 
style='font-family:serif,"sans-serif";mso-fareast-language:EN-GB'>..</span>
<span 
style='font-family:"serif","sans-serif";mso-fareast-language:EN-GB'>..</span>

The output is:
<span style="font-family:&#34;Arial&#34;,sans-serif">..</span>
<span style="font-family:&#34;Arial&#34;,sans-serif">..</span>
<span style="font-family:&#34;Arial&#34;">..</span>
<span style="font-family:&#34;Arial&#34;">..</span>

<span style="font-family:&#34;Arial&#34;,sans-serif">..</span>
<span style="font-family:&#34;Arial&#34;,sans-serif">..</span>
<span style="font-family:&#34;Arial&#34;">..</span>
<span style="font-family:&#34;Arial&#34;">..</span>

<span style="font-family:&#34;Arial&#34;,&#34;Verdana&#34;">..</span>
<span style="font-family:&#34;Arial&#34;,&#34;Verdana&#34;">..</span>
<span style="font-family:&#34;Arial&#34;,&#34;Verdana&#34;">..</span>
<span style="font-family:&#34;Arial&#34;,&#34;Verdana&#34;">..</span>

<span style="font-family:serif,sans-serif">..</span>
<span style="font-family:&#34;serif&#34;,sans-serif">..</span>
<span style="font-family:serif">..</span>
<span style="font-family:&#34;serif&#34;">..</span>

What is the expected output? What do you see instead?
I expected to see all the fonts listed.

What version of the product are you using? On what operating system?
r164, Java-1.6, JUnit.

Thank you!
Fred


Original issue reported on code.google.com by [email protected] on 11 May 2013 at 8:08

Ending Tag Removal Not Emitting Event for HtmlChangeListener

Hey there,

I wanted to use this sanitization library to help detect issues with HTML input. I noticed that when I have an closing tag in my input with no opening tag, the sanitizer will take care of it but will not emit an event in the HtmlChangeListener.
I think it might be related to this as well:
#40

Would greatly appreciate if this closing tag with no opening tag could be emitted as an event to be captured in my implementation of the HtmlChangeListener.

Thank you

Element policies can receive list of attributes with duplicates.

A policy that uses a permissive attribute policy because there is an element policy that

  1. Looks for a particular attribute name
  2. Extracts and vets the value
  3. Assumes that all other attributes have been vetted by sufficiently strict attribute policies

can be confused.

We should prevent attributes with duplicate names from making it to an element policy to prevent element policy authors from being confused. The DOM model for element already assumes that there is at most one value for any given (namespace/local-name) pair so we lose no generality by restricting the output to have at most one attribute with a given name.

HtmlPolicyBuilder.allowUrlProtocols() doesn't work

If I specify a custom set of allowed URL protocols different from the set 
"http", "https" and "mailto", some URLs are not handled correctly.

E.g. for the input "<img src=\"http://canaries.org/canary.png\">" the policy 
builder
new HtmlPolicyBuilder()
            .allowElements("img")
            .allowAttributes("src").onElements("img")
            .allowUrlProtocols("http")
returns an empty string, but should return the unmodified input value.

I have attached a patch containing an additional test case that shows the issue 
and a fix for it in the class FilterUrlByProtocolAttributePolicy.

Original issue reported on code.google.com by [email protected] on 26 Mar 2012 at 10:12

Attachments:

"&nbsp;" returns space after sanitize instead of returning same "&nbsp;"

I can not say this is bug but may be the policy we configure is wrong.


On the string if have html entites "&nbsp;" than after sanitize it show (empty 
space) but not return "&nbsp;"  while for other example "&lt", "&gt" shows 
correctly after sanitize.

example,

final String test = "&nbsp;&gt;";

final PolicyFactory policy = Sanitizers.FORMATTING.and(
Sanitizers.BLOCKS).and(Sanitizers.STYLES);
final String safeHTML = policy.sanitize(test);

System.out.println("Before:" +test);
System.out.println("After:" +safeHTML);

Result:
-------
Before:&nbsp;&gt;
After: &gt;

Actually we need &nbsp; after sanitize so can your provide guidance on this how 
to achieve.

Thx in advance!

Kr,
Urvish

Original issue reported on code.google.com by [email protected] on 23 May 2014 at 1:17

Span ending tag is removing after Sanitization

I am using r239 in windows 8 when i give the text as below

<span style=\"color:rgb(72, 72, 72); font-family:helveticaneue\"> <span>my &nbsp;</span> list of style names or a </span>

the sanitization text is not properly ending the span tag the text as below which i got after sanitizaion

<span style="color:rgb(72, 72, 72); font-family:helveticaneue"> my  </span> list of style names or a 

font-family not well-formed after sanitizing

What steps will reproduce the problem?

HTML before sanitizing

<span style="font-size:9.0pt;font-family:"Trebuchet MS","sans-serif";color:#505050">

HTML after sanitizing

<span style="font-size:9pt;font-family:'trebuchet ms' ,;color:#505050">

I already read in other issues, why the sans-serif font will be dropped. This
would be fine, but there is a "," left after removing the font.
Firefox struggles with this "," and will not use any of the provided Fonts.

So the expected Output is:

<span style="font-size:9pt;font-family:'trebuchet ms' ;color:#505050">

When removing the "," Firefox renders the page as expected.

What version of the product are you using? On what operating system?

r239, Windows 8.1, Firefox 32.0.3
No Issue in Internet Explorer and Chrome.

Original issue reported on code.google.com by [email protected] on 26 Oct 2014 at 12:21

select tag is closed prematurely relative to child option tags

See standalone JUnit test attached.

Briefly:

"<select>\n" +
"<option>A</option>\n" +
"<option>B</option>\n" +
"</select>\n"

will sanitize just fine into:

<select>" +
"<option>A</option>" +
"<option>B</option>" +
"</select>\n"

but

"<select>" +
"<option>A</option> \n" + // <-- notice the space before the newline
"<option>B</option> \n" + // <-- notice the space before the newline
"</select>\n";

produces this mangled result:

"<select><option>A</option></select> \n" +
"<option>B</option> \n" +
"\n"

Original issue reported on code.google.com by [email protected] on 3 May 2014 at 1:08

Attachments:

child elements are moved out of their parents

> What steps will reproduce the problem?
Execute the attached testcase

> What is the expected output? What do you see instead?
When sanitizing, the sanitizer moves inner elements out of it's parent under 
certain circumstances (see examples in testcase).

I don't want the sanitizer to change the markup but to remove all contents that 
are not allowed.

> What version of the product are you using? On what operating system?
r135 / linux

Original issue reported on code.google.com by [email protected] on 1 Feb 2013 at 12:10

Attachments:

disslow xss vectors from style attribute

There are known Style Attribute XSS attacks like:

<DIV STYLE="color: red; width: expression(alert('XSS')); background-image: 
url('expression.png') ">
Or

<DIV STYLE="background-image: url(javascript:alert('XSS'));  border-image: 
url(images/javascript.png) 30 round round;">


And i need to satinaze html to this:
<DIV STYLE="color: red; background-image: url('expression.png') ">
Or

<DIV STYLE="border-image: url(images/javascript.png) 30 round round;">



Is this librarry cover such options?

Original issue reported on code.google.com by [email protected] on 19 Jun 2013 at 1:02

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.