HTML Sanitization

By default, Universal Feed Parser sanitizes HTML markup in several elements, removing HTML tags and attributes that could introduce Javascript or other security risks.

These elements are sanitized by default:

The following HTML tags are allowed by default (all others are stripped): a, abbr, acronym, address, area, b, big, blockquote, br, button, caption, center, cite, code, col, colgroup, dd, del, dfn, dir, div, dl, dt, em, fieldset, font, form, h1, h2, h3, h4, h5, h6, hr, i, img, input, ins, kbd, label, legend, li, map, menu, ol, optgroup, option, p, pre, q, s, samp, select, small, span, strike, strong, sub, sup, table, tbody, td, textarea, tfoot, th, thead, tr, tt, u, ul, var

The following HTML attributes are allowed by default (all others are stripped): abbr, accept, accept-charset, accesskey, action, align, alt, axis, border, cellpadding, cellspacing, char, charoff, charset, checked, cite, class, clear, cols, colspan, color, compact, coords, datetime, dir, disabled, enctype, for, frame, headers, height, href, hreflang, hspace, id, ismap, label, lang, longdesc, maxlength, media, method, multiple, name, nohref, noshade, nowrap, prompt, readonly, rel, rev, rows, rowspan, rules, scope, selected, shape, size, span, src, start, summary, tabindex, target, title, type, usemap, valign, value, vspace, width

Note
The unit tests for HTML sanitizing show many different examples of dangerous markup that Universal Feed Parser sanitizes by default.

Whitelist, Don't Blacklist

I am often asked why Universal Feed Parser is so hard-assed about HTML sanitizing. This topic usually comes up when someone notices that Universal Feed Parser strips all style attributes by default.

Here is an incomplete list of potentially dangerous HTML tags and attributes:

  • script, which can contain malicious script
  • applet, embed, and object, which can automatically download and execute malicious code
  • meta, which can contain malicious redirects
  • onload, onunload, and all other on* attributes, which can contain malicious script
  • style, link, and the style attribute, which can contain malicious script

style? Yes, style. CSS definitions can contain executable code.

Example: Embedding Javascript in CSS

This sample is taken from http://feedparser.org/docs/examples/rss20.xml:

<description>Watch out for
&lt;span style="background: url(javascript:window.location='http://example.org/')"&gt;
nasty tricks&lt;/span&gt;</description>

This sample is more advanced, and does not contain the keyword javascript: that many naive HTML sanitizers scan for:

<description>Watch out for
&lt;span style="any: expression(window.location='http://example.org/')"&gt;
nasty tricks&lt;/span&gt;</description>

Internet Explorer for Windows will execute the Javascript in both of these examples.

Now consider that in HTML, attribute values may be entity-encoded in several different ways.

Example: Embedding encoded Javascript in CSS

To a browser, this:

<span style="any: expression(window.location='http://example.org/')">

is the same as this (without the line breaks):

<span style="&#97;&#110;&#121;&#58;&#32;&#101;&#120;&#112;&#114;&#101;
&#115;&#115;&#105;&#111;&#110;&#40;&#119;&#105;&#110;&#100;&#111;&#119;
&#46;&#108;&#111;&#99;&#97;&#116;&#105;&#111;&#110;&#61;&#39;&#104;
&#116;&#116;&#112;&#58;&#47;&#47;&#101;&#120;&#97;&#109;&#112;&#108;
&#101;&#46;&#111;&#114;&#103;&#47;&#39;&#41;">

which is the same as this (without the line breaks):

<span style="&#x61;&#x6e;&#x79;&#x3a;&#x20;&#x65;&#x78;&#x70;&#x72;
&#x65;&#x73;&#x73;&#x69;&#x6f;&#x6e;&#x28;&#x77;&#x69;&#x6e;
&#x64;&#x6f;&#x77;&#x2e;&#x6c;&#x6f;&#x63;&#x61;&#x74;&#x69;
&#x6f;&#x6e;&#x3d;&#x27;&#x68;&#x74;&#x74;&#x70;&#x3a;&#x2f;
&#x2f;&#x65;&#x78;&#x61;&#x6d;&#x70;&#x6c;&#x65;&#x2e;&#x6f;
&#x72;&#x67;&#x2f;&#x27;&#x29;">

And so on, plus several other variations, plus every combination of every variation.

The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about tags or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code. I will not attempt to preserve “just the good styles”. All styles are stripped.

← Date Parsing
Content Normalization →