These elements are sanitized by default:
The following HTML tags are allowed by default (all others are stripped): a, abbr, acronym, address, area, b, big, blockquote, br, button, caption, center, cite, code, col, colgroup, dd, del, dfn, dir, div, dl, dt, em, fieldset, font, form, h1, h2, h3, h4, h5, h6, hr, i, img, input, ins, kbd, label, legend, li, map, menu, ol, optgroup, option, p, pre, q, s, samp, select, small, span, strike, strong, sub, sup, table, tbody, td, textarea, tfoot, th, thead, tr, tt, u, ul, var
The following HTML attributes are allowed by default (all others are stripped): abbr, accept, accept-charset, accesskey, action, align, alt, axis, border, cellpadding, cellspacing, char, charoff, charset, checked, cite, class, clear, cols, colspan, color, compact, coords, datetime, dir, disabled, enctype, for, frame, headers, height, href, hreflang, hspace, id, ismap, label, lang, longdesc, maxlength, media, method, multiple, name, nohref, noshade, nowrap, prompt, readonly, rel, rev, rows, rowspan, rules, scope, selected, shape, size, span, src, start, summary, tabindex, target, title, type, usemap, valign, value, vspace, width
|The unit tests for HTML sanitizing show many different examples of dangerous markup that Universal Feed Parser sanitizes by default.|
I am often asked why Universal Feed Parser is so hard-assed about HTML sanitizing. This topic usually comes up when someone notices that Universal Feed Parser strips all style attributes by default.
Here is an incomplete list of potentially dangerous HTML tags and attributes:
- script, which can contain malicious script
- applet, embed, and object, which can automatically download and execute malicious code
- meta, which can contain malicious redirects
- onload, onunload, and all other on* attributes, which can contain malicious script
- style, link, and the style attribute, which can contain malicious script
style? Yes, style. CSS definitions can contain executable code.
This sample is taken from http://feedparser.org/docs/examples/rss20.xml:
<description>Watch out for <span style="any: expression(window.location='http://example.org/')"> nasty tricks</span></description>
Now consider that in HTML, attribute values may be entity-encoded in several different ways.
To a browser, this:
<span style="any: expression(window.location='http://example.org/')">
is the same as this (without the line breaks):
<span style="any: expre ssion(window .location='h ttp://exampl e.org/')">
which is the same as this (without the line breaks):
<span style="any: expr ession(win dow.locati on='http:/ /example.o rg/')">
And so on, plus several other variations, plus every combination of every variation.
The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about tags or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code. I will not attempt to preserve “just the good styles”. All styles are stripped.