|
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859 |
-
- HTML Purifier
- by Edward Z. Yang
-
- There are a number of ad hoc HTML filtering solutions out there on the web
- (some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that
- claim to filter HTML properly, preventing malicious JavaScript and layout
- breaking HTML from getting through the parser. None of them, however,
- demonstrates a thorough knowledge of neither the DTD that defines the HTML
- nor the caveats of HTML that cannot be expressed by a DTD. Configurable
- filters (such as kses or PHP's built-in striptags() function) have trouble
- validating the contents of attributes and can be subject to security attacks
- due to poor configuration. Other filters take the naive approach of
- blacklisting known threats and tags, failing to account for the introduction
- of new technologies, new tags, new attributes or quirky browser behavior.
-
- However, HTML Purifier takes a different approach, one that doesn't use
- specification-ignorant regexes or narrow blacklists. HTML Purifier will
- decompose the whole document into tokens, and rigorously process the tokens by:
- removing non-whitelisted elements, transforming bad practice tags like <font>
- into <span>, properly checking the nesting of tags and their children and
- validating all attributes according to their RFCs.
-
- To my knowledge, there is nothing like this on the web yet. Not even MediaWiki,
- which allows an amazingly diverse mix of HTML and wikitext in its documents,
- gets all the nesting quirks right. Existing solutions hope that no JavaScript
- will slip through, but either do not attempt to ensure that the resulting
- output is valid XHTML or send the HTML through a draconic XML parser (and yet
- still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
- tags from being nested within each other).
-
- This document no longer is a detailed description of how HTMLPurifier works,
- as those descriptions have been moved to the appropriate code. The first
- draft was drawn up after two rough code sketches and the implementation of a
- forgiving lexer. You may also be interested in the unit tests located in the
- tests/ folder, which provide a living document on how exactly the filter deals
- with malformed input.
-
- In summary (see corresponding classes for more details):
-
- 1. Parse document into an array of tag and text tokens (Lexer)
- 2. Remove all elements not on whitelist and transform certain other elements
- into acceptable forms (i.e. <font>)
- 3. Make document well formed while helpfully taking into account certain quirks,
- such as the fact that <p> tags traditionally are closed by other block-level
- elements.
- 4. Run through all nodes and check children for proper order (especially
- important for tables).
- 5. Validate attributes according to more restrictive definitions based on the
- RFCs.
- 6. Translate back into a string. (Generator)
-
- HTML Purifier is best suited for documents that require a rich array of
- HTML tags. Things like blog comments are, in all likelihood, most appropriately
- written in an extremely restrictive set of markup that doesn't require
- all this functionality (or not written in HTML at all), although this may
- be changing in the future with the addition of levels of filtering.
-
- vim: et sw=4 sts=4
|