|
-
- The Modularization of HTMLDefinition in HTML Purifier
-
- WARNING: This document was drafted before the implementation of this
- system, and some implementation details may have evolved over time.
-
- HTML Purifier uses the modularization of XHTML
- <http://www.w3.org/TR/xhtml-modularization/> to organize the internals
- of HTMLDefinition into a more manageable and extensible fashion. Rather
- than have one super-object, HTMLDefinition is split into HTMLModules,
- each of which are responsible for defining elements, their attributes,
- and other properties (for a more indepth coverage, see
- /library/HTMLPurifier/HTMLModule.php's docblock comments). These modules
- are managed by HTMLModuleManager.
-
- Modules that we don't support but could support are:
-
- * 5.6. Table Modules
- o 5.6.1. Basic Tables Module [?]
- * 5.8. Client-side Image Map Module [?]
- * 5.9. Server-side Image Map Module [?]
- * 5.12. Target Module [?]
- * 5.21. Name Identification Module [deprecated]
-
- These modules would be implemented as "unsafe":
-
- * 5.2. Core Modules
- o 5.2.1. Structure Module
- * 5.3. Applet Module
- * 5.5. Forms Modules
- o 5.5.1. Basic Forms Module
- o 5.5.2. Forms Module
- * 5.10. Object Module
- * 5.11. Frames Module
- * 5.13. Iframe Module
- * 5.14. Intrinsic Events Module
- * 5.15. Metainformation Module
- * 5.16. Scripting Module
- * 5.17. Style Sheet Module
- * 5.19. Link Module
- * 5.20. Base Module
-
- We will not be using W3C's XML Schemas or DTDs directly due to the lack
- of robust tools for handling them (the main problem is that all the
- current parsers are usually PHP 5 only and solely-validating, not
- correcting).
-
- This system may be generalized and ported over for CSS.
-
- == General Use-Case ==
-
- The outwards API of HTMLDefinition has been largely preserved, not
- only for backwards-compatibility but also by design. Instead,
- HTMLDefinition can be retrieved "raw", in which it loads a structure
- that closely resembles the modules of XHTML 1.1. This structure is very
- dynamic, making it easy to make cascading changes to global content
- sets or remove elements in bulk.
-
- However, once HTML Purifier needs the actual definition, it retrieves
- a finalized version of HTMLDefinition. The finalized definition involves
- processing the modules into a form that it is optimized for multiple
- calls. This final version is immutable and, even if editable, would
- be extremely hard to change.
-
- So, some code taking advantage of the XHTML modularization may look
- like this:
-
- <?php
- $config = HTMLPurifier_Config::createDefault();
- $def =& $config->getHTMLDefinition(true); // reference to raw
- $def->addElement('marquee', 'Block', 'Flow', 'Common');
- $purifier = new HTMLPurifier($config);
- $purifier->purify($html); // now the definition is finalized
- ?>
-
- == Inclusions ==
-
- One of the nice features of HTMLDefinition is that piggy-backing off
- of global attribute and content sets is extremely easy to do.
-
- === Attributes ===
-
- HTMLModule->elements[$element]->attr stores attribute information for the
- specific attributes of $element. This is quite close to the final
- API that HTML Purifier interfaces with, but there's an important
- extra feature: attr may also contain a array with a member index zero.
-
- <?php
- HTMLModule->elements[$element]->attr[0] = array('AttrSet');
- ?>
-
- Rather than map the attribute key 0 to an array (which should be
- an AttrDef), it defines a number of attribute collections that should
- be merged into this elements attribute array.
-
- Furthermore, the value of an attribute key, attribute value pair need
- not be a fully fledged AttrDef object. They can also be a string, which
- signifies a AttrDef that is looked up from a centralized registry
- AttrTypes. This allows more concise attribute definitions that look
- more like W3C's declarations, as well as offering a centralized point
- for modifying the behavior of one attribute type. And, of course, the
- old method of manually instantiating an AttrDef still works.
-
- === Attribute Collections ===
-
- Attribute collections are stored and processed in the AttrCollections
- object, which is responsible for performing the inclusions signified
- by the 0 index. These attribute collections, too, are mutable, by
- using HTMLModule->attr_collections. You may add new attributes
- to a collection or define an entirely new collection for your module's
- use. Inclusions can also be cumulative.
-
- Attribute collections allow us to get rid of so called "global attributes"
- (which actually aren't so global).
-
- === Content Models and ChildDef ===
-
- An implementation of the above-mentioned attributes and attribute
- collections was applied to the ChildDef system. HTML Purifier uses
- a proprietary system called ChildDef for performance and flexibility
- reasons, but this does not line up very well with W3C's notion of
- regexps for defining the allowed children of an element.
-
- HTMLPurifier->elements[$element]->content_model and
- HTMLPurifier->elements[$element]->content_model_type store information
- about the final ChildDef that will be stored in
- HTMLPurifier->elements[$element]->child (we use a different variable
- because the two forms are sufficiently different).
-
- $content_model is an abstract, string representation of the internal
- state of ChildDef, while $content_model_type is a string identifier
- of which ChildDef subclass to instantiate. $content_model is processed
- by substituting all content set identifiers (capitalized element names)
- with their contents. It is then parsed and passed into the appropriate
- ChildDef class, as defined by the ContentSets->getChildDef() or the
- custom fallback HTMLModule->getChildDef() for custom child definitions
- not in the core.
-
- You'll need to use these facilities if you plan on referencing a content
- set like "Inline" or "Block", and using them is recommended even if you're
- not due to their conciseness.
-
- A few notes on $content_model: it's structure can be as complicated
- as you want, but the pipe symbol (|) is reserved for defining possible
- choices, due to the content sets implementation. For example, a content
- model that looks like:
-
- "Inline -> Block -> a"
-
- ...when the Inline content set is defined as "span | b" and the Block
- content set is defined as "div | blockquote", will expand into:
-
- "span | b -> div | blockquote -> a"
-
- The custom HTMLModule->getChildDef() function will need to be able to
- then feed this information to ChildDef in a usable manner.
-
- === Content Sets ===
-
- Content sets can be altered using HTMLModule->content_sets, an associative
- array of content set names to content set contents. If the content set
- already exists, your values are appended on to it (great for, say,
- registering the font tag as an inline element), otherwise it is
- created. They are substituted into content_model.
-
- vim: et sw=4 sts=4
|