You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

167 lines
6.7KB

  1. The Modularization of HTMLDefinition in HTML Purifier
  2. WARNING: This document was drafted before the implementation of this
  3. system, and some implementation details may have evolved over time.
  4. HTML Purifier uses the modularization of XHTML
  5. <http://www.w3.org/TR/xhtml-modularization/> to organize the internals
  6. of HTMLDefinition into a more manageable and extensible fashion. Rather
  7. than have one super-object, HTMLDefinition is split into HTMLModules,
  8. each of which are responsible for defining elements, their attributes,
  9. and other properties (for a more indepth coverage, see
  10. /library/HTMLPurifier/HTMLModule.php's docblock comments). These modules
  11. are managed by HTMLModuleManager.
  12. Modules that we don't support but could support are:
  13. * 5.6. Table Modules
  14. o 5.6.1. Basic Tables Module [?]
  15. * 5.8. Client-side Image Map Module [?]
  16. * 5.9. Server-side Image Map Module [?]
  17. * 5.12. Target Module [?]
  18. * 5.21. Name Identification Module [deprecated]
  19. These modules would be implemented as "unsafe":
  20. * 5.2. Core Modules
  21. o 5.2.1. Structure Module
  22. * 5.3. Applet Module
  23. * 5.5. Forms Modules
  24. o 5.5.1. Basic Forms Module
  25. o 5.5.2. Forms Module
  26. * 5.10. Object Module
  27. * 5.11. Frames Module
  28. * 5.13. Iframe Module
  29. * 5.14. Intrinsic Events Module
  30. * 5.15. Metainformation Module
  31. * 5.16. Scripting Module
  32. * 5.17. Style Sheet Module
  33. * 5.19. Link Module
  34. * 5.20. Base Module
  35. We will not be using W3C's XML Schemas or DTDs directly due to the lack
  36. of robust tools for handling them (the main problem is that all the
  37. current parsers are usually PHP 5 only and solely-validating, not
  38. correcting).
  39. This system may be generalized and ported over for CSS.
  40. == General Use-Case ==
  41. The outwards API of HTMLDefinition has been largely preserved, not
  42. only for backwards-compatibility but also by design. Instead,
  43. HTMLDefinition can be retrieved "raw", in which it loads a structure
  44. that closely resembles the modules of XHTML 1.1. This structure is very
  45. dynamic, making it easy to make cascading changes to global content
  46. sets or remove elements in bulk.
  47. However, once HTML Purifier needs the actual definition, it retrieves
  48. a finalized version of HTMLDefinition. The finalized definition involves
  49. processing the modules into a form that it is optimized for multiple
  50. calls. This final version is immutable and, even if editable, would
  51. be extremely hard to change.
  52. So, some code taking advantage of the XHTML modularization may look
  53. like this:
  54. <?php
  55. $config = HTMLPurifier_Config::createDefault();
  56. $def =& $config->getHTMLDefinition(true); // reference to raw
  57. $def->addElement('marquee', 'Block', 'Flow', 'Common');
  58. $purifier = new HTMLPurifier($config);
  59. $purifier->purify($html); // now the definition is finalized
  60. ?>
  61. == Inclusions ==
  62. One of the nice features of HTMLDefinition is that piggy-backing off
  63. of global attribute and content sets is extremely easy to do.
  64. === Attributes ===
  65. HTMLModule->elements[$element]->attr stores attribute information for the
  66. specific attributes of $element. This is quite close to the final
  67. API that HTML Purifier interfaces with, but there's an important
  68. extra feature: attr may also contain a array with a member index zero.
  69. <?php
  70. HTMLModule->elements[$element]->attr[0] = array('AttrSet');
  71. ?>
  72. Rather than map the attribute key 0 to an array (which should be
  73. an AttrDef), it defines a number of attribute collections that should
  74. be merged into this elements attribute array.
  75. Furthermore, the value of an attribute key, attribute value pair need
  76. not be a fully fledged AttrDef object. They can also be a string, which
  77. signifies a AttrDef that is looked up from a centralized registry
  78. AttrTypes. This allows more concise attribute definitions that look
  79. more like W3C's declarations, as well as offering a centralized point
  80. for modifying the behavior of one attribute type. And, of course, the
  81. old method of manually instantiating an AttrDef still works.
  82. === Attribute Collections ===
  83. Attribute collections are stored and processed in the AttrCollections
  84. object, which is responsible for performing the inclusions signified
  85. by the 0 index. These attribute collections, too, are mutable, by
  86. using HTMLModule->attr_collections. You may add new attributes
  87. to a collection or define an entirely new collection for your module's
  88. use. Inclusions can also be cumulative.
  89. Attribute collections allow us to get rid of so called "global attributes"
  90. (which actually aren't so global).
  91. === Content Models and ChildDef ===
  92. An implementation of the above-mentioned attributes and attribute
  93. collections was applied to the ChildDef system. HTML Purifier uses
  94. a proprietary system called ChildDef for performance and flexibility
  95. reasons, but this does not line up very well with W3C's notion of
  96. regexps for defining the allowed children of an element.
  97. HTMLPurifier->elements[$element]->content_model and
  98. HTMLPurifier->elements[$element]->content_model_type store information
  99. about the final ChildDef that will be stored in
  100. HTMLPurifier->elements[$element]->child (we use a different variable
  101. because the two forms are sufficiently different).
  102. $content_model is an abstract, string representation of the internal
  103. state of ChildDef, while $content_model_type is a string identifier
  104. of which ChildDef subclass to instantiate. $content_model is processed
  105. by substituting all content set identifiers (capitalized element names)
  106. with their contents. It is then parsed and passed into the appropriate
  107. ChildDef class, as defined by the ContentSets->getChildDef() or the
  108. custom fallback HTMLModule->getChildDef() for custom child definitions
  109. not in the core.
  110. You'll need to use these facilities if you plan on referencing a content
  111. set like "Inline" or "Block", and using them is recommended even if you're
  112. not due to their conciseness.
  113. A few notes on $content_model: it's structure can be as complicated
  114. as you want, but the pipe symbol (|) is reserved for defining possible
  115. choices, due to the content sets implementation. For example, a content
  116. model that looks like:
  117. "Inline -> Block -> a"
  118. ...when the Inline content set is defined as "span | b" and the Block
  119. content set is defined as "div | blockquote", will expand into:
  120. "span | b -> div | blockquote -> a"
  121. The custom HTMLModule->getChildDef() function will need to be able to
  122. then feed this information to ChildDef in a usable manner.
  123. === Content Sets ===
  124. Content sets can be altered using HTMLModule->content_sets, an associative
  125. array of content set names to content set contents. If the content set
  126. already exists, your values are appended on to it (great for, say,
  127. registering the font tag as an inline element), otherwise it is
  128. created. They are substituted into content_model.
  129. vim: et sw=4 sts=4