You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

282 lines
13KB

  1. INCLUDES, AUTOLOAD, BYTECODE CACHES and OPTIMIZATION
  2. The Problem
  3. -----------
  4. HTML Purifier contains a number of extra components that are not used all
  5. of the time, only if the user explicitly specifies that we should use
  6. them.
  7. Some of these optional components are optionally included (Filter,
  8. Language, Lexer, Printer), while others are included all the time
  9. (Injector, URIFilter, HTMLModule, URIScheme). We will stipulate that these
  10. are all developer specified: it is conceivable that certain Tokens are not
  11. used, but this is user-dependent and should not be trusted.
  12. We should come up with a consistent way to handle these things and ensure
  13. that we get the maximum performance when there is bytecode caches and
  14. when there are not. Unfortunately, these two goals seem contrary to each
  15. other.
  16. A peripheral issue is the performance of ConfigSchema, which has been
  17. shown take a large, constant amount of initialization time, and is
  18. intricately linked to the issue of includes due to its pervasive use
  19. in our plugin architecture.
  20. Pros and Cons
  21. -------------
  22. We will assume that user-based extensions will be included by them.
  23. Conditional includes:
  24. Pros:
  25. - User management is simplified; only a single directive needs to be set
  26. - Only necessary code is included
  27. Cons:
  28. - Doesn't play nicely with opcode caches
  29. - Adds complexity to standalone version
  30. - Optional configuration directives are not exposed without a little
  31. extra coaxing (not implemented yet)
  32. Include it all:
  33. Pros:
  34. - User management is still simple
  35. - Plays nicely with opcode caches and standalone version
  36. - All configuration directives are present
  37. Cons:
  38. - Lots of (how much?) extra code is included
  39. - Classes that inherit from external libraries will cause compile
  40. errors
  41. Build an include stub (Let's do this!):
  42. Pros:
  43. - Only necessary code is included
  44. - Plays nicely with opcode caches and standalone version
  45. - require (without once) can be used, see above
  46. - Could further extend as a compilation to one file
  47. Cons:
  48. - Not implemented yet
  49. - Requires user intervention and use of a command line script
  50. - Standalone script must be chained to this
  51. - More complex and compiled-language-like
  52. - Requires a whole new class of system-wide configuration directives,
  53. as configuration objects can be reused
  54. - Determining what needs to be included can be complex (see above)
  55. - No way of autodetecting dynamically instantiated classes
  56. - Might be slow
  57. Include stubs
  58. -------------
  59. This solution may be "just right" for users who are heavily oriented
  60. towards performance. However, there are a number of picky implementation
  61. details to work out beforehand.
  62. The number one concern is how to make the HTML Purifier files "work
  63. out of the box", while still being able to easily get them into a form
  64. that works with this setup. As the codebase stands right now, it would
  65. be necessary to strip out all of the require_once calls. The only way
  66. we could get rid of the require_once calls is to use __autoload or
  67. use the stub for all cases (which might not be a bad idea).
  68. Aside
  69. -----
  70. An important thing to remember, however, is that these require_once's
  71. are valuable data about what classes a file needs. Unfortunately, there's
  72. no distinction between whether or not the file is needed all the time,
  73. or whether or not it is one of our "optional" files. Thus, it is
  74. effectively useless.
  75. Deprecated
  76. ----------
  77. One of the things I'd like to do is have the code search for any classes
  78. that are explicitly mentioned in the code. If a class isn't mentioned, I
  79. get to assume that it is "optional," i.e. included via introspection.
  80. The choice is either to use PHP's tokenizer or use regexps; regexps would
  81. be faster but a tokenizer would be more correct. If this ends up being
  82. unfeasible, adding dependency comments isn't a bad idea. (This could
  83. even be done automatically by search/replacing require_once, although
  84. we'd have to manually inspect the results for the optional requires.)
  85. NOTE: This ends up not being necessary, as we're going to make the user
  86. figure out all the extra classes they need, and only include the core
  87. which is predetermined.
  88. Using the autoload framework with include stubs works nicely with
  89. introspective classes: instead of having to have require_once inside
  90. the function, we can let autoload do the work; we simply need to
  91. new $class or accept the object straight from the caller. Handling filters
  92. becomes a simple matter of ticking off configuration directives, and
  93. if ConfigSchema spits out errors, adding the necessary includes. We could
  94. also use the autoload framework as a fallback, in case the user forgets
  95. to make the include, but doesn't really care about performance.
  96. Insight
  97. -------
  98. All of this talk is merely a natural extension of what our current
  99. standalone functionality does. However, instead of having our code
  100. perform the includes, or attempting to inline everything that possibly
  101. could be used, we boot the issue to the user, making them include
  102. everything or setup the fallback autoload handler.
  103. Configuration Schema
  104. --------------------
  105. A common deficiency for all of the conditional include setups (including
  106. the dynamically built include PHP stub) is that if one of this
  107. conditionally included files includes a configuration directive, it
  108. is not accessible to configdoc. A stopgap solution for this problem is
  109. to have it piggy-back off of the data in the merge-library.php script
  110. to figure out what extra files it needs to include, but if the file also
  111. inherits classes that don't exist, we're in big trouble.
  112. I think it's high time we centralized the configuration documentation.
  113. However, the type checking has been a great boon for the library, and
  114. I'd like to keep that. The compromise is to use some other source, and
  115. then parse it into the ConfigSchema internal format (sans all of those
  116. nasty documentation strings which we really don't need at runtime) and
  117. serialize that for future use.
  118. The next question is that of format. XML is very verbose, and the prospect
  119. of setting defaults in it gives me willies. However, this may be necessary.
  120. Splitting up the file into manageable chunks may alleviate this trouble,
  121. and we may be even want to create our own format optimized for specifying
  122. configuration. It might look like (based off the PHPT format, which is
  123. nicely compact yet unambiguous and human-readable):
  124. Core.HiddenElements
  125. TYPE: lookup
  126. DEFAULT: array('script', 'style') // auto-converted during processing
  127. --ALIASES--
  128. Core.InvisibleElements, Core.StupidElements
  129. --DESCRIPTION--
  130. <p>
  131. Blah blah
  132. </p>
  133. The first line is the directive name, the lines after that prior to the
  134. first --HEADER-- block are single-line values, and then after that
  135. the multiline values are there. No value is restricted to a particular
  136. format: DEFAULT could very well be multiline if that would be easier.
  137. This would make it insanely easy, also, to add arbitrary extra parameters,
  138. like:
  139. VERSION: 3.0.0
  140. ALLOWED: 'none', 'light', 'medium', 'heavy' // this is wrapped in array()
  141. EXTERNAL: CSSTidy // this would be documented somewhere else with a URL
  142. The final loss would be that you wouldn't know what file the directive
  143. was used in; with some clever regexps it should be possible to
  144. figure out where $config->get($ns, $d); occurs. Reflective calls to
  145. the configuration object is mitigated by the fact that getBatch is
  146. used, so we can simply talk about that in the namespace definition page.
  147. This might be slow, but it would only happen when we are creating
  148. the documentation for consumption, and is sugar.
  149. We can put this in a schema/ directory, outside of HTML Purifier. The serialized
  150. data gets treated like entities.ser.
  151. The final thing that needs to be handled is user defined configurations.
  152. They can be added at runtime using ConfigSchema::registerDirectory()
  153. which globs the directory and grabs all of the directives to be incorporated
  154. in. Then, the result is saved. We may want to take advantage of the
  155. DefinitionCache framework, although it is not altogether certain what
  156. configuration directives would be used to generate our key (meta-directives!)
  157. Further thoughts
  158. ----------------
  159. Our master configuration schema will only need to be updated once
  160. every new version, so it's easily versionable. User specified
  161. schema files are far more volatile, but it's far too expensive
  162. to check the filemtimes of all the files, so a DefinitionRev style
  163. mechanism works better. However, we can uniquely identify the
  164. schema based on the directories they loaded, so there's no need
  165. for a DefinitionId until we give them full programmatic control.
  166. These variables should be directly incorporated into ConfigSchema,
  167. and ConfigSchema should handle serialization. Some refactoring will be
  168. necessary for the DefinitionCache classes, as they are built with
  169. Config in mind. If the user changes something, the cache file gets
  170. rebuilt. If the version changes, the cache file gets rebuilt. Since
  171. our unit tests flush the caches before we start, and the operation is
  172. pretty fast, this will not negatively impact unit testing.
  173. One last thing: certain configuration directives require that files
  174. get added. They may even be specified dynamically. It is not a good idea
  175. for the HTMLPurifier_Config object to be used directly for such matters.
  176. Instead, the userland code should explicitly perform the includes. We may
  177. put in something like:
  178. REQUIRES: HTMLPurifier_Filter_ExtractStyleBlocks
  179. To indicate that if that class doesn't exist, and the user is attempting
  180. to use the directive, we should fatally error out. The stub includes the core files,
  181. and the user includes everything else. Any reflective things like new
  182. $class would be required to tie in with the configuration.
  183. It would work very well with rarely used configuration options, but it
  184. wouldn't be so good for "core" parts that can be disabled. In such cases
  185. the core include file would need to be modified, and the only way
  186. to properly do this is use the configuration object. Once again, our
  187. ability to create cache keys saves the day again: we can create arbitrary
  188. stub files for arbitrary configurations and include those. They could
  189. even be the single file affairs. The only thing we'd need to include,
  190. then, would be HTMLPurifier_Config! Then, the configuration object would
  191. load the library.
  192. An aside...
  193. -----------
  194. One questions, however, the wisdom of letting PHP files write other PHP
  195. files. It seems like a recipe for disaster, or at least lots of headaches
  196. in highly secured setups, where PHP does not have the ability to write
  197. to its root. In such cases, we could use sticky bits or tell the user
  198. to manually generate the file.
  199. The other troublesome bit is actually doing the calculations necessary.
  200. For certain cases, it's simple (such as URIScheme), but for AttrDef
  201. and HTMLModule the dependency trees are very complex in relation to
  202. %HTML.Allowed and friends. I think that this idea should be shelved
  203. and looked at a later, less insane date.
  204. An interesting dilemma presents itself when a configuration form is offered
  205. to the user. Normally, the configuration object is not accessible without
  206. editing PHP code; this facility changes thing. The sensible thing to do
  207. is stipulate that all classes required by the directives you allow must
  208. be included.
  209. Unit testing
  210. ------------
  211. Setting up the parsing and translation into our existing format would not
  212. be difficult to do. It might represent a good time for us to rethink our
  213. tests for these facilities; as creative as they are, they are often hacky
  214. and require public visibility for things that ought to be protected.
  215. This is especially applicable for our DefinitionCache tests.
  216. Migration
  217. ---------
  218. Because we are not *adding* anything essentially new, it should be trivial
  219. to write a script to take our existing data and dump it into the new format.
  220. Well, not trivial, but fairly easy to accomplish. Primary implementation
  221. difficulties would probably involve formatting the file nicely.
  222. Backwards-compatibility
  223. -----------------------
  224. I expect that the ConfigSchema methods should stick around for a little bit,
  225. but display E_USER_NOTICE warnings that they are deprecated. This will
  226. require documentation!
  227. New stuff
  228. ---------
  229. VERSION: Version number directive was introduced
  230. DEPRECATED-VERSION: If the directive was deprecated, when was it deprecated?
  231. DEPRECATED-USE: If the directive was deprecated, what should the user use now?
  232. REQUIRES: What classes does this configuration directive require, but are
  233. not part of the HTML Purifier core?
  234. vim: et sw=4 sts=4