You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

212 lines
9.8KB

  1. Considerations for ErrorCollection
  2. Presently, HTML Purifier takes a code-execution centric approach to handling
  3. errors. Errors are organized and grouped according to which segment of the
  4. code triggers them, not necessarily the portion of the input document that
  5. triggered the error. This means that errors are pseudo-sorted by category,
  6. rather than location in the document.
  7. One easy way to "fix" this problem would be to re-sort according to line number.
  8. However, the "category" style information we derive from naively following
  9. program execution is still useful. After all, each of the strategies which
  10. can report errors still process the document mostly linearly. Furthermore,
  11. not only do they process linearly, but the way they pass off operations to
  12. sub-systems mirrors that of the document. For example, AttrValidator will
  13. linearly proceed through elements, and on each element will use AttrDef to
  14. validate those contents. From there, the attribute might have more
  15. sub-components, which have execution passed off accordingly.
  16. In fact, each strategy handles a very specific class of "error."
  17. RemoveForeignElements - element tokens
  18. MakeWellFormed - element token ordering
  19. FixNesting - element token ordering
  20. ValidateAttributes - attributes of elements
  21. The crucial point is that while we care about the hierarchy governing these
  22. different errors, we *don't* care about any other information about what actually
  23. happens to the elements. This brings up another point: if HTML Purifier fixes
  24. something, this is not really a notice/warning/error; it's really a suggestion
  25. of a way to fix the aforementioned defects.
  26. In short, the refactoring to take this into account kinda sucks.
  27. Errors should not be recorded in order that they are reported. Instead, they
  28. should be bound to the line (and preferably element) in which they were found.
  29. This means we need some way to uniquely identify every element in the document,
  30. which doesn't presently exist. An easy way of adding this would be to track
  31. line columns. An important ramification of this is that we *must* use the
  32. DirectLex implementation.
  33. 1. Implement column numbers for DirectLex [DONE!]
  34. 2. Disable error collection when not using DirectLex [DONE!]
  35. Next, we need to re-orient all of the error declarations to place CurrentToken
  36. at utmost important. Since this is passed via Context, it's not always clear
  37. if that's available. ErrorCollector should complain HARD if it isn't available.
  38. There are some locations when we don't have a token available. These include:
  39. * Lexing - this can actually have a row and column, but NOT correspond to
  40. a token
  41. * End of document errors - bump this to the end
  42. Actually, we *don't* have to complain if CurrentToken isn't available; we just
  43. set it as a document-wide error. And actually, nothing needs to be done here.
  44. Something interesting to consider is whether or not we care about the locations
  45. of attributes and CSS properties, i.e. the sub-objects that compose these things.
  46. In terms of consistency, at the very least attributes should have column/line
  47. numbers attached to them. However, this may be overkill, as attributes are
  48. uniquely identifiable. You could go even further, with CSS, but they are also
  49. uniquely identifiable.
  50. Bottom-line is, however, this information must be available, in form of the
  51. CurrentAttribute and CurrentCssProperty (theoretical) context variables, and
  52. it must be used to organize the errors that the sub-processes may throw.
  53. There is also a hierarchy of sorts that may make merging this into one context
  54. variable more sense, if it hadn't been for HTML's reasonably rigid structure.
  55. A CSS property will never contain an HTML attribute. So we won't ever get
  56. recursive relations, and having multiple depths won't ever make sense. Leave
  57. this be.
  58. We already have this information, and consequently, using start and end is
  59. *unnecessary*, so long as the context variables are set appropriately. We don't
  60. care if an error was thrown by an attribute transform or an attribute definition;
  61. to the end user these are the same (for a developer, they are different, but
  62. they're better off with a stack trace (which we should add support for) in such
  63. cases).
  64. 3. Remove start()/end() code. Don't get rid of recursion, though [DONE]
  65. 4. Setup ErrorCollector to use context information to setup hierarchies.
  66. This may require a different internal format. Use objects if it gets
  67. complex. [DONE]
  68. ASIDE
  69. More on this topic: since we are now binding errors to lines
  70. and columns, a particular error can have three relationships to that
  71. specific location:
  72. 1. The token at that location directly
  73. RemoveForeignElements
  74. AttrValidator (transforms)
  75. MakeWellFormed
  76. 2. A "component" of that token (i.e. attribute)
  77. AttrValidator (removals)
  78. 3. A modification to that node (i.e. contents from start to end
  79. token) as a whole
  80. FixNesting
  81. This needs to be marked accordingly. In the presentation, it might
  82. make sense keep (3) separate, have (2) a sublist of (1). (1) can
  83. be a closing tag, in which case (3) makes no sense at all, OR it
  84. should be related with its opening tag (this may not necessarily
  85. be possible before MakeWellFormed is run).
  86. So, the line and column counts as our identifier, so:
  87. $errors[$line][$col] = ...
  88. Then, we need to identify case 1, 2 or 3. They are identified as
  89. such:
  90. 1. Need some sort of semaphore in RemoveForeignElements, etc.
  91. 2. If CurrentAttr/CurrentCssProperty is non-null
  92. 3. Default (FixNesting, MakeWellFormed)
  93. One consideration about (1) is that it usually is actually a
  94. (3) modification, but we have no way of knowing about that because
  95. of various optimizations. However, they can probably be treated
  96. the same. The other difficulty is that (3) is never a line and
  97. column; rather, it is a range (i.e. a duple) and telling the user
  98. the very start of the range may confuse them. For example,
  99. <b>Foo<div>bar</div></b>
  100. ^ ^
  101. The node being operated on is <b>, so the error would be assigned
  102. to the first caret, with a "node reorganized" error. Then, the
  103. ChildDef would have submitted its own suggestions and errors with
  104. regard to what's going in the internals. So I suppose this is
  105. ok. :-)
  106. Now, the structure of the earlier mentioned ... would be something
  107. like this:
  108. object {
  109. type = (token|attr|property),
  110. value, // appropriate for type
  111. errors => array(),
  112. sub-errors = [recursive],
  113. }
  114. This helps us keep things agnostic. It is also sufficiently complex
  115. enough to warrant an object.
  116. So, more wanking about the object format is in order. The way HTML Purifier is
  117. currently setup, the only possible hierarchy is:
  118. token -> attr -> css property
  119. These relations do not exist all of the time; a comment or end token would not
  120. ever have any attributes, and non-style attributes would never have CSS properties
  121. associated with them.
  122. I believe that it is worth supporting multiple paths. At some point, we might
  123. have a hierarchy like:
  124. * -> syntax
  125. -> token -> attr -> css property
  126. -> url
  127. -> css stylesheet <style>
  128. et cetera. Now, one of the practical implications of this is that every "node"
  129. on our tree is well-defined, so in theory it should be possible to either 1.
  130. create a separate class for each error struct, or 2. embed this information
  131. directly into HTML Purifier's token stream. Embedding the information in the
  132. token stream is not a terribly good idea, since tokens can be removed, etc.
  133. So that leaves us with 1... and if we use a generic interface we can cut down
  134. on a lot of code we might need. So let's leave it like this.
  135. ~~~~
  136. Then we setup suggestions.
  137. 5. Setup a separate error class which tells the user any modifications
  138. HTML Purifier made.
  139. Some information about this:
  140. Our current paradigm is to tell the user what HTML Purifier did to the HTML.
  141. This is the most natural mode of operation, since that's what HTML Purifier
  142. is all about; it was not meant to be a validator.
  143. However, most other people have experience dealing with a validator. In cases
  144. where HTML Purifier unambiguously does the right thing, simply giving the user
  145. the correct version isn't a bad idea, but problems arise when:
  146. - The user has such bad HTML we do something odd, when we should have just
  147. flagged the HTML as an error. Such examples are when we do things like
  148. remove text from directly inside a <table> tag. It was probably meant to
  149. be in a <td> tag or be outside the table, but we're not smart enough to
  150. realize this so we just remove it. In such a case, we should tell the user
  151. that there was foreign data in the table, but then we shouldn't "demand"
  152. the user remove the data; it's more of a "here's a possible way of
  153. rectifying the problem"
  154. - Giving line context for input is hard enough, but feasible; giving output
  155. line context will be extremely difficult due to shifting lines; we'd probably
  156. have to track what the tokens are and then find the appropriate out context
  157. and it's not guaranteed to work etc etc etc.
  158. ````````````
  159. Don't forget to spruce up output.
  160. 6. Output needs to automatically give line and column numbers, basically
  161. "at line" on steroids. Look at W3C's output; it's ok. [PARTIALLY DONE]
  162. - We need a standard CSS to apply (check demo.css for some starting
  163. styling; some buttons would also be hip)
  164. vim: et sw=4 sts=4