Maven Doxia
  1. Maven Doxia
  2. DOXIA-226

Make XML based parsers better handle whitespace

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Regarding whitespace in XML documents, one needs to consider the following aspects:

      • ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent
      • collapsible whitespace, i.e. view "Text   Text" and "Text Text" as equivalent
      • trimmable whitespace, i.e. view "<p> Text </p>" and "<p>Text</p>" as equivalent

      Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.

      Currently, whitespace handling is rather static, e.g. XhtmlBaseParser pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if AbstractXmlParser provided a default implementation of handleText() that subclasses can simply control via state flags instead of implementing handleText() from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.

      More precisely, I image the following changes:

      • Have AbstractXmlParser maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
      • Have AbstractXmlParser push/pop a tuple from this stack before/after calling handleStartTag()/handleEndTag()
      • Have AbstractXmlParser provide setters to allow subclasses to control the desired whitespace handling in their handleStartTag() implementation
      • Have AbstractXmlParser implement handleText() where it evalutes the top-most tuple from the stack

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Benjamin Bentmann
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development