Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Auto Closed
-
None
-
None
-
None
-
None
Description
Regarding whitespace in XML documents, one needs to consider the following aspects:
- ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent
- collapsible whitespace, i.e. view "Text Text" and "Text Text" as equivalent
- trimmable whitespace, i.e. view "<p> Text </p>" and "<p>Text</p>" as equivalent
Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.
Currently, whitespace handling is rather static, e.g. XhtmlBaseParser pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if AbstractXmlParser provided a default implementation of handleText() that subclasses can simply control via state flags instead of implementing handleText() from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.
More precisely, I image the following changes:
- Have AbstractXmlParser maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
- Have AbstractXmlParser push/pop a tuple from this stack before/after calling handleStartTag()/handleEndTag()
- Have AbstractXmlParser provide setters to allow subclasses to control the desired whitespace handling in their handleStartTag() implementation
- Have AbstractXmlParser implement handleText() where it evalutes the top-most tuple from the stack
Attachments
Issue Links
- is depended upon by
-
DOXIA-405 The generated xhtml document has the entire content on a single line
- Closed
- is duplicated by
-
DOXIA-251 The AbstractXmlParser should take care of EOL
- Closed
- is related to
-
DOXIA-577 Handle whitespace in tables properly in ConfluenceSink
- Closed
- relates to
-
DOXIA-263 Improve validation of input documents
- Closed