Details
-
Bug
-
Status: Resolved
-
Resolution: Fixed
-
2.6.2
-
None
-
None
-
Operating System: Other
Platform: Other
-
27583
Description
When Xerces (XMLReader.parse()) encounters malformed Unicode data such as an
invalid UTF-8 sequence it throws an IOException, more specifically a
UTFDataFormatException or a CharConversionException. However, according to the
SAX and XML specificaitons this should be a SAXException which is reported to
the ErrorHandler's fatalError() mehtod.
Note first from the XML spec which states, in section 4.3.3:
It is a fatal error when an XML processor encounters an entity with an encoding
that it is unable to process. It is a fatal error if an XML entity is determined
(via default, encoding declaration, or higher-level protocol) to be in a certain
encoding but contains byte sequences that are not legal in that encoding.
Specifically, it is a fatal error if an entity encoded in UTF-8 contains any
irregular code unit sequences, as defined in Unicode 3.1 [Unicode3]. Unless an
encoding is determined by a higher-level protocol, it is also a fatal error if
an XML entity contains no encoding declaration and its content is not legal
UTF-8 or UTF-16.
The SAX spec says of the fatalError() method, "This corresponds to the
definition of "fatal error" in section 1.2 of the W3C XML 1.0 Recommendation.
For example, a parser would use this callback to report the violation of a
well-formedness constraint." At one point I thought it was OK to report this as
an IOException. However, since the XML spec is unambiguous that character
encoding errors are fatal errors, and since the SAX spec does not limit fatal
errors to well-formedness errors, I think character encoding errors should be
reported as SAXExceptions rather than IOExceptions, and should be reported ot
the fatalError method.