Uploaded image for project: 'Xerces2-J'
  1. Xerces2-J
  2. XERCESJ-913

Xerces throws IOExcepitons that should be SAXExceptions for bad UTF-8 and similar

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Resolution: Fixed
    • 2.6.2
    • None
    • SAX
    • None
    • Operating System: Other
      Platform: Other
    • 27583

    Description

      When Xerces (XMLReader.parse()) encounters malformed Unicode data such as an
      invalid UTF-8 sequence it throws an IOException, more specifically a
      UTFDataFormatException or a CharConversionException. However, according to the
      SAX and XML specificaitons this should be a SAXException which is reported to
      the ErrorHandler's fatalError() mehtod.

      Note first from the XML spec which states, in section 4.3.3:

      It is a fatal error when an XML processor encounters an entity with an encoding
      that it is unable to process. It is a fatal error if an XML entity is determined
      (via default, encoding declaration, or higher-level protocol) to be in a certain
      encoding but contains byte sequences that are not legal in that encoding.
      Specifically, it is a fatal error if an entity encoded in UTF-8 contains any
      irregular code unit sequences, as defined in Unicode 3.1 [Unicode3]. Unless an
      encoding is determined by a higher-level protocol, it is also a fatal error if
      an XML entity contains no encoding declaration and its content is not legal
      UTF-8 or UTF-16.

      The SAX spec says of the fatalError() method, "This corresponds to the
      definition of "fatal error" in section 1.2 of the W3C XML 1.0 Recommendation.
      For example, a parser would use this callback to report the violation of a
      well-formedness constraint." At one point I thought it was OK to report this as
      an IOException. However, since the XML spec is unambiguous that character
      encoding errors are fatal errors, and since the SAX spec does not limit fatal
      errors to well-formedness errors, I think character encoding errors should be
      reported as SAXExceptions rather than IOExceptions, and should be reported ot
      the fatalError method.

      Attachments

        Activity

          People

            Unassigned Unassigned
            elharo@metalab.unc.edu elharo
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: