Uploaded image for project: 'XalanJ2'
  1. XalanJ2
  2. XALANJ-2419

Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.7.1
    • None
    • Serialization
    • None

    Description

      org.apache.xml.serializer.ToStream contains the following code:
      else if (m_encodingInfo.isInEncoding(ch))

      { // If the character is in the encoding, and // not in the normal ASCII range, we also // just leave it get added on to the clean characters }

      else

      { // This is a fallback plan, we should never get here // but if the character wasn't previously handled // (i.e. isn't in the encoding, etc.) then what // should we do? We choose to write out an entity writeOutCleanChars(chars, i, lastDirtyCharProcessed); writer.write("&#"); writer.write(Integer.toString(ch)); writer.write(';'); lastDirtyCharProcessed = i; }

      This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.

      The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.

      Attachments

        1. XALANJ-2419-fix-v3.txt
          4 kB
          Jesper Steen Møller
        2. XALANJ-2419-tests-v3.txt
          54 kB
          Jesper Steen Møller

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            hsivonen Henri Sivonen

            Dates

              Created:
              Updated:

              Slack

                Issue deployment