Uploaded image for project: 'XalanJ2'
  1. XalanJ2
  2. XALANJ-2419

Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.1
    • Fix Version/s: None
    • Component/s: Serialization
    • Labels:
      None

      Description

      org.apache.xml.serializer.ToStream contains the following code:
      else if (m_encodingInfo.isInEncoding(ch))

      { // If the character is in the encoding, and // not in the normal ASCII range, we also // just leave it get added on to the clean characters }

      else

      { // This is a fallback plan, we should never get here // but if the character wasn't previously handled // (i.e. isn't in the encoding, etc.) then what // should we do? We choose to write out an entity writeOutCleanChars(chars, i, lastDirtyCharProcessed); writer.write("&#"); writer.write(Integer.toString(ch)); writer.write(';'); lastDirtyCharProcessed = i; }

      This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.

      The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.

        Attachments

        1. XALANJ-2419-fix-v3.txt
          4 kB
          Jesper Steen Møller
        2. XALANJ-2419-tests-v3.txt
          54 kB
          Jesper Steen Møller

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                hsivonen Henri Sivonen
              • Votes:
                9 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated: