Uploaded image for project: 'XalanJ2'
  1. XalanJ2
  2. XALANJ-2419

Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

    XMLWordPrintableJSON

Details

    Description

      org.apache.xml.serializer.ToStream contains the following code:
      else if (m_encodingInfo.isInEncoding(ch))

      { // If the character is in the encoding, and // not in the normal ASCII range, we also // just leave it get added on to the clean characters }

      else

      { // This is a fallback plan, we should never get here // but if the character wasn't previously handled // (i.e. isn't in the encoding, etc.) then what // should we do? We choose to write out an entity writeOutCleanChars(chars, i, lastDirtyCharProcessed); writer.write("&#"); writer.write(Integer.toString(ch)); writer.write(';'); lastDirtyCharProcessed = i; }

      This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.

      The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.

      Attachments

        1. XALANJ-2419-fix-v3.txt
          4 kB
          Jesper Steen Møller
        2. XALANJ-2419-tests-v3.txt
          54 kB
          Jesper Steen Møller

        Issue Links

          Activity

            People

              keshlam@alum.mit.edu Joe Kesselman
              hsivonen Henri Sivonen
              Votes:
              11 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: