Uploaded image for project: 'XalanJ2'
  1. XalanJ2
  2. XALANJ-2419

Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.1
    • Fix Version/s: None
    • Component/s: Serialization
    • Labels:
      None

      Description

      org.apache.xml.serializer.ToStream contains the following code:
      else if (m_encodingInfo.isInEncoding(ch))

      { // If the character is in the encoding, and // not in the normal ASCII range, we also // just leave it get added on to the clean characters }

      else

      { // This is a fallback plan, we should never get here // but if the character wasn't previously handled // (i.e. isn't in the encoding, etc.) then what // should we do? We choose to write out an entity writeOutCleanChars(chars, i, lastDirtyCharProcessed); writer.write("&#"); writer.write(Integer.toString(ch)); writer.write(';'); lastDirtyCharProcessed = i; }

      This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.

      The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.

      1. XALANJ-2419-fix.txt
        2 kB
        Jesper Steen Møller
      2. XALANJ-2419-tests.txt
        39 kB
        Jesper Steen Møller

        Issue Links

          Activity

          Hide
          richard.evans@datanomic.com Richard Evans added a comment -

          This is a serious problem because Xalan fails to encode any of the Unicode supplementary characters.

          The version shipped with Java 1.6 does not have this problem. However if I need to run with Java 1.5 (on systems which do not support 1.6), there is no solution for encoding supplementary characters - the 1.5 built in code does not handle them and plugging in a different Xalan will not help.

          Show
          richard.evans@datanomic.com Richard Evans added a comment - This is a serious problem because Xalan fails to encode any of the Unicode supplementary characters. The version shipped with Java 1.6 does not have this problem. However if I need to run with Java 1.5 (on systems which do not support 1.6), there is no solution for encoding supplementary characters - the 1.5 built in code does not handle them and plugging in a different Xalan will not help.
          Hide
          minchau@ca.ibm.com Brian Minchau added a comment -

          Henri, Richard:
          you are right, the code there should have checked if ch was a high in a high low/pair, and
          m_encodingInfo.isInEncoding(high,low) should have been called in this case.

          Definite bug. I'll see if I can supply a patch.

          Show
          minchau@ca.ibm.com Brian Minchau added a comment - Henri, Richard: you are right, the code there should have checked if ch was a high in a high low/pair, and m_encodingInfo.isInEncoding(high,low) should have been called in this case. Definite bug. I'll see if I can supply a patch.
          Hide
          lexspoon Lex Spoon added a comment -

          What are people doing to work around this bug? It appears to make Xalan unsuitable for any documents that might need surrogate characters. Yet, it is left lingering for years, so perhaps it is not as serious as it looks?

          Show
          lexspoon Lex Spoon added a comment - What are people doing to work around this bug? It appears to make Xalan unsuitable for any documents that might need surrogate characters. Yet, it is left lingering for years, so perhaps it is not as serious as it looks?
          Hide
          jespersm Jesper Steen Møller added a comment -

          I tried transforming XML using the Xalan in Java 1.6_20 and Xalan 2.7.1, and they both split astral characters and produce illegal character entities.

          My fix is attached, as tests and a possible fix against the 2.7.1. tag, against 'test' and 'java' respectively.

          The test patch adds three test cases to StreamResultAPITest where the same source XML file is transformed into three different files, each with different encoding (ASCII, ISO8859 and UTF-8). Also, a new test program, ToStreamTest, is also added which excercises the ToXMLStream class directly, checking for handling of encoding directly in the ToStream class.

          The patches and input files are licensed as <http://www.apache.org/licenses/LICENSE-2.0>

          Show
          jespersm Jesper Steen Møller added a comment - I tried transforming XML using the Xalan in Java 1.6_20 and Xalan 2.7.1, and they both split astral characters and produce illegal character entities. My fix is attached, as tests and a possible fix against the 2.7.1. tag, against 'test' and 'java' respectively. The test patch adds three test cases to StreamResultAPITest where the same source XML file is transformed into three different files, each with different encoding (ASCII, ISO8859 and UTF-8). Also, a new test program, ToStreamTest, is also added which excercises the ToXMLStream class directly, checking for handling of encoding directly in the ToStream class. The patches and input files are licensed as < http://www.apache.org/licenses/LICENSE-2.0 >
          Hide
          sergi Sergey Oplavin added a comment -

          Hello, is there any plans to fix this bug?

          Show
          sergi Sergey Oplavin added a comment - Hello, is there any plans to fix this bug?
          Hide
          smitchelus Scott Mitchell added a comment -

          Hi, we just got bitten by this same issue. I applied the patch supplied in this bug report and confirmed it fixed our issue. Is there any possibility of getting an officially supported release that has this fix in it? I'd hate to have to run our own customized version of Xalan.

          Show
          smitchelus Scott Mitchell added a comment - Hi, we just got bitten by this same issue. I applied the patch supplied in this bug report and confirmed it fixed our issue. Is there any possibility of getting an officially supported release that has this fix in it? I'd hate to have to run our own customized version of Xalan.
          Hide
          thetaphi Uwe Schindler added a comment -

          One workaround for this is:

          • Just specify output encoding to be UTF-16
          • Instead of OutputStream write the XML to a standard java.io.Writer with correct charset. Unfortunately this will print the wrong charset into the XML header, but this may be fixed with some FilterWriter. I did not have that issue, as we emit no XML header, so not sure how hard this is.

          But this bug should really be fixed. It makes serializer.jar unuseable with any far-east language! This also affects xerces.jar, which also uses serializer.jar when serializing XML.

          Show
          thetaphi Uwe Schindler added a comment - One workaround for this is: Just specify output encoding to be UTF-16 Instead of OutputStream write the XML to a standard java.io.Writer with correct charset. Unfortunately this will print the wrong charset into the XML header, but this may be fixed with some FilterWriter. I did not have that issue, as we emit no XML header, so not sure how hard this is. But this bug should really be fixed. It makes serializer.jar unuseable with any far-east language! This also affects xerces.jar, which also uses serializer.jar when serializing XML.
          Hide
          thetaphi Uwe Schindler added a comment -

          Maybe it's a better idea to ask the XERCES people to fix this bug? XALAN seems dead to me, unfortunately.

          Show
          thetaphi Uwe Schindler added a comment - Maybe it's a better idea to ask the XERCES people to fix this bug? XALAN seems dead to me, unfortunately.
          Hide
          garydgregory Gary Gregory added a comment -

          Does the whole test suite pass with this patch on top of 2.7.2?

          Show
          garydgregory Gary Gregory added a comment - Does the whole test suite pass with this patch on top of 2.7.2?
          Hide
          smitchelus Scott Mitchell added a comment -

          Hi Gary, I'm trying to answer that question for you, but so far I've been unable to figure out how to run the test suite. I originally downloaded a source distribution to make my modifications, but that doesn't appear to have the tests included in it. I then tried cloning the Git repo and checking out the SVN repo and neither of those seem to include the tests either. Any clue how I can get my hands on the tests? FWIW, here's the error I'm getting when I run "ant smoketest" or "minitest":

          tests-not-available:
          [echo] [tests] The tests do not seem to be present in ../test
          [echo] [tests] You must have checked out from CVS to run the tests,
          [echo] [tests] it is not included in binary distributions.
          [echo] [tests] See http://xml.apache.org/xalan-j/test/ for more info.

          Show
          smitchelus Scott Mitchell added a comment - Hi Gary, I'm trying to answer that question for you, but so far I've been unable to figure out how to run the test suite. I originally downloaded a source distribution to make my modifications, but that doesn't appear to have the tests included in it. I then tried cloning the Git repo and checking out the SVN repo and neither of those seem to include the tests either. Any clue how I can get my hands on the tests? FWIW, here's the error I'm getting when I run "ant smoketest" or "minitest": tests-not-available: [echo] [tests] The tests do not seem to be present in ../test [echo] [tests] You must have checked out from CVS to run the tests, [echo] [tests] it is not included in binary distributions. [echo] [tests] See http://xml.apache.org/xalan-j/test/ for more info.
          Hide
          smitchelus Scott Mitchell added a comment -

          Man, I can't figure out how to get the test source code. Does anyone watching this thread have any insight here? As far as I can tell it might be that the test source code is still in CVS???

          Show
          smitchelus Scott Mitchell added a comment - Man, I can't figure out how to get the test source code. Does anyone watching this thread have any insight here? As far as I can tell it might be that the test source code is still in CVS???
          Hide
          smitchelus Scott Mitchell added a comment - - edited

          I even figured out how to install CVS on my Mac (who thought I'd ever have to go there), and now the Apache CVS server is refusing connections. Sigh...

          Show
          smitchelus Scott Mitchell added a comment - - edited I even figured out how to install CVS on my Mac (who thought I'd ever have to go there), and now the Apache CVS server is refusing connections. Sigh...
          Hide
          jespersm Jesper Steen Møller added a comment - - edited

          I followed the instructions on https://xalan.apache.org/xalan-j/downloads.html#buildmyself on my Mac OS X 10.10.4 with Xcode developer tools installed.
          I had to add execute permissions on test/build.sh, and temporarily change my locale to "All American" (or the test "Extension test of javaSample3.xsl" fails)

          That worked, and I got 2 x CONGRATULATIONS when running "./build.sh smoketest"

          I then applied the patch containing the tests (using svn patch), and then ToStreamTest.runTest() and StreamResultAPITest.runTest() both failed, as was expected.

          I then applied the fix, and the tests were once again OK.

          So, yes, the fix still applies.

          This was on Java 1.7.

          Hope this helps!

          Show
          jespersm Jesper Steen Møller added a comment - - edited I followed the instructions on https://xalan.apache.org/xalan-j/downloads.html#buildmyself on my Mac OS X 10.10.4 with Xcode developer tools installed. I had to add execute permissions on test/build.sh, and temporarily change my locale to "All American" (or the test "Extension test of javaSample3.xsl" fails) That worked, and I got 2 x CONGRATULATIONS when running "./build.sh smoketest" I then applied the patch containing the tests (using svn patch), and then ToStreamTest.runTest() and StreamResultAPITest.runTest() both failed, as was expected. I then applied the fix, and the tests were once again OK. So, yes, the fix still applies. This was on Java 1.7. Hope this helps!
          Hide
          smitchelus Scott Mitchell added a comment -

          Thanks Jesper!

          So, does this imply it would be possible to get this integrated into a release?

          Show
          smitchelus Scott Mitchell added a comment - Thanks Jesper! So, does this imply it would be possible to get this integrated into a release?
          Hide
          garydgregory Gary Gregory added a comment - - edited
          Show
          garydgregory Gary Gregory added a comment - - edited The 2.7.x maintenance is coming out of: https://svn.apache.org/repos/asf/xalan/java/branches/xalan-j_2_7_1_maint https://svn.apache.org/repos/asf/xalan/test/branches/xalan-j_2_7_x Do the tests pass on this branch?
          Hide
          jespersm Jesper Steen Møller added a comment - - edited

          Yes, they work on those branches, too. The patches were originally produced against the 2.7.1 tag.
          (but, yes, I verified just now)

          Show
          jespersm Jesper Steen Møller added a comment - - edited Yes, they work on those branches, too. The patches were originally produced against the 2.7.1 tag. (but, yes, I verified just now)
          Hide
          jespersm Jesper Steen Møller added a comment -

          I noticed a confusing comment in the ToStreamTest, where the test method testCase2() asserts that

          	reporter.check(actual2, AELIG_OSLASH_ARING, "ISO-8859-1 characters should come out as entities");
          

          This should read

          	reporter.check(actual2, AELIG_OSLASH_ARING, "ISO-8859-1 characters should come out unscathed");
          

          ... as it's also what's asserted.

          Show
          jespersm Jesper Steen Møller added a comment - I noticed a confusing comment in the ToStreamTest, where the test method testCase2() asserts that reporter.check(actual2, AELIG_OSLASH_ARING, "ISO-8859-1 characters should come out as entities" ); This should read reporter.check(actual2, AELIG_OSLASH_ARING, "ISO-8859-1 characters should come out unscathed" ); ... as it's also what's asserted.
          Hide
          smitchelus Scott Mitchell added a comment -

          So, given that Jesper has confirmed the compatibility of the patch with the 2.7.X branch, what is the possibility of getting this included in a release in the near term?

          Show
          smitchelus Scott Mitchell added a comment - So, given that Jesper has confirmed the compatibility of the patch with the 2.7.X branch, what is the possibility of getting this included in a release in the near term?
          Hide
          garydgregory Gary Gregory added a comment - - edited

          The only thing blocking me from doing this ATM is priorities. Next week, next month, who knows. Got to pay the bills

          Show
          garydgregory Gary Gregory added a comment - - edited The only thing blocking me from doing this ATM is priorities. Next week, next month, who knows. Got to pay the bills
          Hide
          jharrop Jason Harrop added a comment -

          If you just want to use TransformerIdentityImpl with astral characters, it is possible to repackage little more than org.apache.xml.serializer (and apply the fix in this issue).

          You also need to modify the *.properties files in that package, to point to your new repackaging.

          In addition to that package, I repackaged TransformerIdentityImpl, SerializerSwitcher and OutputProperties.

          Of course, life would be simpler if the Xalan maintainers would just make a 2.7.3 fixing this 7 year old bug.

          Show
          jharrop Jason Harrop added a comment - If you just want to use TransformerIdentityImpl with astral characters, it is possible to repackage little more than org.apache.xml.serializer (and apply the fix in this issue). You also need to modify the *.properties files in that package, to point to your new repackaging. In addition to that package, I repackaged TransformerIdentityImpl, SerializerSwitcher and OutputProperties. Of course, life would be simpler if the Xalan maintainers would just make a 2.7.3 fixing this 7 year old bug.
          Hide
          smitchelus Scott Mitchell added a comment -

          Understood Gary, glad to hear it will come eventually. I certainly would prefer an official build vs. us forking it and rolling our own version.

          Show
          smitchelus Scott Mitchell added a comment - Understood Gary, glad to hear it will come eventually. I certainly would prefer an official build vs. us forking it and rolling our own version.
          Hide
          smitchelus Scott Mitchell added a comment -

          Hi Gary, any word on the timing for a bug fix release?

          Show
          smitchelus Scott Mitchell added a comment - Hi Gary, any word on the timing for a bug fix release?
          Hide
          smitchelus Scott Mitchell added a comment -

          Hate to be that nagging guy, but any chance of getting a release soon?

          Show
          smitchelus Scott Mitchell added a comment - Hate to be that nagging guy, but any chance of getting a release soon?

            People

            • Assignee:
              Unassigned
              Reporter:
              hsivonen Henri Sivonen
            • Votes:
              5 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development