Uploaded image for project: 'XalanJ2'
  1. XalanJ2
  2. XALANJ-2617

Serializer produces separately escaped surrogate pair instead of codepoint

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.1, 2.7.2
    • Fix Version/s: None
    • Component/s: Serialization, Xalan
    • Security Level: No security risk; visible to anyone (Ordinary problems in Xalan projects. Anybody can view the issue.)
    • Labels:
      None

      Description

      When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "&#55360" is an invalid XML character. It looks like a bug introduced in the XALANJ-2271 fix.

       

      Output of Xalan ver. 2.7.2
      kec@phoebe:~/Downloads$ java -version
      java version "1.8.0_171"
      Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
      Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
      
      kec@phoebe:~/Downloads$ java -cp /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.2/xalan-2.7.2.jar:/home/kec/.m2/repository/xalan/serializer/2.7.2/serializer-2.7.2.jar:. JI9053942
      Character: 𠀋
      EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
       ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>&#55360;&#56331;</a>
      [Fatal Error] :1:50: Character reference "&#
      
      But Xalan ver. 2.7.0 works OK
      kec@phoebe:~/Downloads$ java -cp /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.0/xalan-2.7.0.jar:/home/kec/.m2/repository/xalan/serializer/2.7.0/serializer-2.7.0.jar:. JI9053942
      Character: 𠀋
      EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
       ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
      ACTUAL PARSED CHAR 𠀋
      
      Test
      String value = "\uD840\uDC0B"; 
      System.out.println("Character: " + value); 
      System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>"); 
      StringWriter writer = new StringWriter(); 
      
      final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); 
      Document dom = documentBuilder.newDocument(); 
      final Element rootEl = dom.createElement("a"); 
      rootEl.setTextContent(value); 
      dom.appendChild(rootEl); 
      
      Transformer transformer = TransformerFactory.newInstance().newTransformer(); 
      transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer)); 
      String xml = writer.toString(); 
      System.out.println(" ACTUAL: " + xml); 
      
      InputSource inputSource = new InputSource(); 
      inputSource.setCharacterStream(new StringReader(xml)); 
      System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent()); 
      

        Attachments

        1. XALANJ-2617_test.patch
          9 kB
          Peter De Maeyer
        2. XALANJ-2617_java.patch
          4 kB
          Peter De Maeyer
        3. XALANJ-2617_Fix_missing_surrogate_pairs_support.patch
          1.0 kB
          Daniel Kec
        4. JI9053942.java
          2 kB
          Daniel Kec

          Issue Links

            Activity

              People

              • Assignee:
                shathaway Steven J. Hathaway
                Reporter:
                danielkec Daniel Kec
              • Votes:
                5 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated: