[XALANJ-2593] Incorrect showing of supplementary characters in attributes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.7.2
Fix Version/s: None
Component/s: Serialization
Security Level: No security risk; visible to anyone (Ordinary problems in Xalan projects. Anybody can view the issue.)
Labels:
None
Environment:
Win 7 x64, Java 1.6

Xalan info:

PatchAvailable
Fix priority:
fp1

Description

In Xalan 2.7.2 the supplementary characters (see http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html for details) shown incorrectly in attributes .
For example, I need to show symbols 𣎴 (& # 144308 ; ) or 𠘨 (& # 132648 ; ) in attribute "y" of element "x"
Expected result:

<?xml version="1.0" encoding="UTF-8"?><x y="&#144308; - &#132648;"/>

Actual result for Xalan 2.7.2 is:

 <?xml version="1.0" encoding="UTF-8"?><x y="&#55372;&#57268; - &#55361;&#56872;"/>

Code snippet for test:

public static void main(String[] argv) throws Exception {
        TransformerFactory tFactory = TransformerFactory.newInstance();
        StreamSource stylesource = new StreamSource(new StringReader("<?xml version=\"1.0\" encoding=\"UTF-8\"?><xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\" ><xsl:template match=\"/\"><x y=\"{xslt/search/value1}\" /></xsl:template></xsl:stylesheet>"));
        Transformer transformer = tFactory.newTransformer(stylesource);
        StreamSource source = new StreamSource(new StringReader("<?xml version=\"1.0\"?><xslt><search><value1>𣎴 - 𠘨</value1></search></xslt>"));
        Result result = new StreamResult(System.out);
        transformer.transform(source, result);
    }

The problem relates to the method org.apache.xml.serializer.ToStream.writeAttrString(Writer, String, String).

            if (m_charInfo.shouldMapAttrChar(ch)) {
                // The character is supposed to be replaced by a String
                // e.g.   '&'  -->  "&amp;"
                // e.g.   '<'  -->  "&lt;"
                accumDefaultEscape(writer, ch, i, stringChars, len, false, true);
            }

this part doesn't process multicharacter sequences like supplementary characters within Java platform and this leads to executing next part within same method

            else {
                    // This is a fallback plan, we should never get here
                    // but if the character wasn't previously handled
                    // (i.e. isn't in the encoding, etc.) then what
                    // should we do?  We choose to write out a character ref
                    writer.write("!13&#");
                    writer.write(Integer.toString(ch));
                    writer.write(';');
                }

PS: Can't add patch file, so put here.

--- src\org\apache\xml\serializer\ToStream.java	2014-03-26 17:21:30 +0200
+++ src\org\apache\xml\serializer\ToStream.java	2014-09-09 19:09:30 +0300
@@ -2112,8 +2112,13 @@
                 // e.g.   '&'  -->  "&amp;"
                 // e.g.   '<'  -->  "&lt;"
                 accumDefaultEscape(writer, ch, i, stringChars, len, false, true);
-            }
-            else {
+            } else if (Encodings.isHighUTF16Surrogate(ch)) {
+                // more than single input character can be processed
+                // within accumDefaultEscape()
+                // so we set appropriate value for loop for().
+                i = accumDefaultEscape(writer, ch, i, stringChars, len, false, true); 
+
+            } else {
                 if (0x0 <= ch && ch <= 0x1F) {
                     // Range 0x00 through 0x1F inclusive
                     // This covers the non-whitespace control characters

Attachments

Activity

People

Assignee:: Steven J. Hathaway

Reporter:: Eugene Shkel

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Sep/14 16:24

Updated:: 23/May/18 07:51

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified