Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.7.1
-
None
Description
org.apache.xml.serializer.ToStream contains the following code:
else if (m_encodingInfo.isInEncoding(ch))
else
{ // This is a fallback plan, we should never get here // but if the character wasn't previously handled // (i.e. isn't in the encoding, etc.) then what // should we do? We choose to write out an entity writeOutCleanChars(chars, i, lastDirtyCharProcessed); writer.write("&#"); writer.write(Integer.toString(ch)); writer.write(';'); lastDirtyCharProcessed = i; }This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.
The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.
Attachments
Attachments
Issue Links
- is related to
-
XALANJ-2560 ToXMLStream does not support unicode supplementary characters
- Resolved
- relates to
-
XALANJ-2617 Serializer produces separately escaped surrogate pair instead of codepoint
- Resolved
- links to