[XALANJ-2419] Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.1
Fix Version/s: The Latest Development Code
Component/s: Serialization
Labels:
None

Description

org.apache.xml.serializer.ToStream contains the following code:
else if (m_encodingInfo.isInEncoding(ch))

{ // If the character is in the encoding, and // not in the normal ASCII range, we also // just leave it get added on to the clean characters }

else

{ // This is a fallback plan, we should never get here // but if the character wasn't previously handled // (i.e. isn't in the encoding, etc.) then what // should we do? We choose to write out an entity writeOutCleanChars(chars, i, lastDirtyCharProcessed); writer.write("&#"); writer.write(Integer.toString(ch)); writer.write(';'); lastDirtyCharProcessed = i; }

This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.

The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

XALANJ-2419-fix-v3.txt
16/Apr/18 11:51
4 kB
Jesper Steen Møller
XALANJ-2419-tests-v3.txt
16/Apr/18 11:51
54 kB
Jesper Steen Møller

Issue Links

is related to

XALANJ-2560 ToXMLStream does not support unicode supplementary characters

Resolved

relates to

XALANJ-2617 Serializer produces separately escaped surrogate pair instead of codepoint

Resolved

links to

GitHub Pull Request #163

Activity

People

Assignee:: Joe Kesselman

Reporter:: Henri Sivonen

Votes:: 11 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 02/Jan/08 11:07

Updated:: 27/Jan/24 15:48

Resolved:: 27/Jan/24 15:48