Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
Patch
Description
There's a bug in the surrogate handling when the reader buffer is exhausted and only the high-part can be written. On the next run the low-part gets added but the buffer space calculation is off by one.
This gets triggered when parsing the current enwiktionary dump file.
org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence.
In the attached patch I added a fix + testcase for this bug. Another related issue is that when the low-part is written as last part of the stream -1 is returned instead of 1.
Is UTF8Reader still necessary? It might be safer to just use a plain InputStreamReader.
Attachments
Attachments
Issue Links
- relates to
-
XERCESJ-1257 buffer overflow in UTF8Reader for characters out of BMP
- Reopened