[XERCESJ-1668] Off-by-one bug w/ surrogates in UTF8Reader - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Other
Labels:
None

Flags:

Patch

Description

There's a bug in the surrogate handling when the reader buffer is exhausted and only the high-part can be written. On the next run the low-part gets added but the buffer space calculation is off by one.

This gets triggered when parsing the current enwiktionary dump file.

org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence.

In the attached patch I added a fix + testcase for this bug. Another related issue is that when the low-part is written as last part of the stream -1 is returned instead of 1.

Is UTF8Reader still necessary? It might be safer to just use a plain InputStreamReader.