[HADOOP-136] Overlong UTF8's not handled well - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 0.2.0
Fix Version/s: 0.6.0
Component/s: io
Labels:
None

Description

When we feed an overlong string to the UTF8 constructor, two suboptimal things happen.

First, we truncate to 0xffff/3 characters on the assumption that every character takes three bytes in UTF8. This can truncate strings that don't need it, and it can be overoptimistic since there are characters that render as four bytes in UTF8.

Second, the code doesn't actually handle four-byte characters.

Third, there's a behavioral discontinuity. If the string is "discovered" to be overlong by the arbitrary limit described above, we truncate with a log message, otherwise we signal a RuntimeException. One feels that both forms of truncation should be treated alike. However, this issue is concealed by the second issue; the exception will never be thrown because UTF8.utf8Length can't return more than three times the length of its input.

I would recommend changing UTF8.utf8Length to let its caller know how many characters of the input string will actually fit if there's an overflow [perhaps by returning the negative of that number] and doing the truncation accurately as needed.

-dk

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

largeutf8.patch
17/May/06 03:49
13 kB
Michel Tourn

Issue Links

is part of

HADOOP-302 class Text (replacement for class UTF8) was: HADOOP-136

Closed

Activity

People

Assignee:: Hairong Kuang

Reporter:: Dick King

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/Apr/06 03:11

Updated:: 30/Aug/06 23:33

Resolved:: 30/Aug/06 23:33