Issue Details (XML | Word | Printable)

Key: HADOOP-302
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Hairong Kuang
Reporter: Michel Tourn
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

class Text (replacement for class UTF8) was: HADOOP-136

Created: 14/Jun/06 07:32 AM   Updated: 04/Aug/06 10:22 PM
Return to search
Component/s: io
Affects Version/s: None
Fix Version/s: 0.5.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works text.patch 2006-07-25 09:03 PM Hairong Kuang 28 kB
Text File Licensed for inclusion in ASF works textwrap.patch 2006-07-27 12:27 AM Hairong Kuang 1 kB
Text File Licensed for inclusion in ASF works VInt.patch 2006-07-25 08:50 PM Hairong Kuang 12 kB
Issue Links:
Incorporates
 
Reference
 

Resolution Date: 26/Jul/06 08:05 AM


 Description  « Hide
Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8)

a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or
b) the record-IO scheme in o.a.h.record.Utils.java:readInt

Either way, note that:

1. UTF8.java and its successor Text.java need to read the length in two ways:
1a. consume 1+ bytes from a DataInput and
1b. parse the length within a byte array at a given offset
(1.b is used for the "WritableComparator optimized for UTF8 keys" ).

o.a.h.record.Utils only supports the DataInput mode.
It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes

2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation.
For the byte array case, the varlen-reader utility needs to be extended to return both:
the decoded length and the length of the encoded length.
(so that the caller can do offset += encodedlength)

3. A String length does not need (small) negative integers.

4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124)



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order