Hadoop Common
  1. Hadoop Common
  2. HADOOP-302

class Text (replacement for class UTF8) was: HADOOP-136

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: io
    • Labels:
      None

      Description

      Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8)

      a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I think is what Doug is describing in his last comment) or
      b) the record-IO scheme in o.a.h.record.Utils.java:readInt

      Either way, note that:

      1. UTF8.java and its successor Text.java need to read the length in two ways:
      1a. consume 1+ bytes from a DataInput and
      1b. parse the length within a byte array at a given offset
      (1.b is used for the "WritableComparator optimized for UTF8 keys" ).

      o.a.h.record.Utils only supports the DataInput mode.
      It is not clear to me what is the best way to extend this Utils code when you need to support both reading modes

      2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should be no Object allocation.
      For the byte array case, the varlen-reader utility needs to be extended to return both:
      the decoded length and the length of the encoded length.
      (so that the caller can do offset += encodedlength)

      3. A String length does not need (small) negative integers.

      4. One advantage of a) is that it is standard (or at least well-known and natural) and there are no magic constants (like -120, -121 -124)

      1. text.patch
        28 kB
        Hairong Kuang
      2. textwrap.patch
        1 kB
        Hairong Kuang
      3. VInt.patch
        12 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Bryan Pendleton added a comment -

          Don't know what the culprits are, but here are two stack traces I got today that killed a 2-hour job. Maybe, for now, validateUTF should be called when first serializing, too. There seem to still be a few bugs in how Text handles content.

          java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 3
          at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:152)
          at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:272)
          at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1076)
          Caused by: java.nio.charset.MalformedInputException: Input length = 3
          at org.apache.hadoop.io.Text.validateUTF(Text.java:439)
          at org.apache.hadoop.io.Text.validateUTF8(Text.java:419)
          at org.apache.hadoop.io.Text.readFields(Text.java:228)
          at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:82)
          at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:370)
          at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:183)
          at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:149)
          ... 3 more

          java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 26
          at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:152)
          at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:272)
          at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1076)
          Caused by: java.nio.charset.MalformedInputException: Input length = 26
          at org.apache.hadoop.io.Text.validateUTF(Text.java:439)
          at org.apache.hadoop.io.Text.validateUTF8(Text.java:419)
          at org.apache.hadoop.io.Text.readFields(Text.java:228)
          at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:82)
          at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:370)
          at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:183)
          at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:149)
          ... 3 more

          Unlike last time, this content doesn't contain tabs. Contact me off-list for pointers to the dataset this occured in.

          Show
          Bryan Pendleton added a comment - Don't know what the culprits are, but here are two stack traces I got today that killed a 2-hour job. Maybe, for now, validateUTF should be called when first serializing, too. There seem to still be a few bugs in how Text handles content. java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 3 at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:152) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:272) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1076) Caused by: java.nio.charset.MalformedInputException: Input length = 3 at org.apache.hadoop.io.Text.validateUTF(Text.java:439) at org.apache.hadoop.io.Text.validateUTF8(Text.java:419) at org.apache.hadoop.io.Text.readFields(Text.java:228) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:82) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:370) at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:183) at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:149) ... 3 more java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 26 at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:152) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:272) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1076) Caused by: java.nio.charset.MalformedInputException: Input length = 26 at org.apache.hadoop.io.Text.validateUTF(Text.java:439) at org.apache.hadoop.io.Text.validateUTF8(Text.java:419) at org.apache.hadoop.io.Text.readFields(Text.java:228) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:82) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:370) at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:183) at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:149) ... 3 more Unlike last time, this content doesn't contain tabs. Contact me off-list for pointers to the dataset this occured in.
          Hide
          Hairong Kuang added a comment -

          Here is the patch that should fix Bryan's problem.

          Show
          Hairong Kuang added a comment - Here is the patch that should fix Bryan's problem.
          Hide
          Bryan Pendleton added a comment -

          Whoops, forgot that inline patches are bad form. The point is, the encoding of "Bad \t encoding \t testcase" fails the verifyUTF check. It seems to have something to do with the double tabs.

          Show
          Bryan Pendleton added a comment - Whoops, forgot that inline patches are bad form. The point is, the encoding of "Bad \t encoding \t testcase" fails the verifyUTF check. It seems to have something to do with the double tabs.
          Hide
          Bryan Pendleton added a comment -

          Started using this in my code... looks like there are still some bugs. Here's a testcase that shouldn't fail. (I have no idea how UTF works, or I'd try to offer an actual solution):

          Index: hadoop/src/test/org/apache/hadoop/io/TestText.java
          ===================================================================
          — hadoop/src/test/org/apache/hadoop/io/TestText.java (revision 425795)
          +++ hadoop/src/test/org/apache/hadoop/io/TestText.java (working copy)
          @@ -87,7 +87,11 @@

          public void testCoding() throws Exception {

          • for (int i = 0; i < NUM_ITERATIONS; i++) {
            + String badString = "Bad \t encoding \t testcase.";
            + Text testCase = new Text(badString);
            + assertTrue(badString.equals(testCase.toString()));
            +
            + for (int i = 0; i < NUM_ITERATIONS; i++) {
            try {
            // generate a random string
            String before;
          Show
          Bryan Pendleton added a comment - Started using this in my code... looks like there are still some bugs. Here's a testcase that shouldn't fail. (I have no idea how UTF works, or I'd try to offer an actual solution): Index: hadoop/src/test/org/apache/hadoop/io/TestText.java =================================================================== — hadoop/src/test/org/apache/hadoop/io/TestText.java (revision 425795) +++ hadoop/src/test/org/apache/hadoop/io/TestText.java (working copy) @@ -87,7 +87,11 @@ public void testCoding() throws Exception { for (int i = 0; i < NUM_ITERATIONS; i++) { + String badString = "Bad \t encoding \t testcase."; + Text testCase = new Text(badString); + assertTrue(badString.equals(testCase.toString())); + + for (int i = 0; i < NUM_ITERATIONS; i++) { try { // generate a random string String before;
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks!

          Show
          Doug Cutting added a comment - I just committed this. Thanks!
          Hide
          Hairong Kuang added a comment -

          This patch includes Text class which stores a string in the standard utf8 format, compares two strings bytewise in the utf8 order, and many other functions.

          This patch also includes a Junit test for the Text class.

          Many thanks to Addison Philip for his tim on the design discussion and code review. He also contributed quite a lot of code.

          Show
          Hairong Kuang added a comment - This patch includes Text class which stores a string in the standard utf8 format, compares two strings bytewise in the utf8 order, and many other functions. This patch also includes a Junit test for the Text class. Many thanks to Addison Philip for his tim on the design discussion and code review. He also contributed quite a lot of code.
          Hide
          Hairong Kuang added a comment -

          This patch extracts the zero-compress integer code in the hadoop record into hadoop io. It also adds functions for comparing serialized integers bytewise.

          Show
          Hairong Kuang added a comment - This patch extracts the zero-compress integer code in the hadoop record into hadoop io. It also adds functions for comparing serialized integers bytewise.
          Hide
          Hairong Kuang added a comment -

          Sounds great! I believe that ordering by UTF8 is the right way to go.

          Show
          Hairong Kuang added a comment - Sounds great! I believe that ordering by UTF8 is the right way to go.
          Hide
          Doug Cutting added a comment -

          Re String comparison: The bug here is with Java. Since we wish to keep our persistent data structures language-independent, we should order by UTF-8, not UTF-16.

          The javadoc is confusing:

          http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#compareTo(java.lang.String)

          It says it compares unicode characters, when in fact it compares UTF-16.

          So any code that orders by Java String and expects things to align with the Hadoop Text class will be buggy when processing text with surrogate pairs. We should make this clear in the javadoc.

          Does this sound reasonable?

          Show
          Doug Cutting added a comment - Re String comparison: The bug here is with Java. Since we wish to keep our persistent data structures language-independent, we should order by UTF-8, not UTF-16. The javadoc is confusing: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#compareTo(java.lang.String ) It says it compares unicode characters, when in fact it compares UTF-16. So any code that orders by Java String and expects things to align with the Hadoop Text class will be buggy when processing text with surrogate pairs. We should make this clear in the javadoc. Does this sound reasonable?
          Hide
          Milind Bhandarkar added a comment -

          There is support for negative numbers as well in recordio scheme, which is not needed here, thus allowing us to save a few more bits.

          Show
          Milind Bhandarkar added a comment - There is support for negative numbers as well in recordio scheme, which is not needed here, thus allowing us to save a few more bits.
          Hide
          Hairong Kuang added a comment -

          If we use the recordio scheme, we need to extend it so that it can read a variable-length integer from a byte array. This is for the support of byte-wise comparison.

          Show
          Hairong Kuang added a comment - If we use the recordio scheme, we need to extend it so that it can read a variable-length integer from a byte array. This is for the support of byte-wise comparison.
          Hide
          Hairong Kuang added a comment -

          If we use standard UTF8, comparison on the binary form does not produce the same results as the string comparison. See the following example provided by Addison:

          > Consider the sequence U+D800 U+DC00 (a surrogate pair). In String comparison, this compares as less than U+E000 (since
          > D800 < E000). In UTF-8 byte comparisons it is greater than E000 (because the lead byte of the Unicode character U+10000
          > encoded by the surrogate pair is 0xF0, which is bigger than lead byte of U+E000, which is 0xEE).

          Is it an issue?

          Show
          Hairong Kuang added a comment - If we use standard UTF8, comparison on the binary form does not produce the same results as the string comparison. See the following example provided by Addison: > Consider the sequence U+D800 U+DC00 (a surrogate pair). In String comparison, this compares as less than U+E000 (since > D800 < E000). In UTF-8 byte comparisons it is greater than E000 (because the lead byte of the Unicode character U+10000 > encoded by the surrogate pair is 0xF0, which is bigger than lead byte of U+E000, which is 0xEE). Is it an issue?
          Hide
          eric baldeschwieler added a comment -

          +1 on doug's suggestion. Let's use real UTF8. Then we can interoperate with more things.

          Agreed that we need to use one of the existing variable length encodings. Inventing another would be counter productive. My preference would be to use the recordio scheme, since it is already in hadoop. If we choose to import the lucene version, we should consider using it for recordio too, easy to change now, since it is still new.

          Show
          eric baldeschwieler added a comment - +1 on doug's suggestion. Let's use real UTF8. Then we can interoperate with more things. Agreed that we need to use one of the existing variable length encodings. Inventing another would be counter productive. My preference would be to use the recordio scheme, since it is already in hadoop. If we choose to import the lucene version, we should consider using it for recordio too, easy to change now, since it is still new.
          Hide
          Doug Cutting added a comment -

          Please see the related Lucene issue. Note that Marvin has attached a patch that includes optimized code for conversion from standard UTF8 to Java strings.

          Show
          Doug Cutting added a comment - Please see the related Lucene issue. Note that Marvin has attached a patch that includes optimized code for conversion from standard UTF8 to Java strings.
          Hide
          Doug Cutting added a comment -

          I think we should use this opportunity to switch to standard UTF-8 for persistent data. Optimized code should try to avoid conversion of these to Java strings. For example, comparision can be done on the binary form (since this yeilds the same results as lexicographic unicode comparisons).

          Show
          Doug Cutting added a comment - I think we should use this opportunity to switch to standard UTF-8 for persistent data. Optimized code should try to avoid conversion of these to Java strings. For example, comparision can be done on the binary form (since this yeilds the same results as lexicographic unicode comparisons).
          Hide
          Hairong Kuang added a comment -

          There are two issues with the current implementation of UTF8.

          The first is that it does not handle over long string. The length of a string is limited to a short, not a int. I'd like to address this problem by storing the length of a string in a variable-length formt. The highest bit of each byte is an extension bit. '1' means that more bytes are followed, while '0' means last byte.

          The second is that the class chooses Java modified UTF8 as the serialized form. Some argue that we should use the standard UTF8. It seems to me that serializing a string to Java modified UTF8 is quite efficient. But it is Java's internal representation. If we want to support inter-programming-language communication, it makes more sense to use the standard UTF8.

          Also for the name of the class, could I use "StringWritable"? It is consistent with other classes that implement WritableComparable, like IntWritable, FloatWritable etc.

          Show
          Hairong Kuang added a comment - There are two issues with the current implementation of UTF8. The first is that it does not handle over long string. The length of a string is limited to a short, not a int. I'd like to address this problem by storing the length of a string in a variable-length formt. The highest bit of each byte is an extension bit. '1' means that more bytes are followed, while '0' means last byte. The second is that the class chooses Java modified UTF8 as the serialized form. Some argue that we should use the standard UTF8. It seems to me that serializing a string to Java modified UTF8 is quite efficient. But it is Java's internal representation. If we want to support inter-programming-language communication, it makes more sense to use the standard UTF8. Also for the name of the class, could I use "StringWritable"? It is consistent with other classes that implement WritableComparable, like IntWritable, FloatWritable etc.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Michel Tourn
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development