Hadoop Common
  1. Hadoop Common
  2. HADOOP-5589

TupleWritable: Lift implicit limit on the number of values that can be stored

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.1
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TupleWritable uses an instance field of the primitive type, long, which I presume is so that it can quickly determine if a position has been written to in its array of Writables (by using bit-shifting operations on the long field). The problem with this is that it implies that there is a maximum limit of 64 values you can store in a TupleWritable.

      An example of a use-case where I think this would be a problem is if you had two MR jobs with over 64 reduces tasks and you wanted to join the outputs with CompositeInputFormat - this will probably cause unexpected results in the current scheme.

      At the very least, the 64-value limit should be documented in TupleWritable.

      1. HADOOP-5589-1.patch
        9 kB
        Jingkei Ly
      2. HADOOP-5589-2.patch
        10 kB
        Jingkei Ly
      3. HADOOP-5589-3.patch
        10 kB
        Jingkei Ly
      4. HADOOP-5589-4.patch
        14 kB
        Jingkei Ly
      5. HADOOP-5589-4.patch
        14 kB
        Jingkei Ly

        Activity

        Hide
        Jingkei Ly added a comment -

        The example below demonstrates the current 64-value limit in TupleWritable:

        Text emptyText = new Text("Should not be set written");
        public void testTupleBoundarySuccess() throws Exception {
            Writable[] values = new Writable[64];
            Arrays.fill(values,emptyText);
            values[63] = new Text("Should be the only value set written");
                                             
            TupleWritable tuple = new TupleWritable(values);
            tuple.setWritten(63);
            
            for (int pos=0; pos<tuple.size();pos++) {
              boolean has = tuple.has(pos);
              if (pos == 63) {
                assertTrue(has);
              }
              else {
                assertFalse("Tuple position is incorrectly labelled as set: " + pos, has);
              }
            }
        }
         
        public void testTupleBoundaryFailure() throws Exception {
            Writable[] values = new Writable[65];
            Arrays.fill(values,emptyText);
            values[64] = new Text("Should be the only value set written");
                                             
            TupleWritable tuple = new TupleWritable(values);
            tuple.setWritten(64);
            
            for (int pos=0; pos<tuple.size();pos++) {
              boolean has = tuple.has(pos);
              if (pos == 64) {
                assertTrue(has);
              }
              else {
                assertFalse("Tuple position is incorrectly labelled as set: " + pos, has);
              }
            }
          }
        
        Show
        Jingkei Ly added a comment - The example below demonstrates the current 64-value limit in TupleWritable: Text emptyText = new Text( "Should not be set written" ); public void testTupleBoundarySuccess() throws Exception { Writable[] values = new Writable[64]; Arrays.fill(values,emptyText); values[63] = new Text( "Should be the only value set written" ); TupleWritable tuple = new TupleWritable(values); tuple.setWritten(63); for ( int pos=0; pos<tuple.size();pos++) { boolean has = tuple.has(pos); if (pos == 63) { assertTrue(has); } else { assertFalse( "Tuple position is incorrectly labelled as set: " + pos, has); } } } public void testTupleBoundaryFailure() throws Exception { Writable[] values = new Writable[65]; Arrays.fill(values,emptyText); values[64] = new Text( "Should be the only value set written" ); TupleWritable tuple = new TupleWritable(values); tuple.setWritten(64); for ( int pos=0; pos<tuple.size();pos++) { boolean has = tuple.has(pos); if (pos == 64) { assertTrue(has); } else { assertFalse( "Tuple position is incorrectly labelled as set: " + pos, has); } } }
        Hide
        Jingkei Ly added a comment -

        My suggestion would be to replace the long field, "written", with a java.util.BitSet. This should provide more than enough space to do the job of the written field in a more scalable fashion, and is relatively compact. I'm not sure of what the performance implications are though, if any.

        Show
        Jingkei Ly added a comment - My suggestion would be to replace the long field, "written", with a java.util.BitSet. This should provide more than enough space to do the job of the written field in a more scalable fashion, and is relatively compact. I'm not sure of what the performance implications are though, if any.
        Hide
        Jingkei Ly added a comment -

        I've attached my shot at implementing this with BitSets.

        Show
        Jingkei Ly added a comment - I've attached my shot at implementing this with BitSets.
        Hide
        Jingkei Ly added a comment -

        I've noticed a bug in my original patch where writeBitSet() would incorrectly detect when to write out a new long to the stream - this new patch should fix it.

        Show
        Jingkei Ly added a comment - I've noticed a bug in my original patch where writeBitSet() would incorrectly detect when to write out a new long to the stream - this new patch should fix it.
        Hide
        Jingkei Ly added a comment -

        This is another change to the patch - this time to make the serialized bitset more compact (by writing the bitset to the stream as bytes instead of variable-length longs). Also addressed a problem when writing a sparse bitset.

        Show
        Jingkei Ly added a comment - This is another change to the patch - this time to make the serialized bitset more compact (by writing the bitset to the stream as bytes instead of variable-length longs). Also addressed a problem when writing a sparse bitset.
        Hide
        Jingkei Ly added a comment -

        I think I'm probably going to be hitting this problem soon. Has anybody got any comments on the proposed solution and the patch I submitted?

        Show
        Jingkei Ly added a comment - I think I'm probably going to be hitting this problem soon. Has anybody got any comments on the proposed solution and the patch I submitted?
        Hide
        Chris Douglas added a comment -

        Unfortunately, this would be an incompatible change to TupleWritable; one could no longer read tuples written using an older version of Hadoop, right?

        I see two possible approaches:

        1. Define backwards-compatible semantics for TupleWritable, so readFields will read both old and new
        2. Subclass or create a new TupleWritable2 (or whatever) and modify the join framework to use that instead
        Show
        Chris Douglas added a comment - Unfortunately, this would be an incompatible change to TupleWritable; one could no longer read tuples written using an older version of Hadoop, right? I see two possible approaches: Define backwards-compatible semantics for TupleWritable, so readFields will read both old and new Subclass or create a new TupleWritable2 (or whatever) and modify the join framework to use that instead
        Hide
        Jingkei Ly added a comment -

        The patch could be backwards-compatible if the bitset was written to the stream as VLongs (essentially what had been implemented in HADOOP-5589-2.patch, minus the bug with sparse bitsets), as the bytes written to the stream would be exactly the same in both implementations as long as there were less than 64 values.

        However, because we can't read an old TupleWritable containing over 64 values without throwing an EOFException, it won't be "fully" backwardly-compatible.

        While I would be tempted to argue that TupleWritable never supported over 64 values in a tuple anyway, is there still a need to support users who were storing tuples over 64 values but with incorrect results?

        Show
        Jingkei Ly added a comment - The patch could be backwards-compatible if the bitset was written to the stream as VLongs (essentially what had been implemented in HADOOP-5589 -2.patch, minus the bug with sparse bitsets), as the bytes written to the stream would be exactly the same in both implementations as long as there were less than 64 values. However, because we can't read an old TupleWritable containing over 64 values without throwing an EOFException, it won't be "fully" backwardly-compatible. While I would be tempted to argue that TupleWritable never supported over 64 values in a tuple anyway, is there still a need to support users who were storing tuples over 64 values but with incorrect results?
        Hide
        Chris Douglas added a comment -

        The patch could be backwards-compatible if the bitset was written to the stream as VLongs (essentially what had been implemented in HADOOP-5589-2.patch

        This seems like a reasonable approach to me. As long as this can read old data, it's OK if it writes records that cannot be read by the old implementation.

        is there still a need to support users who were storing tuples over 64 values but with incorrect results?

        Such data is already corrupt, so I don't think anything can be done for it.

        Show
        Chris Douglas added a comment - The patch could be backwards-compatible if the bitset was written to the stream as VLongs (essentially what had been implemented in HADOOP-5589 -2.patch This seems like a reasonable approach to me. As long as this can read old data, it's OK if it writes records that cannot be read by the old implementation. is there still a need to support users who were storing tuples over 64 values but with incorrect results? Such data is already corrupt, so I don't think anything can be done for it.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12404067/HADOOP-5589-3.patch
        against trunk revision 765427.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12404067/HADOOP-5589-3.patch against trunk revision 765427. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/200/console This message is automatically generated.
        Hide
        Jingkei Ly added a comment -

        I've attached a new patch to support reading of TupleWritables written by previous versions of the class. As discussed, it will fail to read older TupleWritables if they held > 64 values.

        The new patch contains additional tests to test backwards-compatibility and attempts to fix the findbugs warning.

        Show
        Jingkei Ly added a comment - I've attached a new patch to support reading of TupleWritables written by previous versions of the class. As discussed, it will fail to read older TupleWritables if they held > 64 values. The new patch contains additional tests to test backwards-compatibility and attempts to fix the findbugs warning.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12405766/HADOOP-5589-4.patch
        against trunk revision 767331.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12405766/HADOOP-5589-4.patch against trunk revision 767331. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/223/console This message is automatically generated.
        Hide
        Jingkei Ly added a comment -

        The test that failed appears to be unrelated to this patch, and passes when run locally on my box.

        Show
        Jingkei Ly added a comment - The test that failed appears to be unrelated to this patch, and passes when run locally on my box.
        Hide
        Chris Douglas added a comment -

        Just a few minor nits:

        • Instead of using BitSet::set(i, false), BitSet::clear(i) should accomplish the same thing in TupleWritable::clearWritten; in the iterator, using clear is probably more readable than flip though they're obviously equivalent.
        • The size check in equals can be dropped, since BitSet also performs it in its equals implementation
        • The loop in readBitSet would be easier to read as:
              for (int offset = Long.SIZE; offset < nbits; offset += Byte.SIZE) {
          
        • I'd replace setBit with the equivalent, single line of code in writeBitSet

        The rest of the changes look good.

        Show
        Chris Douglas added a comment - Just a few minor nits: Instead of using BitSet::set(i, false) , BitSet::clear(i) should accomplish the same thing in TupleWritable::clearWritten ; in the iterator, using clear is probably more readable than flip though they're obviously equivalent. The size check in equals can be dropped, since BitSet also performs it in its equals implementation The loop in readBitSet would be easier to read as: for ( int offset = Long .SIZE; offset < nbits; offset += Byte .SIZE) { I'd replace setBit with the equivalent, single line of code in writeBitSet The rest of the changes look good.
        Hide
        Jingkei Ly added a comment -

        Applied changes to patch as-per Chris' comments (I particularly like the simplication to my overcomplicated for-loop in #readBitSet)

        Show
        Jingkei Ly added a comment - Applied changes to patch as-per Chris' comments (I particularly like the simplication to my overcomplicated for-loop in #readBitSet)
        Hide
        Chris Douglas added a comment -

        +1

             [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        

        Relevant tests, TestDatamerge and TestTupleWritable both pass

        Show
        Chris Douglas added a comment - +1 [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Relevant tests, TestDatamerge and TestTupleWritable both pass
        Hide
        Chris Douglas added a comment -

        I committed this. Thanks, Jingkei

        Show
        Chris Douglas added a comment - I committed this. Thanks, Jingkei
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk #817 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/817/)
        . Eliminate source limit of 64 for map-side joins imposed by
        TupleWritable encoding. Contributed by Jingkei Ly

        Show
        Hudson added a comment - Integrated in Hadoop-trunk #817 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/817/ ) . Eliminate source limit of 64 for map-side joins imposed by TupleWritable encoding. Contributed by Jingkei Ly
        Hide
        Jason Venner (www.prohadoop.com) added a comment -

        How does the CompositeRecordReader change with this patch. I have been backporting to 18.2 and it looks like CompositeRecordReader uses a long to hold information also.

        Show
        Jason Venner (www.prohadoop.com) added a comment - How does the CompositeRecordReader change with this patch. I have been backporting to 18.2 and it looks like CompositeRecordReader uses a long to hold information also.
        Hide
        Jingkei Ly added a comment -

        It looks like you would probably need to backport HADOOP-3721 as well, as this incorporated a change to CompositeRecordReader.JoinCollector that removed the limitation from using longs.

        Show
        Jingkei Ly added a comment - It looks like you would probably need to backport HADOOP-3721 as well, as this incorporated a change to CompositeRecordReader.JoinCollector that removed the limitation from using longs.

          People

          • Assignee:
            Jingkei Ly
            Reporter:
            Jingkei Ly
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development