Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.3, 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      New BinaryPartitioner that partitions BinaryComparable keys by hashing a configurable part of the bytes array corresponding to the key.

      Description

      It would be useful to have a BinaryPartitioner that partitions BinaryComparable keys by hashing a configurable part of the bytes array corresponding to each key.

      1. HADOOP-5528.patch
        10 kB
        Klaas Bosteels
      2. HADOOP-5528.patch
        22 kB
        Klaas Bosteels
      3. HADOOP-5528.patch
        24 kB
        Klaas Bosteels
      4. HADOOP-5528.patch
        20 kB
        Klaas Bosteels
      5. HADOOP-5528.patch
        18 kB
        Klaas Bosteels
      6. 5528_20090401.patch
        13 kB
        Tsz Wo Nicholas Sze
      7. HADOOP-5528-0.18.patch
        28 kB
        Tim Sell
      8. HADOOP-5528-branch-1.patch
        2 kB
        Klaas Bosteels

        Issue Links

          Activity

          Hide
          Arun C Murthy added a comment -

          Thanks Klaas, I've committed this to branch-1 too.

          Show
          Arun C Murthy added a comment - Thanks Klaas, I've committed this to branch-1 too.
          Hide
          Klaas Bosteels added a comment -

          Ah but the version for the old mapred API seems to be missing, and that one is required for Dumbo compatibility since Streaming still uses the old API. Will attach a patch for this in sec...

          Show
          Klaas Bosteels added a comment - Ah but the version for the old mapred API seems to be missing, and that one is required for Dumbo compatibility since Streaming still uses the old API. Will attach a patch for this in sec...
          Hide
          Klaas Bosteels added a comment -

          Seems like the BinaryPartitioner is already in branch-1 actually? Guess not further action is need then – sorry for the confusion.

          Show
          Klaas Bosteels added a comment - Seems like the BinaryPartitioner is already in branch-1 actually? Guess not further action is need then – sorry for the confusion.
          Hide
          Arun C Murthy added a comment -

          Klaas - sure, makes sense. Would you please provide a patch which applies to branch-1? Thanks!

          Show
          Arun C Murthy added a comment - Klaas - sure, makes sense. Would you please provide a patch which applies to branch-1? Thanks!
          Hide
          Klaas Bosteels added a comment -

          Hey Arun, since you seem to have been adding the typed bytes / binary streaming patches to 1.0, I wanted to suggest including this one as well because then 1.0 would be fully compatible with Dumbo out of the box...

          Show
          Klaas Bosteels added a comment - Hey Arun, since you seem to have been adding the typed bytes / binary streaming patches to 1.0, I wanted to suggest including this one as well because then 1.0 would be fully compatible with Dumbo out of the box...
          Hide
          Robert Chansler added a comment -

          Editorial pass over all release notes prior to publication of 0.21.

          Show
          Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.
          Hide
          Tim Sell added a comment -

          We needed this for 0.18, so I back ported it. Attached patch includes HADOOP-4151 tweaked for 0.18. All tests passed.

          Show
          Tim Sell added a comment - We needed this for 0.18, so I back ported it. Attached patch includes HADOOP-4151 tweaked for 0.18. All tests passed.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #796 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/796/)
          . Add a configurable hash partitioner operating on ranges of BinaryComparable keys. Contributed by Klaas Bosteels.

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #796 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/796/ ) . Add a configurable hash partitioner operating on ranges of BinaryComparable keys. Contributed by Klaas Bosteels.
          Hide
          Konstantin Shvachko added a comment -

          I just committed this.
          Thank you Klaas, Nicholas and Chris for working on this.

          Show
          Konstantin Shvachko added a comment - I just committed this. Thank you Klaas, Nicholas and Chris for working on this.
          Hide
          Konstantin Shvachko added a comment -

          +1.
          The patch takes care of the warnings, the redundant test.
          New deprecated class is transitional.
          Same for the findbug warning.

          Show
          Konstantin Shvachko added a comment - +1. The patch takes care of the warnings, the redundant test. New deprecated class is transitional. Same for the findbug warning.
          Hide
          Klaas Bosteels added a comment -

          The findbugs warning is unavoidable (see also above), so, as far as I'm concerned, this new patch can be committed. Thanks for your help, Nicholas.

          Show
          Klaas Bosteels added a comment - The findbugs warning is unavoidable (see also above), so, as far as I'm concerned, this new patch can be committed. Thanks for your help, Nicholas.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12404384/5528_20090401.patch
          against trunk revision 761082.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12404384/5528_20090401.patch against trunk revision 761082. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/console This message is automatically generated.
          Hide
          Klaas Bosteels added a comment -

          Ah right, missed that (I was looking for @SuppressWarnings annotations). I'll close HADOOP-5604 then, since this patch fixes it.

          Show
          Klaas Bosteels added a comment - Ah right, missed that (I was looking for @SuppressWarnings annotations). I'll close HADOOP-5604 then, since this patch fixes it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It needs <?> in the test.

          +    BinaryPartitioner<?> partitioner = 
          +      ReflectionUtils.newInstance(BinaryPartitioner.class, conf);
          
          Show
          Tsz Wo Nicholas Sze added a comment - It needs <?> in the test. + BinaryPartitioner<?> partitioner = + ReflectionUtils.newInstance(BinaryPartitioner.class, conf);
          Hide
          Klaas Bosteels added a comment -

          It's not clear to me how you fixed the warning in the unit test, Nicholas. The unit test in your patch seems to be identical to the corresponding one in the patch that got committed.

          Show
          Klaas Bosteels added a comment - It's not clear to me how you fixed the warning in the unit test, Nicholas. The unit test in your patch seems to be identical to the corresponding one in the patch that got committed.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          5528_20090401.patch: fixed all unchecked warning introduced by the previous patch and removed the duplicated test.

          Show
          Tsz Wo Nicholas Sze added a comment - 5528_20090401.patch: fixed all unchecked warning introduced by the previous patch and removed the duplicated test.
          Hide
          Konstantin Shvachko added a comment -

          I had to revert the patch.
          It needs to deal with deprecated classes, code replication and 32 javac warnings. See discussion in the linked issue.

          Show
          Konstantin Shvachko added a comment - I had to revert the patch. It needs to deal with deprecated classes, code replication and 32 javac warnings. See discussion in the linked issue.
          Hide
          Chris Douglas added a comment -

          I committed this. Thanks, Klaas

          Show
          Chris Douglas added a comment - I committed this. Thanks, Klaas
          Hide
          Klaas Bosteels added a comment -

          The failed tests are unrelated to the patch, and I think the current semantics are fine, so feel free to commit, Chris.

          Show
          Klaas Bosteels added a comment - The failed tests are unrelated to the patch, and I think the current semantics are fine, so feel free to commit, Chris.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12403798/HADOOP-5528.patch
          against trunk revision 759321.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12403798/HADOOP-5528.patch against trunk revision 759321. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          FWIW, I like the current semantics. If you're OK with them, I'll commit this patch.

          Show
          Chris Douglas added a comment - FWIW, I like the current semantics. If you're OK with them, I'll commit this patch.
          Hide
          Chris Douglas added a comment -

          Please don't hesitate to pick the semantics. After all, this is your patch. If you prefer python semantics to "python-like," then by all means: go that route. I mentioned the original comment only to explain where I was coming from, as part of our disagreement stemmed from a mismatch in our understanding. I have no opinion on which are "better", as both are equivalently expressive.

          I'm +1 on the current patch, and would not object if the indices were adjusted to fit the former semantics.

          Show
          Chris Douglas added a comment - Please don't hesitate to pick the semantics. After all, this is your patch. If you prefer python semantics to "python-like," then by all means: go that route. I mentioned the original comment only to explain where I was coming from, as part of our disagreement stemmed from a mismatch in our understanding. I have no opinion on which are "better", as both are equivalently expressive. I'm +1 on the current patch, and would not object if the indices were adjusted to fit the former semantics.
          Hide
          Klaas Bosteels added a comment -

          Husdon seems to be rather busy, so maybe it's better to just copy my local test-patch output here (for now):

          [exec] -1 overall.  
          [exec] 
          [exec]     +1 @author.  The patch does not contain any @author tags.
          [exec] 
          [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
          [exec] 
          [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
          [exec] 
          [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          [exec] 
          [exec]     -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.
          [exec] 
          [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec] 
          [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          (The findbugs warning is about the fact that the name of the deprecated class in mapred shadows the simple name of the class it extends.)

          Show
          Klaas Bosteels added a comment - Husdon seems to be rather busy, so maybe it's better to just copy my local test-patch output here (for now): [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. (The findbugs warning is about the fact that the name of the deprecated class in mapred shadows the simple name of the class it extends.)
          Hide
          Klaas Bosteels added a comment -

          Apparently I didn't read Owen's proposal carefully enough, Chris. I thought he suggested using the same indices as python, instead of similar but slightly different ones. I think I finally understood what you guys meant now, though, and the new patch implements this. My only concern with this is that the slight differentiation from Python might confuse some programmers, but the javadoc explains it clearly so I guess that shouldn't be much of an issue.

          Show
          Klaas Bosteels added a comment - Apparently I didn't read Owen's proposal carefully enough, Chris. I thought he suggested using the same indices as python, instead of similar but slightly different ones. I think I finally understood what you guys meant now, though, and the new patch implements this. My only concern with this is that the slight differentiation from Python might confuse some programmers, but the javadoc explains it clearly so I guess that shouldn't be much of an issue.
          Hide
          Chris Douglas added a comment -

          Klaas,

          Thanks for clarifying the python syntax; I hadn't realized that an absent left or right offset would use the beginning/end of the list. It's what I get for not reading the javadoc or that section carefully enough. I was expecting python-like syntax, per Owen's original comment.

          The one point unaddressed is whether the partitioner should honor the indices when they exceed the key length. On this, I continue to disagree with the current patch. Particularly when dealing with raw bytes, it is better to error out and require user intervention than to silently proceed. Too often, this will lead to unexpected and undetected results.

          The class hierarchy still seems unnecessary to me. Since it's clear that what it effects can be mapped to the aforementioned expression, adjusting the user-provided indices to match the python semantics within that mapping takes less code, is more readily understandable, and is more efficient.

          Show
          Chris Douglas added a comment - Klaas, Thanks for clarifying the python syntax; I hadn't realized that an absent left or right offset would use the beginning/end of the list. It's what I get for not reading the javadoc or that section carefully enough. I was expecting python-like syntax, per Owen's original comment . The one point unaddressed is whether the partitioner should honor the indices when they exceed the key length. On this, I continue to disagree with the current patch. Particularly when dealing with raw bytes, it is better to error out and require user intervention than to silently proceed. Too often, this will lead to unexpected and undetected results. The class hierarchy still seems unnecessary to me. Since it's clear that what it effects can be mapped to the aforementioned expression, adjusting the user-provided indices to match the python semantics within that mapping takes less code, is more readily understandable, and is more efficient.
          Hide
          Klaas Bosteels added a comment -

          Chris,

          • 3:-1 does indeed not refer to the last two bytes, but
            • that's how it works in Python as well:
              >>> l = [1,2,3,4,5]
              >>> l[1:-2], l[1:3], l[3:-1]
              ([2, 3], [2, 3], [4])
              
            • you can specify "the last n" bytes by setting only the left offset (because LastIndexer is the default right indexer), which is also how you do it in Python:
              >>> l[-2:]
              [4, 5]
              
            • because of the min in the PosOffsetIndexer, you can also just set the right offset to a large enough number to get "the last n" bytes.
          • I don't think that -1 should be the default right offset, since that would mean that the last byte is ignored by default.
          • It might indeed be possible to use (index + key.getLength()) % key.getLength() for both negative and positive offsets, but we need a separate indexer to implement the default right index anyway, and I think it makes sense to minimize the required computations by using more specialized indexers.

          So, personally, I think that:

          • we need the indexer classes (and cannot use -1 as default right index),
          • the max/min games are useful (and not merely a way of preventing exceptions),
          • the semantics are correct,

          which leaves me with nothing to change in the latest patch smile Can you agree with this, or is there still something you want me to change nevertheless?

          Show
          Klaas Bosteels added a comment - Chris, 3:-1 does indeed not refer to the last two bytes, but that's how it works in Python as well: >>> l = [1,2,3,4,5] >>> l[1:-2], l[1:3], l[3:-1] ([2, 3], [2, 3], [4]) you can specify "the last n" bytes by setting only the left offset (because LastIndexer is the default right indexer), which is also how you do it in Python: >>> l[-2:] [4, 5] because of the min in the PosOffsetIndexer , you can also just set the right offset to a large enough number to get "the last n" bytes. I don't think that -1 should be the default right offset, since that would mean that the last byte is ignored by default. It might indeed be possible to use (index + key.getLength()) % key.getLength() for both negative and positive offsets, but we need a separate indexer to implement the default right index anyway, and I think it makes sense to minimize the required computations by using more specialized indexers. So, personally, I think that: we need the indexer classes (and cannot use -1 as default right index), the max/min games are useful (and not merely a way of preventing exceptions), the semantics are correct, which leaves me with nothing to change in the latest patch smile Can you agree with this, or is there still something you want me to change nevertheless?
          Hide
          Chris Douglas added a comment -

          This is looking good.

          • Instead of the Indexer classes, this can use (index + key.getLength()) % key.getLength() for both positive and negative values
          • The right index should probably default to -1
          • It's fair to assume that a user specifying an offset into binary data is certain that its length can accommodate it, so the max/min games preventing exceptions are probably too permissive.
          • The semantics as written for the unit tests confuse me. For example, given
            x = { 1, 2, 3, 4, 5 };
            y = { 6, 2, 3, 7, 8 };
            

            The test asserts that the left, right offsets 1, -2 and 1, 3 refer to the same elements, but this seems incorrect, because then 3, -1 would not refer to the last two bytes, but to the same byte. There would be no way to specify "the last n" bytes because the last byte is unaddressable.

          Show
          Chris Douglas added a comment - This is looking good. Instead of the Indexer classes, this can use (index + key.getLength()) % key.getLength() for both positive and negative values The right index should probably default to -1 It's fair to assume that a user specifying an offset into binary data is certain that its length can accommodate it, so the max/min games preventing exceptions are probably too permissive. The semantics as written for the unit tests confuse me. For example, given x = { 1, 2, 3, 4, 5 }; y = { 6, 2, 3, 7, 8 }; The test asserts that the left, right offsets 1, -2 and 1, 3 refer to the same elements, but this seems incorrect, because then 3, -1 would not refer to the last two bytes, but to the same byte. There would be no way to specify "the last n" bytes because the last byte is unaddressable.
          Hide
          Klaas Bosteels added a comment -

          Thanks Chris. The new patch addresses all of the issues you raised:

          • The deprecated class now extends the one from mapreduce.
          • Static convenience methods for setting the offsets were added.
          • The indexer classes are private now (and hence also not configurable anymore).

          Since the offset properties still might need to be set manually when this partitioner is used by a non-Java Hadoop program (e.g. a Streaming program), they are still documented in the javadoc for the class, but I added a paragraph that encourages Java programmers to use the static convenience methods instead.

          Findbugs will probably complain about the the fact that the name of the deprecated class in mapred shadows the simple name of the class it extends, by the way.

          Show
          Klaas Bosteels added a comment - Thanks Chris. The new patch addresses all of the issues you raised: The deprecated class now extends the one from mapreduce. Static convenience methods for setting the offsets were added. The indexer classes are private now (and hence also not configurable anymore). Since the offset properties still might need to be set manually when this partitioner is used by a non-Java Hadoop program (e.g. a Streaming program), they are still documented in the javadoc for the class, but I added a paragraph that encourages Java programmers to use the static convenience methods instead. Findbugs will probably complain about the the fact that the name of the deprecated class in mapred shadows the simple name of the class it extends, by the way.
          Hide
          Chris Douglas added a comment -
          • It would make sense for the deprecated partitioner class to extend the partitioner from mapreduce, per the idiom used in HADOOP-1230
          • Code using BinaryPartitioner might be more readable if it had a static method for setting the left and right indices in the config. There are other examples in lib
          • The indexer class hierarchy confuses me. Instead of making the semantics configurable, defining and describing them in the aforementioned static method would be much easier. I'd also be a fan of using the python semantics, all the time.
          Show
          Chris Douglas added a comment - It would make sense for the deprecated partitioner class to extend the partitioner from mapreduce, per the idiom used in HADOOP-1230 Code using BinaryPartitioner might be more readable if it had a static method for setting the left and right indices in the config. There are other examples in lib The indexer class hierarchy confuses me. Instead of making the semantics configurable, defining and describing them in the aforementioned static method would be much easier. I'd also be a fan of using the python semantics, all the time.
          Hide
          Klaas Bosteels added a comment -

          Any further comments on the latest patch?

          Show
          Klaas Bosteels added a comment - Any further comments on the latest patch?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12403429/HADOOP-5528.patch
          against trunk revision 757459.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12403429/HADOOP-5528.patch against trunk revision 757459. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12403429/HADOOP-5528.patch
          against trunk revision 756858.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12403429/HADOOP-5528.patch against trunk revision 756858. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/console This message is automatically generated.
          Hide
          Klaas Bosteels added a comment -

          Corrected small mistake and extended unit tests accordingly.

          Show
          Klaas Bosteels added a comment - Corrected small mistake and extended unit tests accordingly.
          Hide
          Klaas Bosteels added a comment -

          Discovered a small mistake ("Math.min(offset, length - 1)" should really be "Math.min(offset, length)"). Will submit a corrected patch soon.

          Show
          Klaas Bosteels added a comment - Discovered a small mistake (" Math.min(offset, length - 1) " should really be " Math.min(offset, length) "). Will submit a corrected patch soon.
          Hide
          Klaas Bosteels added a comment - - edited

          The revised patch allows the subarray to be defined by means of Python-style offsets:

          • mapred.binary.partitioner.left.offset: left Python-style offset in array
          • mapred.binary.partitioner.right.offset: right Python-style offset in array

          The best way to remember how these offsets work is by thinking of them as indices pointing between the array elements, with the left edge of the first element numbered 0, e.g.:

          . +---+---+---+---+---+
            | B | B | B | B | B |
            +---+---+---+---+---+
            0   1   2   3   4   5
           -5  -4  -3  -2  -1
          

          The first row of numbers gives the position of the offsets 0...5 in the array; the second row gives the corresponding negative offsets. When i and j are specified as left and right offset, respectively, then all bytes between the edges labeled i and j are taken into account for the partitioning.

          More generally, the indexing logic can now be customized by specifying the BinaryPartitioner.Indexer classes to be used via the following properties:

          • mapred.binary.partitioner.left.indexer.class
          • mapred.binary.partitioner.right.indexer.class

          By default, FirstIndexer and LastIndexer are used (i.e. the whole byte array is taken into account for the hashing), and the offset properties trigger the usage of PosOffsetIndexer and/or NegOffsetIndexer, which implement the indexing by means of Python-style offsets.

          Show
          Klaas Bosteels added a comment - - edited The revised patch allows the subarray to be defined by means of Python-style offsets: mapred.binary.partitioner.left.offset : left Python-style offset in array mapred.binary.partitioner.right.offset : right Python-style offset in array The best way to remember how these offsets work is by thinking of them as indices pointing between the array elements, with the left edge of the first element numbered 0, e.g.: . +---+---+---+---+---+ | B | B | B | B | B | +---+---+---+---+---+ 0 1 2 3 4 5 -5 -4 -3 -2 -1 The first row of numbers gives the position of the offsets 0...5 in the array; the second row gives the corresponding negative offsets. When i and j are specified as left and right offset, respectively, then all bytes between the edges labeled i and j are taken into account for the partitioning. More generally, the indexing logic can now be customized by specifying the BinaryPartitioner.Indexer classes to be used via the following properties: mapred.binary.partitioner.left.indexer.class mapred.binary.partitioner.right.indexer.class By default, FirstIndexer and LastIndexer are used (i.e. the whole byte array is taken into account for the hashing), and the offset properties trigger the usage of PosOffsetIndexer and/or NegOffsetIndexer , which implement the indexing by means of Python-style offsets.
          Hide
          Owen O'Malley added a comment -

          I understand now. right is defined as bytes from the right to ignore. Unfortunately that means you can't use it to pick up bytes 4-8 if you don't know the length. How about using python style offsets where negative numbers means count from the right. That will allow a lot more flexibility.

          0 = start of bytes
          1 = after first byte
          -2 = before last byte
          -1 = end of bytes

          So left=4, right=8 would use bytes 4-8 from the right.
          Left=-5, right=-1 would use the last 4 bytes.

          Show
          Owen O'Malley added a comment - I understand now. right is defined as bytes from the right to ignore. Unfortunately that means you can't use it to pick up bytes 4-8 if you don't know the length. How about using python style offsets where negative numbers means count from the right. That will allow a lot more flexibility. 0 = start of bytes 1 = after first byte -2 = before last byte -1 = end of bytes So left=4, right=8 would use bytes 4-8 from the right. Left=-5, right=-1 would use the last 4 bytes.
          Hide
          Klaas Bosteels added a comment -

          The meaning is different yes. Specifying left = x means that the x leftmost bytes will not be taken into account and, simlarily, specifying right = y means that the y rightmost bytes will not be taken into account. So the right offset is not relative to the first byte of the array, but to the last one.

          Show
          Klaas Bosteels added a comment - The meaning is different yes. Specifying left = x means that the x leftmost bytes will not be taken into account and, simlarily, specifying right = y means that the y rightmost bytes will not be taken into account. So the right offset is not relative to the first byte of the array, but to the last one.
          Hide
          Owen O'Malley added a comment -

          I don't see the issue. How is it different specifying:

          offset = 1
          length = 5

          rather than

          left = 1
          right = 6

          Am I missing something?

          Show
          Owen O'Malley added a comment - I don't see the issue. How is it different specifying: offset = 1 length = 5 rather than left = 1 right = 6 Am I missing something?
          Hide
          Klaas Bosteels added a comment -

          Well, I basically created this issue because I wanted the hashing to ignore the last n bytes from variable length keys, so specifying the offset and length wouldn't really work for me. Maybe we could support both and only take into account the length when both a right offset and a length are specified?

          Show
          Klaas Bosteels added a comment - Well, I basically created this issue because I wanted the hashing to ignore the last n bytes from variable length keys, so specifying the offset and length wouldn't really work for me. Maybe we could support both and only take into account the length when both a right offset and a length are specified?
          Hide
          Owen O'Malley added a comment -

          I think this would be much clearer if the attribute names were:

          mapred.binary.partitioner.offset
          mapred.binary.partitioner.length

          especially if you used max(key.length, mapred.binary.partitioner.length) so that you could set it to a large value and use the entire key.

          Show
          Owen O'Malley added a comment - I think this would be much clearer if the attribute names were: mapred.binary.partitioner.offset mapred.binary.partitioner.length especially if you used max(key.length, mapred.binary.partitioner.length) so that you could set it to a large value and use the entire key.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12402567/HADOOP-5528.patch
          against trunk revision 756256.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402567/HADOOP-5528.patch against trunk revision 756256. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/console This message is automatically generated.
          Hide
          Klaas Bosteels added a comment -

          The attached patch adds both org.apache.hadoop.mapred.lib.BinaryPartitioner and org.apache.hadoop.mapreduce.lib.partition.BinaryPartitioner, which both take into account the following configuration properties:

          • mapred.binary.partitioner.left.offset: left offset in the keys' byte arrays (0 by default)
          • mapred.binary.partitioner.right.offset: right offset in the keys' byte array (0 by default)
          Show
          Klaas Bosteels added a comment - The attached patch adds both org.apache.hadoop.mapred.lib.BinaryPartitioner and org.apache.hadoop.mapreduce.lib.partition.BinaryPartitioner , which both take into account the following configuration properties: mapred.binary.partitioner.left.offset : left offset in the keys' byte arrays (0 by default) mapred.binary.partitioner.right.offset : right offset in the keys' byte array (0 by default)

            People

            • Assignee:
              Klaas Bosteels
              Reporter:
              Klaas Bosteels
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development