Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4839

TextPartioner for hashing Text with good hashing function to get better distribution

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      partitioner for Text keys using util.Hash framework for hashing function

      1. textpartitioner8.txt
        5 kB
        Radim Kolar
      2. textpartitioner7.txt
        4 kB
        Radim Kolar
      3. textpartitioner6.txt
        4 kB
        Radim Kolar
      4. textpartitioner4.txt
        5 kB
        Radim Kolar
      5. textpartitioner3.txt
        5 kB
        Radim Kolar
      6. textpartitioner2.txt
        5 kB
        Radim Kolar
      7. textpartitioner1.txt
        5 kB
        Radim Kolar

        Activity

        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12555646/textpartitioner1.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        -1 javac. The patch appears to cause the build to fail.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3096//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555646/textpartitioner1.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3096//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12556180/textpartitioner2.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        -1 javac. The applied patch generated 2014 javac compiler warnings (more than the trunk's current 2013 warnings).

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3099//testReport/
        Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3099//artifact/trunk/patchprocess/diffJavacWarnings.txt
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3099//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12556180/textpartitioner2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. -1 javac . The applied patch generated 2014 javac compiler warnings (more than the trunk's current 2013 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3099//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3099//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3099//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12556265/textpartitioner3.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3101//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3101//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12556265/textpartitioner3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3101//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3101//console This message is automatically generated.
        Hide
        Luke Lu added a comment -

        Don't you think that the test should verify that configured hash fn is used as expected (at least the default and an arbitrary configuration)?

        Show
        Luke Lu added a comment - Don't you think that the test should verify that configured hash fn is used as expected (at least the default and an arbitrary configuration)?
        Hide
        Radim Kolar added a comment -

        it can not be tested that way without powermock which was rejected.

        Show
        Radim Kolar added a comment - it can not be tested that way without powermock which was rejected.
        Hide
        Robert Joseph Evans added a comment -

        I have not really looked at this in too much detail, but in the partitioner you are converting key to a String to convert it to UTF-8 bytes. Text was designed to store the data in UTF-8 internally instead of UCS-2 like String does. I think you could just call key.getBytes() directly. I think the only difference is that malformed and unmappable bytes will be replaced in the returned String, instead of leaving the bytes as is.

        Show
        Robert Joseph Evans added a comment - I have not really looked at this in too much detail, but in the partitioner you are converting key to a String to convert it to UTF-8 bytes. Text was designed to store the data in UTF-8 internally instead of UCS-2 like String does. I think you could just call key.getBytes() directly. I think the only difference is that malformed and unmappable bytes will be replaced in the returned String, instead of leaving the bytes as is.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12561545/textpartitioner4.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3134//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3134//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561545/textpartitioner4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3134//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3134//console This message is automatically generated.
        Hide
        Luke Lu added a comment -

        I agree with Bobby that we should just use key.getBytes() directly.

        Show
        Luke Lu added a comment - I agree with Bobby that we should just use key.getBytes() directly.
        Hide
        Radim Kolar added a comment -

        patch v4 is using that

        Show
        Radim Kolar added a comment - patch v4 is using that
        Hide
        Luke Lu added a comment -

        v4 still shows:

        +    int hashval = hashfn.hash(key.toString().getBytes());
        

        to me.

        Show
        Luke Lu added a comment - v4 still shows: + int hashval = hashfn.hash(key.toString().getBytes()); to me.
        Hide
        Luke Lu added a comment -

        ie., toString is not necessary.

        Show
        Luke Lu added a comment - ie., toString is not necessary.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12561791/textpartitioner6.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core:

        org.apache.hadoop.mapreduce.lib.partition.TestTextPartitioner

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3142//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3142//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561791/textpartitioner6.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core: org.apache.hadoop.mapreduce.lib.partition.TestTextPartitioner +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3142//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3142//console This message is automatically generated.
        Hide
        Luke Lu added a comment -

        It seems to me that you really don't need mock for testing hash configurations in this case. You can use a real configuration object directly.

        Show
        Luke Lu added a comment - It seems to me that you really don't need mock for testing hash configurations in this case. You can use a real configuration object directly.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12561808/textpartitioner7.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3144//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3144//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561808/textpartitioner7.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3144//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3144//console This message is automatically generated.
        Hide
        Radim Kolar added a comment -

        i didnt get it. How do you know that p.getPartition() called Hash function if you can not use powermock and hook into it?

        Show
        Radim Kolar added a comment - i didnt get it. How do you know that p.getPartition() called Hash function if you can not use powermock and hook into it?
        Hide
        Luke Lu added a comment -

        I mean you can assert the real value of partition is this case. If numReduceTasks is large enough, the likelihood the value is the same for the tested hashes are small, which IMO is good enough of a test.

        Show
        Luke Lu added a comment - I mean you can assert the real value of partition is this case. If numReduceTasks is large enough, the likelihood the value is the same for the tested hashes are small, which IMO is good enough of a test.
        Hide
        Radim Kolar added a comment -

        check if different hashes returns different partitions

        Show
        Radim Kolar added a comment - check if different hashes returns different partitions
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562377/partitioner8.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3172//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3172//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562377/partitioner8.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3172//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3172//console This message is automatically generated.
        Hide
        Radim Kolar added a comment -

        attached wrong patch, retry.

        Show
        Radim Kolar added a comment - attached wrong patch, retry.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562408/textpartitioner8.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3175//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3175//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562408/textpartitioner8.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3175//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3175//console This message is automatically generated.
        Hide
        Radim Kolar added a comment -

        do not need this in hadoop core

        Show
        Radim Kolar added a comment - do not need this in hadoop core

          People

          • Assignee:
            Radim Kolar
            Reporter:
            Radim Kolar
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development