Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      faster and better then murmur2

      1. murmur3.txt
        11 kB
        Radim Kolar
      2. murmur3-2.txt
        12 kB
        Radim Kolar
      3. murmur3-3.txt
        12 kB
        Radim Kolar
      4. murmur3-4.txt
        12 kB
        Radim Kolar
      5. murmur3-5.txt
        12 kB
        Radim Kolar
      6. murmur3-6.txt
        12 kB
        Radim Kolar
      7. murmur3-7.txt
        12 kB
        Radim Kolar

        Activity

        Hide
        Radim Kolar added a comment -

        was any decision made on this patch?

        Show
        Radim Kolar added a comment - was any decision made on this patch?
        Hide
        Luke Lu added a comment -

        Murmur3 is the state of art for non-crypto hash, suitable for non-adversarial input. +1 on the latest patch. +1 on getting SipHash in as well

        Show
        Luke Lu added a comment - Murmur3 is the state of art for non-crypto hash, suitable for non-adversarial input. +1 on the latest patch. +1 on getting SipHash in as well
        Hide
        Radim Kolar added a comment -

        I just tried SipHash and its way slower. I can add siphash in another JIRA patch.

        Show
        Radim Kolar added a comment - I just tried SipHash and its way slower. I can add siphash in another JIRA patch.
        Hide
        Andy Isaacson added a comment -

        Current best practice is SipHash, https://131002.net/siphash/ rather than Murmur3. If we're going to change hash functions we should probably use that.

        Show
        Andy Isaacson added a comment - Current best practice is SipHash, https://131002.net/siphash/ rather than Murmur3. If we're going to change hash functions we should probably use that.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12560006/murmur3-7.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1851//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1851//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560006/murmur3-7.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1851//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1851//console This message is automatically generated.
        Hide
        Radim Kolar added a comment -

        keep broken murmur2 hash as default

        Show
        Radim Kolar added a comment - keep broken murmur2 hash as default
        Hide
        Luke Lu added a comment -

        I agree with ATM that we should not change the default hash type. OTOH, anyone that uses the hash for persistence (e.g., in a bloomfilter) should request an explicit hash type.

        Show
        Luke Lu added a comment - I agree with ATM that we should not change the default hash type. OTOH, anyone that uses the hash for persistence (e.g., in a bloomfilter) should request an explicit hash type.
        Hide
        Aaron T. Myers added a comment -

        It looks like this patch is changing the default hash type, which as Todd has pointed out should be considered an incompatible change:

           public static int getHashType(Configuration conf) {
        -    String name = conf.get("hadoop.util.hash.type", "murmur");
        +    String name = conf.get("hadoop.util.hash.type", "murmur3");
             return parseHashType(name);
           }
        

        I think it's fine to add this new option, but I agree with Todd that we should not change the default.

        Show
        Aaron T. Myers added a comment - It looks like this patch is changing the default hash type, which as Todd has pointed out should be considered an incompatible change: public static int getHashType(Configuration conf) { - String name = conf.get( "hadoop.util.hash.type" , "murmur" ); + String name = conf.get( "hadoop.util.hash.type" , "murmur3" ); return parseHashType(name); } I think it's fine to add this new option, but I agree with Todd that we should not change the default.
        Hide
        Luke Lu added a comment -

        The patch lgtm. Besides Mahout, it's also being used in Cassandra (CASSANDRA-2975) for bloom filter with significant performance improvement. I can see this get used by HBase soon.

        I'll commit this, if there is no further objection.

        Show
        Luke Lu added a comment - The patch lgtm. Besides Mahout, it's also being used in Cassandra ( CASSANDRA-2975 ) for bloom filter with significant performance improvement. I can see this get used by HBase soon. I'll commit this, if there is no further objection.
        Hide
        Radim Kolar added a comment -

        1. downstream projects are using this function - avoid code duplications
        2. currently murmur2 is slower then murmur3 and has design flaw. Currently used for generating bloom filter
        3. new TextPartitioner in JIRA is using this hash framework and will benefit from better hash function available

        Show
        Radim Kolar added a comment - 1. downstream projects are using this function - avoid code duplications 2. currently murmur2 is slower then murmur3 and has design flaw. Currently used for generating bloom filter 3. new TextPartitioner in JIRA is using this hash framework and will benefit from better hash function available
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12555068/murmur3-6.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1831//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1831//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555068/murmur3-6.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1831//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1831//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12555064/murmur3-5.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javac. The patch appears to cause the build to fail.

        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1830//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555064/murmur3-5.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1830//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12555062/murmur3-4.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javac. The patch appears to cause the build to fail.

        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1829//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555062/murmur3-4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1829//console This message is automatically generated.
        Hide
        Radim Kolar added a comment -

        bloom filter code is using util.Hash family of functions

        Show
        Radim Kolar added a comment - bloom filter code is using util.Hash family of functions
        Hide
        Radim Kolar added a comment -

        faster variant from mahout

        Show
        Radim Kolar added a comment - faster variant from mahout
        Hide
        Radim Kolar added a comment -

        Hash partitioner can do same thing like java hashtable - rehashing hashCode() to get better distribution. current hash quality in hadoop is low if you have lot of similar strings like "aaaaaaaa" "aaaaaab" you will get about 20% unoptimal partitions in average cases. but in some specific cases it can split like 80:20 istead of close 50:50

        Show
        Radim Kolar added a comment - Hash partitioner can do same thing like java hashtable - rehashing hashCode() to get better distribution. current hash quality in hadoop is low if you have lot of similar strings like "aaaaaaaa" "aaaaaab" you will get about 20% unoptimal partitions in average cases. but in some specific cases it can split like 80:20 istead of close 50:50
        Hide
        Todd Lipcon added a comment -

        I don't think HashPartitioner can be changed to use anything but Object.hashCode (which we assume the user has implemented). Changing WritableComparator to use a different hash hardly seems worth it. Have you seen cases where the existing hash code causes poor distribution?

        Also, I don't think we can change the default hashcode for existing types, since it would change user-visible partitioning behavior. Adding a new TextPartitioner which uses Murmur3 sounds useful, if you can show that there are indeed real-world datasets where the "poor hash behavior" of the existing partitioner causes skew.

        Show
        Todd Lipcon added a comment - I don't think HashPartitioner can be changed to use anything but Object.hashCode (which we assume the user has implemented). Changing WritableComparator to use a different hash hardly seems worth it. Have you seen cases where the existing hash code causes poor distribution? Also, I don't think we can change the default hashcode for existing types, since it would change user-visible partitioning behavior. Adding a new TextPartitioner which uses Murmur3 sounds useful, if you can show that there are indeed real-world datasets where the "poor hash behavior" of the existing partitioner causes skew.
        Hide
        Radim Kolar added a comment -

        to improve hashing. WritableComparator hash is weak (used by binarypartitioner), HashPartitioner sucks even more (using object.hashcode), i will add TextPartitioner using murmur3, it have good distribution of results.

        Show
        Radim Kolar added a comment - to improve hashing. WritableComparator hash is weak (used by binarypartitioner), HashPartitioner sucks even more (using object.hashcode), i will add TextPartitioner using murmur3, it have good distribution of results.
        Hide
        Todd Lipcon added a comment -

        Hi Radim. We don't currently use the murmur2 hash anywhere AFAIK. Where do you anticipate wanting to use this new hash function?

        Show
        Todd Lipcon added a comment - Hi Radim. We don't currently use the murmur2 hash anywhere AFAIK. Where do you anticipate wanting to use this new hash function?
        Hide
        Radim Kolar added a comment -
        Show
        Radim Kolar added a comment - this will be faster variant of murmur3 https://issues.apache.org/jira/secure/attachment/12501962/MAHOUT-862.patch
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12554783/murmur3-3.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1818//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1818//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12554783/murmur3-3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1818//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1818//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12554781/murmur3-2.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1817//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1817//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1817//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12554781/murmur3-2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1817//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1817//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1817//console This message is automatically generated.
        Hide
        Radim Kolar added a comment -

        shutup findbugs

        Show
        Radim Kolar added a comment - shutup findbugs
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12554780/murmur3.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1816//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1816//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1816//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12554780/murmur3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1816//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1816//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1816//console This message is automatically generated.

          People

          • Assignee:
            Radim Kolar
            Reporter:
            Radim Kolar
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development