Lucene - Core
  1. Lucene - Core
  2. LUCENE-5604

Should we switch BytesRefHash to MurmurHash3?

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      MurmurHash3 has better hashing distribution than the current hash function we use for BytesRefHash which is a simple multiplicative function with 31 multiplier (same as Java's String.hashCode, but applied to bytes not chars). Maybe we should switch ...

      1. BytesRefHash.perturb.patch
        4 kB
        Michael McCandless
      2. LUCENE-5604.patch
        30 kB
        Michael McCandless
      3. LUCENE-5604.patch
        28 kB
        Michael McCandless
      4. LUCENE-5604.patch
        33 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial patch, Lucene tests pass, but solrj doesn't yet compile....

        I factored out Hash.murmurhash3_x86_32 from Solr into Lucene's StringHelper, and cut over BytesRef.hash, TermToBytesRefAttribute.fillBytesRef, and BytesRefHash.

        I left some nocommits: I think we should change TermToBytesRefAttribute to not return this hashCode? And also remove the BytesRefHash.add method that takes a hashCode? Seems awkward to make the hash code impl of BytesRefHash so public ... it should be under the hood.

        I also randomized/salted the hash seed per JVM instance (poached this from Guava), by setting a common static seed on JVM init (just System.currentTimeMillis()). This should frustrate denial of service attacks, and also can catch any places where we rely on this hash function not changing across JVM instances (e.g. persisting to disk somewhere).

        Show
        Michael McCandless added a comment - Initial patch, Lucene tests pass, but solrj doesn't yet compile.... I factored out Hash.murmurhash3_x86_32 from Solr into Lucene's StringHelper, and cut over BytesRef.hash, TermToBytesRefAttribute.fillBytesRef, and BytesRefHash. I left some nocommits: I think we should change TermToBytesRefAttribute to not return this hashCode? And also remove the BytesRefHash.add method that takes a hashCode? Seems awkward to make the hash code impl of BytesRefHash so public ... it should be under the hood. I also randomized/salted the hash seed per JVM instance (poached this from Guava), by setting a common static seed on JVM init (just System.currentTimeMillis()). This should frustrate denial of service attacks, and also can catch any places where we rely on this hash function not changing across JVM instances (e.g. persisting to disk somewhere).
        Hide
        Michael McCandless added a comment -

        I ran performance tests on first 5M Wikipedia "medium" (1 KB sized)
        docs and Geonames (sources for the benchmark are all in luceneutil):

        Wiki first 5M docs, no merge policy, 64 MB RAM buffer, 4 indexing threads, default codec:
          trunk:    136.985 sec, 189729244 conflicts
          murmur:   134.156 sec, 164990724 conflicts
        
        Geonames, no merge policy, 64 MB RAM buffer, 4 indexing threads, default codec:
          trunk:    167.354 sec, 236051203 conflicts
          murmur:   168.101 sec, 179747265 conflicts
        

        Net/net the indexing time is the same (within noise of run-to-run).
        The conflict count is how many times we had to probe in the open
        addressed hash table inside BytesRefHash, and Murmur3 gives a nice
        reduction (~ 13-24%). I think we should switch.

        Show
        Michael McCandless added a comment - I ran performance tests on first 5M Wikipedia "medium" (1 KB sized) docs and Geonames (sources for the benchmark are all in luceneutil): Wiki first 5M docs, no merge policy, 64 MB RAM buffer, 4 indexing threads, default codec: trunk: 136.985 sec, 189729244 conflicts murmur: 134.156 sec, 164990724 conflicts Geonames, no merge policy, 64 MB RAM buffer, 4 indexing threads, default codec: trunk: 167.354 sec, 236051203 conflicts murmur: 168.101 sec, 179747265 conflicts Net/net the indexing time is the same (within noise of run-to-run). The conflict count is how many times we had to probe in the open addressed hash table inside BytesRefHash, and Murmur3 gives a nice reduction (~ 13-24%). I think we should switch.
        Hide
        Michael McCandless added a comment -

        Separately, I also tried a different probing function inside
        BytesRefHash, poaching the "perturbing" approach from Python's
        dictionary object:

        Wiki
          murmur + perturb: 134.228 sec, 176358406 conflicts
        
        Geonames
          murmur + perturb: 167.735 sec, 200311281 conflicts
        

        Curiously, it increased the number of collisions from Murmur3 alone.
        It's possible I messed up the implementation (though all Lucene tests
        did pass).

        Or, it could be that because we only use 32 bits for our hash code
        (Python uses 64 bit hash codes on 64 bit arch), we just don't have
        enough bits to mixin when probing for new addresses.

        In fact, if we move all hashing to be private (under the hood) of
        BytesRefHash, maybe we could switch to the 128 bit variant MurmurHash3
        and then the "perturbing" might help.

        Show
        Michael McCandless added a comment - Separately, I also tried a different probing function inside BytesRefHash, poaching the "perturbing" approach from Python's dictionary object: Wiki murmur + perturb: 134.228 sec, 176358406 conflicts Geonames murmur + perturb: 167.735 sec, 200311281 conflicts Curiously, it increased the number of collisions from Murmur3 alone. It's possible I messed up the implementation (though all Lucene tests did pass). Or, it could be that because we only use 32 bits for our hash code (Python uses 64 bit hash codes on 64 bit arch), we just don't have enough bits to mixin when probing for new addresses. In fact, if we move all hashing to be private (under the hood) of BytesRefHash, maybe we could switch to the 128 bit variant MurmurHash3 and then the "perturbing" might help.
        Hide
        Robert Muir added a comment -

        Can we use methods like Integer.reverseBytes/rotateLeft instead of doing byte swapping or bit rotations manually? this may improve the speed, e.g. the former is a jvm intrinsic.

        Show
        Robert Muir added a comment - Can we use methods like Integer.reverseBytes/rotateLeft instead of doing byte swapping or bit rotations manually? this may improve the speed, e.g. the former is a jvm intrinsic.
        Hide
        Yonik Seeley added a comment -

        The JVM recognizes pairs of shifts that amount to a rotate and replaces them with an intrinsic.

        Initial patch, Lucene tests pass, but solrj doesn't yet compile....

        Right - SolrJ does not have lucene dependencies. Solr also depends on the exact hash, so it can't be tweaked (for example if a variant turns out to be better for lucene indexing). Perhaps Lucene should just make a copy of the one it needs (the byte[] version).

        Show
        Yonik Seeley added a comment - The JVM recognizes pairs of shifts that amount to a rotate and replaces them with an intrinsic. Initial patch, Lucene tests pass, but solrj doesn't yet compile.... Right - SolrJ does not have lucene dependencies. Solr also depends on the exact hash, so it can't be tweaked (for example if a variant turns out to be better for lucene indexing). Perhaps Lucene should just make a copy of the one it needs (the byte[] version).
        Hide
        Uwe Schindler added a comment -

        The JVM recognizes pairs of shifts that amount to a rotate and replaces them with an intrinsic.

        I still think we should replace them by the "methods". This is the same like replacing the ternary ? : with Number.compare(x,y) for comparators. Brings no improvements, just better readability in Java 7 and is less error-prone (cf. the possible overflows if implementing the compare with a dumb ternary op).

        Show
        Uwe Schindler added a comment - The JVM recognizes pairs of shifts that amount to a rotate and replaces them with an intrinsic. I still think we should replace them by the "methods". This is the same like replacing the ternary ? : with Number.compare(x,y) for comparators. Brings no improvements, just better readability in Java 7 and is less error-prone (cf. the possible overflows if implementing the compare with a dumb ternary op).
        Hide
        Dawid Weiss added a comment -

        > by setting a common static seed on JVM init (just System.currentTimeMillis()).

        This will render any tests that rely on hash ordering, etc. not-repeatable. I suggest initializing this to current time millis OR to the current random seed value (system property 'tests.seed').

        Show
        Dawid Weiss added a comment - > by setting a common static seed on JVM init (just System.currentTimeMillis()). This will render any tests that rely on hash ordering, etc. not-repeatable. I suggest initializing this to current time millis OR to the current random seed value (system property 'tests.seed').
        Hide
        Michael McCandless added a comment -

        New patch, folding in all feedback (thanks!). I think it's ready:

        • I reverted the Solr changes
        • I dup'd the murmurhash3_x86_32 taking byte[] into StringHelper, but changed to the intrinsics for Integer.rotateLeft
        • I added a small test case, confirming our MurmurHash3 impl matches a separate Python/C impl I found
        • I made the hashing private to BytesRefHash, and changed TermToBytesAtt.fillBytesRef to return void
        • For the seed/salt, I now pull from tests.seed property if it's non-null
        Show
        Michael McCandless added a comment - New patch, folding in all feedback (thanks!). I think it's ready: I reverted the Solr changes I dup'd the murmurhash3_x86_32 taking byte[] into StringHelper, but changed to the intrinsics for Integer.rotateLeft I added a small test case, confirming our MurmurHash3 impl matches a separate Python/C impl I found I made the hashing private to BytesRefHash, and changed TermToBytesAtt.fillBytesRef to return void For the seed/salt, I now pull from tests.seed property if it's non-null
        Hide
        Adrien Grand added a comment -

        Strong +1 on this change!

        Separately, I also tried a different probing function inside BytesRefHash

        I'm wondering if we should try linear probing? Now that we use a good hash function, the likelyness of having clusters of hashes in the hash table is much lower (especially given that BytesRefHash hard-codes quite a low load factor: 0.5) so linear probing might help get some performance back since it tends to be more cache-friendly?

        I added a small test case, confirming our MurmurHash3 impl matches a separate Python/C impl I found

        Maybe we could add Guava as a test dependency and do some duels on random bytes?

        Show
        Adrien Grand added a comment - Strong +1 on this change! Separately, I also tried a different probing function inside BytesRefHash I'm wondering if we should try linear probing? Now that we use a good hash function, the likelyness of having clusters of hashes in the hash table is much lower (especially given that BytesRefHash hard-codes quite a low load factor: 0.5) so linear probing might help get some performance back since it tends to be more cache-friendly? I added a small test case, confirming our MurmurHash3 impl matches a separate Python/C impl I found Maybe we could add Guava as a test dependency and do some duels on random bytes?
        Hide
        Robert Muir added a comment -

        Surely we can avoid depending on guava, lets please not do this.

        Show
        Robert Muir added a comment - Surely we can avoid depending on guava, lets please not do this.
        Hide
        Dawid Weiss added a comment -

        > I'm wondering if we should try linear probing?

        It's what fastutil and hppc use internally. I remember Mike did try switching to linear probing a while back (was it for BytesRefHash or the FST work?) and it didn't bring any reasonable improvement. Also, there are minor gotchas with linear probing like this one: http://issues.carrot2.org/browse/HPPC-80

        Show
        Dawid Weiss added a comment - > I'm wondering if we should try linear probing? It's what fastutil and hppc use internally. I remember Mike did try switching to linear probing a while back (was it for BytesRefHash or the FST work?) and it didn't bring any reasonable improvement. Also, there are minor gotchas with linear probing like this one: http://issues.carrot2.org/browse/HPPC-80
        Hide
        Michael McCandless added a comment -

        OK, new patch using simple linear probe.

        For the benchmark, I switched to a single indexing thread,
        NoMergePolicy, and set RAM buffer to 512 MB (so more entries in the
        hash table before flushing, to provoke more conflict chances). I
        index the first 5M docs of Wikipedia medium, and full Geonames corpus:

        wiki
           trunk:                     402.8 sec, 194906771 conflicts
           murmur3 + current probe:   410.8 sec, 136527391 conflicts
           murmur3 + linear probe:    395.8 sec,  63403239 conflicts
        
        geonames
           trunk:                     624.1 sec, 368781904 conflicts
           murmur3 + current probe:   607.1 sec, 168310018 conflicts
           murmur3 + linear probe:    606.6 sec, 176405531 conflicts
        

        Curious how the linear probe was such a big win for Wikipedia in
        reducing conflicts, but slightly higher conflicts for Geonames, but
        net/net I think we should switch to linear probe.

        Show
        Michael McCandless added a comment - OK, new patch using simple linear probe. For the benchmark, I switched to a single indexing thread, NoMergePolicy, and set RAM buffer to 512 MB (so more entries in the hash table before flushing, to provoke more conflict chances). I index the first 5M docs of Wikipedia medium, and full Geonames corpus: wiki trunk: 402.8 sec, 194906771 conflicts murmur3 + current probe: 410.8 sec, 136527391 conflicts murmur3 + linear probe: 395.8 sec, 63403239 conflicts geonames trunk: 624.1 sec, 368781904 conflicts murmur3 + current probe: 607.1 sec, 168310018 conflicts murmur3 + linear probe: 606.6 sec, 176405531 conflicts Curious how the linear probe was such a big win for Wikipedia in reducing conflicts, but slightly higher conflicts for Geonames, but net/net I think we should switch to linear probe.
        Hide
        Michael McCandless added a comment -

        I remember Mike did try switching to linear probing a while back (was it for BytesRefHash or the FST work?) and it didn't bring any reasonable improvement.

        Looks like it was LUCENE-2967, for FSTs.

        Show
        Michael McCandless added a comment - I remember Mike did try switching to linear probing a while back (was it for BytesRefHash or the FST work?) and it didn't bring any reasonable improvement. Looks like it was LUCENE-2967 , for FSTs.
        Hide
        Dawid Weiss added a comment -

        Looks good to me (although only skimmed through the patch, didn't see the consolidated code).

        Show
        Dawid Weiss added a comment - Looks good to me (although only skimmed through the patch, didn't see the consolidated code).
        Hide
        ASF subversion and git services added a comment -

        Commit 1587489 from mikemccand@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1587489 ]

        LUCENE-5604: switch BytesRef/BytesRefHash hashing to MurmurHash3

        Show
        ASF subversion and git services added a comment - Commit 1587489 from mikemccand@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1587489 ] LUCENE-5604 : switch BytesRef/BytesRefHash hashing to MurmurHash3
        Hide
        ASF subversion and git services added a comment -

        Commit 1587492 from mikemccand@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1587492 ]

        LUCENE-5604: switch BytesRef/BytesRefHash hashing to MurmurHash3

        Show
        ASF subversion and git services added a comment - Commit 1587492 from mikemccand@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1587492 ] LUCENE-5604 : switch BytesRef/BytesRefHash hashing to MurmurHash3
        Hide
        ASF subversion and git services added a comment -

        Commit 1587509 from mikemccand@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1587509 ]

        LUCENE-5604: woops, remove diagnostic conflict counter

        Show
        ASF subversion and git services added a comment - Commit 1587509 from mikemccand@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1587509 ] LUCENE-5604 : woops, remove diagnostic conflict counter
        Hide
        ASF subversion and git services added a comment -

        Commit 1587510 from mikemccand@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1587510 ]

        LUCENE-5604: woops, remove diagnostic conflict counter

        Show
        ASF subversion and git services added a comment - Commit 1587510 from mikemccand@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1587510 ] LUCENE-5604 : woops, remove diagnostic conflict counter
        Hide
        ASF subversion and git services added a comment -

        Commit 1587684 from mikemccand@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1587684 ]

        LUCENE-5604: switch these 'many terms' tests back to previous adversarial hash conflict case

        Show
        ASF subversion and git services added a comment - Commit 1587684 from mikemccand@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1587684 ] LUCENE-5604 : switch these 'many terms' tests back to previous adversarial hash conflict case
        Hide
        ASF subversion and git services added a comment -

        Commit 1587685 from mikemccand@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1587685 ]

        LUCENE-5604: switch these 'many terms' tests back to previous adversarial hash conflict case

        Show
        ASF subversion and git services added a comment - Commit 1587685 from mikemccand@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1587685 ] LUCENE-5604 : switch these 'many terms' tests back to previous adversarial hash conflict case
        Hide
        Michael McCandless added a comment -

        I confirmed that the new hash + probe "fixes" the adversarial case (sequential byte[] terms) that affected the old hash+probe, and converted two of our @Monster tests to use sequential bytes not random bytes.

        Given that sequential bytes could be a fairly common case (e.g. an app that assigns IDs sequentially), maybe we should back-port this fix for 4.8?

        Show
        Michael McCandless added a comment - I confirmed that the new hash + probe "fixes" the adversarial case (sequential byte[] terms) that affected the old hash+probe, and converted two of our @Monster tests to use sequential bytes not random bytes. Given that sequential bytes could be a fairly common case (e.g. an app that assigns IDs sequentially), maybe we should back-port this fix for 4.8?
        Hide
        Robert Muir added a comment -

        Personally I am +1, I think we should at least discuss the risks. We know the old behavior is close to bug territory at least, I hit this problem several times (e.g. trying to test increasing DV values for SortedBytes). And I think this is probably really common in practice.

        Show
        Robert Muir added a comment - Personally I am +1, I think we should at least discuss the risks. We know the old behavior is close to bug territory at least, I hit this problem several times (e.g. trying to test increasing DV values for SortedBytes). And I think this is probably really common in practice.
        Hide
        Michael McCandless added a comment -

        Uwe Schindler what do you think?

        Show
        Michael McCandless added a comment - Uwe Schindler what do you think?
        Hide
        Simon Willnauer added a comment -

        I'm +1 on getting this into 4.8 as well - I ran into a lot of problems where BytesRef#hashCode() had problems

        Show
        Simon Willnauer added a comment - I'm +1 on getting this into 4.8 as well - I ran into a lot of problems where BytesRef#hashCode() had problems
        Hide
        Uwe Schindler added a comment -

        I think the first release candidate will delay a bit ,onger, so why not. If this patch does not change index format, I am fine.

        Show
        Uwe Schindler added a comment - I think the first release candidate will delay a bit ,onger, so why not. If this patch does not change index format, I am fine.
        Hide
        Michael McCandless added a comment -

        OK, thanks Uwe Schindler, there's no index format change; I'll reopen & backport.

        Show
        Michael McCandless added a comment - OK, thanks Uwe Schindler , there's no index format change; I'll reopen & backport.
        Hide
        Michael McCandless added a comment -

        Reopen for 4.8 backport.

        Show
        Michael McCandless added a comment - Reopen for 4.8 backport.
        Hide
        ASF subversion and git services added a comment -

        Commit 1587873 from mikemccand@apache.org in branch 'dev/branches/lucene_solr_4_8'
        [ https://svn.apache.org/r1587873 ]

        LUCENE-5604: switch BytesRef/BytesRefHash hashing to MurmurHash3

        Show
        ASF subversion and git services added a comment - Commit 1587873 from mikemccand@apache.org in branch 'dev/branches/lucene_solr_4_8' [ https://svn.apache.org/r1587873 ] LUCENE-5604 : switch BytesRef/BytesRefHash hashing to MurmurHash3
        Hide
        ASF subversion and git services added a comment -

        Commit 1587874 from mikemccand@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1587874 ]

        LUCENE-5604: move CHANGES entry

        Show
        ASF subversion and git services added a comment - Commit 1587874 from mikemccand@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1587874 ] LUCENE-5604 : move CHANGES entry
        Hide
        ASF subversion and git services added a comment -

        Commit 1587875 from mikemccand@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1587875 ]

        LUCENE-5604: move CHANGES entry

        Show
        ASF subversion and git services added a comment - Commit 1587875 from mikemccand@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1587875 ] LUCENE-5604 : move CHANGES entry
        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.
        Hide
        David Smiley added a comment -

        Unless I'm missing something; shouldn't this be marked as resolved/closed and for 4.8?

        Show
        David Smiley added a comment - Unless I'm missing something; shouldn't this be marked as resolved/closed and for 4.8?
        Hide
        Michael McCandless added a comment -

        Woops, yes, thanks for reminding me David Smiley

        Show
        Michael McCandless added a comment - Woops, yes, thanks for reminding me David Smiley
        Hide
        David Smiley added a comment -

        The system property "tests.seed" is being set to the empty string "" in my test setup, and it would be annoying to have it not do this. If one looks at the Maven pom.xml we generate, it's written to pass along certain system properties which get defaulted essentially to "", unless set on the command-line. Can we have the static initializer in StringHelper guard against the empty string case so that it doesn't hit an exception? I'd be happy to commit the trivial change (and to 4.8.1)

        Show
        David Smiley added a comment - The system property "tests.seed" is being set to the empty string "" in my test setup, and it would be annoying to have it not do this. If one looks at the Maven pom.xml we generate, it's written to pass along certain system properties which get defaulted essentially to "", unless set on the command-line. Can we have the static initializer in StringHelper guard against the empty string case so that it doesn't hit an exception? I'd be happy to commit the trivial change (and to 4.8.1)
        Hide
        Robert Muir added a comment -

        Please open a new issue. i dont think this tests.seed stuff is useful anyway. but its not for here: this issue is released.

        Show
        Robert Muir added a comment - Please open a new issue. i dont think this tests.seed stuff is useful anyway. but its not for here: this issue is released.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development