Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6793

NumericRangeQuery.hashCode() produces frequent collisions

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.6, 5.3
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We have a user who is developing a Solr plugin and needs to store NumericRangeQuery objects in a hash table. They found that NumericRangeQuery.hashCode() produces extremely frequent collisions. I understand that the contract for hashCode doesn't (and can't) guarantee unique hash codes for every value, but the distribution of this method seems particularly bad with an affinity for the hash value 897548010. Out of a set of 31 ranges, hashCode returned 897548010 14 times. This is going to result in very inefficient distribution of the objects in the hash table. The standard "times 31" hash function recommended by Effective Java fares quite a bit better, although it still produces quite a few collisions. Here's a test case that compares the results of the current hashCode function with the times 31 method. An even better method, like Murmur3 might be found here: http://floodyberry.com/noncryptohashzoo/

      package com.company;
      
      import org.apache.lucene.search.NumericRangeQuery;
      
      public class Main {
      
          public static int betterHash(NumericRangeQuery query) {
              // I can't subclass NumericRangeQuery since it's constructor is private, so I can't call super and
              // had to copy and paste from the hashCode method for both MultiTermQuery and NumericRangeQuery
      
              // MultiTermQuery.hashCode (copied verbatim)
              final int prime = 31;
              int result = 1;
              result = prime * result + Float.floatToIntBits(query.getBoost());
              result = prime * result + query.getRewriteMethod().hashCode();
              if (query.getField() != null) result = prime * result + query.getField().hashCode();
      
              // NumericRangeQuery.hashCode (changed XOR with random constant to times 31)
              result = result * prime + query.getPrecisionStep();
              if (query.getMin() != null) result = result * prime + query.getMin().hashCode();
              if (query.getMax() != null) result = result * prime + query.getMax().hashCode();
              result = result * prime + Boolean.valueOf(query.includesMin()).hashCode();
              result = result * prime + Boolean.valueOf(query.includesMax()).hashCode();
              return result;
          }
      
          public static void main(String[] args) {
              long previous = Long.MIN_VALUE;
              long[] list = {
                      -9223372036854775798L,
                      -8608480567731124078L,
                      -7993589098607472357L,
                      -7378697629483820637L,
                      -6763806160360168916L,
                      -6148914691236517196L,
                      -5534023222112865475L,
                      -4919131752989213755L,
                      -4304240283865562034L,
                      -3689348814741910314L,
                      -3074457345618258593L,
                      -2459565876494606873L,
                      -1844674407370955152L,
                      -1229782938247303432L,
                      -614891469123651711L,
                      10L,
                      614891469123651730L,
                      1229782938247303451L,
                      1844674407370955171L,
                      2459565876494606892L,
                      3074457345618258612L,
                      3689348814741910333L,
                      4304240283865562053L,
                      4919131752989213774L,
                      5534023222112865494L,
                      6148914691236517215L,
                      6763806160360168935L,
                      7378697629483820656L,
                      7993589098607472376L,
                      8608480567731124097L,
                      Long.MAX_VALUE
              };
      
              for (long current : list) {
                  NumericRangeQuery<Long> query =  NumericRangeQuery.newLongRange("_token_long", 8, previous, current, true, true);
                  System.out.println("[" + previous + " TO " + current + "]: " + query.hashCode() + " / " + betterHash(query));
                  previous = current + 1;
              }
          }
      }
      

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Hmmm, yeah... the XORs of random constants don't actually accomplish much in the current hashCode since there is no bit shifting (and the xor operations will be done the same for all NumericRangeQueries).
        Your betterHash looks much better, more in line with our other hashCode implementations.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Hmmm, yeah... the XORs of random constants don't actually accomplish much in the current hashCode since there is no bit shifting (and the xor operations will be done the same for all NumericRangeQueries). Your betterHash looks much better, more in line with our other hashCode implementations.
        Hide
        jpountz Adrien Grand added a comment -

        +1 to switching to the "times 31" method

        Show
        jpountz Adrien Grand added a comment - +1 to switching to the "times 31" method
        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch. This also fixes the new point queries that were using the same pattern for their hashcode.

        Show
        jpountz Adrien Grand added a comment - Here is a patch. This also fixes the new point queries that were using the same pattern for their hashcode.
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Adrien Grand!

        Show
        mikemccand Michael McCandless added a comment - Thanks Adrien Grand !
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 509c6a0acbeaee6291bb80a5f9537aaff55599c4 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=509c6a0 ]

        LUCENE-6793: Make LegacyNumericRangeQuery and point queries less subject to hash collisions.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 509c6a0acbeaee6291bb80a5f9537aaff55599c4 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=509c6a0 ] LUCENE-6793 : Make LegacyNumericRangeQuery and point queries less subject to hash collisions.

          People

          • Assignee:
            Unassigned
            Reporter:
            jblangston@datastax.com J.B. Langston
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development