Uploaded image for project: 'Commons Collections'
  1. Commons Collections
  2. COLLECTIONS-824

BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Done
    • 4.5.0-M1
    • 4.5.0-M1
    • Collection

    Description

                  public boolean forEachIndex(IntPredicate consumer) {
                      Objects.requireNonNull(consumer, "consumer");
                      int bits = shape.getNumberOfBits();
                      /*
                       * Essentially this is computing a wrapped modulus from a start point and an
                       * increment. So actually you only need two modulus operations before the loop.
                       * This avoids any modulus operation inside the while loop. It uses a long index
                       * to avoid overflow.
                       */
                      long index = mod(initial, bits);
                      int inc = mod(increment, bits);
      
      
                      for (int functionalCount = 0; functionalCount < shape.getNumberOfHashFunctions(); functionalCount++) {
      
      
                          if (!consumer.test((int) index)) {
                              return false;
                          }
                          index += inc;
                          index = index >= bits ? index - bits : index;
                      }
                      return true;
                  }
      
      

      This can be fixed to be non zero by using a smaller modulus:

                      // Ensure non-zero increment
                      int inc = 1 + mod(increment, bits - 1);
      

      This will prevent the 1/bits occurrence that the increment is zero and the hash is poor (it will be a single index).

      If bits == 1 then this will throw an exception. However if bits==1 this is an edge case that can be handled immediately after obtaining the number of bits as:

                      if (bits == 1) {
                          // No need to send k duplicates, the filter shape has been constructed incorrectly.
                          return consumer.test(0);
                      }
      
      

      https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144

      Here we have index as a long. However if the increment (which is always positive) is subtracted we can use an int and compare to zero:

      index -= inc;
      index = index < 0 ? index + bits : index;
      
      

      This does break a lot of unit tests. These tests have been created using SimpleHasher knowing what the hash modulus will be. The tests should be updated to use a FixedHasher in the test package:

          /**
           * To be used for testing only. The fixed indices must fit within the filter shape.
           */
          class FixedHasher implements Hasher {
              private final int[] indices;
              private int[] distinct;
      
              FixedHasher(int... indices) {
                  this.indices = indices.clone();
              }
      
              @Override
              public IndexProducer indices(Shape shape) {
                  checkIndices(shape);
                  return IndexProducer.fromIndexArray(indices);
              }
      
              @Override
              public IndexProducer uniqueIndices(Shape shape) {
                  checkIndices(shape);
                  int[] d = distinct;
                  if (d == null) {
                      distinct = d = Arrays.stream(indices).distinct().toArray();
                  }
                  return IndexProducer.fromIndexArray(d);
              }
      
              private void checkIndices(Shape shape) {
                  // Check the hasher is OK for the shape
                  int bits = shape.getNumberOfBits();
                  for (int i : indices) {
                      Assertions.assertTrue(i < bits);
                  }
              }
          }
      
      

      This hasher will allow filters to be constructed and manipulated. Or change it to use % bits and drop the assertion in checkIndices.

      The SimpleHasher should just be tested that it outputs random indices. I would suggest creating a few thousand randomly seeded hashers, outputting 5 indices from each and building a histogram of the counts of each index. This should be uniform when tested with a Chi-square test. This can be repeated for a few different shape sizes.

      Updating the filter in this way will still implement a Krisch and Mitzenmacher hasher where:

      g_i(x) = (h1(x) + i h2(x)) mod m
      

      The only difference is the subtraction rather than addition.

      The wikipedia article on Double Hashing (variants) notes that this type of hasher when used in Bloom filters will result in a collision rate that is twice as likely as intended.

      An alternative is to provide enhanced double hashing which changes the increment each loop iteration. This would require some additional work on wrapping the modulus of the increment.

      It may be better to change the name to KMHasher and more explicitly state that the hash functions are:

       

      g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
      

       

      Attachments

        Issue Links

          Activity

            People

              claude Claude Warren
              claude Claude Warren
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: