[COLLECTIONS-824] BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: 4.5.0-M1
Fix Version/s: 4.5.0-M1
Component/s: Collection
Labels:
- bloom-filter

Description

            public boolean forEachIndex(IntPredicate consumer) {
                Objects.requireNonNull(consumer, "consumer");
                int bits = shape.getNumberOfBits();
                /*
                 * Essentially this is computing a wrapped modulus from a start point and an
                 * increment. So actually you only need two modulus operations before the loop.
                 * This avoids any modulus operation inside the while loop. It uses a long index
                 * to avoid overflow.
                 */
                long index = mod(initial, bits);
                int inc = mod(increment, bits);


                for (int functionalCount = 0; functionalCount < shape.getNumberOfHashFunctions(); functionalCount++) {


                    if (!consumer.test((int) index)) {
                        return false;
                    }
                    index += inc;
                    index = index >= bits ? index - bits : index;
                }
                return true;
            }

This can be fixed to be non zero by using a smaller modulus:

                // Ensure non-zero increment
                int inc = 1 + mod(increment, bits - 1);

This will prevent the 1/bits occurrence that the increment is zero and the hash is poor (it will be a single index).

If bits == 1 then this will throw an exception. However if bits==1 this is an edge case that can be handled immediately after obtaining the number of bits as:

                if (bits == 1) {
                    // No need to send k duplicates, the filter shape has been constructed incorrectly.
                    return consumer.test(0);
                }

https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144

Here we have index as a long. However if the increment (which is always positive) is subtracted we can use an int and compare to zero:

index -= inc;
index = index < 0 ? index + bits : index;

This does break a lot of unit tests. These tests have been created using SimpleHasher knowing what the hash modulus will be. The tests should be updated to use a FixedHasher in the test package:

    /**
     * To be used for testing only. The fixed indices must fit within the filter shape.
     */
    class FixedHasher implements Hasher {
        private final int[] indices;
        private int[] distinct;

        FixedHasher(int... indices) {
            this.indices = indices.clone();
        }

        @Override
        public IndexProducer indices(Shape shape) {
            checkIndices(shape);
            return IndexProducer.fromIndexArray(indices);
        }

        @Override
        public IndexProducer uniqueIndices(Shape shape) {
            checkIndices(shape);
            int[] d = distinct;
            if (d == null) {
                distinct = d = Arrays.stream(indices).distinct().toArray();
            }
            return IndexProducer.fromIndexArray(d);
        }

        private void checkIndices(Shape shape) {
            // Check the hasher is OK for the shape
            int bits = shape.getNumberOfBits();
            for (int i : indices) {
                Assertions.assertTrue(i < bits);
            }
        }
    }

This hasher will allow filters to be constructed and manipulated. Or change it to use % bits and drop the assertion in checkIndices.

The SimpleHasher should just be tested that it outputs random indices. I would suggest creating a few thousand randomly seeded hashers, outputting 5 indices from each and building a histogram of the counts of each index. This should be uniform when tested with a Chi-square test. This can be repeated for a few different shape sizes.

Updating the filter in this way will still implement a Krisch and Mitzenmacher hasher where:

g_i(x) = (h1(x) + i h2(x)) mod m

The only difference is the subtraction rather than addition.

The wikipedia article on Double Hashing (variants) notes that this type of hasher when used in Bloom filters will result in a collision rate that is twice as likely as intended.

An alternative is to provide enhanced double hashing which changes the increment each loop iteration. This would require some additional work on wrapping the modulus of the increment.

It may be better to change the name to KMHasher and more explicitly state that the hash functions are:

g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]

Attachments

Issue Links

is a child of

COLLECTIONS-728 BloomFilter contribution

Resolved

BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change

Details

Description

Attachments

Issue Links

Activity

People

Dates