Uploaded image for project: 'Commons Collections'
  1. Commons Collections

BloomFilter: replace ArrayTracker



    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Done
    • 4.5.0-M1
    • None
    • Collection


      aherbert on 27 Feb

      The Hasher.ArrayTracker class could be replaced with an open addressed hash table. I have created an implementation that is very efficient and can handle up to 2^29 unique indices (with a load factor of 0.5). It is approximately 7x faster than TreeSet for add operations, even without explicitly setting the upper capacity. It is only outperformed by a BitMap type array when the density of indices exceeds 2-4 bits per long which matches up to how the IndexFilter is choosing the implementation type based on size. I added the simple array tracker to the benchmark as it is still faster when the number of hash functions is very small (e.g. 5).

      The same int hash table can be used for the SparseBloomFilter but it would not support a saturated filter if the number of bits is above 2^29. The second caveat is that iteration of indices in order is slow, they are naturally unsorted. Iteration in unsorted raw form is fast (as fast as TreeMap) but sorted requires extracting the indices and sorting them which is about 4-6x slower than TreeMap iteration. The sorted form is required to support the BitMapProducer interface. This then becomes a question of whether to optimise a sparse filter for merge and contains operations or for bitmap iteration (required for set operations).

      Currently this and the BitMapTracker are public. I would move them to package private implementation details (i.e. out of this class). This allows them to be changed later. All access should be through IndexTracker.create.


        Issue Links



              claude Claude Warren
              claude Claude Warren
              0 Vote for this issue
              1 Start watching this issue