Details

Type: Improvement

Status: Closed

Priority: Trivial

Resolution: Won't Fix

Affects Version/s: None

Fix Version/s: 4.0ALPHA

Component/s: None

Labels:None
Description
I recently had an interesting discussion with Sebastiano Vigna (fastutil), who suggested that linear probing, given a hash mixing function with good avalanche properties, is a way better method of constructing lookups in associative arrays compared to quadratic probing. Indeed, with linear probing you can implement removals from a hash map without removed slot markers and linear probing has nice properties with respect to modern CPUs (caches). I've reimplemented HPPC's hash maps to use linear probing and we observed a nice speedup (the same applies for fastutils of course).
This patch changes NodeHash's implementation to use linear probing. The code is a bit simpler (I think . I also moved the load factor to a constant – 0.5 seems like a generous load factor, especially if we allow large FSTs to be built. I don't see any significant speedup in constructing large automata, but there is no slowdown either (I checked on one machine only for now, but will verify on other machines too).
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
I spent some time on this. It's quite fascinating: the number of collisions for the default probing is smaller than:
a) linear probing with murmurhash mix of the original hash
b) linear probing without murmurhash mix (start from raw hash only).
Curiously, the number of collisions for (b) is smaller than for (a) – this could be explained if we assume bits are spread evently throughout the entire 32bit range after murmurhash, so after masking to table size there should be more collisions on lower bits compared to a raw hash (this would have more collisions on upper bits and fewer on lower bits because it is multiplicative... or at least I think so).
Anyway, I tried many different versions and I don't see any significant difference in favor of linear probing here. Measured the GC overhead during my tests too, but it is not the primary factor contributing to the total cost of constructing the FST (about 35% of the total time, running in parallel, typically).