Currently the hash tables used for hash join and aggregation use 16 bytes per bucket and 24 bytes per additional duplicate for a key:
There are some comments in the code that suggest folding the boolean values into the upper bits of the pointers (since on amd64 the address space is only 48 bits, but moving to 57 bits apparently - see https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf). That would reduce the bucket to 12 bytes of actual data.
This would give us the opportunity to reduce memory requirements of joins and the pressure on caches significantly, provided we can work out the implementation issues and the cost of the bit manipulation doesn't exceed the benefit (my intuition is that cache effects are way more important but I could be wrong).
Here's a rough idea of what we could do:
- Implement folding of booleans into the pointer and mark struct Bucket as packed so that it doesn't just undo the work with additional padding.
- Modifying Hashtable to work with the new bucket structure. This needs a little thought since the bucket allocations must be a power-of-two size in bytes, but we also need the hash table entries to be a power-of-two in order for masking the hash to get the bucket number to work. I think either we could just leave wasted space in the buffer or switch to a non-power-of-two number of buckets and using an alternative method of getting the bucket from the hash: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
- Run benchmarks to see if it's beneficial. The effect probably depends on the data set size.