Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 4.0.0
-
None
-
None
-
ghx-label-13
Description
hash-table.h's DuplicateNode structure is currently 24 bytes:
/// Linked list of entries used for duplicates. struct DuplicateNode { /// Used for full outer and right {outer, anti, semi} joins. Indicates whether the /// row in the DuplicateNode has been matched. /// From an abstraction point of view, this is an awkward place to store this /// information. /// TODO: Fold this flag in the next pointer below. bool matched; /// padding to an 8 byte boundary /// Chain to next duplicate node, NULL when end of list. DuplicateNode* next; HtData htdata; };
The matched boolean occupies 8 bytes due to the alignment of next and htdata. Apart from the space usage, this linked list is also not very cache friendly during iteration. Multiple DuplicateNodes are allocated from a single chunk of memory, so it is possible that there is some locality by happenstance, but there are no guarantees. One way to make this more cache friendly is to batch multiple duplicates into a single structure. For example, consider this structure:
/// Linked list of entries used for duplicates. struct DuplicateNode { /// Used for full outer and right {outer, anti, semi} joins. Indicates whether the /// row in the DuplicateNode has been matched. /// From an abstraction point of view, this is an awkward place to store this /// information. /// TODO: Fold this flag in the next pointer below. bool matched[BATCH_SIZE]; HtData htdata[BATCH_SIZE]; /// Chain to next duplicate node, NULL when end of list. DuplicateNode* next; };
This amortizes the matched boolean and the next pointer over multiple data pointers, reducing the overhead. If BATCH_SIZE exceeded 8, the matched could be packed further as individual bits on a single field if needed.
BATCH_SIZE=2 => 32 bytes
BATCH_SIZE=3 => 40 bytes
BATCH_SIZE=4 => 48 bytes
Num Duplicates | BATCH_SIZE=1 | BATCH_SIZE=2 | BATCH_SIZE=3 | BATCH_SIZE=4 | BATCH_SIZE=5 | BATCH_SIZE=6 |
---|---|---|---|---|---|---|
1 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | 0 bytes | 0 bytes |
2 | 48 bytes | 32 bytes | 40 bytes | 48 bytes | 56 bytes | 64 bytes |
3 | 72 bytes | 64 bytes | 40 bytes | 48 bytes | 56 bytes | 64 bytes |
4 | 96 bytes | 64 bytes | 80 bytes | 48 bytes | 56 bytes | 64 bytes |
5 | 120 bytes | 96 bytes | 80 bytes | 96 bytes | 56 bytes | 64 bytes |
6 | 144 bytes | 96 bytes | 80 bytes | 96 bytes | 112 bytes | 64 bytes |
7 | 168 bytes | 128 bytes | 120 bytes | 96 bytes | 112 bytes | 128 bytes |
8 | 192 bytes | 128 bytes | 120 bytes | 96 bytes | 112 bytes | 128 bytes |
9 | 216 bytes | 160 bytes | 120 bytes | 144 bytes | 112 bytes | 128 bytes |
10 | 240 bytes | 160 bytes | 160 bytes | 144 bytes | 112 bytes | 128 bytes |
11 | 264 bytes | 196 bytes | 160 bytes | 144 bytes | 168 bytes | 128 bytes |
12 | 288 bytes | 196 bytes | 160 bytes | 144 bytes | 168 bytes | 128 bytes |
It seems like BATCH_SIZE=3 or 4 could make the duplicates code more cache friendly without regressing memory usage or requiring any complicated logic. BATCH_SIZE=6 corresponds to 64 bytes and a single cache line.
Of course, there are also ways to make the logic more complicated (e.g. use a certain batch size for the first two or three and then switch to larger ones).
Attachments
Issue Links
- is related to
-
IMPALA-7635 Reduce size of hash tables in-memory by packing buckets more densely
- Resolved