Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.1.0-m1
Description
There are cases where joining two record batches can result in redundant work. Consider a merge join performed on two tables (t1 and t2) with duplicate keys on both sides:
t1
key | value |
---|---|
2 | 'a' |
2 | 'b' |
t2
key | value |
---|---|
2 | 'A' |
2 | 'B' |
2 | 'C' |
The resulting table will contain the cross product of all key values '2':
key | t1.value | t2.value |
---|---|---|
2 | 'a' | 'A' |
2 | 'a' | 'B' |
2 | 'a' | 'C' |
2 | 'b' | 'A' |
2 | 'b' | 'B' |
2 | 'b' | 'C' |
The current implementation iteratively copies t2.value from the incoming vectors. Ideally, the t2.value vector would only be iteratively constructed the first pass; after that it can be copied.