Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
A major operation in analytics is the join. This issue concerns adding the join operation.
Given the complexity of this task, I propose starting with a sub-set of all joins, an hash join whose "ON" can only be a set of column names (i.e. no expressions).
Suggestion for DOD:
- physical plan to execute the join
- logical plan with the join
- SQL planner with the join
- tests on each of the above
One idea to perform this join in parallel is to, for each RecordBatch in the left, perform the join with a record on the right. Another way is to first perform a hash by key and sort on both sides, and then perform a "SortMergeJoin" on each of the partitions. There may be better ways to achieve this, though.
Attachments
There are no Sub-Tasks for this issue.