Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Possible optimizations:
1) If it is a skewed join, then we can combine ordering into it instead of doing a additional orderby as we skewed join already involves sampling.
2) If it is a normal join, then we can do the order by and then join. i.e
Current plan:
Vertex 1 (load massive), Vertex 2 (load big) -> Vertex 3 (join) -> Vertex 4 (sampler), Vertex 5 (Partitioner using vertex 4 sample) -> Vertex 6 (order by)
New plan:
Vertex 1 (load massive) > Vertex 2 (sampler), Vertex 3 (Partitioner using vertex 2 sample) -> Vertex 4 (order by and join) < Vertex 5 (load big and construct WeightedRangePartitioner from Vertex 2 sample)
3) If it is a replicated join, similar plan in 2) should work with Vertex 5 changing to broadcast input to Vertex 4 instead of using WeightedRangePartitioner.