Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Currently, when optimize some kinds of JOIN, the indexed or sampling files are saved into HDFS. By setting the replication to a larger number, it serves as distributed cache.
Spark's broadcast mechanism is suitable for this. It seems that we can add a physical operator to broadcast small RDDs.
This will benefit the optimization of some specialized Joins, such as Skewed Join, Replicated Join and so on.
Attachments
Attachments
Issue Links
- links to