Spark SQL should support the common optimization known as cost estimations. For example, each logical operator should be able to estimate its cardinality, based on the estimates from its children.
As a first step, the framework to support doing so should be added, which might include the interface for the aforementioned cardinality estimation, and/or some other metrics.
As the first proof of concept usage of this optimization, a simple optimization strategy for certain equi-joins should be added: namely, if one side of a qualifying join has a small estimated physical size (smaller than some threshold), the planner should use a broadcast join physical plan to execute the join, broadcasting the small side and streaming through the bigger side. Similar concept exists in Hive and is explained here.