I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization:
1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern.
This has the downside that if a given rule applies to two patterns (say Load->Filter->Group, Load->Group) you have to write two rules. But it has the upside that
the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the
resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it
is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this
will have no effect on them.
2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the
It would change to be:
The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple
times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass.
For example, we might have a plan that looks like: Load->Join->Filter->Foreach, and we want to optimize it to Load->Foreach->Filter->Join. With two simple
rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the
big picture of the entire plan.
3) Add three calls to OperatorPlan:
The rules in the optimizer can use these three functions, along with the existing insertBetween(), replace(), and removeAndReconnect() calls to operate on the
4) Add a new call to Operator:
This method will be called by the swap, pushBefore, pushAfter, insertBetween, replace, and removeAndReconnect in OperatorPlan whenever an operator is moved
around so that the operator has a chance to make any necessary changes.
5) Add new calls to LogicalOperator and PhysicalOperator
These calls will be called by optimizer rules to determine whether or not a swap can be done (for example, you can't swap two operators if the second one uses a
field added by the first), and once the swap is done they will be used by rewire to understand how to map projections in the operators.
6) It's not clear that the RuleMatcher, in its current form, will work with rules that are not linear. That is, it matches rules that look like:
But I don't know if it will match rules that look like:
For the optimizer to be able to determine join types and operations with splits, it will have to be able to do that.
Examples of types of rules that is optimizer could support:
1) Pushing filters in front of joins.
2) Pushing foreachs with flattens (which thus greathly expand the data) down the tree past filters, joins, etc.
3) Pushing type casting used for schemas in loads down to the point where the field is actually used.
4) Deciding when to do fragment/replicate join or sort/merge join instead of the standard hash join.
5) The current optimizations: pushing limit up the tree, making implicit splits explicit, merge load and stream where possible, using the combiner.
6) Merge filters or foreachs where possible
In particular the combiner optimizer hopefully can be completely rewritten to use the optimizer framework to make decisions about how to rework physical plans
to push work into the combiner.