Thanks Koji Horaguchi for detailed explanation. This patch fixed several issues:
1. Remove fixDuplicateUids from LOJoin/LOGenerate
That's for sure need to be removed. We cannot arbitrary reassign uid. Every operator should have lineage knowledge and be able to trace to input uids which generate a particular output uid. Sometimes we do need to reassign the uid, but if this happens, we need to remember the mapping inside the operator, and update the logic ColumnDependencyVisitor of tracing every output uid to its input uid.
Take LOJoin for example, each output column is derived from one input column without transformation, so we keep uid unchanged. If input#1 of join has the uid (1, 2, 3), input#2 has the uid (4, 5), the join output will have the uid (1, 2, 3, 4, 5). When ColumnDependencyVisitor try to prune uid 5, we know that is from second column of input#2. In the case we do want to regenerate output uid to (21, 22, 23, 24, 25), join operator need to remember the mapping (1->21, 2->22, 3->23, 4->24, 5->25), so that when ColumnDependencyVisitor try to prune uid 25, we know it is from second column of input#2 as well. One example of uid regeneration is LOSplit, which has a field uidMapping to remember the mapping. ColumnDependencyVisitor will make use of this information to trace the lineage.
As we already mentioned, uid regeneration is already done in LOSplitOutput, so after ImplicitSplitInserter, we shall not have duplicate uids and should not have uid conflict. The issue seen in
PIG-3020 is due to a bug in LOSplitOutput uid mapping process as Koji mentioned.
2. LogicalPlanBuilder.expandAndResetVisitor change
Not sure what drive the original change, seems there is no reason to visit the whole plan after building one operator. The logical plan is not in a consistent state at this moment, global SchemaResetter might not work
3. move DuplicateForEachColumnRewrite and ImplicitSplitInserter
When we design optimizer, we try to include every plan transformation into it. But now I think we may separate it into two part. One is to bring the plan into a valid state, the other is the real optimizer rule. The reason is at validate stage, the plan is still inconsistent and many optimizer rule cannot handle, and the global SchemaReset/UidResetter does not work. We may want to do some refactory and make the separation clear in another Jira. "describe" should happen after validation, but might not need to go through optimizer. Move DuplicateForEachColumnRewrite/ImplicitSplitInserter into compile for now seems to be ok.
Overall, the patch looks pretty good. It fix above mentioned issue cleanly. We might want to preceed with clear validator/optmizer separation in anther Jira.
Will commit if tests pass.