While POFRJoin is getting compiled in MRCompiler, it needs to identify for each of its
predecessor in physical plan of which compiled MROperator they are part of. Currently, it is
assumed to be one of the compiledInputs(an array of MRoper which are immediate predecessor of current MROper in MROper DAG).
Mostly this is true, but in cases where one physical operator results in two or more MR operator, this may not be true, as is the
case here. When there is an order-by before FRJoin; one of the inputs of POFRJoin will be
POSort, but POSort operator will be in the first MROper of the two generated MROperator
and thus will not be found in compiledInputs (which contains second MROper). Thus,
current way of identifying corresponding MRoper of a physical operator is unreliable.
This bug also affects the implementation of merge-sort join
https://issues.apache.org/jira/browse/PIG-845 . Since POMergeJoin needs to know which MROper
corresponds to its left input and which one corresponds to its right. It can do so by looking
into compiledInputs as long as there is no order-by (or similiar PO which results in
multiple MROper) as its predecessors. Doing order-by before using merge
join is however a natural use-case there.
Proposal is to introduce a new private member variable in MRCompiler phyToMROperMap
(similiar to logToPhyMap) using which leaf MROper for a given
physical operator can be identified. Thoughts?