Need to establish order in shuffle inputs
Just a minor comment. Can we change "pig.shuffled.inputs" to something more generic since inputs can be shuffle, broadcast, or 1-1? What do you think?
How about "pig.popackage.inputs"?
Sounds good to me.
Why do we have to serialize as separate pig.popackage.inputs config? Can setInputKeys() of POShuffleTezLoad be used as POShuffleTezLoad is being serialized anyway as part of the plan?
Daniel Dai, I am doing what Rohini suggests here as part of PIG-3604. I need to set inputKeys in POShuffleTezLoad to handle the case where both scatter/gather and broadcast edges are attached to the same vertex. For eg,
a = LOAD 'foo' AS (x:int, y:chararray);
a1 = GROUP a BY x;
b = LOAD 'bar' AS (x:int, y:chararray);
d = JOIN a1 BY group, b BY x USING 'replicated'; -- replicated join in reducer
Let me post a new patch in PIG-3604 that includes the fix for this jira.
Yes, I also see setInputKeys, that should be better.
Fixed as part of PIG-3604. Closing the jira.