Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
In spark mode, for a merge join, the flag is NOT set as true internally. The input splits will be in the order of file size. The output is out of order.
Scenaro:
cat input1
1 1
cat input2
2 2
cat input3
33 33
A = LOAD 'input*' as (a:int, b:int);
B = LOAD 'input*' as (a:int, b:int);
C = JOIN A BY $0, B BY $0 USING 'merge';
DUMP C;
expected result:
(1,1,1,1) (2,2,2,2) (33,33,33,33)
actual result:
(33,33,33,33) (1,1,1,1) (2,2,2,2)
In MR mode, the flag was set as true internally for a merge join(see: PIG-2773). However, it doesn't work now. The output is still out of order, because the splits will be ordered again by hadoop-client. In spark mode, we can solve this issue.