[PIG-4848] pig.noSplitCombination=true should always be set internally for a merge join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: spark-branch
Component/s: spark
Labels:
None

Description

In spark mode, for a merge join, the flag is NOT set as true internally. The input splits will be in the order of file size. The output is out of order.

Scenaro:
cat input1

1	1

cat input2

2	2

cat input3

33	33

A = LOAD 'input*' as (a:int, b:int);
B = LOAD 'input*' as (a:int, b:int);
C = JOIN A BY $0, B BY $0 USING 'merge';
DUMP C;

expected result:

(1,1,1,1)
(2,2,2,2)
(33,33,33,33)

actual result:

(33,33,33,33)
(1,1,1,1)
(2,2,2,2)

In MR mode, the flag was set as true internally for a merge join(see: PIG-2773). However, it doesn't work now. The output is still out of order, because the splits will be ordered again by hadoop-client. In spark mode, we can solve this issue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-4848.patch
30/Mar/16 01:39
2 kB
Xianda Ke
PIG-4848-2.patch
30/Mar/16 02:30
2 kB
Xianda Ke
PIG-4848-hotfix.patch
31/Mar/16 05:58
0.8 kB
Xianda Ke

Activity

People

Assignee:: Xianda Ke

Reporter:: Xianda Ke

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 23/Mar/16 03:54

Updated:: 21/Jun/17 09:18

Resolved:: 02/Apr/16 21:56