[PIG-5040] Order by and CROSS partitioning is not deterministic due to usage of Random - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0, 0.16.1
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

Maps can be rerun due to shuffle fetch failures. Half of the reducers can end up successfully pulling partitions from first run of the map while other half could pull from the rerun after shuffle fetch failures. If the data is not partitioned by the Partitioner exactly the same way every time then it could lead to incorrect results (loss of records and duplicated records). Even though issue has existed for 8 years now with order by and affects mapreduce as well found this with Tez where the frequency of rerun due to shuffle fetch failures is high (Order by partitioner gets its data from a 1-1 edge, so there are no retries and shuffle fetch failures trigger a rerun immediately).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-5040-1.patch
12/Oct/16 17:10
19 kB
Rohini Palaniswamy
PIG-5040-1-nowhitespacechanges.patch
12/Oct/16 17:10
13 kB
Rohini Palaniswamy

Issue Links

is related to

PIG-5154 Fix GFCross related issues after merging from trunk to spark

Closed

Activity

People

Assignee:: Rohini Palaniswamy

Reporter:: Rohini Palaniswamy

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Oct/16 17:05

Updated:: 21/Jun/17 09:15

Resolved:: 17/Oct/16 15:20