[MAPREDUCE-6166] Reducers do not validate checksum of map outputs when fetching directly to disk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.7.0, 2.6.1, 3.0.0-alpha1
Component/s: mrv2
Labels:
- 2.6.1-candidate

Target Version/s:

2.7.0

Description

In very large map/reduce jobs (50000 maps, 2500 reducers), the intermediate map partition output gets corrupted on disk on the map side. If this corrupted map output is too large to shuffle in memory, the reducer streams it to disk without validating the checksum. In jobs this large, it could take hours before the reducer finally tries to read the corrupted file and fails. Since retries of the failed reduce attempt will also take hours, this delay in discovering the failure is multiplied greatly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6166.v5.txt
11/Dec/14 17:09
7 kB
Eric Payne
MAPREDUCE-6166.v4.txt
10/Dec/14 20:20
7 kB
Eric Payne
MAPREDUCE-6166.v3.txt
04/Dec/14 20:41
6 kB
Eric Payne
MAPREDUCE-6166.v2.201411251627.txt
25/Nov/14 16:31
6 kB
Eric Payne
MAPREDUCE-6166.v1.201411221941.txt
22/Nov/14 19:47
6 kB
Eric Payne

Activity

People

Assignee:: Eric Payne

Reporter:: Eric Payne

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 19/Nov/14 15:54

Updated:: 14/Oct/19 15:38

Resolved:: 16/Dec/14 03:29