[MAPREDUCE-2413] TaskTracker should handle disk failures at both startup and runtime - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.20.204.0
Fix Version/s: 0.20.204.0
Component/s: task-controller, tasktracker
Labels:
None

Hadoop Flags:

Reviewed

Description

At present, TaskTracker doesn't handle disk failures properly both at startup and runtime.

(1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is on a bad disk. TaskTracker should ignore that particular mapred-local-dir and start up and use only the remaining good mapred-local-dirs.
(2) If a disk goes bad while TaskTracker is running, currently TaskTracker doesn't do anything special. This results in either
(a) TaskTracker continues to "try to use that bad disk" and this results in lots of task failures and possibly job failures(because of multiple TTs having bad disks) and eventually these TTs getting graylisted for all jobs. And this needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk. OR
(b) Health check script identifying the disk as bad and the TT gets blacklisted. And this also needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk.

This JIRA is to make TaskTracker more fault-tolerant to disk failures solving (1) and (2). i.e. TT should start even if at least one of the mapred-local-dirs is on a good disk and TT should adjust its in-memory list of mapred-local-dirs and avoid using bad mapred-local-dirs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MR-2413.v0.1.patch
08/Apr/11 17:16
44 kB
Ravi Gummadi
MR-2413.v0.2.patch
20/Apr/11 00:13
44 kB
Jagane Sundar
MR-2413.v0.3.patch
20/Apr/11 05:50
44 kB
Ravi Gummadi
MR-2413.v0.patch
31/Mar/11 20:52
44 kB
Ravi Gummadi

Issue Links

blocks

MAPREDUCE-2415 Distribute TaskTracker userlogs onto multiple disks

Closed

is related to

MAPREDUCE-2850 Add test for TaskTracker disk failure handling (MR-2413)

Closed

MAPREDUCE-2928 MR-2413 improvements

Closed

MAPREDUCE-2959 The TT daemon should shutdown on fatal exceptions

Open

supercedes

MAPREDUCE-134 TaskTracker startup fails if any mapred.local.dir entries don't exist

Resolved

Activity

People

Assignee:: Ravi Gummadi

Reporter:: Bharath Mundlapudi

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 31/Mar/11 18:43

Updated:: 29/Aug/13 04:30

Resolved:: 22/Apr/11 23:07