[YARN-92] NM disk failure detection only covers local dirs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: nodemanager
Labels:
None

Description

This is the MR counterpart to HDFS-1848. Like HDFS volume failure detection, NM disk failure detection checks a subset of the disks, and a subset of the directories. Eg the TT and the NM do not check the root disk for errors unless a local dir resides on them. Even if a local dir resides on the root disk the disk checking code only checks the local dirs so a failure only seen when accessing a part of the disk no hosting the local dirs will not be noticed. The disk that hosts the logs, pid, tmp dirs etc is critical, so if needs to be checked as well, and the NM should shutdown if a critical disk is not available (to prevent MR issues similar to HDFS-1848 and ~~HDFS-2095~~). Typically people currently work around this limitation by (aside from ignoring it) by using raid-1 for the root disk or a health script that checks the root disk health.

Attachments

Issue Links

relates to

HDFS-2095 org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check storm making data node unavailable

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Eli Collins

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 27/Nov/11 22:31

Updated:: 08/Sep/12 00:14