[YARN-4011] Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: yarn
Labels:
None

Description

We observed jobs failed since tasks couldn't launch on nodes due to "java.io.IOException No space left on device".
On digging in further, we found a rogue job which filled up disk.
Specifically it was wrote a lot of map spills(like attempt_1432082376223_461647_m_000421_0_spill_10000.out) to nm-local-dir causing disk to fill up, and it failed/got killed, but didn't clean up these files in nm-local-dir.
So the disk remained full, causing subsequent jobs to fail.

This jira is created to address why files under nm-local-dir doesn't get cleaned up when job fails after filling up disk.

Attachments

Issue Links

duplicates

YARN-90 NodeManager should identify failed disks becoming good again

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ashwin Shankar

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/Aug/15 19:03

Updated:: 23/Sep/15 15:47

Resolved:: 03/Aug/15 20:59