Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.0
-
None
-
None
Description
We observed jobs failed since tasks couldn't launch on nodes due to "java.io.IOException No space left on device".
On digging in further, we found a rogue job which filled up disk.
Specifically it was wrote a lot of map spills(like attempt_1432082376223_461647_m_000421_0_spill_10000.out) to nm-local-dir causing disk to fill up, and it failed/got killed, but didn't clean up these files in nm-local-dir.
So the disk remained full, causing subsequent jobs to fail.
This jira is created to address why files under nm-local-dir doesn't get cleaned up when job fails after filling up disk.
Attachments
Issue Links
- duplicates
-
YARN-90 NodeManager should identify failed disks becoming good again
- Closed