Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue.
When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that.
ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x. 2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--. 1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_000001 -rw-r--r--. 1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_000001.crc -rw-r--r--. 1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--. 1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc
When RM starts up
2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
Attachments
Issue Links
- relates to
-
YARN-2010 Handle app-recovery failures gracefully
- Closed