Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
Slider 0.80
-
None
-
None
-
None
Description
I have a hung slider AM in the following state.
The first app attempt failed to start, so this is the 2nd one. However, the 1st app attempt process is still running on the same machine, and it is in a state where I cannot jstack it even with -F. I will kill it shortly and see what happens. YARN thinks it's killed..nm, it was some other process. The first container was on a different machine and did die.
The 2nd attempt received the container death notification for the first one:
2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO appmaster.SliderAppMaster - Container Completion for containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE, exitStatus=-105, diagnostics=Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
Note that is is from the 2nd container (container_e02_1450721565699_0007_02_000001) logs. Jstack for the 2nd attempt has the deadlock:
Found one Java-level deadlock: ============================= "AMRM Callback Handler Thread": waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a org/apache/slider/server/appmaster/state/AppState), which is held by "main" "main": waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a org/apache/slider/server/appmaster/SliderAppMaster), which is held by "AMRM Callback Handler Thread"
The jstack is with -F, so I cannot actually see thread names in the dump, but these look like it (not sure about the first one):
Thread 11054: (state = BLOCKED) - org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus) @bci=0, line=1534 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List) @bci=119, line=1606 (Interpreted frame) - org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run() @bci=141, line=300 (Interpreted frame) ... Thread 10254: (state = BLOCKED) - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5, line=1369 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String) @bci=1291, line=822 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162, line=576 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration, java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration, java.lang.String[]) @bci=4, line=475 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List) @bci=21, line=403 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List) @bci=143, line=630 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[]) @bci=24, line=2327 (Interpreted frame)
Attachments
Issue Links
- is depended upon by
-
SLIDER-1069 Prepare & release Slider 0.82
- Open
- is related to
-
YARN-4593 Deadlock in AbstractService.getConfig()
- Resolved