Uploaded image for project: 'Slider'
  1. Slider
  2. SLIDER-1052

Deadlock in slider AM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • Slider 0.80
    • None
    • None
    • None

    Description

      I have a hung slider AM in the following state.
      The first app attempt failed to start, so this is the 2nd one. However, the 1st app attempt process is still running on the same machine, and it is in a state where I cannot jstack it even with -F. I will kill it shortly and see what happens. YARN thinks it's killed..nm, it was some other process. The first container was on a different machine and did die.
      The 2nd attempt received the container death notification for the first one:

      2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO  appmaster.SliderAppMaster - Container Completion for containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE, exitStatus=-105, diagnostics=Container killed by the ApplicationMaster.
      Container killed on request. Exit code is 143
      Container exited with a non-zero exit code 143
      

      Note that is is from the 2nd container (container_e02_1450721565699_0007_02_000001) logs. Jstack for the 2nd attempt has the deadlock:

      Found one Java-level deadlock:
      =============================
      
      "AMRM Callback Handler Thread":
        waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a org/apache/slider/server/appmaster/state/AppState),
        which is held by "main"
      "main":
        waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a org/apache/slider/server/appmaster/SliderAppMaster),
        which is held by "AMRM Callback Handler Thread"
      
      

      The jstack is with -F, so I cannot actually see thread names in the dump, but these look like it (not sure about the first one):

      Thread 11054: (state = BLOCKED)
       - org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus) @bci=0, line=1534 (Interpreted frame)
       - org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List) @bci=119, line=1606 (Interpreted frame)
       - org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run() @bci=141, line=300 (Interpreted frame)
      ...
      Thread 10254: (state = BLOCKED)
       - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403 (Interpreted frame)
       - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5, line=1369 (Interpreted frame)
       - org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String) @bci=1291, line=822 (Interpreted frame)
       - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162, line=576 (Interpreted frame)
       - org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration, java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame)
       - org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration, java.lang.String[]) @bci=4, line=475 (Interpreted frame)
       - org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List) @bci=21, line=403 (Interpreted frame)
       - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List) @bci=143, line=630 (Interpreted frame)
       - org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[]) @bci=24, line=2327 (Interpreted frame)
      
      
      

      Attachments

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              sershe Sergey Shelukhin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: