Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-12904

LLAP: deadlock in task scheduling

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.0.0
    • 2.0.0
    • None
    • None

    Description

      Thread 34107: (state = BLOCKED)
       - org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.isInWaitQueue() @bci=0, line=690 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.finishableStateUpdated(org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper, boolean) @bci=8, line=485 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.access$1500(org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService, org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper, boolean) @bci=3, line=78 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.finishableStateUpdated(boolean) @bci=27, line=733 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.QueryInfo$FinishableStateTracker.sourceStateUpdated(java.lang.String) @bci=76, line=210 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.QueryInfo.sourceStateUpdated(java.lang.String) @bci=5, line=164 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.registerSourceStateChange(java.lang.String, java.lang.String, org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateProto) @bci=34, line=228 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.sourceStateUpdated(org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateUpdatedRequestProto) @bci=47, line=255 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.sourceStateUpdated(org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateUpdatedRequestProto) @bci=5, line=328 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.sourceStateUpdated(com.google.protobuf.RpcController, org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateUpdatedRequestProto) @bci=5, line=105 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor, com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=80, line=13067 (Compiled frame)
       - org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(org.apache.hadoop.ipc.RPC$Server, java.lang.String, org.apache.hadoop.io.Writable, long) @bci=246, line=616 (Compiled frame)
       - org.apache.hadoop.ipc.RPC$Server.call(org.apache.hadoop.ipc.RPC$RpcKind, java.lang.String, org.apache.hadoop.io.Writable, long) @bci=9, line=969 (Compiled frame)
       - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=38, line=2151 (Compiled frame)
       - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=1, line=2147 (Compiled frame)
       - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame)
       - javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=422 (Compiled frame)
       - org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1657 (Compiled frame)
       - org.apache.hadoop.ipc.Server$Handler.run() @bci=315, line=2145 (Interpreted frame)
      
      
      and 
      
      
      Thread 34500: (state = BLOCKED)
       - org.apache.hadoop.hive.llap.daemon.impl.QueryInfo$FinishableStateTracker.unregisterForUpdates(org.apache.hadoop.hive.llap.daemon.FinishableStateUpdateHandler) @bci=0, line=195 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.QueryInfo.unregisterFinishableStateUpdate(org.apache.hadoop.hive.llap.daemon.FinishableStateUpdateHandler) @bci=5, line=160 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.QueryFragmentInfo.unregisterForFinishableStateUpdates(org.apache.hadoop.hive.llap.daemon.FinishableStateUpdateHandler) @bci=5, line=143 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.maybeUnregisterForFinishedStateNotifications() @bci=20, line=681 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$InternalCompletionListener.onSuccess(org.apache.tez.runtime.task.TaskRunner2Result) @bci=32, line=548 (Compiled frame)
       - org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$InternalCompletionListener.onSuccess(java.lang.Object) @bci=5, line=535 (Compiled frame)
       - com.google.common.util.concurrent.Futures$4.run() @bci=55, line=1149 (Compiled frame)
       - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1142 (Compiled frame)
       - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=617 (Interpreted frame)
       - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
      
      "IPC Server handler 0 on 15001":
        waiting to lock Monitor@0x00007f5d322ecb08 (Object@0x00007f67032cd2c0, a org/apache/hadoop/hive/llap/daemon/impl/TaskExecutorService$TaskWrapper),
        which is held by "ExecutionCompletionThread #0"
      "ExecutionCompletionThread #0":
        waiting to lock Monitor@0x00007f6066b9e8c8 (Object@0x00007f66b6570200, a org/apache/hadoop/hive/llap/daemon/impl/QueryInfo$FinishableStateTracker),
        which is held by "IPC Server handler 0 on 15001"
      
      Found a total of 1 deadlock.
      
      

      Looks like it's caused by synchronized blocks:

      TaskWrapper:
      public synchronized void maybeUnregisterForFinishedStateNotifications
      

      Eventually calls

      FinishableStateTracker
      synchronized void unregisterForUpdates(FinishableStateUpdateHandler handler) {
      

      and

      FST
       synchronized void sourceStateUpdated(String sourceName) {
         

      eventually calls

       public synchronized boolean isInWaitQueue() {
      

      The latter returns the boolean, so it definitely doesn't need synchronized, however I don't know if there are other similar issues and what is necessary inside sync blocks, perhaps there's a better fix.

      Overall I'd say synch methods on objects that call any other non-trivial objects should not be used. Perhaps for now it would be good to replace all sync methods by sync blocks that cover entire method, as well as remove the unnecessary ones like the isWait... one. Then the scope of the blocks can be adjusted based on logic in future.

      Attachments

        1. HIVE-12904.2.patch
          11 kB
          Siddharth Seth
        2. HIVE-12904.3.patch
          12 kB
          Siddharth Seth
        3. HIVE-12904.patch
          5 kB
          Sergey Shelukhin

        Activity

          People

            sershe Sergey Shelukhin
            huizane Hui Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: