Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4543

Throw a special exception to DagClient when there is no current DAG

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.10.4
    • None
    • None

    Description

      given the following scenario:

      1. DAG is assigned to an AM
      2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing network errors:

      hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status due to IOException: DestHost:destPort query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222 , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local exception: java.io.IOException: java.io.IOException: Connection reset by peer
      

      by this time, HS2 cannot tell if the AM is lost forever, or there is a recoverable intermittent network issue

      3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG status (getDagStatus call) from the restarted coordinator, HS2 isn't even able to realize it was talking to a new AM, and keeps asking for DAG status
      4. in AM, the below exception is kept thrown and it's not handled by the DagClient

       <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO" thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call Call#15312255 Retry#0 org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from 127.0.0.6:56221
      org.apache.tez.dag.api.TezException: No running dag at present
          at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
          at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
          at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
          at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
          at java.base/java.security.AccessController.doPrivileged(Native Method)
          at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
      

      AM should be able to return a specialized exception which can be handled by the client

      Attachments

        Issue Links

          Activity

            People

              abstractdog László Bodor
              abstractdog László Bodor
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h