Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2550

DAGAppMaster gets locked up due to ATS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Thread 30453: (state = IN_NATIVE)
       - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
       - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame)
       - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame)
       - java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame)
       - java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=286 (Compiled frame)
       - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=345 (Compiled frame)
       - sun.net.www.http.HttpClient.parseHTTPHeader(sun.net.www.MessageHeader, sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) @bci=51, line=703 (Compiled frame)
       - sun.net.www.http.HttpClient.parseHTTP(sun.net.www.MessageHeader, sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) @bci=56, line=647 (Compiled frame)
       - sun.net.www.protocol.http.HttpURLConnection.getInputStream0() @bci=327, line=1534 (Compiled frame)
       - sun.net.www.protocol.http.HttpURLConnection.getInputStream() @bci=52, line=1439 (Compiled frame)
       - java.net.HttpURLConnection.getResponseCode() @bci=16, line=480 (Compiled frame)
       - com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(com.sun.jersey.api.client.ClientRequest) @bci=272, line=240 (Interpreted frame)
       - com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(com.sun.jersey.api.client.ClientRequest) @bci=2, line=147 (Interpreted frame)
       - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run() @bci=11, line=226 (Interpreted frame)
       - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientRetryOp) @bci=11, line=162 (Interpreted frame)
       - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(com.sun.jersey.api.client.ClientRequest) @bci=18, line=237 (Interpreted frame)
       - com.sun.jersey.api.client.Client.handle(com.sun.jersey.api.client.ClientRequest) @bci=35, line=648 (Interpreted frame)
       - com.sun.jersey.api.client.WebResource.handle(java.lang.Class, com.sun.jersey.api.client.ClientRequest) @bci=10, line=670 (Interpreted frame)
       - com.sun.jersey.api.client.WebResource.access$200(com.sun.jersey.api.client.WebResource, java.lang.Class, com.sun.jersey.api.client.ClientRequest) @bci=3, line=74 (Compiled frame)
       - com.sun.jersey.api.client.WebResource$Builder.post(java.lang.Class, java.lang.Object) @bci=12, line=563 (Compiled frame)
       - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(java.lang.Object, java.lang.String) @bci=41, line=472 (Compiled frame)
       - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(java.lang.Object, java.lang.String) @bci=3, line=321 (Compiled frame)
       - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(org.apache.hadoop.yarn.api.records.timeline.TimelineEntity[]) @bci=55, line=301 (Compiled frame)
       - org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(java.util.List) @bci=188, line=343 (Compiled frame)
       - org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop() @bci=273, line=229 (Interpreted frame)
       - org.apache.hadoop.service.AbstractService.stop() @bci=32, line=221 (Interpreted frame)
       - org.apache.hadoop.service.ServiceOperations.stop(org.apache.hadoop.service.Service) @bci=5, line=52 (Interpreted frame)
       - org.apache.hadoop.service.ServiceOperations.stopQuietly(org.apache.commons.logging.Log, org.apache.hadoop.service.Service) @bci=1, line=80 (Interpreted frame)
       - org.apache.hadoop.service.CompositeService.stop(int, boolean) @bci=115, line=157 (Interpreted frame)
       - org.apache.hadoop.service.CompositeService.serviceStop() @bci=58, line=131 (Interpreted frame)
       - org.apache.tez.dag.history.HistoryEventHandler.serviceStop() @bci=11, line=80 (Interpreted frame)
       - org.apache.hadoop.service.AbstractService.stop() @bci=32, line=221 (Interpreted frame)
       - org.apache.hadoop.service.ServiceOperations.stop(org.apache.hadoop.service.Service) @bci=5, line=52 (Interpreted frame)
       - org.apache.hadoop.service.ServiceOperations.stopQuietly(org.apache.commons.logging.Log, org.apache.hadoop.service.Service) @bci=1, line=80 (Interpreted frame)
       - org.apache.hadoop.service.ServiceOperations.stopQuietly(org.apache.hadoop.service.Service) @bci=4, line=65 (Interpreted frame)
       - org.apache.tez.dag.app.DAGAppMaster.stopServices() @bci=137, line=1675 (Interpreted frame)
       - org.apache.tez.dag.app.DAGAppMaster.serviceStop() @bci=30, line=1831 (Interpreted frame)
       - org.apache.hadoop.service.AbstractService.stop() @bci=32, line=221 (Interpreted frame)
       - org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run() @bci=48, line=840 (Interpreted frame)
       - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
      
      .....
      .....
      .....
      .....
      
      
      Thread 26211: (state = BLOCKED)
       - org.apache.tez.dag.app.DAGAppMaster.shutdownTezAM() @bci=0, line=1176 (Interpreted frame)
       - org.apache.tez.dag.api.client.DAGClientHandler.shutdownAM() @bci=22, line=124 (Interpreted frame)
       - org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.shutdownSession(com.google.protobuf.RpcController, org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$ShutdownSessionRequestProto) @bci=55, line=179 (Interpreted frame)
       - org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor, com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=152, line=7473 (Compiled frame)
       - org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(org.apache.hadoop.ipc.RPC$Server, java.lang.String, org.apache.hadoop.io.Writable, long) @bci=246, line=619 (Compiled frame)
       - org.apache.hadoop.ipc.RPC$Server.call(org.apache.hadoop.ipc.RPC$RpcKind, java.lang.String, org.apache.hadoop.io.Writable, long) @bci=9, line=962 (Compiled frame)
       - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=38, line=2039 (Compiled frame)
       - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=1, line=2035 (Compiled frame)
       - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame)
       - javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=422 (Compiled frame)
       - org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1628 (Compiled frame)
       - org.apache.hadoop.ipc.Server$Handler.run() @bci=308, line=2033 (Interpreted frame)
      

      DAGAppMaster.serviceStop() gets a lock which is not released due to ATS connection (thought socket read timeout would be there; but this never comes out of the blocking call. Waited for more than 10-15 minutes). Due to this shutdownTezAM() gets blocked and ends up occupying the slot.

      This happened with latest tez master (commit ce26b3f52761d2a48a612a7613d99b712a320204). Not sure if this is consistently reproduceable; Creating this ticket as a placeholder.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: