Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-1422

RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      If getQueueUserAclInfo() on a parent/root queue (e.g. via CapacityScheduler.getQueueUserAclInfo) is called, and a container is completing, then the ResourceManager can deadlock.

      It is similar to https://issues.apache.org/jira/browse/YARN-325.

      More details:

      • Thread A
        1) In a synchronized block of code (a lockid 0x00000000c18d8870=LeafQueue.class), LeafQueue.completedContainer wants to inform the parent queue that a container is being completed and invokes ParentQueue.completedContainer method.
        3) The ParentQueue.completedContainer waits to aquire a lock on itself (a lockid 0x00000000c1846350=ParentQueue.class) to go to synchronized block of code. It can not accuire this lock, because Thread B already has this lock.
      • Thread B
        0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This method invokes a synchronized method on ParentQueue.class i.e. ParentQueue.getQueueUserAclInfo (a lockid 0x00000000c1846350=ParentQueue.class) and aquires the lock that Thread A will be waiting for.
        2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, but it does not have a lock on LeafQueue.class (a lockid 0x00000000c18d8870=LeafQueue.class). This lock is already held by LeafQueue.completedContainer in Thread A.

      The order that causes the deadlock: B0 -> A1 -> B2 -> A3.

      Java Stacktrace

      Found one Java-level deadlock:
      =============================
      "1956747953@qtp-109760451-1959":
        waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
        which is held by "IPC Server handler 39 on 8032"
      "IPC Server handler 39 on 8032":
        waiting to lock monitor 0x00000000422bbc58 (object 0x00000000c18d8870, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue),
        which is held by "ResourceManager Event Processor"
      "ResourceManager Event Processor":
        waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
        which is held by "IPC Server handler 39 on 8032"
      
      Java stack information for the threads listed above:
      ===================================================
      "1956747953@qtp-109760451-1959":
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276)
      	- waiting to lock <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
      	at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.<init>(CapacitySchedulerInfo.java:49)
      	at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203)
      	at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
      	at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
      	at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
      	at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
      	at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
      	at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
      	at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
      	at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
      	at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
      	at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76)
      	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
      	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
      	at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
      	at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
      	at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
      	at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
      	at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
      	at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
      	at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
      	at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
      	at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
      	at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
      	at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
      	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      	at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
      	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      	at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)
      	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      	at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
      	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      	at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
      	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
      	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
      	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
      	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
      	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
      	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
      	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
      	at org.mortbay.jetty.Server.handle(Server.java:326)
      	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
      	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
      	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
      	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
      	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
      	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
      	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
      "IPC Server handler 39 on 8032":
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueUserAclInfo(LeafQueue.java:544)
      	- waiting to lock <0x00000000c18d8870> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueUserAclInfo(ParentQueue.java:351)
      	- locked <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueUserAclInfo(CapacityScheduler.java:622)
      	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:517)
      	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueUserAcls(ApplicationClientProtocolPBServiceImpl.java:225)
      	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:255)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:396)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
      "ResourceManager Event Processor":
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.completedContainer(ParentQueue.java:693)
      	- waiting to lock <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1460)
      	- locked <0x00000000c18d8870> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:838)
      	- locked <0x00000000c1846310> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:648)
      	- locked <0x00000000c1846310> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86)
      	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
      	at java.lang.Thread.run(Thread.java:662)
      
      Found 1 deadlock.
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kawaa Adam Kawa
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: