Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9871 Miscellaneous scalability improvement
  3. YARN-9738

Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Env :
      Server OS :- UBUNTU
      No. of Cluster Node:- 9120 NMs
      Env Mode:- [Secure / Non secure]Secure

      Preconditions:
      ~9120 NM's was running
      ~1250 applications was in running state
      35K applications was in pending state

      Test Steps:
      1. Submit the application from 5 clients, each client 2 threads and total 10 queues
      2. Once application submittion increases (for each application of distributted shell will call getClusterNodes)

      ClientRMservice#getClusterNodes tries to get ClusterNodeTracker#getNodeReport where map nodes is locked.

      "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 tid=0x00007f75095de000 nid=0x1949c waiting on condition [0x00007f74cff78000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x00007f759f6d8858> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
        at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
        at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
        at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)

      Instead we can make nodes as concurrentHashMap and remove readlock

      Attachments

        1. YARN-9738-003.patch
          4 kB
          Bilwa S T
        2. YARN-9738-002.patch
          2 kB
          Bilwa S T
        3. YARN-9738-001.patch
          2 kB
          Bilwa S T

        Activity

          People

            BilwaST Bilwa S T
            BilwaST Bilwa S T
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: