Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10695 Event related improvement of YARN for better usage.
  3. YARN-10739

GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.0, 3.3.1, 3.2.3
    • Fix Version/s: 3.4.0
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on AsyncDispatcher, if the event queue size is too large, the printEventQueueDetails will cost too much time and RM  take a long time to process.

      For example:
      If we have 4K nodes on cluster and 4K apps running, if we do switch and the node manager will register with RM, and RM will call NodesListManager to do RMAppNodeUpdateEvent, code like below:

      for(RMApp app : rmContext.getRMApps().values()) {
        if (!app.isAppFinalStateStored()) {
          this.rmContext
              .getDispatcher()
              .getEventHandler()
              .handle(
                  new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
                      appNodeUpdateType));
        }
      }

      So the total event is 4k*4k=16 mil, during this window, the GenericEventHandler.printEventQueueDetails will print the event queue detail and be called frequently, once the event queue size reaches 1 mil+, the Iterator of the queue from printEventQueueDetails will be so slow refer to below: 

      private void printEventQueueDetails() {
        Iterator<Event> iterator = eventQueue.iterator();
        Map<Enum, Long> counterMap = new HashMap<>();
        while (iterator.hasNext()) {
          Enum eventType = iterator.next().getType();
      

      Then RM recovery will cost too much time.....
      Refer to our log:

      2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(306)) - Size of event-queue is 12000000
      
      2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event record counter: 310836
      2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, Event record counter: 1103
      2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, Event record counter: 1
      2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, Event record counter: 1
      

      Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s to do Iterator.

      I upload a file to ensure the printEventQueueDetails only be called one-time pre-30s.

       

        Attachments

        1. YARN-10739.003.patch
          4 kB
          Qi Zhu
        2. YARN-10739.003.patch
          4 kB
          Qi Zhu
        3. YARN-10739.004.patch
          4 kB
          Qi Zhu
        4. YARN-10739.005.patch
          4 kB
          Qi Zhu
        5. YARN-10739.006.patch
          3 kB
          Qi Zhu
        6. YARN-10739-001.patch
          2 kB
          Zhanqi Cai
        7. YARN-10739-002.patch
          2 kB
          Zhanqi Cai

          Issue Links

            Activity

              People

              • Assignee:
                zhuqi Qi Zhu
                Reporter:
                zhanqi.cai Zhanqi Cai
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: