Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6102

RMActiveService context to be updated with new RMContext on failover

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.8.0, 2.7.3
    • Fix Version/s: 2.9.0, 3.0.0-beta1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in dispatcher thread
      java.lang.Exception: No handler for registered for class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120)
              at java.lang.Thread.run(Thread.java:745)
      2017-01-17 16:42:17,914 INFO  [AsyncDispatcher ShutDown handler] event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..

      The same stack i was also noticed in TestResourceTrackerOnHA exits abnormally, after some analysis, i was able to reproduce.

      Once the nodeHeartBeat is sent to RM, inside org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest), before sending it to dispatcher through
      this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent); if RM failover is called, the dispatcher is reset
      The new dispatcher is however first started and then the events are registered at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)

      So event order will look like
      1. Send Node heartbeat to ResourceTrackerService
      2. In ResourceTrackerService.nodeHeartbeat, before passing to dispatcher call RM failover
      3. In RM Failover, current active will reset dispatcher @reinitialize i.e ( resetDispatcher(); + createAndInitActiveServices(); )

      Now between resetDispatcher(); and createAndInitActiveServices(); , the ResourceTrackerService.nodeHeartbeat invokes dipatcher

      This will cause the above error as at point of time when STATUS_UPDATE event is given to dispatcher in ResourceTrackerService , the new dispatcher(from the failover) may be started but not yet registered for events
      Using same steps(with pausing JVM at debug), i was able to reproduce this in production cluster also. for STATUS_UPDATE active service event, when the service is yet to forward the event to RM dispatcher but a failover is called and dispatcher reset is between resetDispatcher(); & createAndInitActiveServices();

        Attachments

        1. YARN-6102-YARN-5355-branch-2.addendum.patch
          7 kB
          Varun Saxena
        2. YARN-6102-branch-2.003-addendum.patch
          9 kB
          Subru Krishnan
        3. YARN-6102-branch-2.002-addednum.patch
          1.0 kB
          Rohith Sharma K S
        4. YARN-6102-branch-2.002.patch
          46 kB
          Rohith Sharma K S
        5. YARN-6102-branch-2.001.patch
          46 kB
          Rohith Sharma K S
        6. YARN-6102.07.patch
          57 kB
          Rohith Sharma K S
        7. YARN-6102.06.patch
          57 kB
          Rohith Sharma K S
        8. YARN-6102.05.patch
          48 kB
          Rohith Sharma K S
        9. YARN-6102.04.patch
          47 kB
          Rohith Sharma K S
        10. YARN-6102.03.patch
          44 kB
          Rohith Sharma K S
        11. YARN-6102.02.patch
          43 kB
          Rohith Sharma K S
        12. YARN-6102.01.patch
          23 kB
          Rohith Sharma K S
        13. eventOrder.JPG
          30 kB
          Ajith S

          Issue Links

            Activity

              People

              • Assignee:
                rohithsharma Rohith Sharma K S
                Reporter:
                ajithshetty Ajith S
              • Votes:
                0 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: