Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8451

Multiple NM heartbeat thread created when a slow NM resync with RM

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      During a NM resync with RM (say RM did a master slave switch), if NM is running slow, more than one RESYNC event may be put into the NM dispatcher by the existing heartbeat thread before they are processed. As a result, multiple new heartbeat thread are later created and start to hb to RM concurrently with their own responseId. If at some point of time, one thread becomes more than one step behind others, RM will send back a resync signal in this heartbeat response, killing all containers in this NM.

      See comments below for details on how this can happen.

      Attachments

        1. YARN-8451.v1.patch
          9 kB
          Botong Huang
        2. YARN-8451.v2.patch
          9 kB
          Botong Huang

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            botong Botong Huang
            botong Botong Huang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment