Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8451

Multiple NM heartbeat thread created when a slow NM resync with RM

    Details

    • Hadoop Flags:
      Reviewed

      Description

      During a NM resync with RM (say RM did a master slave switch), if NM is running slow, more than one RESYNC event may be put into the NM dispatcher by the existing heartbeat thread before they are processed. As a result, multiple new heartbeat thread are later created and start to hb to RM concurrently with their own responseId. If at some point of time, one thread becomes more than one step behind others, RM will send back a resync signal in this heartbeat response, killing all containers in this NM.

      See comments below for details on how this can happen.

        Attachments

        1. YARN-8451.v2.patch
          9 kB
          Botong Huang
        2. YARN-8451.v1.patch
          9 kB
          Botong Huang

          Activity

            People

            • Assignee:
              botong Botong Huang
              Reporter:
              botong Botong Huang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: