Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1018

Single lost heartbeat leads to a "Lost task tracker"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.10.0, 0.11.2, 0.12.0
    • 0.15.0
    • None
    • None
    • Nutch trunk/ (Hadoop 0.10.0), Linux, JDK 1.5, a cluster of 9 machines.

    Description

      Under heavy load, task tracker may lose the heartbeat response from the JobTracker. Task tracker tries to resend the last heartbeat message, which job tracker treats as "duplicate" response and ignores. Since task tracker tries to resend the same heartbeat message, with the same id, over and over again, no "valid" messages reach the job tracker, so after a while it considers the task tracker to be lost. Task tracker cannot recover from this state and needs to be restarted.

      Looking at Hadoop trunk/ I believe this problem still may occur - in JobTracker.java.heartbeat():992 JobTracker should not ignore duplicate messages but acknowledge them without processing. This would cause the task tracker to sync back it's last heartbeat id with the last hearbeat id remembered in the job tracker.

      Attachments

        1. HADOOP-1018_1_20070906.patch
          1 kB
          Arun Murthy

        Activity

          People

            acmurthy Arun Murthy
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: