Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.203.0
    • Component/s: jobtracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We should have a configurable knob to throttle how many out of band heartbeats are sent.

        Issue Links

          Activity

          Owen O'Malley created issue -
          Hide
          Dick King added a comment -

          The reason we need this is that if many jobs have short tasks, the job tracker can get beat up with too many heartbeats.

          I think that the patch should have two pieces.

          1: In any one node, we should delay an out-of-band heartbeat that we are considering sending but that would otherwise occur too soon after the most recent heartbeat, in the hopes of reporting multiple task attempt completions in one heartbeat thus reducing the total load placed on the job tracker. This involves compromises, because the node won't get a new task immediately.

          2: We should cap the total number of heartbeats over a time interval. The cap and the interval should be configurable. If that interval is INT and the cap is C, we should track the times of the last C heartbeats we sent, and if the time T of the oldest one is less than INT ago and we otherwise meet the criteria for sending a heartbeat we should unconditionally send one at time T + INT rather than immediately.

          Since principle 2 may induce a longish delay, perhaps each heartbeat should say when the next heartbeat should occur? This makes this patch a bigger deal because up to now all changes could be localized to the TaskTracker but now they can't, but it might be worthwhile.

          Show
          Dick King added a comment - The reason we need this is that if many jobs have short tasks, the job tracker can get beat up with too many heartbeats. I think that the patch should have two pieces. 1: In any one node, we should delay an out-of-band heartbeat that we are considering sending but that would otherwise occur too soon after the most recent heartbeat, in the hopes of reporting multiple task attempt completions in one heartbeat thus reducing the total load placed on the job tracker. This involves compromises, because the node won't get a new task immediately. 2: We should cap the total number of heartbeats over a time interval. The cap and the interval should be configurable. If that interval is INT and the cap is C, we should track the times of the last C heartbeats we sent, and if the time T of the oldest one is less than INT ago and we otherwise meet the criteria for sending a heartbeat we should unconditionally send one at time T + INT rather than immediately. Since principle 2 may induce a longish delay, perhaps each heartbeat should say when the next heartbeat should occur? This makes this patch a bigger deal because up to now all changes could be localized to the TaskTracker but now they can't, but it might be worthwhile.
          Hide
          Todd Lipcon added a comment -

          Attaching the patch from the security branch, plus added the new config parameter to mapred-default with an explanation of what the values mean.

          Show
          Todd Lipcon added a comment - Attaching the patch from the security branch, plus added the new config parameter to mapred-default with an explanation of what the values mean.
          Todd Lipcon made changes -
          Field Original Value New Value
          Attachment mr-2355-from-sec-branch.txt [ 12473739 ]
          Hide
          Lianhui Wang added a comment -

          i think it is important when the next heartbeat should occur.
          how to compute the time interval? that is decided by the different job of the requirements.

          Show
          Lianhui Wang added a comment - i think it is important when the next heartbeat should occur. how to compute the time interval? that is decided by the different job of the requirements.
          Hide
          Todd Lipcon added a comment -

          Should this be marked as resolved? It seems to have been committed in 0.20.202.

          Show
          Todd Lipcon added a comment - Should this be marked as resolved? It seems to have been committed in 0.20.202.
          Liyin Liang made changes -
          Link This issue is related to MAPREDUCE-4478 [ MAPREDUCE-4478 ]
          Hide
          Suresh Srinivas added a comment -

          This change has been available since 0.20.203 in Apache. Marking this as resolved.

          Show
          Suresh Srinivas added a comment - This change has been available since 0.20.203 in Apache. Marking this as resolved.
          Suresh Srinivas made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s 0.20.203.0 [ 12316151 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development