Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1906

Lower default minimum heartbeat interval for tasktracker > Jobtracker

    Details

    • Hadoop Flags:
      Reviewed
    • Release Note:
      The default minimum heartbeat interval has been dropped from 3 seconds to 300ms to increase scheduling throughput on small clusters. Users may tune mapreduce.jobtracker.heartbeats.in.second to adjust this value.

      Description

      I get a 0% to 15% performance increase for smaller clusters by making the heartbeat throttle stop penalizing clusters with less than 300 nodes.

      Between 0.19 and 0.20, the default minimum heartbeat interval increased from 2s to 3s. If a JobTracker is throttled at 100 heartbeats / sec for large clusters, why should a cluster with 10 nodes be throttled to 3.3 heartbeats per second?

      1. MAPREDUCE-1906.branch-1.patch
        1 kB
        Brandon Li
      2. mapreduce-1906.txt
        11 kB
        Todd Lipcon
      3. mapreduce-1906.txt
        11 kB
        Todd Lipcon
      4. MAPREDUCE-1906-0.21.patch
        1 kB
        Scott Carey

        Activity

        Hide
        Scott Carey added a comment -

        JobTracker.java has this code:

        (0.21 branch, line 2497)

         public int getNextHeartbeatInterval() {
        	// get the no of task trackers
        	int clusterSize = getClusterStatus().getTaskTrackers();
        	int heartbeatInterval = Math.max(
        	(int)(1000 * HEARTBEATS_SCALING_FACTOR *
        	Math.ceil((double)clusterSize /
        	NUM_HEARTBEATS_IN_SECOND)),
        	HEARTBEAT_INTERVAL_MIN) ;
         	return heartbeatInterval;
        } 
        

        HEARTBEAT_INTERVAL_MIN is 3000 (milliseconds). This means that only after a cluster has reached 300 nodes does the jobtracker get 100 heartbeats / second.

        This throttle is far too large in my experinence. I have a development cluster with 10 nodes, each node can handle 10 maps and 10 reduces concurrently. With 0.20, the most the scheduler will do is one map and one reduce per heartbeat. The result is an always underutilized cluster whenever there are anything but very large jobs running. Much of our data flows start out large, then end with a couple dozen smaller jobs that are mostly chained together.

        I have been running in production and development with a patch to MRConstants.java that improves cluster utilization significantly by changing HEARTBEAT_INTERVAL_MIN to to 300 ms. In small clusters, a heartbeat every 300ms is not an issue. The above code already throttles the system, the floor of 3000ms is too large. It still takes a cluster of 30 machines to get to the 100 heartbeat/sec threshold.

        I also could not find an explanation why this was increased from 2000 to 3000 between 0.19 and 0.20.

        I

        Show
        Scott Carey added a comment - JobTracker.java has this code: (0.21 branch, line 2497) public int getNextHeartbeatInterval() { // get the no of task trackers int clusterSize = getClusterStatus().getTaskTrackers(); int heartbeatInterval = Math .max( ( int )(1000 * HEARTBEATS_SCALING_FACTOR * Math .ceil(( double )clusterSize / NUM_HEARTBEATS_IN_SECOND)), HEARTBEAT_INTERVAL_MIN) ; return heartbeatInterval; } HEARTBEAT_INTERVAL_MIN is 3000 (milliseconds). This means that only after a cluster has reached 300 nodes does the jobtracker get 100 heartbeats / second. This throttle is far too large in my experinence. I have a development cluster with 10 nodes, each node can handle 10 maps and 10 reduces concurrently. With 0.20, the most the scheduler will do is one map and one reduce per heartbeat. The result is an always underutilized cluster whenever there are anything but very large jobs running. Much of our data flows start out large, then end with a couple dozen smaller jobs that are mostly chained together. I have been running in production and development with a patch to MRConstants.java that improves cluster utilization significantly by changing HEARTBEAT_INTERVAL_MIN to to 300 ms. In small clusters, a heartbeat every 300ms is not an issue. The above code already throttles the system, the floor of 3000ms is too large. It still takes a cluster of 30 machines to get to the 100 heartbeat/sec threshold. I also could not find an explanation why this was increased from 2000 to 3000 between 0.19 and 0.20. I
        Hide
        Scott Carey added a comment -

        This patch changes the default minimum TaskTracker > JobTracker heartbeat interval from 3000ms to 300ms.

        Effectively, this makes clusters between 30 and 300 nodes increase their heartbeat rate to a cluster-wide 100 heartbeats per second.
        Clusters larger than 300 nodes remain unchanged at a cluster-wide 100 heartbeats per second.

        Clusters with less than 30 nodes have a constant 300ms between pings per node. so for a 15 node cluster it is 50 heartbeats per second, and for a 3 node cluster it is 10 heartbeats per second.

        Show
        Scott Carey added a comment - This patch changes the default minimum TaskTracker > JobTracker heartbeat interval from 3000ms to 300ms. Effectively, this makes clusters between 30 and 300 nodes increase their heartbeat rate to a cluster-wide 100 heartbeats per second. Clusters larger than 300 nodes remain unchanged at a cluster-wide 100 heartbeats per second. Clusters with less than 30 nodes have a constant 300ms between pings per node. so for a 15 node cluster it is 50 heartbeats per second, and for a 3 node cluster it is 10 heartbeats per second.
        Hide
        Scott Carey added a comment -

        Is it possible to consider this for 0.21?

        Show
        Scott Carey added a comment - Is it possible to consider this for 0.21?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12448507/MAPREDUCE-1906-0.21.patch
        against trunk revision 960808.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448507/MAPREDUCE-1906-0.21.patch against trunk revision 960808. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/288/console This message is automatically generated.
        Hide
        Scott Carey added a comment -

        This is a one-line change to a static constant, no new unit tests are needed.

        The three tests that fail are:
        org.apache.hadoop.mapred.TestSimulatorSerialJobSubmission.testMain
        org.apache.hadoop.mapred.TestSimulatorDeterministicReplay.testMain
        org.apache.hadoop.mapred.TestMapredHeartbeat.testJobDirCleanup (TestMapredHeartbeat.java:46)

        The first two seem unrelated.

        The last one looks like the test is explicitly testing the constant. The test assumes that the minimum heartbeat interval will in fact be, HEARTBEAT_INTERVAL_MIN, but the calculation in
        JobTracker.getNextHeartbeatInterval() is a step-function. It essentially treats every NUM_HEARTBEATS_IN_SECOND nodes as a step-function in terms of increase in heartbeat delay.

        Currently, with NUM_HEARTBEATS_IN_SECOND = 100 and HEARTBEATS_SCALING_FACTOR = 0.001, a cluster with 500 nodes would have a 5 second heartbeat interval. But one with 501 nodes would have a 6 second interval. Is there a good reason for the intervals to be rounded up to the next whole second? How about we just remove the Math.ceil() and round to the next millisecond. This will make the test's assumptions be true, and provide smooth throttling as nodes come and go.

        However, It is possible that somewhere else in the code there is an assumption that jobtracker pings will be at whole second intervals.

        public int getNextHeartbeatInterval() {
        	// get the no of task trackers
        	int clusterSize = getClusterStatus().getTaskTrackers();
        	int heartbeatInterval = Math.max(
        	(int)(1000 * HEARTBEATS_SCALING_FACTOR *
        	((double)clusterSize /
        	NUM_HEARTBEATS_IN_SECOND)),
        	HEARTBEAT_INTERVAL_MIN) ;
         	return heartbeatInterval;
        }
        

        What were the reasons for the long minimum ping time in the first place? Why did it go up from 2 to 3 seconds between 0.19 and 0.20?

        Show
        Scott Carey added a comment - This is a one-line change to a static constant, no new unit tests are needed. The three tests that fail are: org.apache.hadoop.mapred.TestSimulatorSerialJobSubmission.testMain org.apache.hadoop.mapred.TestSimulatorDeterministicReplay.testMain org.apache.hadoop.mapred.TestMapredHeartbeat.testJobDirCleanup (TestMapredHeartbeat.java:46) The first two seem unrelated. The last one looks like the test is explicitly testing the constant. The test assumes that the minimum heartbeat interval will in fact be, HEARTBEAT_INTERVAL_MIN, but the calculation in JobTracker.getNextHeartbeatInterval() is a step-function. It essentially treats every NUM_HEARTBEATS_IN_SECOND nodes as a step-function in terms of increase in heartbeat delay. Currently, with NUM_HEARTBEATS_IN_SECOND = 100 and HEARTBEATS_SCALING_FACTOR = 0.001, a cluster with 500 nodes would have a 5 second heartbeat interval. But one with 501 nodes would have a 6 second interval. Is there a good reason for the intervals to be rounded up to the next whole second? How about we just remove the Math.ceil() and round to the next millisecond. This will make the test's assumptions be true, and provide smooth throttling as nodes come and go. However, It is possible that somewhere else in the code there is an assumption that jobtracker pings will be at whole second intervals. public int getNextHeartbeatInterval() { // get the no of task trackers int clusterSize = getClusterStatus().getTaskTrackers(); int heartbeatInterval = Math .max( ( int )(1000 * HEARTBEATS_SCALING_FACTOR * (( double )clusterSize / NUM_HEARTBEATS_IN_SECOND)), HEARTBEAT_INTERVAL_MIN) ; return heartbeatInterval; } What were the reasons for the long minimum ping time in the first place? Why did it go up from 2 to 3 seconds between 0.19 and 0.20?
        Hide
        Scott Carey added a comment -

        Patch adds one line change to JobTracker.java to make the heartbeat interval a smooth function instead of a step function. Total patch is two one-line changes.

        Show
        Scott Carey added a comment - Patch adds one line change to JobTracker.java to make the heartbeat interval a smooth function instead of a step function. Total patch is two one-line changes.
        Hide
        Scott Carey added a comment -

        MAPREDUCE-1906-0.21-v2.patch

        changes ping to a smooth function from a step function and lowers the minimum to 300ms. Clusters larger than 300 nodes only see the step-function > smooth function change. Clusters between 30 and 300 nodes smoothly increase their ping interval. Clusters with 30 nodes or less have 300ms ping intervals when the TT has nothing to do. This improves scheduling latency on small clusters significantly.

        The cluster wide ping interval is roughly proportional to how fast the cluster can schedule a job.

        cluster size current ping interval (ms) current ping rate at JT patched ping interval (ms) patched ping rate at JT
        10 3000 3.33 /sec 300 33.3 /sec
        30 3000 10 /sec 300 100 /sec
        100 3000 33.3 /sec 1000 100 /sec
        300 3000 100 /sec 3000 100 /sec
        301 4000 75 /sec 3010 100 /sec
        1000 10000 100 /sec 10000 100 /sec
        1001 11000 91 /sec 10010 100 /sec
        Show
        Scott Carey added a comment - MAPREDUCE-1906 -0.21-v2.patch changes ping to a smooth function from a step function and lowers the minimum to 300ms. Clusters larger than 300 nodes only see the step-function > smooth function change. Clusters between 30 and 300 nodes smoothly increase their ping interval. Clusters with 30 nodes or less have 300ms ping intervals when the TT has nothing to do. This improves scheduling latency on small clusters significantly. The cluster wide ping interval is roughly proportional to how fast the cluster can schedule a job. cluster size current ping interval (ms) current ping rate at JT patched ping interval (ms) patched ping rate at JT 10 3000 3.33 /sec 300 33.3 /sec 30 3000 10 /sec 300 100 /sec 100 3000 33.3 /sec 1000 100 /sec 300 3000 100 /sec 3000 100 /sec 301 4000 75 /sec 3010 100 /sec 1000 10000 100 /sec 10000 100 /sec 1001 11000 91 /sec 10010 100 /sec
        Hide
        Scott Carey added a comment -

        re-subit for hudson.

        Show
        Scott Carey added a comment - re-subit for hudson.
        Hide
        Scott Carey added a comment -

        re-submit for hudson.

        Show
        Scott Carey added a comment - re-submit for hudson.
        Hide
        Scott Carey added a comment -

        replaced the original patch with the the latest.

        Show
        Scott Carey added a comment - replaced the original patch with the the latest.
        Hide
        Todd Lipcon added a comment -

        Perhaps this should be made into a expert-level configurable option? I agree that 3000ms is a bit excessive on a small cluster where the JobTracker is generally "bored" and we've also seen big throughput improvements, especially when users submit jobs with too-small tasks.

        Show
        Todd Lipcon added a comment - Perhaps this should be made into a expert-level configurable option? I agree that 3000ms is a bit excessive on a small cluster where the JobTracker is generally "bored" and we've also seen big throughput improvements, especially when users submit jobs with too-small tasks.
        Hide
        Todd Lipcon added a comment -

        This is Scott's patch but also makes the minimum interval configurable. I set the default to 300ms as Scott suggests.

        Show
        Todd Lipcon added a comment - This is Scott's patch but also makes the minimum interval configurable. I set the default to 300ms as Scott suggests.
        Hide
        Scott Carey added a comment -

        When I last looked at this 6 months ago, the patch caused some test failures. They seemed to be because the tests had hard-coded assumptions about what the interval was, but I did not fix them and resubmit a patch.

        Show
        Scott Carey added a comment - When I last looked at this 6 months ago, the patch caused some test failures. They seemed to be because the tests had hard-coded assumptions about what the interval was, but I did not fix them and resubmit a patch.
        Hide
        Todd Lipcon added a comment -

        Had to update TestMapredHeartbeat to fix an assertion for the new minimum.

        Show
        Todd Lipcon added a comment - Had to update TestMapredHeartbeat to fix an assertion for the new minimum.
        Hide
        Eli Collins added a comment -

        +1

        The latest patch looks ready to go to me.

        Show
        Eli Collins added a comment - +1 The latest patch looks ready to go to me.
        Hide
        Todd Lipcon added a comment -

        Committed to trunk only. Thanks for the original contribution and for your patience, Scott!

        Show
        Todd Lipcon added a comment - Committed to trunk only. Thanks for the original contribution and for your patience, Scott!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #566 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/566/)
        MAPREDUCE-1906. Lower minimum heartbeat interval for TaskTracker. Contributed by Scott Carey and Todd Lipcon

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #566 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/566/ ) MAPREDUCE-1906 . Lower minimum heartbeat interval for TaskTracker. Contributed by Scott Carey and Todd Lipcon
        Hide
        Brandon Li added a comment -

        Uploaded a patch to back port the change to branch-1. The tests I ran were teragen/terasort/teravalidate.

        Show
        Brandon Li added a comment - Uploaded a patch to back port the change to branch-1. The tests I ran were teragen/terasort/teravalidate.
        Hide
        Siddharth Seth added a comment -

        +1 for the backport.

        Show
        Siddharth Seth added a comment - +1 for the backport.
        Hide
        Siddharth Seth added a comment -

        Committed to branch-1.

        Show
        Siddharth Seth added a comment - Committed to branch-1.
        Hide
        Arun C Murthy added a comment -

        Matt - if you don't mind, I'd like to merge this into branch-1.1 since it's been well baked-in. Thoughts?

        Show
        Arun C Murthy added a comment - Matt - if you don't mind, I'd like to merge this into branch-1.1 since it's been well baked-in. Thoughts?
        Hide
        Arun C Murthy added a comment -

        I merged it to branch-1.1 too.

        Show
        Arun C Murthy added a comment - I merged it to branch-1.1 too.

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Scott Carey
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development