Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-284

Improvements to RPC between Child and TaskTracker

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Enables tcp.nodelay for RPC between Child and TaskTracker.

      Description

      We could improve the RPC between the Child and TaskTracker:

      • Set ping interval lower by default to 5s
      • Disable nagle's algorithm (tcp no-delay)
      1. MR-284.patch
        1.0 kB
        Ravi Gummadi
      2. MR-284.v1.patch
        0.5 kB
        Ravi Gummadi
      3. MR-284.v2.patch
        0.6 kB
        Ravi Gummadi

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #39 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/39/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #39 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/39/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #83 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/83/)
        . Enables ipc.client.tcpnodelay in Tasktracker's Child. Contributed by Ravi Gummadi.

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #83 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/83/ ) . Enables ipc.client.tcpnodelay in Tasktracker's Child. Contributed by Ravi Gummadi.
        Hide
        Sharad Agarwal added a comment -

        I just committed this. Thanks Ravi!

        Show
        Sharad Agarwal added a comment - I just committed this. Thanks Ravi!
        Hide
        Jothi Padmanabhan added a comment -

        +1.

        Show
        Jothi Padmanabhan added a comment - +1.
        Hide
        Ravi Gummadi added a comment -

        unit tests passed on my local machine.

        ant test-patch gave:

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
        [exec] Please justify why no new tests are needed for this patch.
        [exec] Also please list what manual steps were performed to verify this patch.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        I don't see a simple way to add a unit testcase for this.

        Show
        Ravi Gummadi added a comment - unit tests passed on my local machine. ant test-patch gave: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. I don't see a simple way to add a unit testcase for this.
        Hide
        Ravi Gummadi added a comment -

        As changing ipc.client.tcpnodelay affects other clients(not just child to TT) and that needs more discussion, am uploading a new patch that sets tcpnodelay only in Child.java as part of this JIRA.

        Please review and provide your comments.

        Show
        Ravi Gummadi added a comment - As changing ipc.client.tcpnodelay affects other clients(not just child to TT) and that needs more discussion, am uploading a new patch that sets tcpnodelay only in Child.java as part of this JIRA. Please review and provide your comments.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch that sets ipc.client.tcpnodelay to true in core-default.xml

        Show
        Ravi Gummadi added a comment - Attaching patch that sets ipc.client.tcpnodelay to true in core-default.xml
        Hide
        Arun C Murthy added a comment -

        The reasoning behind reducing rpc ping intervals was to allow the child tasks to 'fail fast' when sockets to either the TaskTracker or the DataNode were unexpectedly shutdown.

        That was meant to read as 'TaskTracker, JobTracker or NameNode'...

        Show
        Arun C Murthy added a comment - The reasoning behind reducing rpc ping intervals was to allow the child tasks to 'fail fast' when sockets to either the TaskTracker or the DataNode were unexpectedly shutdown. That was meant to read as 'TaskTracker, JobTracker or NameNode'...
        Hide
        Arun C Murthy added a comment -

        I still don't see a strong reason for reducing the ping interval.

        The reasoning behind reducing rpc ping intervals was to allow the child tasks to 'fail fast' when sockets to either the TaskTracker or the DataNode were unexpectedly shutdown.

        Also, I am worried about the approach followed in the patch. In particular, i don't like the way that ping interval is hardcoded to 5 seconds in the Child process (overriding the configuration passed). I also think that in the current implementation, it affects the task->namenode ping interval which is not desirable.

        Fair enough, any other ideas?

        I propose that for this patch, we just fix the case of TCP_NODELAY being ON by default.

        +1 if we can't come up with better heuristics. Maybe in that case we can open another jira to track it?

        Show
        Arun C Murthy added a comment - I still don't see a strong reason for reducing the ping interval. The reasoning behind reducing rpc ping intervals was to allow the child tasks to 'fail fast' when sockets to either the TaskTracker or the DataNode were unexpectedly shutdown. Also, I am worried about the approach followed in the patch. In particular, i don't like the way that ping interval is hardcoded to 5 seconds in the Child process (overriding the configuration passed). I also think that in the current implementation, it affects the task->namenode ping interval which is not desirable. Fair enough, any other ideas? I propose that for this patch, we just fix the case of TCP_NODELAY being ON by default. +1 if we can't come up with better heuristics. Maybe in that case we can open another jira to track it?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12415328/MR-284.patch
        against trunk revision 801517.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415328/MR-284.patch against trunk revision 801517. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/446/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        I still don't see a strong reason for reducing the ping interval. Also, I am worried about the approach followed in the patch. In particular, i don't like the way that ping interval is hardcoded to 5 seconds in the Child process (overriding the configuration passed). I also think that in the current implementation, it affects the task->namenode ping interval which is not desirable.
        I propose that for this patch, we just fix the case of TCP_NODELAY being ON by default.
        Thoughts?

        Show
        Devaraj Das added a comment - I still don't see a strong reason for reducing the ping interval. Also, I am worried about the approach followed in the patch. In particular, i don't like the way that ping interval is hardcoded to 5 seconds in the Child process (overriding the configuration passed). I also think that in the current implementation, it affects the task->namenode ping interval which is not desirable. I propose that for this patch, we just fix the case of TCP_NODELAY being ON by default. Thoughts?
        Hide
        Ravi Gummadi added a comment -

        All unit tests passed on my local machine.

        ant test-patch gave:

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
        [exec] Please justify why no new tests are needed for this patch.
        [exec] Also please list what manual steps were performed to verify this patch.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        This doesn't need a unit test as this just sets ping interval to a smaller value.
        Tested manually with 3 sort jobs started at a time on 5 nodes cluster and there is no SocketTimeoutException seen.

        Show
        Ravi Gummadi added a comment - All unit tests passed on my local machine. ant test-patch gave: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. This doesn't need a unit test as this just sets ping interval to a smaller value. Tested manually with 3 sort jobs started at a time on 5 nodes cluster and there is no SocketTimeoutException seen.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch that sets the ping interval to 5sec and disable Nagle's algorithm (tcp no-delay).

        Show
        Ravi Gummadi added a comment - Attaching patch that sets the ping interval to 5sec and disable Nagle's algorithm (tcp no-delay).

          People

          • Assignee:
            Ravi Gummadi
            Reporter:
            Arun C Murthy
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development