Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: ipc
    • Labels:
      None

      Description

      I think we should switch the default for the IPC client and server NODELAY options to true. As wikipedia says:

      In general, since Nagle's algorithm is only a defense against careless applications, it will not benefit a carefully written application that takes proper care of buffering; the algorithm has either no effect, or negative effect on the application.

      Since our IPC layer is well contained and does its own buffering, we shouldn't be careless.

        Issue Links

          Activity

          Todd Lipcon created issue -
          Hide
          Todd Lipcon added a comment -

          A good reason to do this is that, when I run an IPC benchmark on an "echo" function, as soon as the message size eclipses 8K the latency goes up to 40ms due to the interaction of nagling and delayed ACK.

          Show
          Todd Lipcon added a comment - A good reason to do this is that, when I run an IPC benchmark on an "echo" function, as soon as the message size eclipses 8K the latency goes up to 40ms due to the interaction of nagling and delayed ACK.
          Todd Lipcon made changes -
          Field Original Value New Value
          Link This issue is related to HADOOP-8071 [ HADOOP-8071 ]
          Hide
          Todd Lipcon added a comment -

          Attached patch changes the defaults and removes these keys from the documentation. There isn't any good reason that the user should change them, since we do our own buffering at the IPC layer.

          Show
          Todd Lipcon added a comment - Attached patch changes the defaults and removes these keys from the documentation. There isn't any good reason that the user should change them, since we do our own buffering at the IPC layer.
          Todd Lipcon made changes -
          Attachment hadoop-8069.txt [ 12514452 ]
          Hide
          Todd Lipcon added a comment -

          Illustrating the improvement that NODELAY makes using the benchmark from HADOOP-8070:

          todd@todd-w510:~/git/hadoop-common/hadoop-common-project/hadoop-common$ /usr/lib/jvm/java-6-sun/bin/java -cp /home/todd/git/hadoop-common/hadoop-dist/target/hadoop-0.24.0-SNAPSHOT/share/hadoop/common/lib/*:target/classes:target/test-classes org.apache.hadoop.ipc.RPCCallBenchmark -Dipc.server.tcpnodelay=false -Dipc.client.tcpnodelay=false -c 1 -s 1 -t 10 -m 8300  -e protobuf
          Calls per second: 21.0
          Calls per second: 24.0
          Calls per second: 24.0
          Calls per second: 24.0
          Calls per second: 23.0
          Calls per second: 25.0
          Calls per second: 24.0
          Calls per second: 24.0
          Calls per second: 23.0
          Calls per second: 24.0
          ====== Results ======
          Options:
          rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
          serverThreads=1
          serverReaderThreads=1
          clientThreads=1
          host=0.0.0.0
          port=12345
          secondsToRun=10
          msgSize=8300
          Total calls per second: 24.0
          CPU time per call on client: 691056 ns
          CPU time per call on server: 894308 ns
          todd@todd-w510:~/git/hadoop-common/hadoop-common-project/hadoop-common$ /usr/lib/jvm/java-6-sun/bin/java -cp /home/todd/git/hadoop-common/hadoop-dist/target/hadoop-0.24.0-SNAPSHOT/share/hadoop/common/lib/*:target/classes:target/test-classes org.apache.hadoop.ipc.RPCCallBenchmark -Dipc.server.tcpnodelay=true -Dipc.client.tcpnodelay=true -c 1 -s 1 -t 10 -m 8300  -e protobuf
          Calls per second: 642.0
          Calls per second: 859.0
          Calls per second: 1593.0
          Calls per second: 2378.0
          Calls per second: 2069.0
          Calls per second: 2716.0
          Calls per second: 3400.0
          Calls per second: 3973.0
          Calls per second: 4117.0
          Calls per second: 4075.0
          ====== Results ======
          Options:
          rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
          serverThreads=1
          serverReaderThreads=1
          clientThreads=1
          host=0.0.0.0
          port=12345
          secondsToRun=10
          msgSize=8300
          Total calls per second: 2582.0
          CPU time per call on client: 137426 ns
          CPU time per call on server: 151749 ns
          

          Note that the 24 calls/sec corresponds to 41ms/call, which is just above the 40ms delay you see with interaction of delayed ACK and nagling on Linux

          Show
          Todd Lipcon added a comment - Illustrating the improvement that NODELAY makes using the benchmark from HADOOP-8070 : todd@todd-w510:~/git/hadoop-common/hadoop-common-project/hadoop-common$ /usr/lib/jvm/java-6-sun/bin/java -cp /home/todd/git/hadoop-common/hadoop-dist/target/hadoop-0.24.0-SNAPSHOT/share/hadoop/common/lib/*:target/classes:target/test-classes org.apache.hadoop.ipc.RPCCallBenchmark -Dipc.server.tcpnodelay= false -Dipc.client.tcpnodelay= false -c 1 -s 1 -t 10 -m 8300 -e protobuf Calls per second: 21.0 Calls per second: 24.0 Calls per second: 24.0 Calls per second: 24.0 Calls per second: 23.0 Calls per second: 25.0 Calls per second: 24.0 Calls per second: 24.0 Calls per second: 23.0 Calls per second: 24.0 ====== Results ====== Options: rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine serverThreads=1 serverReaderThreads=1 clientThreads=1 host=0.0.0.0 port=12345 secondsToRun=10 msgSize=8300 Total calls per second: 24.0 CPU time per call on client: 691056 ns CPU time per call on server: 894308 ns todd@todd-w510:~/git/hadoop-common/hadoop-common-project/hadoop-common$ /usr/lib/jvm/java-6-sun/bin/java -cp /home/todd/git/hadoop-common/hadoop-dist/target/hadoop-0.24.0-SNAPSHOT/share/hadoop/common/lib/*:target/classes:target/test-classes org.apache.hadoop.ipc.RPCCallBenchmark -Dipc.server.tcpnodelay= true -Dipc.client.tcpnodelay= true -c 1 -s 1 -t 10 -m 8300 -e protobuf Calls per second: 642.0 Calls per second: 859.0 Calls per second: 1593.0 Calls per second: 2378.0 Calls per second: 2069.0 Calls per second: 2716.0 Calls per second: 3400.0 Calls per second: 3973.0 Calls per second: 4117.0 Calls per second: 4075.0 ====== Results ====== Options: rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine serverThreads=1 serverReaderThreads=1 clientThreads=1 host=0.0.0.0 port=12345 secondsToRun=10 msgSize=8300 Total calls per second: 2582.0 CPU time per call on client: 137426 ns CPU time per call on server: 151749 ns Note that the 24 calls/sec corresponds to 41ms/call, which is just above the 40ms delay you see with interaction of delayed ACK and nagling on Linux
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12514452/hadoop-8069.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/594//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/594//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12514452/hadoop-8069.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/594//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/594//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          +1, the patch looks good to me. Great analysis/benchmarking.

          Show
          Aaron T. Myers added a comment - +1, the patch looks good to me. Great analysis/benchmarking.
          Hide
          Eli Collins added a comment -

          +1 nice

          Show
          Eli Collins added a comment - +1 nice
          Hide
          Todd Lipcon added a comment -

          Before committing this, I want to double check a couple things to make sure there are no cases where we end up making more packets than before.

          Show
          Todd Lipcon added a comment - Before committing this, I want to double check a couple things to make sure there are no cases where we end up making more packets than before.
          Hide
          Todd Lipcon added a comment -

          One issue here with nagling off is the following:

          In the Server implementation, we write with maximum 8KB write() calls, to avoid a heap malloc inside the JDK's SocketOutputStream implementation (with less than 8K it uses a stack buffer instead).
          So, if we write a 10KB response, we end up doing a write(8KB) followed by write(2KB)

          The problem here, when NODELAY is on, is that the TCP MSS doesn't divide neatly into the 8K buffer size. So we get the following behavior:
          write(8K):
          sends 5 packets of MSS size (eg 1490 bytes)
          sends 1 packet of around half MSS (around 750 bytes)
          write(2K):
          sends 1 packet of MSS
          sends 1 packet around 1/3 MSS

          although we should have fit the result in 7 packets, instead we used 8

          The following thread about postfix discusses a similar issue:
          http://tech.groups.yahoo.com/group/postfix-users/message/224183

          Possible solutions:
          1) accept the inefficiency - it's bounded by one extra "small" packet for every 8KB in the response
          2) try to set the write buffer size to an exact multiple of MSS. This is difficult because Java doesn't let you call getsockopt(TCP_MAXSEG)
          3) use TCP_CORK and TCP_UNCORK to control the packet sending behavior. This is difficult because Java also doesn't expose those
          4) in the Server.channelIO loop, turn off NODELAY while writing all but the last buffer worth, then turn on NODELAY for the last buffer. This should act as a flush of all the remaining buffered data

          Canceling patch for now to work through this

          Show
          Todd Lipcon added a comment - One issue here with nagling off is the following: In the Server implementation, we write with maximum 8KB write() calls, to avoid a heap malloc inside the JDK's SocketOutputStream implementation (with less than 8K it uses a stack buffer instead). So, if we write a 10KB response, we end up doing a write(8KB) followed by write(2KB) The problem here, when NODELAY is on, is that the TCP MSS doesn't divide neatly into the 8K buffer size. So we get the following behavior: write(8K): sends 5 packets of MSS size (eg 1490 bytes) sends 1 packet of around half MSS (around 750 bytes) write(2K): sends 1 packet of MSS sends 1 packet around 1/3 MSS although we should have fit the result in 7 packets, instead we used 8 The following thread about postfix discusses a similar issue: http://tech.groups.yahoo.com/group/postfix-users/message/224183 Possible solutions: 1) accept the inefficiency - it's bounded by one extra "small" packet for every 8KB in the response 2) try to set the write buffer size to an exact multiple of MSS. This is difficult because Java doesn't let you call getsockopt(TCP_MAXSEG) 3) use TCP_CORK and TCP_UNCORK to control the packet sending behavior. This is difficult because Java also doesn't expose those 4) in the Server.channelIO loop, turn off NODELAY while writing all but the last buffer worth, then turn on NODELAY for the last buffer. This should act as a flush of all the remaining buffered data Canceling patch for now to work through this
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Daryn Sharp added a comment -

          The short packets should result 40 byte packet overhead. Delayed ack should coalesce the ack to prevent any overhead there. Overall, the overhead for short packets should be ~0.5%.

          The key is avoiding 200ms bubbles in the default settings for a socket. From memory: nagle holds data until a full packet is assembled, it receives an ack for the previous packet, or 200ms expires. Receiver sends delayed acks for every other packet, or 200ms expires. If the last partial packet is an odd packet, the sender and receiver are waiting for each other to send something. The receiver's 200ms dack timer expires, it sends the ack for the next to last even packet, sender sends the last odd packet.

          I think, but I'm rusty, that the main differences between nagle and cork are:

          • nagle may send a partial packet if the receiver acks before another full packet is assembled
          • cork ignores acks and just sends full packets, or 200ms expires
          • uncorking flushes the socket buffer and sets the tcp push flag (causing immediate ack, not dack) on the partial packet
          • nodelay might be setting the push flag on all packets (generating 2X acks), but I think it's just the partial packets
          • nodelay is portable, cork is not

          All said, I think #4 is probably the best bet. It should in effect be like cork unless the writes for a given chunk of data are written in a slow/sporadic fashion, thus causing acks to send out partial packets. Most comparisons are straight nodelay or straight cork, so your findings will be interesting.

          Show
          Daryn Sharp added a comment - The short packets should result 40 byte packet overhead. Delayed ack should coalesce the ack to prevent any overhead there. Overall, the overhead for short packets should be ~0.5%. The key is avoiding 200ms bubbles in the default settings for a socket. From memory: nagle holds data until a full packet is assembled, it receives an ack for the previous packet, or 200ms expires. Receiver sends delayed acks for every other packet, or 200ms expires. If the last partial packet is an odd packet, the sender and receiver are waiting for each other to send something. The receiver's 200ms dack timer expires, it sends the ack for the next to last even packet, sender sends the last odd packet. I think , but I'm rusty, that the main differences between nagle and cork are: nagle may send a partial packet if the receiver acks before another full packet is assembled cork ignores acks and just sends full packets, or 200ms expires uncorking flushes the socket buffer and sets the tcp push flag (causing immediate ack, not dack) on the partial packet nodelay might be setting the push flag on all packets (generating 2X acks), but I think it's just the partial packets nodelay is portable, cork is not All said, I think #4 is probably the best bet. It should in effect be like cork unless the writes for a given chunk of data are written in a slow/sporadic fashion, thus causing acks to send out partial packets. Most comparisons are straight nodelay or straight cork, so your findings will be interesting.
          Hide
          Todd Lipcon added a comment -

          Hi Daryn. Your above descriptions sound right, except the nagle delay on Linux is 40ms rather than 200 (I think the dack delay is 200 though like you said).
          I hacked up something like my #4 yesterday morning but didn't really like the way I did it so I threw it away. I'll try again soon

          Show
          Todd Lipcon added a comment - Hi Daryn. Your above descriptions sound right, except the nagle delay on Linux is 40ms rather than 200 (I think the dack delay is 200 though like you said). I hacked up something like my #4 yesterday morning but didn't really like the way I did it so I threw it away. I'll try again soon
          Hide
          Suresh Srinivas added a comment -

          Todd, how many RPC responses go beyond 8K in size? Roughly what would be your guess on what % of total RPC calls this is?

          Show
          Suresh Srinivas added a comment - Todd, how many RPC responses go beyond 8K in size? Roughly what would be your guess on what % of total RPC calls this is?
          Hide
          Todd Lipcon added a comment -

          My hunch is that it's pretty small. I think the only RPC to the NN which would be at all frequent and cross the 8K boundary would be getListing(). On one production hbase cluster I collected metrics from a while back, getListing represented 8.3% of the RPCs. On one of our QA clusters that's been running MR workloads, it represents 2.3%. Unfortunately we don't have enough metrics to get any info on the size distribution of those responses.

          Would be interested to hear if some of your production clusters show a similar mix.

          Show
          Todd Lipcon added a comment - My hunch is that it's pretty small. I think the only RPC to the NN which would be at all frequent and cross the 8K boundary would be getListing(). On one production hbase cluster I collected metrics from a while back, getListing represented 8.3% of the RPCs. On one of our QA clusters that's been running MR workloads, it represents 2.3%. Unfortunately we don't have enough metrics to get any info on the size distribution of those responses. Would be interested to hear if some of your production clusters show a similar mix.
          Hide
          Suresh Srinivas added a comment -

          ... cross the 8K boundary would be getListing()

          This is what I was thinking. However we have iterative listing now. With that perhaps the probability of such RPCs > 8K is lower. However we should tweek DFS_LIST_LIMIT_DEFAULT, certainly based on your findings.

          Additionally there are other RPCs such as Namenode#getBlocks(), ClientProtocol#listCorruptBlocks(). However these are not frequently called.

          Show
          Suresh Srinivas added a comment - ... cross the 8K boundary would be getListing() This is what I was thinking. However we have iterative listing now. With that perhaps the probability of such RPCs > 8K is lower. However we should tweek DFS_LIST_LIMIT_DEFAULT, certainly based on your findings. Additionally there are other RPCs such as Namenode#getBlocks(), ClientProtocol#listCorruptBlocks(). However these are not frequently called.
          James Fitch made changes -
          Link This issue relates to HBASE-1177 [ HBASE-1177 ]
          Robert Joseph Evans made changes -
          Target Version/s 0.23.2 [ 12319855 ] 2.0.0, 3.0.0 [ 12320352, 12320357 ]
          Suresh Srinivas made changes -
          Link This issue is duplicated by HADOOP-7421 [ HADOOP-7421 ]
          Hide
          Daryn Sharp added a comment -

          Should we again consider changing the default for nodelay? On one of our busiest cluster, listStatus represents about ~3% of RPC load. A few extra packets here and there on a fast network are going to be delivered much faster than the nagle delay.

          Show
          Daryn Sharp added a comment - Should we again consider changing the default for nodelay? On one of our busiest cluster, listStatus represents about ~3% of RPC load. A few extra packets here and there on a fast network are going to be delivered much faster than the nagle delay.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development