Hadoop Common
  1. Hadoop Common
  2. HADOOP-6713

The RPC server Listener thread is a scalability bottleneck

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: ipc
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The Hadoop RPC Server implementation has a single Listener thread that reads data from the socket and puts them into a call queue. This means that this single thread can pull RPC requests off the network only as fast as a single CPU can execute. This is a scalability bottlneck in our cluster.

      1. HADOOP-6713.2.patch
        11 kB
        Dmytro Molkov
      2. HADOOP-6713.patch
        10 kB
        Dmytro Molkov
      3. HADOOP-6713-rel20.2.patch
        12 kB
        Bharath Mundlapudi
      4. HADOOP-6713-rel20.3.patch
        13 kB
        Bharath Mundlapudi
      5. HADOOP-6713-rel20.patch
        13 kB
        Bharath Mundlapudi

        Issue Links

          Activity

          Hide
          Suresh Srinivas added a comment -

          +1 for the patch.

          Show
          Suresh Srinivas added a comment - +1 for the patch.
          Hide
          Bharath Mundlapudi added a comment -

          Attached the patch addressing the comments.

          Show
          Bharath Mundlapudi added a comment - Attached the patch addressing the comments.
          Hide
          Suresh Srinivas added a comment -

          Comments for Y20 version of the patch.

          1. Please remove YAHOO-CHANGES.txt from the patch
          2. Remove blank changes that is diff between two patches: lines 51, 73, 138, 171, 203, 220. There could be others.
          3. line 283 exceeds 80 chars
          Show
          Suresh Srinivas added a comment - Comments for Y20 version of the patch. Please remove YAHOO-CHANGES.txt from the patch Remove blank changes that is diff between two patches: lines 51, 73, 138, 171, 203, 220. There could be others. line 283 exceeds 80 chars
          Hide
          Bharath Mundlapudi added a comment -

          Attached the patch for Y20S branch.

          Show
          Bharath Mundlapudi added a comment - Attached the patch for Y20S branch.
          Hide
          Hairong Kuang added a comment -

          I've just committed this. Thanks, Dmytro!

          Show
          Hairong Kuang added a comment - I've just committed this. Thanks, Dmytro!
          Hide
          Hairong Kuang added a comment -

          +1

          Note that this patch assigns read channels to readers in a round-robin fashion. Ideally it would be nicer if all readers share a set of selected keys and ready-to-read channels are dynamically assigned to each reader. This would enforce better balance among readers. But I think we could have this improvement at a later time.

          Show
          Hairong Kuang added a comment - +1 Note that this patch assigns read channels to readers in a round-robin fashion. Ideally it would be nicer if all readers share a set of selected keys and ready-to-read channels are dynamically assigned to each reader. This would enforce better balance among readers. But I think we could have this improvement at a later time.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12442902/HADOOP-6713.2.patch
          against trunk revision 938136.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442902/HADOOP-6713.2.patch against trunk revision 938136. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/52/console This message is automatically generated.
          Hide
          Dmytro Molkov added a comment -

          Resubmitting for Hudson

          Show
          Dmytro Molkov added a comment - Resubmitting for Hudson
          Hide
          Dmytro Molkov added a comment -

          modifying the patch to address Hairong's comments:

          1. Removed the System.out
          2. Removed doRead from listener#run
          3. Changed the code to accept all the connections in the queue

          4. I wrote a comment on startAdd method to explain the synchronization a little. The reason I went with this decision is to keep the semantics of doAccept. After the iteration with certain channel that channel is registered with the selector and is ready to be read. Otherwise it possibly could affect other parts of the code, when the connection is accepted (doAccept iteration for the channel is over) but connectionList is not modified yet. And the connection is only created when the channel is registered with selector.

          Show
          Dmytro Molkov added a comment - modifying the patch to address Hairong's comments: 1. Removed the System.out 2. Removed doRead from listener#run 3. Changed the code to accept all the connections in the queue 4. I wrote a comment on startAdd method to explain the synchronization a little. The reason I went with this decision is to keep the semantics of doAccept. After the iteration with certain channel that channel is registered with the selector and is ready to be read. Otherwise it possibly could affect other parts of the code, when the connection is accepted (doAccept iteration for the channel is over) but connectionList is not modified yet. And the connection is only created when the channel is registered with selector.
          Hide
          Hairong Kuang added a comment -

          This is a great idea! Separating "accept" from "read" should also greatly reduce the Connection reset errors observed at the client when NameNode is busy. Dhruba asked me to review this patch. So here are a few comments:

          1. Please remove the System.out.println or change it to be a log statement;
          2. Listener#run() should remove doRead() else branch;
          3. Now that accept is done is a separate thread, doAccept() should accept as many as possible (not limit to up to 10 as in the trunk). Another option is to use a blocking accept channel.
          4. Optional: the synchronization between listener thread & read thread is very interesting. It took me a while to figure out that it works. But it seems to me that the code is hard to understand and maintain. Another option is that each reader thread maintains a queue of pending registration channels. After choosing a reader, a listener thread simply adds an accepted channel into its pending queue and then wakes up the reader thread. Each reader thread registers all the pending channels before select().

          Show
          Hairong Kuang added a comment - This is a great idea! Separating "accept" from "read" should also greatly reduce the Connection reset errors observed at the client when NameNode is busy. Dhruba asked me to review this patch. So here are a few comments: 1. Please remove the System.out.println or change it to be a log statement; 2. Listener#run() should remove doRead() else branch; 3. Now that accept is done is a separate thread, doAccept() should accept as many as possible (not limit to up to 10 as in the trunk). Another option is to use a blocking accept channel. 4. Optional: the synchronization between listener thread & read thread is very interesting. It took me a while to figure out that it works. But it seems to me that the code is hard to understand and maintain. Another option is that each reader thread maintains a queue of pending registration channels. After choosing a reader, a listener thread simply adds an accepted channel into its pending queue and then wakes up the reader thread. Each reader thread registers all the pending channels before select().
          Hide
          Dmytro Molkov added a comment -

          We were running performance test on our test cluster.
          The test itself is creating a tree of directories with files on the leafs in a depths first search fashion: there is a root and we create N directories in the root directory for the test, each mapper then starts in one of those directories and creates its own subtree with files on the leafs.

          Then there is a read job that for each mapper does ls on the directory, chooses random element in ls, if it is a directory then repeat if it is a file then do read on the file. The files are 4K in size so the read time is small and we are mostly hitting the namenode with this job.

          We were running the branch that had this fix and it also had read write locks for namenode instead of synchronized sections.

          The version without fixes could only get namenode to use 175% cpu. With fixes in place we were using 750% cpu for read only load (when the second job was running on its own and 550% for read-write load when two jobs were running in parallel.

          In the read-write mode the ration of reads to writes was 8:1 (800 read clients vs 100 write clients).

          We are not putting the read-write locks in production in this iteration, seems we feel like we need to do more testing on it. As soon as I have some results for the branch with this fix only I will post my findings here.

          Show
          Dmytro Molkov added a comment - We were running performance test on our test cluster. The test itself is creating a tree of directories with files on the leafs in a depths first search fashion: there is a root and we create N directories in the root directory for the test, each mapper then starts in one of those directories and creates its own subtree with files on the leafs. Then there is a read job that for each mapper does ls on the directory, chooses random element in ls, if it is a directory then repeat if it is a file then do read on the file. The files are 4K in size so the read time is small and we are mostly hitting the namenode with this job. We were running the branch that had this fix and it also had read write locks for namenode instead of synchronized sections. The version without fixes could only get namenode to use 175% cpu. With fixes in place we were using 750% cpu for read only load (when the second job was running on its own and 550% for read-write load when two jobs were running in parallel. In the read-write mode the ration of reads to writes was 8:1 (800 read clients vs 100 write clients). We are not putting the read-write locks in production in this iteration, seems we feel like we need to do more testing on it. As soon as I have some results for the branch with this fix only I will post my findings here.
          Hide
          Koji Noguchi added a comment -

          Nice. Any performance numbers ?

          Show
          Koji Noguchi added a comment - Nice. Any performance numbers ?
          Hide
          dhruba borthakur added a comment -

          +1 code looks good.

          Show
          dhruba borthakur added a comment - +1 code looks good.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12442483/HADOOP-6713.patch
          against trunk revision 936463.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442483/HADOOP-6713.patch against trunk revision 936463. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/49/console This message is automatically generated.
          Hide
          Dmytro Molkov added a comment -

          Please have a look. We did the same change on hadoop-0.20 and ran it for a day under heavy write and read load on the test cluster.
          The unittest included just reruns the TestRCP with multiple number of reader threads

          Show
          Dmytro Molkov added a comment - Please have a look. We did the same change on hadoop-0.20 and ran it for a day under heavy write and read load on the test cluster. The unittest included just reruns the TestRCP with multiple number of reader threads
          Hide
          dhruba borthakur added a comment -

          Hi Christian, I agree with you. We have around 40K clients hitting the namenode and the NN ( inspite of running NN on a newest and greatest Nehalem) has one CPU completely maxed out by the Server.Listener thread.

          Show
          dhruba borthakur added a comment - Hi Christian, I agree with you. We have around 40K clients hitting the namenode and the NN ( inspite of running NN on a newest and greatest Nehalem) has one CPU completely maxed out by the Server.Listener thread.
          Hide
          Christian Kunz added a comment -

          That would be a great improvement for our clusters as well. Probably for any cluster with a lot of clients not necessarily being map-reduce applications.

          Show
          Christian Kunz added a comment - That would be a great improvement for our clusters as well. Probably for any cluster with a lot of clients not necessarily being map-reduce applications.
          Hide
          dhruba borthakur added a comment -

          My proposal is that the Server.Listener.run() thread only executes doAccept(). Also there are a bunch of open Selector objects (instead of just one Selector object in the current code). Each of these Selector objects are handled by its own thread.

          The Server.Listener.run() invoke doAccept() that inserts a SocketChannel into a randomly selected Selector object. The thread that is monitoring this Selector object will then invoke doRead() and process the RPC.

          Show
          dhruba borthakur added a comment - My proposal is that the Server.Listener.run() thread only executes doAccept(). Also there are a bunch of open Selector objects (instead of just one Selector object in the current code). Each of these Selector objects are handled by its own thread. The Server.Listener.run() invoke doAccept() that inserts a SocketChannel into a randomly selected Selector object. The thread that is monitoring this Selector object will then invoke doRead() and process the RPC.

            People

            • Assignee:
              Dmytro Molkov
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development