Hadoop Common
  1. Hadoop Common
  2. HADOOP-2870

Datanode.shutdown() and Namenode.stop() should close all rpc connections

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.16.0
    • Fix Version/s: 0.17.0
    • Component/s: ipc
    • Labels:
      None

      Description

      Currently this two cleanup methods do not close all existing rpc connections. If a mini dfs cluster gets shutdown and then restarted as we do in TestFileCreation, RPCs in second mini cluster reuse the unclosed connections opened in the first run but there is no server running to serve the request. So the client get stuck waiting for the response forever if client side timeout gets removed as suggested by hadoop-2811.

      1. closeConnection4.patch
        25 kB
        Hairong Kuang
      2. closeConnection3.patch
        25 kB
        Hairong Kuang
      3. closeConnection2.patch
        25 kB
        Hairong Kuang
      4. closeConnection1.patch
        25 kB
        Hairong Kuang
      5. closeConnection.patch
        22 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          I am looking at how to close the client-side connections. It looks that an IPC client is an abstract of SocketFactory. Different proxies share the same IPC client as long as they use the same socket factory even though they may talk to different IPC servers. Is this true? What's the motivation of the design?

          Show
          Hairong Kuang added a comment - I am looking at how to close the client-side connections. It looks that an IPC client is an abstract of SocketFactory. Different proxies share the same IPC client as long as they use the same socket factory even though they may talk to different IPC servers. Is this true? What's the motivation of the design?
          Hide
          Doug Cutting added a comment -

          > What's the motivation of the design?

          Conservation of connections was the original motivation. Connections are pooled in the IPC client, so sharing an IPC client permits multiple RPC proxy instances to share connections. We could achieve this in other ways, for example, RPC could cache IPC clients by port too. But, unless we cache proxy instances, we shouldn't create a new IPC client per proxy or we'd lose connection pooling.

          Show
          Doug Cutting added a comment - > What's the motivation of the design? Conservation of connections was the original motivation. Connections are pooled in the IPC client, so sharing an IPC client permits multiple RPC proxy instances to share connections. We could achieve this in other ways, for example, RPC could cache IPC clients by port too. But, unless we cache proxy instances, we shouldn't create a new IPC client per proxy or we'd lose connection pooling.
          Hide
          Hairong Kuang added a comment -

          If two proxies talk to different servers, I do not see how they can share connections by sharing the same IPC clients. It seems to me that it is better to allow two proxies to share a client when they talk to the same server and have the same socket factory.

          Show
          Hairong Kuang added a comment - If two proxies talk to different servers, I do not see how they can share connections by sharing the same IPC clients. It seems to me that it is better to allow two proxies to share a client when they talk to the same server and have the same socket factory.
          Hide
          Hairong Kuang added a comment -

          I feel that it makes more sense to make the key to be the remote server address. Then an IPC client caches all connections to a server.

          Show
          Hairong Kuang added a comment - I feel that it makes more sense to make the key to be the remote server address. Then an IPC client caches all connections to a server.
          Hide
          Doug Cutting added a comment -

          > allow two proxies to share a client when they talk to the same server and have the same socket factory

          Yes, that sounds reasonable. The cache keys would be <host,port,socketFactory>, right?

          Each IPC client currently has a connectionCuller thread, so creating a client per host will roughly double the number of RPC-related threads (each also has a thread per open connection, to read responses asynchronously). We could perhaps make the culler static, rather than per-client if this is thought to be too costly.

          Why does the culler have to be a separate thread anyway? Couldn't the connection thread itself simply exit when reading a response times out and the idle time has been exceeded?

          Show
          Doug Cutting added a comment - > allow two proxies to share a client when they talk to the same server and have the same socket factory Yes, that sounds reasonable. The cache keys would be <host,port,socketFactory>, right? Each IPC client currently has a connectionCuller thread, so creating a client per host will roughly double the number of RPC-related threads (each also has a thread per open connection, to read responses asynchronously). We could perhaps make the culler static, rather than per-client if this is thought to be too costly. Why does the culler have to be a separate thread anyway? Couldn't the connection thread itself simply exit when reading a response times out and the idle time has been exceeded?
          Hide
          Hairong Kuang added a comment -

          > Why does the culler have to be a separate thread anyway? Couldn't the connection thread itself simply exit when reading a response times out and the idle time has been exceeded?
          Yes, absolutely. I removed the culler thread in the patch that 's submitted to hadoop-2811.

          > The cache keys would be <host,port,socketFactory>, right?
          I am thinking to make the client cache key to be host:port. So a Client maps to a server. It maintains all connections to a server. If the key is <host:port, socketFactory>, then a Client is more or less equal to a connection. Does this make sense?

          Show
          Hairong Kuang added a comment - > Why does the culler have to be a separate thread anyway? Couldn't the connection thread itself simply exit when reading a response times out and the idle time has been exceeded? Yes, absolutely. I removed the culler thread in the patch that 's submitted to hadoop-2811. > The cache keys would be <host,port,socketFactory>, right? I am thinking to make the client cache key to be host:port. So a Client maps to a server. It maintains all connections to a server. If the key is <host:port, socketFactory>, then a Client is more or less equal to a connection. Does this make sense?
          Hide
          Hairong Kuang added a comment -

          Now I understand the reason that each IPC client allows connections to multiple servers. This is for supporting parallel calls in the IPC Client. Parallel calls allow multiple requests to different servers simultaneously.

          So I do not plan to change the caching structure. In this jira, I will add a reference counter to each entry in the IPC client cache. Only when the reference counter becomes zero, a client will be closed. I will also add a static method to the RPC class, stopProxy(VersionProtocol), which supports the close of a proxy, which in turn closes a IPC client when its reference counter becomes zero.

          Show
          Hairong Kuang added a comment - Now I understand the reason that each IPC client allows connections to multiple servers. This is for supporting parallel calls in the IPC Client. Parallel calls allow multiple requests to different servers simultaneously. So I do not plan to change the caching structure. In this jira, I will add a reference counter to each entry in the IPC client cache. Only when the reference counter becomes zero, a client will be closed. I will also add a static method to the RPC class, stopProxy(VersionProtocol), which supports the close of a proxy, which in turn closes a IPC client when its reference counter becomes zero.
          Hide
          Hairong Kuang added a comment -

          This patch
          1. adds a reference counter to each entry to the IPC Client cache;
          2. adds a static method to RPC: stopProxy(VersionProtocol)
          3. makes sure that IPC clients and servers close connections in their shutdoown methods
          4. makes sure that Datanode and DFSClient stop the proxy in their shutdown methods
          5. cleanup RPC code by removing junit tests related methods
          6. fix the slowRPC test code in TestRPC so it does not need the change of RPC code
          7. fix varous junit test to explicitely close various connections.

          Show
          Hairong Kuang added a comment - This patch 1. adds a reference counter to each entry to the IPC Client cache; 2. adds a static method to RPC: stopProxy(VersionProtocol) 3. makes sure that IPC clients and servers close connections in their shutdoown methods 4. makes sure that Datanode and DFSClient stop the proxy in their shutdown methods 5. cleanup RPC code by removing junit tests related methods 6. fix the slowRPC test code in TestRPC so it does not need the change of RPC code 7. fix varous junit test to explicitely close various connections.
          Hide
          Owen O'Malley added a comment -

          I think a ClientInfo that wraps Client adding a reference count seems like overkill. Why not just add the reference count to Client itself?

          I wonder if the reference counting is better done by WeakReference? In particular, you should also have the JobTracker and TaskTracker free up their ipc clients too.

          In the test code 'slowping' should be 'slowPing'.

          A big thanks for removing the unit test code from the main source code

          Show
          Owen O'Malley added a comment - I think a ClientInfo that wraps Client adding a reference count seems like overkill. Why not just add the reference count to Client itself? I wonder if the reference counting is better done by WeakReference? In particular, you should also have the JobTracker and TaskTracker free up their ipc clients too. In the test code 'slowping' should be 'slowPing'. A big thanks for removing the unit test code from the main source code
          Hide
          Raghu Angadi added a comment -

          One change: client.stop() could wait a long time if client is waiting for response from server. So this could block other things like creating a different proxy by another thread. We could move client.stop() outside synchronized in ClientCache.stopClient() after removing it from the cache.

          > I wonder if the reference counting is better done by WeakReference?
          I doubt it. I think requirement here is to close the connection as soon as ref count goes to zero. WeakRef is java gc dependent, thus the original test failure that triggered this Jira will still fail.

          Show
          Raghu Angadi added a comment - One change: client.stop() could wait a long time if client is waiting for response from server. So this could block other things like creating a different proxy by another thread. We could move client.stop() outside synchronized in ClientCache.stopClient() after removing it from the cache. > I wonder if the reference counting is better done by WeakReference? I doubt it. I think requirement here is to close the connection as soon as ref count goes to zero. WeakRef is java gc dependent, thus the original test failure that triggered this Jira will still fail.
          Hide
          Hairong Kuang added a comment - - edited

          The new patch incorporates the following comments:
          1. Move Client's reference count to Client;
          2. Make TaskTracker and JobClient close connections in close;
          3. Move client.stop out of the critical section in stopClient;
          4. slowping -> slowPing

          Show
          Hairong Kuang added a comment - - edited The new patch incorporates the following comments: 1. Move Client's reference count to Client; 2. Make TaskTracker and JobClient close connections in close; 3. Move client.stop out of the critical section in stopClient; 4. slowping -> slowPing
          Hide
          Raghu Angadi added a comment -

          > Move client.stop out of the critical section in stopClient

          stopClient() is still synchronized at the top.

          I haven't entirely thought through it, but it would be nice if RPC.stopProxy() could take RetryProxy (which is a wrapper around regular Proxy) as well. That way users don't need to distinguish these two. Right now DataNode has to proxies 'namenode' for RPCs and 'rpcNamenode' for stopProxy().

          Show
          Raghu Angadi added a comment - > Move client.stop out of the critical section in stopClient stopClient() is still synchronized at the top. I haven't entirely thought through it, but it would be nice if RPC.stopProxy() could take RetryProxy (which is a wrapper around regular Proxy) as well. That way users don't need to distinguish these two. Right now DataNode has to proxies 'namenode' for RPCs and 'rpcNamenode' for stopProxy().
          Hide
          Raghu Angadi added a comment -

          Just to clarify, I did not mean to make the above suggestion about RetryProxy a requirement for this patch. It was just a thought .

          Show
          Raghu Angadi added a comment - Just to clarify, I did not mean to make the above suggestion about RetryProxy a requirement for this patch. It was just a thought .
          Hide
          Hairong Kuang added a comment -

          Thanks, Raghu. The patch removed the top synchronization at stopClient. As the thought of providing a method to close RetryProxy and then in turn closes the wrapped RPCProxy, I agree that it is a good idea. I thought about it too. But the problem is that it is hard to make it general. Thanks for clarifying that it is not a requirement for this patch.

          Show
          Hairong Kuang added a comment - Thanks, Raghu. The patch removed the top synchronization at stopClient. As the thought of providing a method to close RetryProxy and then in turn closes the wrapped RPCProxy, I agree that it is a good idea. I thought about it too. But the problem is that it is hard to make it general. Thanks for clarifying that it is not a requirement for this patch.
          Hide
          Raghu Angadi added a comment -

          +1 From my side. Thanks Hairong.

          Show
          Raghu Angadi added a comment - +1 From my side. Thanks Hairong.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377286/closeConnection2.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 12 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377286/closeConnection2.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 12 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1907/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          The patch fixed the findbug bug.

          Show
          Hairong Kuang added a comment - The patch fixed the findbug bug.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377300/closeConnection3.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 12 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377300/closeConnection3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 12 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1911/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          I do not think this findbug warning needs to be fixed. But anyway the attached patch should be able to fix it.

          Show
          Hairong Kuang added a comment - I do not think this findbug warning needs to be fixed. But anyway the attached patch should be able to fix it.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377394/closeConnection4.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 12 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377394/closeConnection4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 12 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1916/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Hairong!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Hairong!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #426 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/426/ )
          Hide
          Tom White added a comment -

          Just noticed that this change removed the public RPC.stopClient() method. It's still used by DFSLocationsRoot in contrib. Should we add it back as a deprecated no-op? Or at least mark this change as incompatible in the change log and fix contrib.

          Show
          Tom White added a comment - Just noticed that this change removed the public RPC.stopClient() method. It's still used by DFSLocationsRoot in contrib. Should we add it back as a deprecated no-op? Or at least mark this change as incompatible in the change log and fix contrib.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development