Flume
  1. Flume
  2. FLUME-1254

RpcClient can hang when communication is broken with the source.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v1.2.0
    • Component/s: None
    • Labels:
      None

      Description

      If the source that the client is connected to fails, the RpcClient can hang indefinitely. This problem also affects the AvroCLIClient and AvroSink since they internally use the same RpcClient.

      1. FLUME-1254-4.patch
        28 kB
        Arvind Prabhakar
      2. FLUME-1254-3.patch
        28 kB
        Arvind Prabhakar
      3. FLUME-1254-2.patch
        28 kB
        Arvind Prabhakar

        Activity

        Hide
        Arvind Prabhakar added a comment -

        Review board is down still. Will post a review when it comes back up.

        Show
        Arvind Prabhakar added a comment - Review board is down still. Will post a review when it comes back up.
        Hide
        Arvind Prabhakar added a comment -

        This patch does the following:

        • Upgrades Avro to version 1.6.3 which fixes two underlying bugs AVRO-982 and AVRO-1013.
        • Modifies the NettyAvroRpcClient to use a custom ThreadFactory for the transceiver that sets each thread as a daemon.
        • Modifies the NettyAvroRpcClient to allow the configuration of connection and request timeouts
        • Modifies the FailoverRpcClient to use the properties based factory method for creating individual NettyAvroRpcClients
        • General clean up of configuration keys/default values and tests
        • Modifies the AvroSink to startup normally even if there is a failure while establishing connection since the process() retries establishing the connection anyway.

        Ran all tests. Will be doning more manual testing in the meantime.

        Show
        Arvind Prabhakar added a comment - This patch does the following: Upgrades Avro to version 1.6.3 which fixes two underlying bugs AVRO-982 and AVRO-1013 . Modifies the NettyAvroRpcClient to use a custom ThreadFactory for the transceiver that sets each thread as a daemon. Modifies the NettyAvroRpcClient to allow the configuration of connection and request timeouts Modifies the FailoverRpcClient to use the properties based factory method for creating individual NettyAvroRpcClients General clean up of configuration keys/default values and tests Modifies the AvroSink to startup normally even if there is a failure while establishing connection since the process() retries establishing the connection anyway. Ran all tests. Will be doning more manual testing in the meantime.
        Hide
        Arvind Prabhakar added a comment -

        Fixed a problem with the sink runner due to which the AvroSink when not connected can spin out of control leading to OOM exceptions.

        Show
        Arvind Prabhakar added a comment - Fixed a problem with the sink runner due to which the AvroSink when not connected can spin out of control leading to OOM exceptions.
        Hide
        Mike Percy added a comment -

        Hi Arvind, I'm looking at your changes. Can you please rebase the patch? It's not applying cleanly for me.

        Show
        Mike Percy added a comment - Hi Arvind, I'm looking at your changes. Can you please rebase the patch? It's not applying cleanly for me.
        Hide
        Arvind Prabhakar added a comment -

        Thanks Mike. Rebased the patch to latest trunk.

        Show
        Arvind Prabhakar added a comment - Thanks Mike. Rebased the patch to latest trunk.
        Hide
        Mike Percy added a comment - - edited

        Hi Arvind,
        These changes look really good. Great job digging into and fixing this stuff!

        Comments:

        • I was going to say that caching InetSocketAddress instead of using the new HostInfo class might be a good thing for performance. But then again it makes sense to allow DNS to change under us... so, not caching each InetSocketAddress instance actually a reliability improvement in that respect. Awesome.

        Just a few nits:
        1. RpcClientConfigurationConstants.java

        • DEFAULT_CONNECT_TIMEOUT_MILLIS - maybe make it 5 seconds default? 60 seconds seems like a long time. I know this was there from before.
        • DEFAULT_REQUEST_TIMEOUT_MILLIS - maybe make it 20 seconds default? (Same as above)

        2. It seems like much of the code in these classes is just parsing properties file configuration. Would be nice if it was factored out somehow (using Context object internally?) - maybe this should be another JIRA

        3. Doesn't look like the commons-compress addition to pom.xml is used.

        Show
        Mike Percy added a comment - - edited Hi Arvind, These changes look really good. Great job digging into and fixing this stuff! Comments: I was going to say that caching InetSocketAddress instead of using the new HostInfo class might be a good thing for performance. But then again it makes sense to allow DNS to change under us... so, not caching each InetSocketAddress instance actually a reliability improvement in that respect. Awesome. Just a few nits: 1. RpcClientConfigurationConstants.java DEFAULT_CONNECT_TIMEOUT_MILLIS - maybe make it 5 seconds default? 60 seconds seems like a long time. I know this was there from before. DEFAULT_REQUEST_TIMEOUT_MILLIS - maybe make it 20 seconds default? (Same as above) 2. It seems like much of the code in these classes is just parsing properties file configuration. Would be nice if it was factored out somehow (using Context object internally?) - maybe this should be another JIRA 3. Doesn't look like the commons-compress addition to pom.xml is used.
        Hide
        Arvind Prabhakar added a comment -

        Thanks for the review Mike.

        * DEFAULT_CONNECT_TIMEOUT_MILLIS - maybe make it 5 seconds default? 60 seconds seems like a long time. I know this was there from before.

        done

        * DEFAULT_REQUEST_TIMEOUT_MILLIS - maybe make it 20 seconds default? (Same as above)

        done

        2. It seems like much of the code in these classes is just parsing properties file configuration. Would be nice if it was factored out somehow (using Context object internally?) - maybe this should be another JIRA

        The rational behind having properties based factory methods was to eventually adopt a properties file based client configuration. So RpcClientFactory.getInstance(CONFIG_FILE_NAME) will return a client that is configured vi the property file CONFIG_FILE_NAME. Although it could internally construct a Context to do getInteger() etc calls, but that would be an implementation detail. I would much rather leave it as is for now.

        3. Doesn't look like the commons-compress addition to pom.xml is used.

        Removed.

        Show
        Arvind Prabhakar added a comment - Thanks for the review Mike. * DEFAULT_CONNECT_TIMEOUT_MILLIS - maybe make it 5 seconds default? 60 seconds seems like a long time. I know this was there from before. done * DEFAULT_REQUEST_TIMEOUT_MILLIS - maybe make it 20 seconds default? (Same as above) done 2. It seems like much of the code in these classes is just parsing properties file configuration. Would be nice if it was factored out somehow (using Context object internally?) - maybe this should be another JIRA The rational behind having properties based factory methods was to eventually adopt a properties file based client configuration. So RpcClientFactory.getInstance(CONFIG_FILE_NAME) will return a client that is configured vi the property file CONFIG_FILE_NAME . Although it could internally construct a Context to do getInteger() etc calls, but that would be an implementation detail. I would much rather leave it as is for now. 3. Doesn't look like the commons-compress addition to pom.xml is used. Removed.
        Hide
        Arvind Prabhakar added a comment -

        Updated with review feedback.

        Show
        Arvind Prabhakar added a comment - Updated with review feedback.
        Hide
        Mike Percy added a comment -

        +1

        Show
        Mike Percy added a comment - +1
        Hide
        Mike Percy added a comment -

        Patch committed. Thanks Arvind!

        Show
        Mike Percy added a comment - Patch committed. Thanks Arvind!
        Hide
        Hudson added a comment -

        Integrated in flume-trunk #222 (See https://builds.apache.org/job/flume-trunk/222/)
        FLUME-1254. RpcClient can hang when communication is broken with the source.

        (Arvind Prabhakar via Mike Percy) (Revision 1346220)

        Result = FAILURE
        mpercy : http://svn.apache.org/viewvc/?view=rev&rev=1346220
        Files :

        • /incubator/flume/trunk/flume-ng-core/src/main/java/org/apache/flume/SinkRunner.java
        • /incubator/flume/trunk/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java
        • /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/AbstractRpcClient.java
        • /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/FailoverRpcClient.java
        • /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/NettyAvroRpcClient.java
        • /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/RpcClientConfigurationConstants.java
        • /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/RpcClientFactory.java
        • /incubator/flume/trunk/flume-ng-sdk/src/test/java/org/apache/flume/api/TestFailoverRpcClient.java
        • /incubator/flume/trunk/pom.xml
        Show
        Hudson added a comment - Integrated in flume-trunk #222 (See https://builds.apache.org/job/flume-trunk/222/ ) FLUME-1254 . RpcClient can hang when communication is broken with the source. (Arvind Prabhakar via Mike Percy) (Revision 1346220) Result = FAILURE mpercy : http://svn.apache.org/viewvc/?view=rev&rev=1346220 Files : /incubator/flume/trunk/flume-ng-core/src/main/java/org/apache/flume/SinkRunner.java /incubator/flume/trunk/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/AbstractRpcClient.java /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/FailoverRpcClient.java /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/NettyAvroRpcClient.java /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/RpcClientConfigurationConstants.java /incubator/flume/trunk/flume-ng-sdk/src/main/java/org/apache/flume/api/RpcClientFactory.java /incubator/flume/trunk/flume-ng-sdk/src/test/java/org/apache/flume/api/TestFailoverRpcClient.java /incubator/flume/trunk/pom.xml

          People

          • Assignee:
            Arvind Prabhakar
            Reporter:
            Arvind Prabhakar
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development