Avro
  1. Avro
  2. AVRO-1292

NettyTransceiver: Client threads can block under certain connection failure scenarios

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.4
    • Fix Version/s: 1.7.5
    • Component/s: java
    • Labels:

      Description

      I've recently found a couple of different failure scenarios with NettyTransceiver that result in:

      • Client threads blocking for long periods of time (uninterruptibly at that) while holding the stateLock write lock
      • RPCs (either sync or async) never returning because a failure in sending the RPC was not propagated back up to the caller

      The patch I'm going to submit will probably be a lot easier to understand, but I'll try to explain the main problems I found. There is a single type of underlying connectivity issue that seems to trigger both of these problems in NettyTransceiver: a failure at the network layer causes all packets to be dropped somewhere between the RPC client and server. You might think this would be a rare scenario, but it has happened several times in our production environment and usually occurs after the RPC server machine becomes unresponsive and needs to be physically rebooted. The only way I've been able to reproduce this scenario for testing purposes has been to set up an iptables rule on the RPC server that simply drops all incoming packets from the client. For example, if the client's IP is 10.0.0.1 I would use the following iptables rule on the server to reproduce the failure:

      iptables -t mangle -A INPUT --source 10.0.0.1 -j DROP
      

      After looking through a lot of stack traces I think I've identified 2 main problems:

      Problem 1: NettyTransceiver calls ChannelFuture#awaitUninterruptibly(long) in a couple places, getChannel() and disconnect(boolean,boolean,Throwable). Under the dropped packet scenario I outlined above, the client thread ends up blocking uninterruptibly for the entire connection timeout duration while holding the stateLock write lock. The stack trace for this situation looks like this:

      "RPC Executor - 11 - 1363627762930" daemon prio=10 tid=0x00002aaad005f000 nid=0x56cf in Object.wait() [0x0000000049344000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at java.lang.Object.wait(Object.java:443)
              at org.jboss.netty.channel.DefaultChannelFuture.await0(DefaultChannelFuture.java:265)
              - locked <0x0000000703acfa00> (a org.jboss.netty.channel.DefaultChannelFuture)
              at org.jboss.netty.channel.DefaultChannelFuture.awaitUninterruptibly(DefaultChannelFuture.java:237)
              at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:248)
              at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:199)
              at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:148)
      

      At a minimum it should be possible to interrupt these connection attempts.

      Problem 2: When an error occurs writing to the Netty channel the error is not passed back up the stack or callback chain (whether it's a sync or async RPC), so the client can end up waiting indefinitely for an RPC that will never return because an error occurred sending the Netty packet (i.e. it was never sent to the server in the first place). This scenario might yield a stack trace like the following:

      "main" prio=10 tid=0x00007f9400008800 nid=0x379b waiting on condition [0x00007f9406bc6000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000007af677960> (a java.util.concurrent.CountDownLatch$Sync)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
              at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
              at org.apache.avro.ipc.CallFuture.await(CallFuture.java:141)
              at org.apache.avro.ipc.Requestor.request(Requestor.java:150)
              at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
              at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
              at $Proxy9.send(Unknown Source)
      
      

      It's difficult to provide a unit test for these issues because a connection refused error alone will not trigger it. The only way I've been able to reliably reproduce it is by setting the iptables rule I mentioned above. Hopefully a code review will be sufficient, but if necessary I can try to find a way to create a unit test.

      1. AVRO-1292-Part2-v2.patch
        6 kB
        James Baldassari
      2. AVRO-1292-Part2.patch
        6 kB
        James Baldassari
      3. AVRO-1292-Part1.patch
        1 kB
        James Baldassari

        Activity

        Hide
        James Baldassari added a comment -

        Attaching Part 1 of the patch:

        • Changed all awaitUninterruptibly(long) calls to await(long) with a try/catch that resets the interrupt flag and then rethrows the InterruptedException as an IOException
        Show
        James Baldassari added a comment - Attaching Part 1 of the patch: Changed all awaitUninterruptibly(long) calls to await(long) with a try/catch that resets the interrupt flag and then rethrows the InterruptedException as an IOException
        Hide
        James Baldassari added a comment -

        Attaching Part 2 of the patch:

        • Add a ChannelFutureListener on all async transceive calls that will pass errors to the callback if an exception is thrown while writing data to the netty channel
        • writeBuffers (synchronous call) now waits for the data to be written to the OS network stack before returning so that it can pass up any errors that may occur
        • Increased visibility on a few methods and inner classes from private/package to protected to allow subclasses to override some behaviors since NettyTransceiver doesn't allow much customization currently (e.g. no event listener interface for when a connection is closed).
        Show
        James Baldassari added a comment - Attaching Part 2 of the patch: Add a ChannelFutureListener on all async transceive calls that will pass errors to the callback if an exception is thrown while writing data to the netty channel writeBuffers (synchronous call) now waits for the data to be written to the OS network stack before returning so that it can pass up any errors that may occur Increased visibility on a few methods and inner classes from private/package to protected to allow subclasses to override some behaviors since NettyTransceiver doesn't allow much customization currently (e.g. no event listener interface for when a connection is closed). Enabled tcp keep-alive by default. I can't think of a reason why a client wouldn't want to use this feature to preemptively detect connectivity issues This was also brought up by Frank Grimes about a little over a year ago in this thread: http://apache-avro.679487.n3.nabble.com/NettyServer-not-setting-TCP-keep-alive-leading-to-leaked-connections-td3788498.html
        Hide
        James Baldassari added a comment -

        Noticed one minor issue with Part 2 of the patch. The correct keep-alive parameter for the client bootstrap is keepAlive rather than child.keepAlive. The latter is the option for the server bootstrap.

        Show
        James Baldassari added a comment - Noticed one minor issue with Part 2 of the patch. The correct keep-alive parameter for the client bootstrap is keepAlive rather than child.keepAlive. The latter is the option for the server bootstrap.
        Hide
        Doug Cutting added a comment -

        James, is this ready to commit? What kind of testing have you done? Is it possible to add tests for this?

        Show
        Doug Cutting added a comment - James, is this ready to commit? What kind of testing have you done? Is it possible to add tests for this?
        Hide
        James Baldassari added a comment -

        Hi Doug. Yes, it's ready to commit. The only way I've found to reproduce this issue is by using iptables to cause packets to suddenly be dropped as described above. I can't think of a way to unit test this fix since the problem seems to be triggered by things happening at the network layer. It might be possible to simulate the issue in a unit test using a modified version of NettyServer that is designed to hang in certain situations. If you'd like me to give that a shot I can try it.

        Show
        James Baldassari added a comment - Hi Doug. Yes, it's ready to commit. The only way I've found to reproduce this issue is by using iptables to cause packets to suddenly be dropped as described above. I can't think of a way to unit test this fix since the problem seems to be triggered by things happening at the network layer. It might be possible to simulate the issue in a unit test using a modified version of NettyServer that is designed to hang in certain situations. If you'd like me to give that a shot I can try it.
        Hide
        James Baldassari added a comment -

        I've been trying to come up with a unit test that will reproduce these problems, but unfortunately I haven't been able to do so. It seems that these issues are triggered by certain failure conditions at the network layer only, such as when packets are dropped. I have tested this patch in a manual way as described in my earlier comment by using VMs and iptables to simulate network problems. If you can think of a way to simulate packet loss in a unit test, let me know, but otherwise I don't think it's going to be possible to come up with a good unit test for this.

        Show
        James Baldassari added a comment - I've been trying to come up with a unit test that will reproduce these problems, but unfortunately I haven't been able to do so. It seems that these issues are triggered by certain failure conditions at the network layer only, such as when packets are dropped. I have tested this patch in a manual way as described in my earlier comment by using VMs and iptables to simulate network problems. If you can think of a way to simulate packet loss in a unit test, let me know, but otherwise I don't think it's going to be possible to come up with a good unit test for this.
        Hide
        James Baldassari added a comment -

        Although I haven't been able to come up with a unit test to easily reproduce this issue, my organization has been running with this patch in production for several weeks now, and we haven't seen this deadlock occur since deploying the patch.

        Show
        James Baldassari added a comment - Although I haven't been able to come up with a unit test to easily reproduce this issue, my organization has been running with this patch in production for several weeks now, and we haven't seen this deadlock occur since deploying the patch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1499040 from Doug Cutting
        [ https://svn.apache.org/r1499040 ]

        AVRO-1292. Java: Fix potential client blocking in NettyTransceiver. Contributed by James Baldassari.

        Show
        ASF subversion and git services added a comment - Commit 1499040 from Doug Cutting [ https://svn.apache.org/r1499040 ] AVRO-1292 . Java: Fix potential client blocking in NettyTransceiver. Contributed by James Baldassari.
        Hide
        Doug Cutting added a comment -

        I committed this. Thanks, James!

        Show
        Doug Cutting added a comment - I committed this. Thanks, James!
        Hide
        Hudson added a comment -

        Integrated in AvroJava #381 (See https://builds.apache.org/job/AvroJava/381/)
        AVRO-1292. Java: Fix potential client blocking in NettyTransceiver. Contributed by James Baldassari. (Revision 1499040)

        Result = SUCCESS
        cutting :
        Files :

        • /avro/trunk/CHANGES.txt
        • /avro/trunk/lang/java/ipc/src/main/java/org/apache/avro/ipc/NettyTransceiver.java
        Show
        Hudson added a comment - Integrated in AvroJava #381 (See https://builds.apache.org/job/AvroJava/381/ ) AVRO-1292 . Java: Fix potential client blocking in NettyTransceiver. Contributed by James Baldassari. (Revision 1499040) Result = SUCCESS cutting : Files : /avro/trunk/CHANGES.txt /avro/trunk/lang/java/ipc/src/main/java/org/apache/avro/ipc/NettyTransceiver.java

          People

          • Assignee:
            James Baldassari
            Reporter:
            James Baldassari
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development