Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1292

NettyTransceiver: Client threads can block under certain connection failure scenarios

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7.4
    • 1.7.5
    • java

    Description

      I've recently found a couple of different failure scenarios with NettyTransceiver that result in:

      • Client threads blocking for long periods of time (uninterruptibly at that) while holding the stateLock write lock
      • RPCs (either sync or async) never returning because a failure in sending the RPC was not propagated back up to the caller

      The patch I'm going to submit will probably be a lot easier to understand, but I'll try to explain the main problems I found. There is a single type of underlying connectivity issue that seems to trigger both of these problems in NettyTransceiver: a failure at the network layer causes all packets to be dropped somewhere between the RPC client and server. You might think this would be a rare scenario, but it has happened several times in our production environment and usually occurs after the RPC server machine becomes unresponsive and needs to be physically rebooted. The only way I've been able to reproduce this scenario for testing purposes has been to set up an iptables rule on the RPC server that simply drops all incoming packets from the client. For example, if the client's IP is 10.0.0.1 I would use the following iptables rule on the server to reproduce the failure:

      iptables -t mangle -A INPUT --source 10.0.0.1 -j DROP
      

      After looking through a lot of stack traces I think I've identified 2 main problems:

      Problem 1: NettyTransceiver calls ChannelFuture#awaitUninterruptibly(long) in a couple places, getChannel() and disconnect(boolean,boolean,Throwable). Under the dropped packet scenario I outlined above, the client thread ends up blocking uninterruptibly for the entire connection timeout duration while holding the stateLock write lock. The stack trace for this situation looks like this:

      "RPC Executor - 11 - 1363627762930" daemon prio=10 tid=0x00002aaad005f000 nid=0x56cf in Object.wait() [0x0000000049344000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at java.lang.Object.wait(Object.java:443)
              at org.jboss.netty.channel.DefaultChannelFuture.await0(DefaultChannelFuture.java:265)
              - locked <0x0000000703acfa00> (a org.jboss.netty.channel.DefaultChannelFuture)
              at org.jboss.netty.channel.DefaultChannelFuture.awaitUninterruptibly(DefaultChannelFuture.java:237)
              at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:248)
              at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:199)
              at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:148)
      

      At a minimum it should be possible to interrupt these connection attempts.

      Problem 2: When an error occurs writing to the Netty channel the error is not passed back up the stack or callback chain (whether it's a sync or async RPC), so the client can end up waiting indefinitely for an RPC that will never return because an error occurred sending the Netty packet (i.e. it was never sent to the server in the first place). This scenario might yield a stack trace like the following:

      "main" prio=10 tid=0x00007f9400008800 nid=0x379b waiting on condition [0x00007f9406bc6000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000007af677960> (a java.util.concurrent.CountDownLatch$Sync)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
              at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
              at org.apache.avro.ipc.CallFuture.await(CallFuture.java:141)
              at org.apache.avro.ipc.Requestor.request(Requestor.java:150)
              at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
              at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
              at $Proxy9.send(Unknown Source)
      
      

      It's difficult to provide a unit test for these issues because a connection refused error alone will not trigger it. The only way I've been able to reliably reproduce it is by setting the iptables rule I mentioned above. Hopefully a code review will be sufficient, but if necessary I can try to find a way to create a unit test.

      Attachments

        1. AVRO-1292-Part1.patch
          1 kB
          James Baldassari
        2. AVRO-1292-Part2.patch
          6 kB
          James Baldassari
        3. AVRO-1292-Part2-v2.patch
          6 kB
          James Baldassari

        Activity

          People

            jbaldassari James Baldassari
            jbaldassari James Baldassari
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: