Avro
  1. Avro
  2. AVRO-1027

NettyTransceiver will deadlock when attempting transceive/disconnect on the same thread

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.1
    • Fix Version/s: 1.6.3
    • Component/s: java
    • Labels:
      None

      Description

      If an Exception is caught while trying to write to a Channel, Netty can deliver the Exception to a ChannelUpstreamHandler on the same thread that attempted to write to the Channel. If this occurs with the NettyClientAvroHandler implementation of ChannelUpstreamHandler then the thread will deadlock.

      Specifically, NettyClientAvroHandler overrides the ChannelUpstreamHandler.exceptionCaught() method to perform a disconnect, which requires the NettyTransceiver's write lock. However, in the above situation, the thread will already have locked the NettyTransceiver's read lock to write to the Channel. ReentrantReadWriteLock does not allow upgrading from a read to a write lock, hence the thread deadlocks.

      Example stack trace (simplified):

      "SessionManager-TimeoutPoller" prio=10 tid=0x7b689c00 nid=0x375d waiting on condition [0x7b0ad000..0x7b0ade70]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0xf2a944d8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
        at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
        >>> [Acquire write lock] at org.apache.avro.ipc.NettyTransceiver.disconnect(NettyTransceiver.java:285)
        at org.apache.avro.ipc.NettyTransceiver.access$2(NettyTransceiver.java:281)
        at org.apache.avro.ipc.NettyTransceiver$NettyClientAvroHandler.exceptionCaught(NettyTransceiver.java:499)
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:122)
        at org.apache.avro.ipc.NettyTransceiver$NettyClientAvroHandler.handleUpstream(NettyTransceiver.java:473)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)
        at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:238)
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:122)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:432)
        at org.jboss.netty.channel.socket.nio.NioWorker.cleanUpWriteBuffer(NioWorker.java:661)
        at org.jboss.netty.channel.socket.nio.NioWorker.writeFromUserCode(NioWorker.java:372)
        at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:117)
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:771)
        at org.jboss.netty.channel.Channels.write(Channels.java:632)
        at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:70)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
        at org.jboss.netty.channel.Channels.write(Channels.java:611)
        at org.jboss.netty.channel.Channels.write(Channels.java:578)
        at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:251)
        >>> [Acquire read lock] at org.apache.avro.ipc.NettyTransceiver.writeDataPack(NettyTransceiver.java:413)
        >>> [Acquire read lock] at org.apache.avro.ipc.NettyTransceiver.transceive(NettyTransceiver.java:394)
        at org.apache.avro.ipc.Requestor.request(Requestor.java:147)
        at org.apache.avro.ipc.Requestor.request(Requestor.java:129)
        at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:68)
        <snip>

      Note, in Avro 1.6.1 the read lock is acquired in both NettyTransceiver.transceive() and NettyTransceiver.writeDataPack(). AVRO-1013 fixes this so that it is acquired only once in NettyTransceiver.transceive().

      I've attached a patch that demonstrates a potential fix for the deadlock; the patch assumes that AVRO-1013 has also been applied.

      1. AVRO-1027.patch
        2 kB
        Simon Wilkinson
      2. AVRO-1027-v2.patch
        2 kB
        James Baldassari

        Activity

        Hide
        Simon Wilkinson added a comment -

        Formatted:

        private void writeDataPack(NettyDataPack dataPack) throws IOException {
          Channel channel = getChannel(); // Channel appears ready
        
          // Server closes
        
          channel.write(dataPack); // Netty will notice the disconnection and handle on the calling thread
        }
        
        Show
        Simon Wilkinson added a comment - Formatted: private void writeDataPack(NettyDataPack dataPack) throws IOException { Channel channel = getChannel(); // Channel appears ready // Server closes channel.write(dataPack); // Netty will notice the disconnection and handle on the calling thread }
        Hide
        Simon Wilkinson added a comment -

        Correction, the server must close after the Netty Channel has been checked for 'readiness' in NettyTransceiver.writeDataPack(NettyDataPack dataPack), but before the request gets very far in the NettyTransceiver's Netty pipeline.

        I.e.:

        private void writeDataPack(NettyDataPack dataPack) throws IOException

        { Channel channel = getChannel(); // Channel appears ready // Server closes channel.write(dataPack); // Netty will notice the disconnection and handle on the calling thread }
        Show
        Simon Wilkinson added a comment - Correction, the server must close after the Netty Channel has been checked for 'readiness' in NettyTransceiver.writeDataPack(NettyDataPack dataPack), but before the request gets very far in the NettyTransceiver's Netty pipeline. I.e.: private void writeDataPack(NettyDataPack dataPack) throws IOException { Channel channel = getChannel(); // Channel appears ready // Server closes channel.write(dataPack); // Netty will notice the disconnection and handle on the calling thread }
        Hide
        Simon Wilkinson added a comment -

        No problem.

        I've been thinking about how to construct the unit test for this. As James has said, it will be tricky.
        Specifically, to force the disconnect, the server has to be closed after a client thread has acquired a NettyTransceiver's read lock in transceive(), but before the request gets very far in the NettyTransceiver's Netty pipeline.

        There doesn't seem any way to make this happen consistently without adding some debug synchronization to NettyTransceiver.

        Show
        Simon Wilkinson added a comment - No problem. I've been thinking about how to construct the unit test for this. As James has said, it will be tricky. Specifically, to force the disconnect, the server has to be closed after a client thread has acquired a NettyTransceiver's read lock in transceive(), but before the request gets very far in the NettyTransceiver's Netty pipeline. There doesn't seem any way to make this happen consistently without adding some debug synchronization to NettyTransceiver.
        Hide
        Doug Cutting added a comment -

        Reassigning credit. Thanks, Simon!

        Show
        Doug Cutting added a comment - Reassigning credit. Thanks, Simon!
        Hide
        James Baldassari added a comment -

        Doug, just a minor correction on the attribution for this patch. The original patch was contributed by Simon Wilkinson. I just updated it to apply cleanly with patch -p0, so I think Simon should probably be the contributor for this one.

        Show
        James Baldassari added a comment - Doug, just a minor correction on the attribution for this patch. The original patch was contributed by Simon Wilkinson. I just updated it to apply cleanly with patch -p0, so I think Simon should probably be the contributor for this one.
        Hide
        George Fletcher added a comment -

        A release candidate today would be fantastic! Thanks!

        Show
        George Fletcher added a comment - A release candidate today would be fantastic! Thanks!
        Hide
        Doug Cutting added a comment -

        I committed this. Thanks, James!

        Show
        Doug Cutting added a comment - I committed this. Thanks, James!
        Hide
        Doug Cutting added a comment -

        James, thanks for running your tests.

        I guess I'll go ahead and commit this now and roll a release candidate today.

        Show
        Doug Cutting added a comment - James, thanks for running your tests. I guess I'll go ahead and commit this now and roll a release candidate today.
        Hide
        James Baldassari added a comment -

        All my tests are passing with trunk + AVRO-1027

        Show
        James Baldassari added a comment - All my tests are passing with trunk + AVRO-1027
        Hide
        James Baldassari added a comment -

        Creating a test that reliably triggers these types of deadlocks can be tricky. I fixed a bug not too long ago that I could reproduce reliably with my own unit tests but couldn't seem to get it to happen in the Avro tests for whatever reason. I think this is a fairly low-risk change, and it would be unfortunate if people hit this in production. So I guess my vote would be a +0 on getting this in for 1.6.3 whether the RC gets rolled today or early next week.

        I could build a snapshot off trunk + AVRO-1027 and run it through my own suite of Avro Netty unit tests. That might at least give us some additional confidence that this won't introduce any regressions.

        Show
        James Baldassari added a comment - Creating a test that reliably triggers these types of deadlocks can be tricky. I fixed a bug not too long ago that I could reproduce reliably with my own unit tests but couldn't seem to get it to happen in the Avro tests for whatever reason. I think this is a fairly low-risk change, and it would be unfortunate if people hit this in production. So I guess my vote would be a +0 on getting this in for 1.6.3 whether the RC gets rolled today or early next week. I could build a snapshot off trunk + AVRO-1027 and run it through my own suite of Avro Netty unit tests. That might at least give us some additional confidence that this won't introduce any regressions.
        Hide
        Doug Cutting added a comment -

        Should we:

        • a. commit this without tests and roll a 1.6.3 RC today
        • b. hold off on 1.6.3 until this has tests next week
        • c. roll a 1.6.3 RC today without this

        I don't have a strong opinion.

        Show
        Doug Cutting added a comment - Should we: a. commit this without tests and roll a 1.6.3 RC today b. hold off on 1.6.3 until this has tests next week c. roll a 1.6.3 RC today without this I don't have a strong opinion.
        Hide
        Simon Wilkinson added a comment -

        Unfortunately I don't have any time to work on a test until Monday.

        Show
        Simon Wilkinson added a comment - Unfortunately I don't have any time to work on a test until Monday.
        Hide
        Doug Cutting added a comment -

        Simon, do you expect to supply a test today?

        It would be good to include this in 1.6.3, but I don't think it's a showstopper, since it's not a regression, is it?

        Show
        Doug Cutting added a comment - Simon, do you expect to supply a test today? It would be good to include this in 1.6.3, but I don't think it's a showstopper, since it's not a regression, is it?
        Hide
        Simon Wilkinson added a comment -

        I could trigger the bug as part of a larger test. I'll see if I can extract this into a specific unit test.

        Show
        Simon Wilkinson added a comment - I could trigger the bug as part of a larger test. I'll see if I can extract this into a specific unit test.
        Hide
        James Baldassari added a comment -

        Simon, do you have a test for this? If not, I'll try to make one. Not sure when I'll have time to work on it, though.

        Show
        James Baldassari added a comment - Simon, do you have a test for this? If not, I'll try to make one. Not sure when I'll have time to work on it, though.
        Hide
        Scott Carey added a comment -

        +1
        Looks good. Is there a test that could cover this?

        Show
        Scott Carey added a comment - +1 Looks good. Is there a test that could cover this?
        Hide
        James Baldassari added a comment -

        +1
        Thanks for the patch. I'm going to attach a different version with the same changes that applies cleanly from trunk with patch -p0. It would be good to get this in for Avro 1.6.2.

        Show
        James Baldassari added a comment - +1 Thanks for the patch. I'm going to attach a different version with the same changes that applies cleanly from trunk with patch -p0. It would be good to get this in for Avro 1.6.2.

          People

          • Assignee:
            Simon Wilkinson
            Reporter:
            Simon Wilkinson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development