Avro
  1. Avro
  2. AVRO-842

Internal NPE when accessing proxy to a closed server endpoint

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.1
    • Fix Version/s: 1.5.2
    • Component/s: java
    • Labels:
      None
    • Environment:

      generic

      Description

      For development of an avro-based transport for remote OSGi services I had to wrap & unwind transport-level exceptions into proper exceptions thrown to a client. One such scenario is accessing a disconnected/dead endpoint, which should be a detectable condition. This works fine with the SaslSocketTransceiver (where my wrapper can properly detect the IO Exception), but not with Netty which hiccups with an internal NPE.
      Generally speaking any transport-level and remote-endpoint exceptions must not only be clearly detectable (either directly by client code or wrapper infrastructure around avro) but also distinguishable from another.

      1. CrashTest.java
        2 kB
        Holger Hoffstätte
      2. AVRO-842-v3.patch
        21 kB
        James Baldassari
      3. AVRO-842-v2.patch
        21 kB
        James Baldassari
      4. AVRO-842.patch
        16 kB
        James Baldassari

        Activity

        Holger Hoffstätte created issue -
        Hide
        Holger Hoffstätte added a comment -

        Small example that demonstrates the internal NPE when accessing a proxy after closing a Netty-based server. This works "fine" (meaning: in a detectable way) with the SaslSocketServer.

        Show
        Holger Hoffstätte added a comment - Small example that demonstrates the internal NPE when accessing a proxy after closing a Netty-based server. This works "fine" (meaning: in a detectable way) with the SaslSocketServer.
        Holger Hoffstätte made changes -
        Field Original Value New Value
        Attachment CrashTest.java [ 12483701 ]
        Hide
        Holger Hoffstätte added a comment -

        I believe this is related to AVRO-747, which seems to mention the same problem.

        Show
        Holger Hoffstätte added a comment - I believe this is related to AVRO-747 , which seems to mention the same problem.
        Hide
        James Baldassari added a comment -

        I reproduced this in trunk (thanks for the test!), and I assume it's a problem in the 1.5 branch as well. I'll see if I can put together a patch to fix this issue as I've recently done some work on the Netty server and transceiver.

        Show
        James Baldassari added a comment - I reproduced this in trunk (thanks for the test!), and I assume it's a problem in the 1.5 branch as well. I'll see if I can put together a patch to fix this issue as I've recently done some work on the Netty server and transceiver.
        Hide
        James Baldassari added a comment -

        I have a patch that fixes this issue that I'll post shortly. The main changes are:

        • Allow a couple of methods in Transceiver and Requestor to throw IOException
        • Don't try to use the Netty channel if it is null. Instead, we try to reconnect, which will result in an IOException if the connection fails. This also allows the NettyTransceiver to recover from temporary network outages.
        • Fixed potential deadlock in NettyTransceiver (that I introduced in AVRO-539) between close() and NettyTransceiver.handleUpstream(ChannelHandlerContext, ChannelEvent)

        If the connection to the server goes down and the client subsequently tries to invoke another RPC:

        • If using the Callback API an IOException will be thrown
        • If using the non-Callback API, the IOException will be wrapped in an AvroRemoteException because those methods can only throw AvroRemoteException
          • The alternatives are to allow the proxy to throw UndeclaredThrowableException or to throw a RuntimeException, but neither of these options seemed like the best choice to me

        I wrote a new unit test in TestNettyServerWithCallbacks that mimics the behavior of Holger's test but also tests the Callback API. All existing unit tests pass.

        Show
        James Baldassari added a comment - I have a patch that fixes this issue that I'll post shortly. The main changes are: Allow a couple of methods in Transceiver and Requestor to throw IOException Don't try to use the Netty channel if it is null. Instead, we try to reconnect, which will result in an IOException if the connection fails. This also allows the NettyTransceiver to recover from temporary network outages. Fixed potential deadlock in NettyTransceiver (that I introduced in AVRO-539 ) between close() and NettyTransceiver.handleUpstream(ChannelHandlerContext, ChannelEvent) If the connection to the server goes down and the client subsequently tries to invoke another RPC: If using the Callback API an IOException will be thrown If using the non-Callback API, the IOException will be wrapped in an AvroRemoteException because those methods can only throw AvroRemoteException The alternatives are to allow the proxy to throw UndeclaredThrowableException or to throw a RuntimeException, but neither of these options seemed like the best choice to me I wrote a new unit test in TestNettyServerWithCallbacks that mimics the behavior of Holger's test but also tests the Callback API. All existing unit tests pass.
        Hide
        James Baldassari added a comment -

        Here's a patch for trunk. This also applies cleanly to the 1.5 branch. All tests pass in trunk and 1.5.

        Show
        James Baldassari added a comment - Here's a patch for trunk. This also applies cleanly to the 1.5 branch. All tests pass in trunk and 1.5.
        James Baldassari made changes -
        Attachment AVRO-842.patch [ 12483769 ]
        Hide
        James Baldassari added a comment -

        Here's a new version of the patch that adds the enhancement requested in AVRO-747.

        Show
        James Baldassari added a comment - Here's a new version of the patch that adds the enhancement requested in AVRO-747 .
        James Baldassari made changes -
        Attachment AVRO-842-v2.patch [ 12483790 ]
        Hide
        Holger Hoffstätte added a comment -

        I've given this a try; it applies cleanly and runs. Nice! A few more/new observations. When I run this against my initial test I've noticed that:

        • the client Transceiver now hangs on close() and the VM does not exit properly any longer
        • because the Transceiver's threads hang around
        • because the Transceiver's factory.releaseExternalResources() is never called
        • because close() calls disconnect(true, true) which waits for channel shutdown forever or hiccups, throwing more exceptions.

        There also seems to be an awful lot of exception logging in various places (in -ipc in general). This makes things a lot harder to read or track and very confusing - propagate, handle or log but not all three at once

        As far as the propagated exception goes, wrapping the IOException into an AvroException with cause is fine and consistent with the SaslSocketServer. We just need some consistent way to figure out that something went kaput.

        Show
        Holger Hoffstätte added a comment - I've given this a try; it applies cleanly and runs. Nice! A few more/new observations. When I run this against my initial test I've noticed that: the client Transceiver now hangs on close() and the VM does not exit properly any longer because the Transceiver's threads hang around because the Transceiver's factory.releaseExternalResources() is never called because close() calls disconnect(true, true) which waits for channel shutdown forever or hiccups, throwing more exceptions. There also seems to be an awful lot of exception logging in various places (in -ipc in general). This makes things a lot harder to read or track and very confusing - propagate, handle or log but not all three at once As far as the propagated exception goes, wrapping the IOException into an AvroException with cause is fine and consistent with the SaslSocketServer. We just need some consistent way to figure out that something went kaput.
        Hide
        James Baldassari added a comment -

        Holger, thanks for trying out the patch! I investigated the hang on close() issue you reported, and I found the cause. It's strange that the unit tests I wrote were not affected by this. It only happened when I ran the test as a normal java process rather than a unit test. Anyway, I'll prepare a new patch. I would appreciate it if you could test this one out too.

        Show
        James Baldassari added a comment - Holger, thanks for trying out the patch! I investigated the hang on close() issue you reported, and I found the cause. It's strange that the unit tests I wrote were not affected by this. It only happened when I ran the test as a normal java process rather than a unit test. Anyway, I'll prepare a new patch. I would appreciate it if you could test this one out too.
        Hide
        James Baldassari added a comment -

        Here's v3 of the patch which fixes the issue with the Netty worker thread pool not closing when NettyTransceiver.close() is called.

        Show
        James Baldassari added a comment - Here's v3 of the patch which fixes the issue with the Netty worker thread pool not closing when NettyTransceiver.close() is called.
        James Baldassari made changes -
        Attachment AVRO-842-v3.patch [ 12484444 ]
        Doug Cutting made changes -
        Assignee James Baldassari [ jbaldassari ]
        Hide
        Doug Cutting added a comment -

        Holger, have you had a chance to try this?

        Also, should we target this fix for 1.5.2?

        Show
        Doug Cutting added a comment - Holger, have you had a chance to try this? Also, should we target this fix for 1.5.2?
        Hide
        James Baldassari added a comment -

        Hi Doug,

        Yes, this patch should go into 1.5.2. It has some important fixes, and it should apply cleanly to the 1.5 branch. If Holger has time to test it out one more time, that would be great, but I'm pretty confident that the most recent version of the path fixes the issues reported here (and in AVRO-747).

        Show
        James Baldassari added a comment - Hi Doug, Yes, this patch should go into 1.5.2. It has some important fixes, and it should apply cleanly to the 1.5 branch. If Holger has time to test it out one more time, that would be great, but I'm pretty confident that the most recent version of the path fixes the issues reported here (and in AVRO-747 ).
        Doug Cutting made changes -
        Fix Version/s 1.5.2 [ 12316398 ]
        Hide
        Holger Hoffstätte added a comment -

        Just gave patch v3 a try and it now works fine. The VM even correctly exits when the client is not shut down explicitly, since the ExecutorService reaps its thread after a minute and then cleanly exits. This shows that there are no accidental leaks. nice

        Show
        Holger Hoffstätte added a comment - Just gave patch v3 a try and it now works fine. The VM even correctly exits when the client is not shut down explicitly, since the ExecutorService reaps its thread after a minute and then cleanly exits. This shows that there are no accidental leaks. nice
        Hide
        James Baldassari added a comment -

        Thanks for verifying, Holger!

        Show
        James Baldassari added a comment - Thanks for verifying, Holger!
        Hide
        Doug Cutting added a comment -

        I committed this. Thanks, James!

        Show
        Doug Cutting added a comment - I committed this. Thanks, James!
        Doug Cutting made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            James Baldassari
            Reporter:
            Holger Hoffstätte
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development