Derby
  1. Derby
  2. DERBY-1425

testProtocol hangs with 10.1 client talking to 10.2 server

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 10.2.1.6
    • Fix Version/s: 10.7.1.1
    • Component/s: Network Server
    • Labels:
      None

      Description

      I was trying to verify that the changes in DERBY-920 hadn't
      introduced any new compatibility problems (they shouldn't, because
      we were changing an internal class, but I wanted to make sure).

      So I was trying to follow some old tips about how to run tests with
      an old client against a new server, as documented in:
      http://wiki.apache.org/db-derby/TestingOldClientNewServer

      However, when I did this, the test "testProtocol" did not terminate.

      1. test.diff
        3 kB
        Knut Anders Hatlen
      2. ddmreader_initialize.diff
        0.5 kB
        Knut Anders Hatlen

        Activity

        Hide
        Kristian Waagan added a comment -

        Closing issue.

        Show
        Kristian Waagan added a comment - Closing issue.
        Hide
        Knut Anders Hatlen added a comment -

        Removed network client from components since the problem was in the server code.

        Show
        Knut Anders Hatlen added a comment - Removed network client from components since the problem was in the server code.
        Hide
        Knut Anders Hatlen added a comment -

        All the regression tests ran cleanly. Committed revision 961271.

        I'm not planning to back-port the fix. It's most likely harmless, but since the analysis so far suggests that this is not a problem that will be seen when using the client driver to talk to the server, it's probably not worth the effort to back-port it.

        Show
        Knut Anders Hatlen added a comment - All the regression tests ran cleanly. Committed revision 961271. I'm not planning to back-port the fix. It's most likely harmless, but since the analysis so far suggests that this is not a problem that will be seen when using the client driver to talk to the server, it's probably not worth the effort to back-port it.
        Hide
        Bryan Pendleton added a comment -

        +1 to proceed with this patch. Thanks for returning to this problem and cleaning it up!

        Show
        Bryan Pendleton added a comment - +1 to proceed with this patch. Thanks for returning to this problem and cleaning it up!
        Hide
        Knut Anders Hatlen added a comment -

        Attaching a patch which makes derbynet/testProtocol.java expose the hang. The patch inserts a test case right before the one Bryan saw hanging, and that new test case executes VALUES 1 without reading the last DSS which contains QRYDTA. This gets DDMReader into the problematic state seen with the 10.1 version of the protocol test, and the next test case hangs. With the suggested fix in DDMReader's initialize() method, testProtocol completes successfully.

        Show
        Knut Anders Hatlen added a comment - Attaching a patch which makes derbynet/testProtocol.java expose the hang. The patch inserts a test case right before the one Bryan saw hanging, and that new test case executes VALUES 1 without reading the last DSS which contains QRYDTA. This gets DDMReader into the problematic state seen with the 10.1 version of the protocol test, and the next test case hangs. With the suggested fix in DDMReader's initialize() method, testProtocol completes successfully.
        Hide
        Knut Anders Hatlen added a comment -

        There has been no activity on this bug for four years, but there's a patch attached. I think we should either see if we could get the patch committed or close the bug as Won't Fix. I'll have a look. Intuitively, clearing the two flags on reinitialization sounds correct, so my intention is to commit it if the tests still pass.

        Show
        Knut Anders Hatlen added a comment - There has been no activity on this bug for four years, but there's a patch attached. I think we should either see if we could get the patch committed or close the bug as Won't Fix. I'll have a look. Intuitively, clearing the two flags on reinitialization sounds correct, so my intention is to commit it if the tests still pass.
        Hide
        Kathey Marsden added a comment -

        From the comments it looks like this is not a product regression but only a test issue. Unchecking the Regression checkbox.

        Show
        Kathey Marsden added a comment - From the comments it looks like this is not a product regression but only a test issue. Unchecking the Regression checkbox.
        Hide
        Knut Anders Hatlen added a comment -

        Yes, the test still fails, but I don't think a separate issue is needed since the test should fail in a mixed environment. It does not test whether the client works, but whether the server responds exactly the way a 10.1 server would. One might even argue that derbynetmats/derbynetclientmats is not the correct suite for this test since it neither uses the client driver nor JCC.

        Show
        Knut Anders Hatlen added a comment - Yes, the test still fails, but I don't think a separate issue is needed since the test should fail in a mixed environment. It does not test whether the client works, but whether the server responds exactly the way a 10.1 server would. One might even argue that derbynetmats/derbynetclientmats is not the correct suite for this test since it neither uses the client driver nor JCC.
        Hide
        Bryan Pendleton added a comment -

        Thanks! Yes, that makes sense.

        With your patch, the hang no longer occurs for me, so I can confirm that your change resolves the hang.

        The test does not hang, but it does not pass, either, but I think that is probably a topic for a separate issue, do you agree?

        Show
        Bryan Pendleton added a comment - Thanks! Yes, that makes sense. With your patch, the hang no longer occurs for me, so I can confirm that your change resolves the hang. The test does not hang, but it does not pass, either, but I think that is probably a topic for a separate issue, do you agree?
        Hide
        Knut Anders Hatlen added a comment -

        Hi Bryan, the patch is for the 10.2 network server. When you are running derbynetclientmats with the configuration you have described in http://wiki.apache.org/db-derby/TestingOldClientNewServer, testProtocol from the 10.1 derbyTesting.jar is used, but it extends the TestProto class found in the 10.2 derbynet.jar, which again uses DDMReader/DDMWriter in the 10.2 derbynet.jar to communicate with the server. So, in a way, you are running 10.2 network server code both on the client side and on the server side, but the actual commands are read from 10.1 test code. Does this make you less confused?

        Note that I'm pretty sure this particular hang cannot occur in the network server itself. When a new session is started, DRDAConnThread.processCommands() begins a loop reading commands and writing responses. It always calls DDMReader.readDssHeader() before writing anything back to the client. readDssHeader() explicitly sets the values of dssIsChainedWithSameID and dssIsChainedWithDiffID. Since the hang was caused by those two variables having values from the previous session when the first flush occurred, I don't see how it could happen on the server.

        Show
        Knut Anders Hatlen added a comment - Hi Bryan, the patch is for the 10.2 network server. When you are running derbynetclientmats with the configuration you have described in http://wiki.apache.org/db-derby/TestingOldClientNewServer , testProtocol from the 10.1 derbyTesting.jar is used, but it extends the TestProto class found in the 10.2 derbynet.jar, which again uses DDMReader/DDMWriter in the 10.2 derbynet.jar to communicate with the server. So, in a way, you are running 10.2 network server code both on the client side and on the server side, but the actual commands are read from 10.1 test code. Does this make you less confused? Note that I'm pretty sure this particular hang cannot occur in the network server itself. When a new session is started, DRDAConnThread.processCommands() begins a loop reading commands and writing responses. It always calls DDMReader.readDssHeader() before writing anything back to the client. readDssHeader() explicitly sets the values of dssIsChainedWithSameID and dssIsChainedWithDiffID. Since the hang was caused by those two variables having values from the previous session when the first flush occurred, I don't see how it could happen on the server.
        Hide
        Bryan Pendleton added a comment -

        Hi Knut Anders, thanks for the patch. I'm a little confused; is this a patch for the
        10.1 Network Client code, or for the 10.2 Network Server code? Or should I
        apply it to both code bases?

        Show
        Bryan Pendleton added a comment - Hi Knut Anders, thanks for the patch. I'm a little confused; is this a patch for the 10.1 Network Client code, or for the 10.2 Network Server code? Or should I apply it to both code bases?
        Hide
        Knut Anders Hatlen added a comment -

        ddmreader_initialize.diff contains the changes to DDMReader which made the hang go away. I have not had time to check whether these changes have unwanted consequences. Use at your own risk!

        Show
        Knut Anders Hatlen added a comment - ddmreader_initialize.diff contains the changes to DDMReader which made the hang go away. I have not had time to check whether these changes have unwanted consequences. Use at your own risk!
        Hide
        Bryan Pendleton added a comment -

        Comment from Knut Anders on derby-dev:

        It seems like the hang is in fact caused by a bug in
        DDMReader. dssIsChainedWithDiffID and dssIsChainedWithSameID are not
        reset in initialize(), so values from a previous session might be
        used. In this particular case, the old values made
        DDMReader.getCurrChainState() return an incorrect value, which again
        made calls to DDMWriter.finalizeChain() a no-op. The protocol test
        therefore never sent the DRDA commands it thought it had sent, and
        both sides ended up waiting for data from the other one.

        It is not clear to me whether this bug actually can be triggered
        outside the protocol test since the client driver doesn't use
        DDMReader/DDMWriter. The way DRDAConnThread uses those classes should
        be safe, as the server always reads a command before responding and
        the fields seem to be set to reasonable values at each read.

        Anyway, setting those two fields to false in DDMReader.initialize()
        made the hang go away, and derbynetclientmats runs cleanly with that
        change.

        Show
        Bryan Pendleton added a comment - Comment from Knut Anders on derby-dev: It seems like the hang is in fact caused by a bug in DDMReader. dssIsChainedWithDiffID and dssIsChainedWithSameID are not reset in initialize(), so values from a previous session might be used. In this particular case, the old values made DDMReader.getCurrChainState() return an incorrect value, which again made calls to DDMWriter.finalizeChain() a no-op. The protocol test therefore never sent the DRDA commands it thought it had sent, and both sides ended up waiting for data from the other one. It is not clear to me whether this bug actually can be triggered outside the protocol test since the client driver doesn't use DDMReader/DDMWriter. The way DRDAConnThread uses those classes should be safe, as the server always reads a command before responding and the fields seem to be set to reasonable values at each read. Anyway, setting those two fields to false in DDMReader.initialize() made the hang go away, and derbynetclientmats runs cleanly with that change.
        Hide
        Bryan Pendleton added a comment -

        Comment from Knut Anders on derby-dev:

        I also see this. However, the CPU usage is more like 0% than 100%. I
        think this is caused by the pre-fetching that was added to the network
        client in DERBY-822. If you put this at the end of values1.inc

        readReplyDss
        readLengthAndCodepoint QRYDTA
        skipBytes

        or simply

        skipDss

        testProtocol will terminate successfully.

        This is not a compatibility issue, since the network client (also the
        10.1 client) knows that a QRYDTA object may or may not arrive. The
        protocol test, on the other hand, is written with a specific version
        of the server in mind, with the expected server response
        hard-coded. Since the 10.1 version of the test doesn't expect QRYDTA
        from an OPNQRY, the actual server response and the expected response
        will be out of sync.

        I'm not sure what causes the hang, though. The test that is hanging is
        "Test for too large value for OUTEXP in EXCSQLSTT", and it happens in
        the first skipDss in connect.inc. When this test starts, there is at
        least a left-over QRYDTA from the previous test (and perhaps more
        since that one doesn't read any data sent from the server). But that
        should not be a problem since the endTest command is supposed to close
        the socket and streams and open new ones. We should probably look into
        why this is happening, at least so that we can eliminate that there's
        something wrong with the network server code.

        Show
        Bryan Pendleton added a comment - Comment from Knut Anders on derby-dev: I also see this. However, the CPU usage is more like 0% than 100%. I think this is caused by the pre-fetching that was added to the network client in DERBY-822 . If you put this at the end of values1.inc readReplyDss readLengthAndCodepoint QRYDTA skipBytes or simply skipDss testProtocol will terminate successfully. This is not a compatibility issue, since the network client (also the 10.1 client) knows that a QRYDTA object may or may not arrive. The protocol test, on the other hand, is written with a specific version of the server in mind, with the expected server response hard-coded. Since the 10.1 version of the test doesn't expect QRYDTA from an OPNQRY, the actual server response and the expected response will be out of sync. I'm not sure what causes the hang, though. The test that is hanging is "Test for too large value for OUTEXP in EXCSQLSTT", and it happens in the first skipDss in connect.inc. When this test starts, there is at least a left-over QRYDTA from the previous test (and perhaps more since that one doesn't read any data sent from the server). But that should not be a problem since the endTest command is supposed to close the socket and streams and open new ones. We should probably look into why this is happening, at least so that we can eliminate that there's something wrong with the network server code.

          People

          • Assignee:
            Knut Anders Hatlen
            Reporter:
            Bryan Pendleton
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development