Derby
  1. Derby
  2. DERBY-1219

jdbcapi/checkDataSource.java and jdbcapi/checkDataSource30.java hang intermittently with client

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 10.2.1.6
    • Fix Version/s: 10.2.1.6
    • Labels:
      None
    • Environment:
      More often on jdk 1.5 or jdk 1.6 but hangs on jdk 1.4.2 as well

      Description

      The tests checkDataSource.java and checkDataSource30.java
      hang intermittently especially with jdk 1.5.

      Attached is the test run output and traces when the server is started separately.

      1) Enable checkDataSource30.java by taking it out of functionTests/suites/DerbyNetClient.exclude.

      2) Run the test with client.
      java -Dij.exceptionTrace=true -Dkeepfiles=true -Dframework=DerbyNetClient org.apache.derbyTesting.functionTests.harness.RunTest jdbcapi/checkDataSource30.java

      Attachements:
      testfiles_after_hang.zip - Test directory.

      traces_on_hang.txt - Server side traces obtained by starting the server separately before running the test.

      I wish I had time to work on this right now as I would really like to see this valuable test in the suite, but hopefully someone else will pick it up.

      1. derby-1219-enable-tests.diff
        22 kB
        Deepa Remesh
      2. derby-1219-enable-tests.status
        0.5 kB
        Deepa Remesh
      3. no-sessions-for-closed-threads.diff
        2 kB
        Bryan Pendleton
      4. interrupt.diff
        1 kB
        Bryan Pendleton
      5. skipThreads.diff
        4 kB
        Bryan Pendleton
      6. drda_traces_050206.zip
        89 kB
        Deepa Remesh
      7. client_stack_trace_050306.txt
        4 kB
        Deepa Remesh
      8. server_stack_trace_050306.txt
        5 kB
        Deepa Remesh
      9. traces_on_hang.txt
        14 kB
        Kathey Marsden
      10. testfiles_afterhang.zip
        67 kB
        Kathey Marsden

        Activity

        Hide
        Kathey Marsden added a comment -

        test output after hang.

        Show
        Kathey Marsden added a comment - test output after hang.
        Hide
        Kathey Marsden added a comment -

        server side stack traces on hang.

        Show
        Kathey Marsden added a comment - server side stack traces on hang.
        Hide
        Deepa Remesh added a comment -

        For DERBY-1148, I was trying to run checkDataSource test with client to repro the issue. I am getting the hang reported by Kathey almost all the time. So I looked into the hang and narrowed it down to this - The hang is occuring in ClientDataSource.getConnection method at this line: return ClientDriver.getFactory().newNetConnection((NetLogWriter) dncLogWriter, user, password, this, -1, false);
        It looks like it is hanging in the method NetConnection.flowServerAttributesAndKeyExchange. I tried to capture server traces but running few times with trace on did not produce the hang for me. I'll be working with this test some more and will update if I get any info. If anyone else gets any clues as to what could be wrong, please post it.

        Show
        Deepa Remesh added a comment - For DERBY-1148 , I was trying to run checkDataSource test with client to repro the issue. I am getting the hang reported by Kathey almost all the time. So I looked into the hang and narrowed it down to this - The hang is occuring in ClientDataSource.getConnection method at this line: return ClientDriver.getFactory().newNetConnection((NetLogWriter) dncLogWriter, user, password, this, -1, false); It looks like it is hanging in the method NetConnection.flowServerAttributesAndKeyExchange. I tried to capture server traces but running few times with trace on did not produce the hang for me. I'll be working with this test some more and will update if I get any info. If anyone else gets any clues as to what could be wrong, please post it.
        Hide
        Deepa Remesh added a comment -

        I talked to Kathey about this issue on IRC and she suggested trying this test outside the harness. I could repro it outside the harness too and I am attaching the stack traces from the client and the server (client_stack_trace_050306.txt, server_stack_trace_050306.txt) . From the trace and from the debugging I did with a few print statements, I think this is what is happening in the test:

        • For getting a new connection, client sends EXCSAT and ACCSEC to the server. Then, client waits for the response from the server for these commands. In all the test runs, I found the hang always happens in this request from the client sent as part of getConnection method.
        • At the server, it tries to read the data from the client and does not get any data in DDMReader.fill. So the server thinks the client has disconnected and ends the connection thread.
        • Client does not get back any response as server has ended the connection thread. Thus the client is blocked trying to read the input stream.

        I have not yet figured out the reason for this miscommunication between server and client. I would appreciate if someone can also go through the traces and confirm my analysis is correct.

        Show
        Deepa Remesh added a comment - I talked to Kathey about this issue on IRC and she suggested trying this test outside the harness. I could repro it outside the harness too and I am attaching the stack traces from the client and the server (client_stack_trace_050306.txt, server_stack_trace_050306.txt) . From the trace and from the debugging I did with a few print statements, I think this is what is happening in the test: For getting a new connection, client sends EXCSAT and ACCSEC to the server. Then, client waits for the response from the server for these commands. In all the test runs, I found the hang always happens in this request from the client sent as part of getConnection method. At the server, it tries to read the data from the client and does not get any data in DDMReader.fill. So the server thinks the client has disconnected and ends the connection thread. Client does not get back any response as server has ended the connection thread. Thus the client is blocked trying to read the input stream. I have not yet figured out the reason for this miscommunication between server and client. I would appreciate if someone can also go through the traces and confirm my analysis is correct.
        Hide
        Bryan Pendleton added a comment -

        Hi Deepa,

        Can you get the DRDA trace files for the hang? I.e., set traceDirectory on the client and derby.drda.traceAll on the server, as described in the Protocol Tracing section of http://wiki.apache.org/db-derby/ProtocolDebuggingTips

        Show
        Bryan Pendleton added a comment - Hi Deepa, Can you get the DRDA trace files for the hang? I.e., set traceDirectory on the client and derby.drda.traceAll on the server, as described in the Protocol Tracing section of http://wiki.apache.org/db-derby/ProtocolDebuggingTips
        Hide
        Deepa Remesh added a comment -

        Thanks Bryan. I am glad that you are looking at this issue. Hopefully we can narrow this down quickly with your DRDA expertise. I am attaching the client and server DRDA traces that I had captured previously when I ran the test inside the harness. You may find some traces which start with "DERBY-1219" which I had added for some debugging. Please ignore these. If needed, I can try to capture fresh traces and upload them.

        I had let the test run for a long time. At the end of the test I killed network server and this accounts for the last DisconnectException seen in client_trace.txt_sds_1. To me, it looked like the client was hanging after sending EXCSAT and ACCSEC whereas network server was idle.

        Show
        Deepa Remesh added a comment - Thanks Bryan. I am glad that you are looking at this issue. Hopefully we can narrow this down quickly with your DRDA expertise. I am attaching the client and server DRDA traces that I had captured previously when I ran the test inside the harness. You may find some traces which start with " DERBY-1219 " which I had added for some debugging. Please ignore these. If needed, I can try to capture fresh traces and upload them. I had let the test run for a long time. At the end of the test I killed network server and this accounts for the last DisconnectException seen in client_trace.txt_sds_1. To me, it looked like the client was hanging after sending EXCSAT and ACCSEC whereas network server was idle.
        Hide
        Bryan Pendleton added a comment -

        Hi Deepa, nothing obvious has jumped out at me yet, but I will keep looking. Three questions:

        1) You said you were able to reproduce this outside the harness; can you post a brief description of the steps, so that I can experiment with the code in my environment?
        2) Were you running with sane=true, or sane=false? If sane=false, can you try sane=true?
        3) You mentioned that the server does not get any data in DDMReader.fill, and thinks the client has disconnected. Can you expand on how you came to that conclusion? Were you able to see an IOException being thrown? Can you get a dump of that exception? After the exception has been thrown (i.e., during the hang), does netstat think that there is still an active TCP/IP connection between the client and the server?

        thanks, bryan

        Show
        Bryan Pendleton added a comment - Hi Deepa, nothing obvious has jumped out at me yet, but I will keep looking. Three questions: 1) You said you were able to reproduce this outside the harness; can you post a brief description of the steps, so that I can experiment with the code in my environment? 2) Were you running with sane=true, or sane=false? If sane=false, can you try sane=true? 3) You mentioned that the server does not get any data in DDMReader.fill, and thinks the client has disconnected. Can you expand on how you came to that conclusion? Were you able to see an IOException being thrown? Can you get a dump of that exception? After the exception has been thrown (i.e., during the hang), does netstat think that there is still an active TCP/IP connection between the client and the server? thanks, bryan
        Hide
        Deepa Remesh added a comment -

        Hi Bryan,

        Here are the answers to your questions:

        1) You said you were able to reproduce this outside the harness; can you post a brief description of the steps, so that I can experiment with the code in my environment?

        To run the test jdbcapi/checkDataSource.java without using the test harness, I started network server and then used the following command to run the test:
        java -Dderby.system.home=C:\deepa\Derby\derby_testing\nwserver -Dframework=DerbyNetClient -Dij.database=jdbc:derby://localhost:1527/wombat;create=true org.apache.derbyTesting.functionTests.tests.jdbcapi.checkDataSource

        The test hangs intermittently and the place where the test hangs is in one of the getConnection methods. The hang location varies in different runs but is always in the getConnection method. I can repro it quite easily (~ 1 out of 5 runs hang) on my machine. I hope you are able to repro it too.

        2) Were you running with sane=true, or sane=false? If sane=false, can you try sane=true?

        I was running with sane=true. The debug output can be seen in the derby.log in the attached zip file (drda_traces_050206.zip)

        3) You mentioned that the server does not get any data in DDMReader.fill, and thinks the client has disconnected. Can you expand on how you came to that conclusion? Were you able to see an IOException being thrown? Can you get a dump of that exception? After the exception has been thrown (i.e., during the hang), does netstat think that there is still an active TCP/IP connection between the client and the server?

        I was looking at the server debug trace from a normal run (without hang) and when the test hangs. On comparing both and also looking at the client and server traces, I found the following from the trace files:

        • Client trace stops with send of EXCSAT and ACCSEC (client_trace.txt_sds_1)
        • At the server side, I can see an empty trace file is created but cannot see the EXCSAT and ACCSEC as the last entries in any trace file (Server10.trace, Server9.trace)
        • In the debug trace of server, the last trace is "Ending connection thread". I added few other traces in DDMReader.fill and DRDAConnThread.run and found that server is actually reaching the end of the input stream (actualBytesRead == -1) without reading any data (totalBytesRead is 0) and hence calls "agent.markCommunicationsFailure ("DDMReader.fill()", "InputStream.read()", "insufficient data", "*");" The connection thread catches this disconnect exception and exits normally thinking the client has disconnected. The debug traces from a successful run show that network server starts a new connection thread at this point.

        From the above, I thought of following possibilities:

        • server is not getting a set of data from the client. data is lost. this looks unlikely
        • server thread is reading a wrong stream (from an already closed connection) and thinking the client does not have more data and that it has disconnected.
        • session state associated with the server thread is wrong. Instead of starting new session, server is thinking it has an active session and tries to process commands for the active session

        I am just throwing in some ideas which come to me and could be totally off here. I plan to look at this some more and will post if I find something else.

        There are no IOExceptions at server or client. The IO exception seen in the client trace file (client_trace.txt_sds_1) is because I killed network server after the test was hanging for a long time. This was just to reconfirm where the client is hanging.

        In netstat output, I can see listening and established statuses for both client and server process during the hang.

        Ouput of runtimeinfo command:
        — Derby Network Server Runtime Information —
        ---------- Session Information ---------------
        Session # :10

        -------------------------------------------------------------

        1. Connection Threads : 2
        2. Active Sessions : 1
        3. Waiting Sessions : 0

        Total Memory : 3694592 Free Memory : 2649768

        Hope this helps.

        Show
        Deepa Remesh added a comment - Hi Bryan, Here are the answers to your questions: 1) You said you were able to reproduce this outside the harness; can you post a brief description of the steps, so that I can experiment with the code in my environment? To run the test jdbcapi/checkDataSource.java without using the test harness, I started network server and then used the following command to run the test: java -Dderby.system.home=C:\deepa\Derby\derby_testing\nwserver -Dframework=DerbyNetClient -Dij.database=jdbc:derby://localhost:1527/wombat;create=true org.apache.derbyTesting.functionTests.tests.jdbcapi.checkDataSource The test hangs intermittently and the place where the test hangs is in one of the getConnection methods. The hang location varies in different runs but is always in the getConnection method. I can repro it quite easily (~ 1 out of 5 runs hang) on my machine. I hope you are able to repro it too. 2) Were you running with sane=true, or sane=false? If sane=false, can you try sane=true? I was running with sane=true. The debug output can be seen in the derby.log in the attached zip file (drda_traces_050206.zip) 3) You mentioned that the server does not get any data in DDMReader.fill, and thinks the client has disconnected. Can you expand on how you came to that conclusion? Were you able to see an IOException being thrown? Can you get a dump of that exception? After the exception has been thrown (i.e., during the hang), does netstat think that there is still an active TCP/IP connection between the client and the server? I was looking at the server debug trace from a normal run (without hang) and when the test hangs. On comparing both and also looking at the client and server traces, I found the following from the trace files: Client trace stops with send of EXCSAT and ACCSEC (client_trace.txt_sds_1) At the server side, I can see an empty trace file is created but cannot see the EXCSAT and ACCSEC as the last entries in any trace file (Server10.trace, Server9.trace) In the debug trace of server, the last trace is "Ending connection thread". I added few other traces in DDMReader.fill and DRDAConnThread.run and found that server is actually reaching the end of the input stream (actualBytesRead == -1) without reading any data (totalBytesRead is 0) and hence calls "agent.markCommunicationsFailure ("DDMReader.fill()", "InputStream.read()", "insufficient data", "*");" The connection thread catches this disconnect exception and exits normally thinking the client has disconnected. The debug traces from a successful run show that network server starts a new connection thread at this point. From the above, I thought of following possibilities: server is not getting a set of data from the client. data is lost. this looks unlikely server thread is reading a wrong stream (from an already closed connection) and thinking the client does not have more data and that it has disconnected. session state associated with the server thread is wrong. Instead of starting new session, server is thinking it has an active session and tries to process commands for the active session I am just throwing in some ideas which come to me and could be totally off here. I plan to look at this some more and will post if I find something else. There are no IOExceptions at server or client. The IO exception seen in the client trace file (client_trace.txt_sds_1) is because I killed network server after the test was hanging for a long time. This was just to reconfirm where the client is hanging. In netstat output, I can see listening and established statuses for both client and server process during the hang. Ouput of runtimeinfo command: — Derby Network Server Runtime Information — ---------- Session Information --------------- Session # :10 ------------------------------------------------------------- Connection Threads : 2 Active Sessions : 1 Waiting Sessions : 0 Total Memory : 3694592 Free Memory : 2649768 Hope this helps.
        Hide
        Deepa Remesh added a comment -

        I forgot to mention that when I set the tracing for client, I was getting a NPE if I did not set all the ClientDataSource properties. Other option was to make a small change in client's LogWriter.java. I have opened DERBY-1298 for this. Just mentioning in case someone else hits the same problem when tracing the client.

        Show
        Deepa Remesh added a comment - I forgot to mention that when I set the tracing for client, I was getting a NPE if I did not set all the ClientDataSource properties. Other option was to make a small change in client's LogWriter.java. I have opened DERBY-1298 for this. Just mentioning in case someone else hits the same problem when tracing the client.
        Hide
        Bryan Pendleton added a comment -

        Hi Deepa. Thank you for the good notes. As you say:

        > The test hangs intermittently and the place where the test hangs is in one of the getConnection
        > methods. The hang location varies in different runs but is always in the getConnection method.
        > I can repro it quite easily

        I appear to be able to reproduce it quite easily, too. So that is good.

        I think that an interesting aspect of this test is that it creates and tears down a lot of connections
        in very rapid succession, and I am wondering whether there is a race condition somewhere
        that is causing the code to lose track of which connection is which.

        I think that your analysis of the (actualBytesRead == -1) case is good, but is perhaps a red
        herring. I believe that this is the normal way that the server cleans up when a client
        connection disconnects and goes away. So I don't think this is directly the place where things
        are going wrong; it's just evidence that the server is seeing a lot of connections come and go.

        It seems like, at least in the cases I've seen so far, the bottom line is this:

        • the client believes it's initiated a new connection with the server
        • the server, however, has no record of that connection, and believes it's cleaned up all
          its connections and is idle

        So I think something is going wrong in the connection management code on one side or
        the other.

        Thanks for the great test case and set of notes; I'll continue to study this one some more and
        let you know if I figure anything out.

        Show
        Bryan Pendleton added a comment - Hi Deepa. Thank you for the good notes. As you say: > The test hangs intermittently and the place where the test hangs is in one of the getConnection > methods. The hang location varies in different runs but is always in the getConnection method. > I can repro it quite easily I appear to be able to reproduce it quite easily, too. So that is good. I think that an interesting aspect of this test is that it creates and tears down a lot of connections in very rapid succession, and I am wondering whether there is a race condition somewhere that is causing the code to lose track of which connection is which. I think that your analysis of the (actualBytesRead == -1) case is good, but is perhaps a red herring. I believe that this is the normal way that the server cleans up when a client connection disconnects and goes away. So I don't think this is directly the place where things are going wrong; it's just evidence that the server is seeing a lot of connections come and go. It seems like, at least in the cases I've seen so far, the bottom line is this: the client believes it's initiated a new connection with the server the server, however, has no record of that connection, and believes it's cleaned up all its connections and is idle So I think something is going wrong in the connection management code on one side or the other. Thanks for the great test case and set of notes; I'll continue to study this one some more and let you know if I figure anything out.
        Hide
        Bryan Pendleton added a comment -

        Attached are "skipThreads.diff" and "interrupt.diff", but before
        reading the diffs, please read these notes.

        I think I understand what is causing the hangs, and I can even make
        the hangs go away. However, I don't think I yet understand how to
        really fix the problem, so I'm sure we'll want to talk about this
        for a while, to see if some of the reviewers can come up with a
        proper solution or at least some techniques to pursue.

        Here's what I see, and what I think it means:

        1) One, or maybe several, times in the test, checkDataSource causes a
        shutdown of the server. It has several different variants on the shutdown
        processing, but at least one of them causes the server to go through
        NetworkServerControlImpl.startNetworkServer() to perform a server restart.

        2) During the server restart processing, the Network Server restart
        code iterates through all the DRDAConnThread instances and closes them.
        This close() call is supposed to cause the DRDAConnThread to terminate itself.

        3) However, all the close() call actually does is mark the thread's
        "close" variable as true, and depending on when the thread checks that
        variable, it may or may not immediately exit. In my test runs, it is
        often the case that at least one of the DRDAConnThread instances is, at this
        point, sitting blocked in NetworkServerControlImpl.getNextSession().
        Calling close() on this thread marks it as closed, but doesn't cause
        it to exit the getNextSession() wait.

        4) A little bit later, the test program makes some new connections
        to the server, and one of those connections is given to the thread
        which was blocked in the getNextSession() call. The thread picks
        up the session and returns to the DRDAConnThread.run() main loop.

        5) At this point, the thread notices that it has been closed, and it
        exits, without sending any response back to the client, and without
        closing the connection to the client. This causes the hang.

        Because this problem involves multi-threading, and thread scheduling,
        there is a bunch of non-determinate behavior, which I believe is why
        others have been experiencing varied results during their tests. The
        behavior of the threads is definitely unpredictable for me.

        There are several aspects to this scenario that puzzle me, but let
        me describe what I've been experimenting with as a patch. I've changed
        the NetworkServerControlImpl restart logic so that, instead of
        closing the DRDAConnThreads, it just leaves the threads alone.
        This change is in "skipThreads.diff", and it seems to make the hangs
        go away.

        The "skipThreads.diff" diff also contains some hacks to the test so
        that I could run it multiple times in a row outside of the harness
        without destroying and re-creating the database each time.
        Those changes don't really belong with this diff, but I didn't bother
        to edit them out.

        I also experimented with a change which tried to close the threads,
        but also, after closing, interrupts the threads, which
        caused them to be blown out of the getNextSession loop and back to
        the main run() loop, at which point the threads shut themselves down,
        which seems like the right behavior for Network Server restart.

        I was hoping that this was the "right" fix, but unfortunately this
        change fixed some, but not all, of the hangs, which was too bad.
        And I'm nervous about adding the call to Thread.interrupt(), which is
        an extremely powerful call and not to be used lightly. For reviewers
        who want to experiment with this change and see how it works for them,
        I've also attached "interrupt.diff"

        I'm still disturbed by the fact that when the main run() method
        in DRDAConnThread noticed that it was closed, it just exited without
        apparently sending any response back to the server or closing the
        socket.

        And, although my change makes the hangs go away, it does not make the
        checkDataSource and checkDataSource30 tests pass. Instead, they run
        to completion, and get a bunch of diffs, and I'm not sure whether my
        changes caused these diffs or not.

        But at this point, before I work on this much more, I'd like to get
        some feedback from the reviewers about the analysis up to this point,
        and the effects of this patch in their environment:

        • does this patch cause the hangs to disappear for you?
        • if so, do the checkDataSource and checkDataSource30 tests pass for you?
        • if they fail, do the failures make sense to you?
        • what should we be doing with the background connection threads during
          a Network Server restart?

        Thanks!

        Show
        Bryan Pendleton added a comment - Attached are "skipThreads.diff" and "interrupt.diff", but before reading the diffs, please read these notes. I think I understand what is causing the hangs, and I can even make the hangs go away. However, I don't think I yet understand how to really fix the problem, so I'm sure we'll want to talk about this for a while, to see if some of the reviewers can come up with a proper solution or at least some techniques to pursue. Here's what I see, and what I think it means: 1) One, or maybe several, times in the test, checkDataSource causes a shutdown of the server. It has several different variants on the shutdown processing, but at least one of them causes the server to go through NetworkServerControlImpl.startNetworkServer() to perform a server restart. 2) During the server restart processing, the Network Server restart code iterates through all the DRDAConnThread instances and closes them. This close() call is supposed to cause the DRDAConnThread to terminate itself. 3) However, all the close() call actually does is mark the thread's "close" variable as true, and depending on when the thread checks that variable, it may or may not immediately exit. In my test runs, it is often the case that at least one of the DRDAConnThread instances is, at this point, sitting blocked in NetworkServerControlImpl.getNextSession(). Calling close() on this thread marks it as closed, but doesn't cause it to exit the getNextSession() wait. 4) A little bit later, the test program makes some new connections to the server, and one of those connections is given to the thread which was blocked in the getNextSession() call. The thread picks up the session and returns to the DRDAConnThread.run() main loop. 5) At this point, the thread notices that it has been closed, and it exits, without sending any response back to the client, and without closing the connection to the client. This causes the hang. Because this problem involves multi-threading, and thread scheduling, there is a bunch of non-determinate behavior, which I believe is why others have been experiencing varied results during their tests. The behavior of the threads is definitely unpredictable for me. There are several aspects to this scenario that puzzle me, but let me describe what I've been experimenting with as a patch. I've changed the NetworkServerControlImpl restart logic so that, instead of closing the DRDAConnThreads, it just leaves the threads alone. This change is in "skipThreads.diff", and it seems to make the hangs go away. The "skipThreads.diff" diff also contains some hacks to the test so that I could run it multiple times in a row outside of the harness without destroying and re-creating the database each time. Those changes don't really belong with this diff, but I didn't bother to edit them out. I also experimented with a change which tried to close the threads, but also, after closing, interrupts the threads, which caused them to be blown out of the getNextSession loop and back to the main run() loop, at which point the threads shut themselves down, which seems like the right behavior for Network Server restart. I was hoping that this was the "right" fix, but unfortunately this change fixed some , but not all , of the hangs, which was too bad. And I'm nervous about adding the call to Thread.interrupt(), which is an extremely powerful call and not to be used lightly. For reviewers who want to experiment with this change and see how it works for them, I've also attached "interrupt.diff" I'm still disturbed by the fact that when the main run() method in DRDAConnThread noticed that it was closed, it just exited without apparently sending any response back to the server or closing the socket. And, although my change makes the hangs go away, it does not make the checkDataSource and checkDataSource30 tests pass. Instead, they run to completion, and get a bunch of diffs, and I'm not sure whether my changes caused these diffs or not. But at this point, before I work on this much more, I'd like to get some feedback from the reviewers about the analysis up to this point, and the effects of this patch in their environment: does this patch cause the hangs to disappear for you? if so, do the checkDataSource and checkDataSource30 tests pass for you? if they fail, do the failures make sense to you? what should we be doing with the background connection threads during a Network Server restart? Thanks!
        Hide
        Deepa Remesh added a comment -

        I ran the standalone repro with both your patches. I was not able to reproduce the hang on my machine after 25 runs with each patch.

        Thanks for the detailed explanation of the problem. I think I was reading some overlapping traces and was misled. After reading your description, I added more traces and confirmed that the run method breaks out because it finds that the thread has been marked closed. As you said, it seems not quite okay to just break out without informing the client. Maybe, it is expected that this code can reached only after client is already informed after an exception or at end of a session. A quick look at places where close method is called seems to indicate this.

        Both your solutions seem to work on my machine but I am not very clear about them. About solution 1 (skipThreads.diff) , it does not seem quite right to remove the cleanup code (code to close existing threads and clear threadList). If it is okay to reuse the same threads after a restart, then it should be okay to remove this cleanup code. Even then, I think the real cause of this problem could be somewhere else. I have not looked at solution 2 as you mentioned it does not work in all cases.

        Just sharing another thought which occurred to me after reading your descriptions. It looks like the new thread which was opened to serve the new connection gets added to threadList before the list is going to be cleaned up during restart. This causes the new thread to be closed unexpectedly when the server is restarted. I do not want to lead you down the wrong path. So I am looking at how/where threadList is accessed and if there could me some missing synchronizations here. I will post if I find something.

        Show
        Deepa Remesh added a comment - I ran the standalone repro with both your patches. I was not able to reproduce the hang on my machine after 25 runs with each patch. Thanks for the detailed explanation of the problem. I think I was reading some overlapping traces and was misled. After reading your description, I added more traces and confirmed that the run method breaks out because it finds that the thread has been marked closed. As you said, it seems not quite okay to just break out without informing the client. Maybe, it is expected that this code can reached only after client is already informed after an exception or at end of a session. A quick look at places where close method is called seems to indicate this. Both your solutions seem to work on my machine but I am not very clear about them. About solution 1 (skipThreads.diff) , it does not seem quite right to remove the cleanup code (code to close existing threads and clear threadList). If it is okay to reuse the same threads after a restart, then it should be okay to remove this cleanup code. Even then, I think the real cause of this problem could be somewhere else. I have not looked at solution 2 as you mentioned it does not work in all cases. Just sharing another thought which occurred to me after reading your descriptions. It looks like the new thread which was opened to serve the new connection gets added to threadList before the list is going to be cleaned up during restart. This causes the new thread to be closed unexpectedly when the server is restarted. I do not want to lead you down the wrong path. So I am looking at how/where threadList is accessed and if there could me some missing synchronizations here. I will post if I find something.
        Hide
        Bryan Pendleton added a comment -

        One strategy that occurred to me is that, rather than clearing and re-using the RunQueue and the ThreadList,
        we could have the restart processing set up a new RunQueue and ThreadList, and that way make a clean
        distinction between old sessions and threads, versus new sessions and threads. That is, try to avoid the
        problem of "a new session comes in, but gets handled by an old thread which is being closed", by teaching
        the code how to tell the difference between old and new threads and sessions.

        Some sort of restart algorithm like:

        oldRunQueue = RunQueue
        oldThreadList = ThreadList
        create new RunQueue and new ThreadList
        go through the old RunQueue and thread list and close and clean them up

        The idea being that, once we reach a certain point in Network Server restart processing, all new
        threads go onto the new thread list, and all new sessions go onto the new run queue, and it
        should be easier to tell the difference between old threads & sessions, which have been closed
        and should now die, and new threads and sessions, which are allowed to start doing new work.

        A similar, but not identical, idea would be to establish some sort of "generation" counter, which
        is incremented by one each time the Network Server restarts within a process, and instead
        of using the "close" API to shut down sessions and threads, use the idea of generations, so
        that a session or thread which is in an older generation is treated as dead and gets cleaned up,
        while new generation threads can start processing new generation sessions, and if a thread
        and a session ever found that their generations didn't match, they would know that something
        horrible had occurred and they could trigger a sanity check or assertion.

        Anyway, just some thoughts to keep the discussion going.

        Show
        Bryan Pendleton added a comment - One strategy that occurred to me is that, rather than clearing and re-using the RunQueue and the ThreadList, we could have the restart processing set up a new RunQueue and ThreadList, and that way make a clean distinction between old sessions and threads, versus new sessions and threads. That is, try to avoid the problem of "a new session comes in, but gets handled by an old thread which is being closed", by teaching the code how to tell the difference between old and new threads and sessions. Some sort of restart algorithm like: oldRunQueue = RunQueue oldThreadList = ThreadList create new RunQueue and new ThreadList go through the old RunQueue and thread list and close and clean them up The idea being that, once we reach a certain point in Network Server restart processing, all new threads go onto the new thread list, and all new sessions go onto the new run queue, and it should be easier to tell the difference between old threads & sessions, which have been closed and should now die, and new threads and sessions, which are allowed to start doing new work. A similar, but not identical, idea would be to establish some sort of "generation" counter, which is incremented by one each time the Network Server restarts within a process, and instead of using the "close" API to shut down sessions and threads, use the idea of generations, so that a session or thread which is in an older generation is treated as dead and gets cleaned up, while new generation threads can start processing new generation sessions, and if a thread and a session ever found that their generations didn't match, they would know that something horrible had occurred and they could trigger a sanity check or assertion. Anyway, just some thoughts to keep the discussion going.
        Hide
        Deepa Remesh added a comment -

        I am thinking along slightly different lines. The restart processing is doing some cleanup and reloading the driver. Should'nt we wait for the reload to happen before we start working on the new session which comes in? In that case, we will need some logic to check if a restart is in progress before we add a new thread/session to the lists.

        Or do we expect the driver to be reloaded by the time the new thread/session gets ready to do the real work?

        Show
        Deepa Remesh added a comment - I am thinking along slightly different lines. The restart processing is doing some cleanup and reloading the driver. Should'nt we wait for the reload to happen before we start working on the new session which comes in? In that case, we will need some logic to check if a restart is in progress before we add a new thread/session to the lists. Or do we expect the driver to be reloaded by the time the new thread/session gets ready to do the real work?
        Hide
        Bryan Pendleton added a comment -

        I think that waiting for the reload to complete is fine, but we'll need to figure out a way to have a
        positive confirmation that the threads have received their notification and shut themselves down,
        since, from a certain point of view, the essence of this bug is that calling DRDAConnThread.close()
        simply asks a thread to shut itself down, but does not wait for that shutdown to actually occur.

        Such an algorithm would look something like:

        • call close on each thread
        • call interrupt on each thread
        • while the threadlist is not empty
          wait for a while

        I'm always nervous about algorithms like this because of the need for an open-ended wait:

        • how do we know how long to wait?
        • what happens if we've waited a long time and the threads still haven't shut themselves down?

        The algorithms that I was proposing in my previous comment basically replace this
        open-ended wait logic with a potential resource leak in the scenario where the old
        threads for some reason don't respond to the request to shut themselves down.
        That is, either we accept the risk of leaking away old threads, or we accept the risk
        of waiting indefinitely for old threads, (or, I suppose, we figure out some way to be
        VERY confident that we can shut the old threads down in a reasonable amount of time).

        Show
        Bryan Pendleton added a comment - I think that waiting for the reload to complete is fine, but we'll need to figure out a way to have a positive confirmation that the threads have received their notification and shut themselves down, since, from a certain point of view, the essence of this bug is that calling DRDAConnThread.close() simply asks a thread to shut itself down, but does not wait for that shutdown to actually occur. Such an algorithm would look something like: call close on each thread call interrupt on each thread while the threadlist is not empty wait for a while I'm always nervous about algorithms like this because of the need for an open-ended wait: how do we know how long to wait? what happens if we've waited a long time and the threads still haven't shut themselves down? The algorithms that I was proposing in my previous comment basically replace this open-ended wait logic with a potential resource leak in the scenario where the old threads for some reason don't respond to the request to shut themselves down. That is, either we accept the risk of leaking away old threads, or we accept the risk of waiting indefinitely for old threads, (or, I suppose, we figure out some way to be VERY confident that we can shut the old threads down in a reasonable amount of time).
        Hide
        Deepa Remesh added a comment -

        Hi Bryan,
        We seem to be thinking of the problem differently. I am just thinking aloud to explain how I understood the scenario:

        I think the scenario is:

        One of the threads have asked to restart the server and this "restart" includes:
        1. a) close all sessions in runQueue b) clear runQueue
        2. a) close all threads in threadList b) clear threadList
        3. reload the embedded driver

        (Note: 1a and 2a are not single steps)

        This restart is in progress and we get a "new connection" from the client. We can do two things:
        1. queue it by adding to the runQueue list
        OR
        2. a) create a new thread for the connection b) add it to threadList c) start the new connection thread

        In addition to this, we can also have other connection threads which are running.

        Some cases I can think of which can cause hang are if we have actions overlapped as follows:
        case1: restart_1a, restart_1b, new_conn_2a, new_conn_2b, restart_2a, new_conn_2c
        case2: new_conn_1, restart 1a, restart 1b ...

        In case 1, after we have added a new connection thread to the threadList, threadList is cleared by restart thread and all threads including the new thread are marked closed. We then start the new thread (new_conn_2c) and it exits because it finds it was marked closed.

        In case 2, we queue up the connection request, runQueue is cleared during restart. So we miss the request.

        To me, the problem seems to be that we access the lists runQueue and threadList when a restart is in progress.

        In case of thread shutdown, I think the way it currently works is that we mark the thread as closed (set close = true). And next time, this thread is used, it checks if it was closed and if so it ends itself without doing anything. As the thread will not do anything else once it is marked closed, I am thinking it is okay to consider it as closed and not wait for it to actually terminate. To me, the culprits seem to be the lists maintained by the server.

        Any thoughts/comments?

        Show
        Deepa Remesh added a comment - Hi Bryan, We seem to be thinking of the problem differently. I am just thinking aloud to explain how I understood the scenario: I think the scenario is: One of the threads have asked to restart the server and this "restart" includes: 1. a) close all sessions in runQueue b) clear runQueue 2. a) close all threads in threadList b) clear threadList 3. reload the embedded driver (Note: 1a and 2a are not single steps) This restart is in progress and we get a "new connection" from the client. We can do two things: 1. queue it by adding to the runQueue list OR 2. a) create a new thread for the connection b) add it to threadList c) start the new connection thread In addition to this, we can also have other connection threads which are running. Some cases I can think of which can cause hang are if we have actions overlapped as follows: case1: restart_1a, restart_1b, new_conn_2a, new_conn_2b, restart_2a, new_conn_2c case2: new_conn_1, restart 1a, restart 1b ... In case 1, after we have added a new connection thread to the threadList, threadList is cleared by restart thread and all threads including the new thread are marked closed. We then start the new thread (new_conn_2c) and it exits because it finds it was marked closed. In case 2, we queue up the connection request, runQueue is cleared during restart. So we miss the request. To me, the problem seems to be that we access the lists runQueue and threadList when a restart is in progress. In case of thread shutdown, I think the way it currently works is that we mark the thread as closed (set close = true). And next time, this thread is used, it checks if it was closed and if so it ends itself without doing anything. As the thread will not do anything else once it is marked closed, I am thinking it is okay to consider it as closed and not wait for it to actually terminate. To me, the culprits seem to be the lists maintained by the server. Any thoughts/comments?
        Hide
        Bryan Pendleton added a comment -

        Hi Deepa,
        I think that your scenarios are excellent, and they definitely demonstrate the problems in this area.
        I think that the fun thing about a bug like this, is that there are many possible scenarios.

        The one I was concentrating on is a bit different from yours, so let me try to diagram it as follows:

        1) Some thread is idling, blocked in NetworkServerControlImpl.getNextSession() as called by
        DRDAConnThread.run(). I think this is the standard place for a thread to block when it is idle.

        2) Server restart occurs, and runs straight through to completion. This results in calling close()
        on the thread from point (1), and also removing that thread from the ThreadList. *But the thread
        does not terminate.*

        3) Some time later, some new connections start coming in. The first new connection, as you
        point out, will create a new thread to handle the session. The next new conection, however, will
        find that a thread already exists, and so it will simply put the session onto the RunQueue list.

        4) The original thread then wakes up, grabs the session, notices that the thread has been
        closed, and exits.

        The point I'm trying to make here is that no overlapping of actions is required, and the connection
        does not have to arrive during the restart.

        It seems to me that, once a restart happens while 1 or more threads happen to be sitting idle,
        blocked in their getNextSession() calls, then those threads are "poisoned", and there is
        now a ticking time bomb in the server. At some point in the future, a session will get added
        to the RunQueue, and one of these "poisoned" (closed) threads will grab the session, and
        will then terminate prematurely without processing the session.

        So the only place where I differ with your analysis, I believe, is that I think it is not okay to
        leave these threads out there, marked as closed, because at some point in the future the
        threads will grab sessions off the run queue and fail to process them.

        So I think one crucial thing to ensure is that, once a thread is marked as closed, it will no
        longer pick up a new session to process.

        With that in mind, I've experimented with yet another patch, called "no-sessions-for-closed-threads.diff",
        which attempts to prevent threads marked as closed from fetching sessions to process.

        It seems to resolve the hang for me, but I haven't exhaustively tested it. Still, I thought it
        showed enough promise to attach for you to examine.

        Show
        Bryan Pendleton added a comment - Hi Deepa, I think that your scenarios are excellent, and they definitely demonstrate the problems in this area. I think that the fun thing about a bug like this, is that there are many possible scenarios. The one I was concentrating on is a bit different from yours, so let me try to diagram it as follows: 1) Some thread is idling, blocked in NetworkServerControlImpl.getNextSession() as called by DRDAConnThread.run(). I think this is the standard place for a thread to block when it is idle. 2) Server restart occurs, and runs straight through to completion. This results in calling close() on the thread from point (1), and also removing that thread from the ThreadList. *But the thread does not terminate.* 3) Some time later, some new connections start coming in. The first new connection, as you point out, will create a new thread to handle the session. The next new conection, however, will find that a thread already exists, and so it will simply put the session onto the RunQueue list. 4) The original thread then wakes up, grabs the session, notices that the thread has been closed, and exits. The point I'm trying to make here is that no overlapping of actions is required, and the connection does not have to arrive during the restart. It seems to me that, once a restart happens while 1 or more threads happen to be sitting idle, blocked in their getNextSession() calls, then those threads are "poisoned", and there is now a ticking time bomb in the server. At some point in the future, a session will get added to the RunQueue, and one of these "poisoned" (closed) threads will grab the session, and will then terminate prematurely without processing the session. So the only place where I differ with your analysis, I believe, is that I think it is not okay to leave these threads out there, marked as closed, because at some point in the future the threads will grab sessions off the run queue and fail to process them. So I think one crucial thing to ensure is that, once a thread is marked as closed, it will no longer pick up a new session to process. With that in mind, I've experimented with yet another patch, called "no-sessions-for-closed-threads.diff", which attempts to prevent threads marked as closed from fetching sessions to process. It seems to resolve the hang for me, but I haven't exhaustively tested it. Still, I thought it showed enough promise to attach for you to examine.
        Hide
        Deepa Remesh added a comment -

        Hi Bryan,
        I applied your new patch 'no-sessions-for-closed-threads.diff' and ran the checkDataSource repro. However, I still get the hang intermittently on my machine. Though the hang has not gone away, I think the scenario you explain is a probable one. And as you said, the bug is quite interesting as it can have many possible scenarios.

        Show
        Deepa Remesh added a comment - Hi Bryan, I applied your new patch 'no-sessions-for-closed-threads.diff' and ran the checkDataSource repro. However, I still get the hang intermittently on my machine. Though the hang has not gone away, I think the scenario you explain is a probable one. And as you said, the bug is quite interesting as it can have many possible scenarios.
        Hide
        Deepa Remesh added a comment -

        Posting my observation with the patch 'no-sessions-for-closed-threads.diff': I think this patch solves the problem of "poisoned" threads partially. It makes sure that a thread which has been marked closed does not get a session to work on. However, the session which came in is still hanging because there are no new threads which can pick it up. When the session came in, server had queued it so that a free thread picks it up. But the thread which picked it up was a "poisoned" (closed) thread. So the session is still waiting for another thread.

        To test that the hang will go away when a new thread gets created, I opened another connection using ij. This made the hang go away and the repro ran to the end. When the connection from ij came in, the newly created thread was able to work on the waiting session and resolve the hang.

        Bryan, I am looking at your trial patch "interrupt.diff" which looks like a good solution to me too. This patch made the hang go away on my machine. But you had said that it did not resolve all hangs for you. Can you please post your observations with this patch?

        Show
        Deepa Remesh added a comment - Posting my observation with the patch 'no-sessions-for-closed-threads.diff': I think this patch solves the problem of "poisoned" threads partially. It makes sure that a thread which has been marked closed does not get a session to work on. However, the session which came in is still hanging because there are no new threads which can pick it up. When the session came in, server had queued it so that a free thread picks it up. But the thread which picked it up was a "poisoned" (closed) thread. So the session is still waiting for another thread. To test that the hang will go away when a new thread gets created, I opened another connection using ij. This made the hang go away and the repro ran to the end. When the connection from ij came in, the newly created thread was able to work on the waiting session and resolve the hang. Bryan, I am looking at your trial patch "interrupt.diff" which looks like a good solution to me too. This patch made the hang go away on my machine. But you had said that it did not resolve all hangs for you. Can you please post your observations with this patch?
        Hide
        Bryan Pendleton added a comment -

        I will gladly work on interrupt.diff some more, and try to provide more testing details.
        But I won't get to it until at least this weekend, sorry. Thanks very much for continuing
        to work on this with me; I think we're making some real progress here!

        Show
        Bryan Pendleton added a comment - I will gladly work on interrupt.diff some more, and try to provide more testing details. But I won't get to it until at least this weekend, sorry. Thanks very much for continuing to work on this with me; I think we're making some real progress here!
        Hide
        Bryan Pendleton added a comment -

        I've been trying to run this test in one of my alternate environments, and I'd like to be able to run the server on a port other than 1527. Unfortunately, when I change the value of ij.database so that it specifies a different port number, the test runs partway, then fails with:

        java.sql.SQLException: java.security.PrivilegedActionException : Error connecting to server localhost on port 1,527 with message Connection refused.
        at org.apache.derby.client.am.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
        at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:342)
        at org.apache.derby.jdbc.ClientDataSource.getConnection(ClientDataSource.java:191)
        at org.apache.derby.jdbc.ClientDataSource.getConnection(ClientDataSource.java:162)
        at org.apache.derbyTesting.functionTests.tests.jdbcapi.checkDataSource.runTest(checkDataSource.java:190)
        at org.apache.derbyTesting.functionTests.tests.jdbcapi.checkDataSource.main(checkDataSource.java:132)

        Line 190 of checkDataSource.java is the last line of the following snippet:

        DataSource dscs = TestUtil.getDataSource(attrs);

        if (testConnectionToString)
        checkToString(dscs);

        DataSource ds = dscs;

        checkConnection("DataSource", ds.getConnection());

        Is it possible to configure this test to run on a port other than 1527?

        Show
        Bryan Pendleton added a comment - I've been trying to run this test in one of my alternate environments, and I'd like to be able to run the server on a port other than 1527. Unfortunately, when I change the value of ij.database so that it specifies a different port number, the test runs partway, then fails with: java.sql.SQLException: java.security.PrivilegedActionException : Error connecting to server localhost on port 1,527 with message Connection refused. at org.apache.derby.client.am.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45) at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:342) at org.apache.derby.jdbc.ClientDataSource.getConnection(ClientDataSource.java:191) at org.apache.derby.jdbc.ClientDataSource.getConnection(ClientDataSource.java:162) at org.apache.derbyTesting.functionTests.tests.jdbcapi.checkDataSource.runTest(checkDataSource.java:190) at org.apache.derbyTesting.functionTests.tests.jdbcapi.checkDataSource.main(checkDataSource.java:132) Line 190 of checkDataSource.java is the last line of the following snippet: DataSource dscs = TestUtil.getDataSource(attrs); if (testConnectionToString) checkToString(dscs); DataSource ds = dscs; checkConnection("DataSource", ds.getConnection()); Is it possible to configure this test to run on a port other than 1527?
        Hide
        Bryan Pendleton added a comment -

        I think the interrupt.diff patch is very close to being good, but I think there are some
        synchronization issues to resolve. The interrupt.diff patch is very reliable on my
        Windows environment; that is, I cannot reproduce the hang with that diff in place.

        In my Linux environment, however, the test still hangs with that patch. However, if
        I introduce even very small timing changes into the test; for example if I put a
        println statement into a critical section of the getNextSession() logic, then the
        hang disappears.

        So I think that the interrupt.diff patch can perhaps be combined with some
        additional synchronization as Deepa suggested earlier, to solve some of the
        races between the Network Server restart, the DRDAConnThreads calling
        getNextSession, and the ClientThread logic involving creating a new session
        and conditionally creating a new thread.

        I'll continue to fiddle with the configuration that I have in which I can still provoke
        a hang, and see what I can do to understand it better.

        Show
        Bryan Pendleton added a comment - I think the interrupt.diff patch is very close to being good, but I think there are some synchronization issues to resolve. The interrupt.diff patch is very reliable on my Windows environment; that is, I cannot reproduce the hang with that diff in place. In my Linux environment, however, the test still hangs with that patch. However, if I introduce even very small timing changes into the test; for example if I put a println statement into a critical section of the getNextSession() logic, then the hang disappears. So I think that the interrupt.diff patch can perhaps be combined with some additional synchronization as Deepa suggested earlier, to solve some of the races between the Network Server restart, the DRDAConnThreads calling getNextSession, and the ClientThread logic involving creating a new session and conditionally creating a new thread. I'll continue to fiddle with the configuration that I have in which I can still provoke a hang, and see what I can do to understand it better.
        Hide
        Bryan Pendleton added a comment -

        Well, I ran a lot of experiments, but I don't have anything conclusive to offer. I believe that:

        • the issue is intricately linked to the handling of threads and sessions during restart
        • the hangs are because closed threads are given sessions to run, but then abandon
          those sessions without running them when they discover they've been closed
        • the interrupt() call makes things better, but does not totally fix the problem. I can still
          reproduce the problem on my RedHat Linux/ Sun JDK 1.4.2 environment.
        • but my various attempts to try to improve on the interrupt() fix with additional changes,
          only made things worse.

        Unfortunately, I'm not going to have much more time to work on this right away, so I'm
        unassigning myself in the hopes that somebody else can work on it.

        Show
        Bryan Pendleton added a comment - Well, I ran a lot of experiments, but I don't have anything conclusive to offer. I believe that: the issue is intricately linked to the handling of threads and sessions during restart the hangs are because closed threads are given sessions to run, but then abandon those sessions without running them when they discover they've been closed the interrupt() call makes things better, but does not totally fix the problem. I can still reproduce the problem on my RedHat Linux/ Sun JDK 1.4.2 environment. but my various attempts to try to improve on the interrupt() fix with additional changes, only made things worse. Unfortunately, I'm not going to have much more time to work on this right away, so I'm unassigning myself in the hopes that somebody else can work on it.
        Hide
        Deepa Remesh added a comment -

        Thanks Bryan for helping to identify the cause of the hang.

        Since we now know what is causing the intermittent hang, I think we can break the issue as follows:

        1) Open a separate code bug against network server component to solve the real hang problem and link to the details in this test issue.

        2) Enable the test to run with client by removing the code which shutdowns the system when we run the test using client framework. This is the code which causes the intermediate hang and it seems this code is not required to check the scenarios meant to be covered by the test. I disabled the following shutdown code using a boolean and ran the checkDataSource tests with the client and did not get the hang so far:

        // DERBY -???? - link to new bug opened for 1) above
        private static boolean hangAfterSystemShutdown = TestUtil.isDerbyNetClientFramework();
        .
        .
        .
        // Shutdown the system only if we are not running in client framework
        if(! hangAfterSystemShutdown) {
        try

        { TestUtil.getConnection("","shutdown=true"); }

        catch (SQLException sqle)

        { JDBCDisplayUtil.ShowSQLException(System.out, sqle); }

        }

        The test has some existing diffs because of some intermediate check-ins which changed the messages/behaviour. I will look at these diffs and submit a patch to enable the checkDataSource tests to run with client. I think it would be good to get the test running with client so that we can keep the masters up to date and also be able to catch any regressions in this area.

        Please provide any comments/feedback.

        Show
        Deepa Remesh added a comment - Thanks Bryan for helping to identify the cause of the hang. Since we now know what is causing the intermittent hang, I think we can break the issue as follows: 1) Open a separate code bug against network server component to solve the real hang problem and link to the details in this test issue. 2) Enable the test to run with client by removing the code which shutdowns the system when we run the test using client framework. This is the code which causes the intermediate hang and it seems this code is not required to check the scenarios meant to be covered by the test. I disabled the following shutdown code using a boolean and ran the checkDataSource tests with the client and did not get the hang so far: // DERBY -???? - link to new bug opened for 1) above private static boolean hangAfterSystemShutdown = TestUtil.isDerbyNetClientFramework(); . . . // Shutdown the system only if we are not running in client framework if(! hangAfterSystemShutdown) { try { TestUtil.getConnection("","shutdown=true"); } catch (SQLException sqle) { JDBCDisplayUtil.ShowSQLException(System.out, sqle); } } The test has some existing diffs because of some intermediate check-ins which changed the messages/behaviour. I will look at these diffs and submit a patch to enable the checkDataSource tests to run with client. I think it would be good to get the test running with client so that we can keep the masters up to date and also be able to catch any regressions in this area. Please provide any comments/feedback.
        Hide
        Deepa Remesh added a comment -

        The two categories of diffs seen when running the checkDataSource test with client are:

        < DriverManager <closedstmt>.execute() null - Invalid operation: statement closed
        11a11
        > DriverManager <closedstmt>.execute() 08003 - No current connection

        AND

        < autocommitxastart expected 'ResultSet' already closed.
        460a459
        > autocommitxastart expected ResultSet not open. Verify that autocommit is OFF.

        These diffs are message changes done as part of DERBY-843/DERBY-842. However, in the case of first diff, I see a difference in the behaviour of client and embedded driver.

        In case of client driver, Statement.execute method first checks if the connection associated with the statement is closed before it checks if the statement itself was closed or not. In case of embedded driver, the check for statement closed is done before the check for connection closed. I would like to clarify if the order of the checks matter and if it does which is the correct behaviour.

        Show
        Deepa Remesh added a comment - The two categories of diffs seen when running the checkDataSource test with client are: < DriverManager <closedstmt>.execute() null - Invalid operation: statement closed 11a11 > DriverManager <closedstmt>.execute() 08003 - No current connection AND < autocommitxastart expected 'ResultSet' already closed. 460a459 > autocommitxastart expected ResultSet not open. Verify that autocommit is OFF. These diffs are message changes done as part of DERBY-843 / DERBY-842 . However, in the case of first diff, I see a difference in the behaviour of client and embedded driver. In case of client driver, Statement.execute method first checks if the connection associated with the statement is closed before it checks if the statement itself was closed or not. In case of embedded driver, the check for statement closed is done before the check for connection closed. I would like to clarify if the order of the checks matter and if it does which is the correct behaviour.
        Hide
        Kathey Marsden added a comment -

        Deepa said ...
        >(The shutdown) is the code which causes the intermediate hang and it seems this code is not required >to check the scenarios meant to be covered by the test.

        I agree that the conditional is a good solution for getting the test running with client, but would like to clarify that I think the shutdown is a valuable part of the test because we verify that the global transaction state is valid even aver the database has been shutdown and restarted. Once the hang has been resolved it would be good to reenable this part of the test for client.

        Show
        Kathey Marsden added a comment - Deepa said ... >(The shutdown) is the code which causes the intermediate hang and it seems this code is not required >to check the scenarios meant to be covered by the test. I agree that the conditional is a good solution for getting the test running with client, but would like to clarify that I think the shutdown is a valuable part of the test because we verify that the global transaction state is valid even aver the database has been shutdown and restarted. Once the hang has been resolved it would be good to reenable this part of the test for client.
        Hide
        Kathey Marsden added a comment -

        Deepa asked about these diffs:
        >DriverManager <closedstmt>.execute() null - Invalid operation: statement closed
        11a11
        > DriverManager <closedstmt>.execute() 08003 - No current connection

        AND

        < autocommitxastart expected 'ResultSet' already closed.
        460a459
        > autocommitxastart expected ResultSet not open. Verify that autocommit is OFF.

        related to the DERBY-843 change. I posted a question about this to DERBY-843, but I think it makes sense to go ahead and update the master and get the test running while that issue is being resolved.

        Show
        Kathey Marsden added a comment - Deepa asked about these diffs: >DriverManager <closedstmt>.execute() null - Invalid operation: statement closed 11a11 > DriverManager <closedstmt>.execute() 08003 - No current connection AND < autocommitxastart expected 'ResultSet' already closed. 460a459 > autocommitxastart expected ResultSet not open. Verify that autocommit is OFF. related to the DERBY-843 change. I posted a question about this to DERBY-843 , but I think it makes sense to go ahead and update the master and get the test running while that issue is being resolved.
        Hide
        Deepa Remesh added a comment -

        Attaching a patch 'derby-1219-enable-tests.diff' which enables the tests jdbcapi/checkDataSource.java and jdbcapi/checkDataSource30.java to run with client.

        Patch adds a condition to exclude a system shutdown when running with client framework. This condition has to be removed once the hang (DERBY-1326) is resolved. Thanks Kathey for mentioning the use of shutdown in this test.

        With this patch, I ran the 2 tests using embedded and client driver with Sun jdk1.4.2 on Windows XP. I ran the tests multiple times with client driver to check that the hang does not occur. I would appreciate if someone can take a look at this patch.

        Show
        Deepa Remesh added a comment - Attaching a patch 'derby-1219-enable-tests.diff' which enables the tests jdbcapi/checkDataSource.java and jdbcapi/checkDataSource30.java to run with client. Patch adds a condition to exclude a system shutdown when running with client framework. This condition has to be removed once the hang ( DERBY-1326 ) is resolved. Thanks Kathey for mentioning the use of shutdown in this test. With this patch, I ran the 2 tests using embedded and client driver with Sun jdk1.4.2 on Windows XP. I ran the tests multiple times with client driver to check that the hang does not occur. I would appreciate if someone can take a look at this patch.
        Hide
        Bryan Pendleton added a comment -

        I ran both tests multiple times on my Linux JDK 1.4.2 environment with no hangs, and tests passed
        each time. This particular machine was very reliable in reproducing the hang before, so I think that's
        a good result, and it confirms your result.

        I would be glad to commit this patch. Do you think there are any other details we should check first?

        Show
        Bryan Pendleton added a comment - I ran both tests multiple times on my Linux JDK 1.4.2 environment with no hangs, and tests passed each time. This particular machine was very reliable in reproducing the hang before, so I think that's a good result, and it confirms your result. I would be glad to commit this patch. Do you think there are any other details we should check first?
        Hide
        Deepa Remesh added a comment -

        Were you asking if we need to run more tests? I think this patch changes only tests and I have run the tests with embedded and client driver. The tests are excluded with jcc driver. We also confirmed that the hang does not occur with the changed test.

        Show
        Deepa Remesh added a comment - Were you asking if we need to run more tests? I think this patch changes only tests and I have run the tests with embedded and client driver. The tests are excluded with jcc driver. We also confirmed that the hang does not occur with the changed test.
        Hide
        Bryan Pendleton added a comment -

        I committed patch derby-1219-enable-tests.diff to subversion as revision 406776:
        http://svn.apache.org/viewcvs?rev=406776&view=rev

        It will be nice to have this test running regularly in the client framework.

        Show
        Bryan Pendleton added a comment - I committed patch derby-1219-enable-tests.diff to subversion as revision 406776: http://svn.apache.org/viewcvs?rev=406776&view=rev It will be nice to have this test running regularly in the client framework.
        Hide
        Kathey Marsden added a comment -

        Cannot reproduce hang. Thanks Deepa

        Show
        Kathey Marsden added a comment - Cannot reproduce hang. Thanks Deepa

          People

          • Assignee:
            Deepa Remesh
            Reporter:
            Kathey Marsden
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development