Derby
  1. Derby
  2. DERBY-1326

Network server may abandon sessions when Derby system is shutdown and this causes intermittent hangs in the client

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 10.2.1.6, 10.3.1.4
    • Component/s: Network Server
    • Labels:
      None

      Description

      This issue was found when working on DERBY-1219. More details can be found in the comments at http://issues.apache.org/jira/browse/DERBY-1219

      1. withNewThreadAndNoShutdownOfCurrentSession.diff
        12 kB
        Bryan Pendleton
      2. unify_NSImpl_instances.diff
        4 kB
        Bryan Pendleton
      3. sessionMgmt1.diff
        7 kB
        Bryan Pendleton
      4. sessionMgmt1_and_nosessionsforclosedthreads.diff
        8 kB
        Deepa Remesh
      5. Restart.java
        2 kB
        Knut Anders Hatlen
      6. resolve_DRDConnThread_conflict.diff
        12 kB
        Bryan Pendleton
      7. repro1326.java
        1 kB
        Deepa Remesh
      8. derby-1326-skipThreads.diff
        1.0 kB
        Knut Anders Hatlen
      9. derby-1326-restartTest.diff
        2 kB
        Knut Anders Hatlen
      10. derby-1326-cleanup_exceptions.diff
        3 kB
        Knut Anders Hatlen
      11. derby-1326-checkds.diff
        3 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Deepa Remesh added a comment -

          Closing the issue as the fix is available in 10.2 and trunk. Thanks to Knut, Bryan (and others involved) in resolving this issue.

          Show
          Deepa Remesh added a comment - Closing the issue as the fix is available in 10.2 and trunk. Thanks to Knut, Bryan (and others involved) in resolving this issue.
          Hide
          Knut Anders Hatlen added a comment -

          Committed derby-1326-restartTest.diff with revision 449616. Marking as fixed in 10.2.1.5 since the fix (but not all tests) made it into the release candidate.

          Show
          Knut Anders Hatlen added a comment - Committed derby-1326-restartTest.diff with revision 449616. Marking as fixed in 10.2.1.5 since the fix (but not all tests) made it into the release candidate.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch which makes derbynet/ShutDownDBWhenNSShutsDownTest.junit run the repro (Restart.java) as a JUnit test case.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch which makes derbynet/ShutDownDBWhenNSShutsDownTest.junit run the repro (Restart.java) as a JUnit test case.
          Hide
          Knut Anders Hatlen added a comment -

          derby-1326-checkds.diff re-enables the shutdown in checkDataSource. Without the skipThreads patch, checkDataSource and checkDataSource30 hung intermittently. With the skipThreads patch, they didn't. Derbyall ran cleanly. Committed with revision 447462.

          Show
          Knut Anders Hatlen added a comment - derby-1326-checkds.diff re-enables the shutdown in checkDataSource. Without the skipThreads patch, checkDataSource and checkDataSource30 hung intermittently. With the skipThreads patch, they didn't. Derbyall ran cleanly. Committed with revision 447462.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for the suggestions, Deepa. I committed skipThreads with revision 447375. Will post a new patch with the suggested test changes.

          Show
          Knut Anders Hatlen added a comment - Thanks for the suggestions, Deepa. I committed skipThreads with revision 447375. Will post a new patch with the suggested test changes.
          Hide
          Deepa Remesh added a comment -

          I'll file a separate issue for removing all the existing sessions after a restart. From Knut's comments, this is only an optimization issue.

          As derby-1326-skipThreads.diff resolves this hang, we can re-enable the shutdown in the test jdbcapi/checkDataSource.java. This was commented out as it caused the test to hang intermittently. It may be good to add the new repro "Restart.java" to the regression tests as it reliably reproduces this hang.

          Show
          Deepa Remesh added a comment - I'll file a separate issue for removing all the existing sessions after a restart. From Knut's comments, this is only an optimization issue. As derby-1326-skipThreads.diff resolves this hang, we can re-enable the shutdown in the test jdbcapi/checkDataSource.java. This was commented out as it caused the test to hang intermittently. It may be good to add the new repro "Restart.java" to the regression tests as it reliably reproduces this hang.
          Hide
          Knut Anders Hatlen added a comment -

          Deepa wrote:
          > I still cannot wrap my head around old sessions hanging around in the server.

          They hang around in the server because the DRDAConnThread could be blocked in a read() call. read() will block until the socket stream is closed (either by the client or the server) or the client sends data. Calling close() on the DRDAConnThread only sets a flag, so this won't interrupt read(). I think Bryan was very close to solving this when he went through sessionTable and closed each session (which will take down the socket stream and make read() return -1 immediately). See one of his earlier comments about this. I don't think that issue is directly related to the hang or abandoning of sessions, so it should probably be filed as a separate issue.

          Show
          Knut Anders Hatlen added a comment - Deepa wrote: > I still cannot wrap my head around old sessions hanging around in the server. They hang around in the server because the DRDAConnThread could be blocked in a read() call. read() will block until the socket stream is closed (either by the client or the server) or the client sends data. Calling close() on the DRDAConnThread only sets a flag, so this won't interrupt read(). I think Bryan was very close to solving this when he went through sessionTable and closed each session (which will take down the socket stream and make read() return -1 immediately). See one of his earlier comments about this. I don't think that issue is directly related to the hang or abandoning of sessions, so it should probably be filed as a separate issue.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching patch (derby-1326-skipThreads.diff) which modifies NetworkServerControlImpl the same way as Bryan's skipThreads.diff (attached to DERBY-1219). I cannot reproduce the hang with the patch. Derbyall ran cleanly on Solaris 10 x86, Sun JVM 1.5.0.

          Show
          Knut Anders Hatlen added a comment - Attaching patch (derby-1326-skipThreads.diff) which modifies NetworkServerControlImpl the same way as Bryan's skipThreads.diff (attached to DERBY-1219 ). I cannot reproduce the hang with the patch. Derbyall ran cleanly on Solaris 10 x86, Sun JVM 1.5.0.
          Hide
          Knut Anders Hatlen added a comment -

          Committed derby-1326-cleanup_exceptions.diff with revision 446538.

          Show
          Knut Anders Hatlen added a comment - Committed derby-1326-cleanup_exceptions.diff with revision 446538.
          Hide
          Deepa Remesh added a comment -

          Thanks Knut for the answers. I still cannot wrap my head around old sessions hanging around in the server. I think this concerns the current code and not related to your patch. The cleanups you suggested sound good to me.

          Show
          Deepa Remesh added a comment - Thanks Knut for the answers. I still cannot wrap my head around old sessions hanging around in the server. I think this concerns the current code and not related to your patch. The cleanups you suggested sound good to me.
          Hide
          Bryan Pendleton added a comment -

          Hi Knut Anders, thanks for the clarifications. I am comfortable with your explanation.

          Show
          Bryan Pendleton added a comment - Hi Knut Anders, thanks for the clarifications. I am comfortable with your explanation.
          Hide
          Knut Anders Hatlen added a comment -

          Deepa wrote:

          > * I think it is okay to re-use free/idle threads without closing
          > them. It may not be okay to re-use active threads without
          > closing/poisoning them. If we do this, wouldn't we end up with
          > threads getting invalid/closed sessions to work on?

          Yes, that is correct. This is also the case with the current code. The
          active threads remain alive until the client sends a new
          command. Since the embedded connection is closed, it gets a
          connection-closed exception and closes the session, before it
          discovers that the thread is closed as well, and exits. If we don't
          close the threads, we get the same behaviour, except that the thread
          is reused instead of closed down after the connection-closed
          exception.

          > * Isn't this a possible scenario which can cause a hang in a client:
          > When restart is in progress, a new client session comes in and gets
          > added to runQueue. After the session is added, the restart code
          > closes the new session or clears the runQueue. So the server just
          > discards the new session and the client keeps waiting.

          This is a possible scenario, but it won't cause a hang. Since the
          session and its socket streams have been closed, the client will get
          an IOException on read() (seen as DRDAProtocolException, for instance:
          insufficient data, read -1 bytes, expected 6).

          (But there is in fact a similar situation that can occur in the
          current code, and that will cause a hang: If a new session is added to
          runQueue after the restart code has closed all sessions in runQueue,
          but before it has called runQueue.clear(), the new session will be
          removed from runQueue without being closed, and the client will
          hang. derby-1326-cleanup_exceptions.diff should fix that issue.)

          Show
          Knut Anders Hatlen added a comment - Deepa wrote: > * I think it is okay to re-use free/idle threads without closing > them. It may not be okay to re-use active threads without > closing/poisoning them. If we do this, wouldn't we end up with > threads getting invalid/closed sessions to work on? Yes, that is correct. This is also the case with the current code. The active threads remain alive until the client sends a new command. Since the embedded connection is closed, it gets a connection-closed exception and closes the session, before it discovers that the thread is closed as well, and exits. If we don't close the threads, we get the same behaviour, except that the thread is reused instead of closed down after the connection-closed exception. > * Isn't this a possible scenario which can cause a hang in a client: > When restart is in progress, a new client session comes in and gets > added to runQueue. After the session is added, the restart code > closes the new session or clears the runQueue. So the server just > discards the new session and the client keeps waiting. This is a possible scenario, but it won't cause a hang. Since the session and its socket streams have been closed, the client will get an IOException on read() (seen as DRDAProtocolException, for instance: insufficient data, read -1 bytes, expected 6). (But there is in fact a similar situation that can occur in the current code, and that will cause a hang: If a new session is added to runQueue after the restart code has closed all sessions in runQueue, but before it has called runQueue.clear(), the new session will be removed from runQueue without being closed, and the client will hang. derby-1326-cleanup_exceptions.diff should fix that issue.)
          Hide
          Knut Anders Hatlen added a comment -

          Bryan, I'm not sure I understand what impact this could have on
          DERBY-51. DERBY-51 is about server shutdown, whereas this issue is
          about a client opening a connection which has the side effect of
          shutting down the engine.

          > That is, it seems to me that a user might expect that if they shut
          > down the Network Server, that when that call returned to them, their
          > application could assume that the database was also shut down as of
          > that point.

          If the network server is started from command line and the user
          invokes NetworkServerControl.shutdown() this is exactly what happens
          in Derby 10.2. In 10.1, and in 10.2 when the server is not started
          from command line, the server does not shut down the engine/databases
          on shutdown. However, getConnection("...;shutdown=true") does not shut
          down the server, only the engine.

          I don't see that poisoning or not poisoning the threads should affect
          that behaviour or "force our hands" in any way w.r.t. DERBY-51. When
          the server at a later point is told to shut down, it is free to shut
          down (or not to shut down) the engine regardless of threads with old
          sessions hanging around. Also, threads with old sessions can be
          present on the server for a long time after an engine shutdown even in
          the current code (there is no call to Thread.interrupt() after the
          engine has been shut down).

          Show
          Knut Anders Hatlen added a comment - Bryan, I'm not sure I understand what impact this could have on DERBY-51 . DERBY-51 is about server shutdown, whereas this issue is about a client opening a connection which has the side effect of shutting down the engine. > That is, it seems to me that a user might expect that if they shut > down the Network Server, that when that call returned to them, their > application could assume that the database was also shut down as of > that point. If the network server is started from command line and the user invokes NetworkServerControl.shutdown() this is exactly what happens in Derby 10.2. In 10.1, and in 10.2 when the server is not started from command line, the server does not shut down the engine/databases on shutdown. However, getConnection("...;shutdown=true") does not shut down the server, only the engine. I don't see that poisoning or not poisoning the threads should affect that behaviour or "force our hands" in any way w.r.t. DERBY-51 . When the server at a later point is told to shut down, it is free to shut down (or not to shut down) the engine regardless of threads with old sessions hanging around. Also, threads with old sessions can be present on the server for a long time after an engine shutdown even in the current code (there is no call to Thread.interrupt() after the engine has been shut down).
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch (derby-1326-cleanup_exceptions.diff) with two small
          cleanups in preparation for the real fix (for which the approach
          hasn't been chosen yet). None of the two cleanups will fix the hang.

          Description of cleanup in DRDAConnThread.handleException(): When I was
          stress testing one possible fix for the hang, I noticed a new type of
          hang that I didn't understand. After a bit of investigation, I found
          that the stress test triggered a NullPointerException while running
          handleException(). handleException() is written so that all code paths
          end with closeSession() and close(). However, if a runtime exception
          is thrown, closeSession() and close() are skipped, and the
          DRDAConnThread stops without closing the session (which leads to a
          hang on the client). To fix this, I refactored handleException() so
          that closeSession() and close() are called from a finally clause.

          Description of cleanup in NetworkServerControlImpl.startNetworkServer():
          Currently, the restart code closes and removes all sessions in
          runQueue, but it leaves them in sessionTable as long as the network
          server runs. Fix: Remove them from sessionTable as well. Also, a
          synchronized block was added around the runQueue loop in case
          addSession() or getNextSession() modifies the size of runQueue while
          the loop is running.

          I thought it would be better to get in these changes in advance to
          make the real fix smaller and cleaner.

          Derbyall ran cleanly with these changes.

          Show
          Knut Anders Hatlen added a comment - Attaching a patch (derby-1326-cleanup_exceptions.diff) with two small cleanups in preparation for the real fix (for which the approach hasn't been chosen yet). None of the two cleanups will fix the hang. Description of cleanup in DRDAConnThread.handleException(): When I was stress testing one possible fix for the hang, I noticed a new type of hang that I didn't understand. After a bit of investigation, I found that the stress test triggered a NullPointerException while running handleException(). handleException() is written so that all code paths end with closeSession() and close(). However, if a runtime exception is thrown, closeSession() and close() are skipped, and the DRDAConnThread stops without closing the session (which leads to a hang on the client). To fix this, I refactored handleException() so that closeSession() and close() are called from a finally clause. Description of cleanup in NetworkServerControlImpl.startNetworkServer(): Currently, the restart code closes and removes all sessions in runQueue, but it leaves them in sessionTable as long as the network server runs. Fix: Remove them from sessionTable as well. Also, a synchronized block was added around the runQueue loop in case addSession() or getNextSession() modifies the size of runQueue while the loop is running. I thought it would be better to get in these changes in advance to make the real fix smaller and cleaner. Derbyall ran cleanly with these changes.
          Hide
          Deepa Remesh added a comment -

          Thanks Knut for working on this.

          Some questions/thoughts:

          • I think it is okay to re-use free/idle threads without closing them. It may not be okay to re-use active threads without closing/poisoning them. If we do this, wouldn't we end up with threads getting invalid/closed sessions to work on?
          • Isn't this a possible scenario which can cause a hang in a client: When restart is in progress, a new client session comes in and gets added to runQueue. After the session is added, the restart code closes the new session or clears the runQueue. So the server just discards the new session and the client keeps waiting.
          Show
          Deepa Remesh added a comment - Thanks Knut for working on this. Some questions/thoughts: I think it is okay to re-use free/idle threads without closing them. It may not be okay to re-use active threads without closing/poisoning them. If we do this, wouldn't we end up with threads getting invalid/closed sessions to work on? Isn't this a possible scenario which can cause a hang in a client: When restart is in progress, a new client session comes in and gets added to runQueue. After the session is added, the restart code closes the new session or clears the runQueue. So the server just discards the new session and the client keeps waiting.
          Hide
          Bryan Pendleton added a comment -

          > For the latter (threads with sessions), it would make sense to notify
          > them about the shutdown so they could drop their invalid sessions
          > earlier, but that would be an optimization and not a correctness
          > issue, I think.

          I think there could be a correctness issue here, actually.

          One of my worries had to do with DERBY-51, and whether there was
          any implicit or explicit contract about how Network Server shutdown and
          engine/database shutdown were linked.

          That is, it seems to me that a user might expect that if they shut down
          the Network Server, that when that call returned to them, their application
          could assume that the database was also shut down as of that point.

          NOTE: I'm not saying that this is how things currently work; I'm just saying
          that it seems to me that users might make that assumption and we
          haven't been explicit about it one way or another. But that's DERBY-51,
          properly, not this bug.

          Still, if we make a change for this bug that forces our hand w.r.t. DERBY-51,
          we should make it clear that we're doing so intentionally so that we don't
          later reverse that change and re-introduce this problem.

          Show
          Bryan Pendleton added a comment - > For the latter (threads with sessions), it would make sense to notify > them about the shutdown so they could drop their invalid sessions > earlier, but that would be an optimization and not a correctness > issue, I think. I think there could be a correctness issue here, actually. One of my worries had to do with DERBY-51 , and whether there was any implicit or explicit contract about how Network Server shutdown and engine/database shutdown were linked. That is, it seems to me that a user might expect that if they shut down the Network Server, that when that call returned to them, their application could assume that the database was also shut down as of that point. NOTE: I'm not saying that this is how things currently work; I'm just saying that it seems to me that users might make that assumption and we haven't been explicit about it one way or another. But that's DERBY-51 , properly, not this bug. Still, if we make a change for this bug that forces our hand w.r.t. DERBY-51 , we should make it clear that we're doing so intentionally so that we don't later reverse that change and re-introduce this problem.
          Hide
          Knut Anders Hatlen added a comment -

          I'm also exploring these options:
          1) Poisoning all active threads (those with session != null) and keeping the free threads.
          2) Poisoning all threads and make poisoned threads put sessions back in runQueue. This would also require moving the creation of new threads to runQueueAdd() in order to prevent the queue from growing bigger than the pool of free threads.

          Show
          Knut Anders Hatlen added a comment - I'm also exploring these options: 1) Poisoning all active threads (those with session != null) and keeping the free threads. 2) Poisoning all threads and make poisoned threads put sessions back in runQueue. This would also require moving the creation of new threads to runQueueAdd() in order to prevent the queue from growing bigger than the pool of free threads.
          Hide
          Knut Anders Hatlen added a comment -

          Francois wrote:

          > Were you actually implying some auto database reconnect of the
          > current DRDA threads created in the network server once the Derby
          > engine or/and databases are brought back up?

          Yes, when the server detects that one of the clients has shut down the
          engine (it checks the SQLSTATE of exceptions before sending them to
          the client), it (a) closes all sessions that are in the run queue
          waiting for a thread, (b) poisons all DRDAConnThreads, and (c) reboots
          the engine. Because the poisoned DRDAConnThreads are left around, they
          will eventually pick up a session but abandon it.

          Bryan wrote:

          > Deepa and I discussed this a fair amount in DERBY-1219; you should
          > probably review that discussion to see if there's more you can learn
          > from it. In particular, the discussion regarding "skipThreads.diff"
          > at: http://issues.apache.org/jira/browse/DERBY-1219#action_12378300

          There are many interesting patches attached to that issue (though I am
          definitely not going for any solution which involves
          Thread.interrupt()). skipThreads.diff seems to do exactly what I
          proposed. From the comments, it seems like there were two objections:

          1) It didn't fix all of the hangs.
          2) Removing the cleanup code didn't feel quite right, but if the
          threads could be reused after a restart, it should be okay.

          The hangs mentioned in (1) could be the same ones as DERBY-1817
          fixed. I also believe that the threads can be reused after a restart.

          > If we didn't close down the threads, would we do anything to notify
          > them of the shutdown? Were you proposing to leave them alone
          > entirely? Or were you proposing to notify them of the shutdown, but
          > not require the threads to exit?

          I don't think we need to notify them about the shutdown. If a thread
          is waiting for runQueue notification in getNextSession(), it is not
          bound to a session and hence doesn't need to know about the
          shutdown. If a thread has a session bound to it, it would stay
          sleeping (and unavailable for new sessions) until the client closes
          the connection or disconnects. This is quite similar to the current
          behaviour (the closed threads are not freed until there is some
          activity on them).

          For the latter (threads with sessions), it would make sense to notify
          them about the shutdown so they could drop their invalid sessions
          earlier, but that would be an optimization and not a correctness
          issue, I think.

          > Are you sure a DRDAConnThread doesn't have dependencies on the
          > engine instance? The thread has a pointer to a session, and a
          > session seems like it could point to database objects (like
          > Connection objects, e.g.)

          All DRDAConnThread's dependencies on the engine go through the session
          pointer. For free threads, the session pointer is null, hence there
          are no dependencies. If it picks up a new session, it would depend on
          the new session's engine, not the old one.

          For active DRDAConnThread instances, there are dependencies on the old
          engine after the shutdown. However, this is also the case for the
          current code (the thread is marked as closed, but the session is not
          abandoned and the thread is not stopped until the client tries to
          perform an operation on the connection).

          Show
          Knut Anders Hatlen added a comment - Francois wrote: > Were you actually implying some auto database reconnect of the > current DRDA threads created in the network server once the Derby > engine or/and databases are brought back up? Yes, when the server detects that one of the clients has shut down the engine (it checks the SQLSTATE of exceptions before sending them to the client), it (a) closes all sessions that are in the run queue waiting for a thread, (b) poisons all DRDAConnThreads, and (c) reboots the engine. Because the poisoned DRDAConnThreads are left around, they will eventually pick up a session but abandon it. Bryan wrote: > Deepa and I discussed this a fair amount in DERBY-1219 ; you should > probably review that discussion to see if there's more you can learn > from it. In particular, the discussion regarding "skipThreads.diff" > at: http://issues.apache.org/jira/browse/DERBY-1219#action_12378300 There are many interesting patches attached to that issue (though I am definitely not going for any solution which involves Thread.interrupt()). skipThreads.diff seems to do exactly what I proposed. From the comments, it seems like there were two objections: 1) It didn't fix all of the hangs. 2) Removing the cleanup code didn't feel quite right, but if the threads could be reused after a restart, it should be okay. The hangs mentioned in (1) could be the same ones as DERBY-1817 fixed. I also believe that the threads can be reused after a restart. > If we didn't close down the threads, would we do anything to notify > them of the shutdown? Were you proposing to leave them alone > entirely? Or were you proposing to notify them of the shutdown, but > not require the threads to exit? I don't think we need to notify them about the shutdown. If a thread is waiting for runQueue notification in getNextSession(), it is not bound to a session and hence doesn't need to know about the shutdown. If a thread has a session bound to it, it would stay sleeping (and unavailable for new sessions) until the client closes the connection or disconnects. This is quite similar to the current behaviour (the closed threads are not freed until there is some activity on them). For the latter (threads with sessions), it would make sense to notify them about the shutdown so they could drop their invalid sessions earlier, but that would be an optimization and not a correctness issue, I think. > Are you sure a DRDAConnThread doesn't have dependencies on the > engine instance? The thread has a pointer to a session, and a > session seems like it could point to database objects (like > Connection objects, e.g.) All DRDAConnThread's dependencies on the engine go through the session pointer. For free threads, the session pointer is null, hence there are no dependencies. If it picks up a new session, it would depend on the new session's engine, not the old one. For active DRDAConnThread instances, there are dependencies on the old engine after the shutdown. However, this is also the case for the current code (the thread is marked as closed, but the session is not abandoned and the thread is not stopped until the client tries to perform an operation on the connection).
          Hide
          Francois Orsini added a comment -

          Well, we do connect to the Derby engine instance from the network server layer so if the embedded Derby engine goes down, we would have to reinstantiate the database connections (logic is not there right now if am not mistaken), which means some state(s) have to be kept around and some logic has to be added. But then, how do you handle incoming request from the client(s), especially if it takes some time to recover the database(s)...What is there is a corruption and some database cannot be recovered? What if after a reboot of the engine some derby properties have affected the behavior and configuration of the recovered databases - For instance, some database properties are static, so they won't be taken into account until the next reboot, which can also mean that a client (user) would no longer be allowed to authenticate if the authentication provider was changed before the next reboot - This is true for Strong Password Substitute which only works with BUILT-IN or NONE authentication providers right now...Hence in this last case, a security mechanism negotiation needs to happen between the client and the network server...Another case if that if you enable authentication at the engine level but derby instance has not rebooted yet, you will then not be able to re-authenticate based on the cached database info per DRDA session - Simply because a password may not be cached at all since there was no authentication required beforehand and now there is because the static property got picked-up by the database in question at the next engine/database reboot...The connection with the database would not work and the client would probably get confused with some authentication exception where he/she was already authenticated before the Derby engine got shutdown...

          Were you actually implying some auto database reconnect of the current DRDA threads created in the network server once the Derby engine or/and databases are brought back up?

          Show
          Francois Orsini added a comment - Well, we do connect to the Derby engine instance from the network server layer so if the embedded Derby engine goes down, we would have to reinstantiate the database connections (logic is not there right now if am not mistaken), which means some state(s) have to be kept around and some logic has to be added. But then, how do you handle incoming request from the client(s), especially if it takes some time to recover the database(s)...What is there is a corruption and some database cannot be recovered? What if after a reboot of the engine some derby properties have affected the behavior and configuration of the recovered databases - For instance, some database properties are static, so they won't be taken into account until the next reboot, which can also mean that a client (user) would no longer be allowed to authenticate if the authentication provider was changed before the next reboot - This is true for Strong Password Substitute which only works with BUILT-IN or NONE authentication providers right now...Hence in this last case, a security mechanism negotiation needs to happen between the client and the network server...Another case if that if you enable authentication at the engine level but derby instance has not rebooted yet, you will then not be able to re-authenticate based on the cached database info per DRDA session - Simply because a password may not be cached at all since there was no authentication required beforehand and now there is because the static property got picked-up by the database in question at the next engine/database reboot...The connection with the database would not work and the client would probably get confused with some authentication exception where he/she was already authenticated before the Derby engine got shutdown... Were you actually implying some auto database reconnect of the current DRDA threads created in the network server once the Derby engine or/and databases are brought back up?
          Hide
          Bryan Pendleton added a comment -

          > Does anyone see problems with reusing the threads?

          Deepa and I discussed this a fair amount in DERBY-1219; you should
          probably review that discussion to see if there's more you can learn from it.
          In particular, the discussion regarding "skipThreads.diff" at:
          http://issues.apache.org/jira/browse/DERBY-1219#action_12378300

          If we didn't close down the threads, would we do anything to notify them of
          the shutdown? Were you proposing to leave them alone entirely? Or were
          you proposing to notify them of the shutdown, but not require the threads to exit?

          Are you sure a DRDAConnThread doesn't have dependencies on the engine
          instance? The thread has a pointer to a session, and a session seems like
          it could point to database objects (like Connection objects, e.g.)

          Show
          Bryan Pendleton added a comment - > Does anyone see problems with reusing the threads? Deepa and I discussed this a fair amount in DERBY-1219 ; you should probably review that discussion to see if there's more you can learn from it. In particular, the discussion regarding "skipThreads.diff" at: http://issues.apache.org/jira/browse/DERBY-1219#action_12378300 If we didn't close down the threads, would we do anything to notify them of the shutdown? Were you proposing to leave them alone entirely? Or were you proposing to notify them of the shutdown, but not require the threads to exit? Are you sure a DRDAConnThread doesn't have dependencies on the engine instance? The thread has a pointer to a session, and a session seems like it could point to database objects (like Connection objects, e.g.)
          Hide
          Knut Anders Hatlen added a comment -

          It is not clear to me why an engine shutdown should close down all DRDAConnThreads. I kind of understand why the sessions are closed (although that would happen implicitly anyway) since their embedded connections are closed as part of the engine shutdown. The DRDAConnThreads are however merely worker threads which don't have any dependencies on the engine instance that was active when they were created, so there is no reason why they can't be reused after the restart of the engine. Does anyone see problems with reusing the threads? In my opinion, it would be easier to ensure a consistent state after a restart if the threads were kept.

          Show
          Knut Anders Hatlen added a comment - It is not clear to me why an engine shutdown should close down all DRDAConnThreads. I kind of understand why the sessions are closed (although that would happen implicitly anyway) since their embedded connections are closed as part of the engine shutdown. The DRDAConnThreads are however merely worker threads which don't have any dependencies on the engine instance that was active when they were created, so there is no reason why they can't be reused after the restart of the engine. Does anyone see problems with reusing the threads? In my opinion, it would be easier to ensure a consistent state after a restart if the threads were kept.
          Hide
          Bryan Pendleton added a comment -

          Restart.java reliably reproduces the hang in my environment as well. Very nice.

          Show
          Bryan Pendleton added a comment - Restart.java reliably reproduces the hang in my environment as well. Very nice.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a repro (Restart.java) which always causes a hang in my environment. What it does is:

          1) Open 20 connections to the server.
          2) Close all connections.
          3) Shut down the engine (will make the server restart).
          4) Open and close a connection 20 times (usually hangs on the second connection).

          After (3), the server has 20 threads sleeping in runQueue.wait(). All of these are closed, but when they start waking up in (4), they don't notice that they have been closed until they have picked up a session from the run queue. That session will be abandoned, and the client hangs.

          Show
          Knut Anders Hatlen added a comment - Attaching a repro (Restart.java) which always causes a hang in my environment. What it does is: 1) Open 20 connections to the server. 2) Close all connections. 3) Shut down the engine (will make the server restart). 4) Open and close a connection 20 times (usually hangs on the second connection). After (3), the server has 20 threads sleeping in runQueue.wait(). All of these are closed, but when they start waking up in (4), they don't notice that they have been closed until they have picked up a session from the run queue. That session will be abandoned, and the client hangs.
          Hide
          Bryan Pendleton added a comment -

          I'm not actively working on this. I hope to return to it in the future, but if somebody else has time in the meantime, please feel free to explore this issue.

          Show
          Bryan Pendleton added a comment - I'm not actively working on this. I hope to return to it in the future, but if somebody else has time in the meantime, please feel free to explore this issue.
          Hide
          Bryan Pendleton added a comment -

          I want to thank all the reviewers for their helpful feedback! Those are very good suggestions.

          I am hoping to return to this bug and continue working on it, incorporating the various ideas.

          I was dismayed by how hard it seemed it was going to be to make a variation of the Network Server "ping" API which was silent.

          Show
          Bryan Pendleton added a comment - I want to thank all the reviewers for their helpful feedback! Those are very good suggestions. I am hoping to return to this bug and continue working on it, incorporating the various ideas. I was dismayed by how hard it seemed it was going to be to make a variation of the Network Server "ping" API which was silent.
          Hide
          John H. Embretsen added a comment -

          > Deepa Remesh commented:
          > -----------------------

          [snip]

          > From current behaviour, we do not shutdown network server when we shutdown Derby. Does anyone know if there is any plan to do as indicated by the comment: "Want to use Monitor.startModule, so it can all get shutdown when cloudscape shuts down, but can't get it working right now." ?

          Speaking of current behaviour, here are some observations I have made recently:

          a) In the "embedded server using properties" scenario, the Network Server starts when the embedded driver is loaded and stops when it is "unloaded", i.e. when the Derby system is shut down: DriverManager.getConnection("jdbc:derby:;shutdown=true");

          b) In the "embedded server using NetworkServerControl API" scenario, the Network server of course starts when NetworkServerControl.start() is called, and stops when shutdown() is called. However, if the Derby system is shut down (by loading the embedded driver and doing as shown above), clients can no longer connect (getting "No suitable driver". Why?). The Network Server port seems to be listening on the same port still, though, according to netstat. Then, if I try to restart the server using the API (without calling shutdown() first), I get the message "Could not listen on port 1527 on host 0.0.0.0.". This makes sense, since the port is already in use. However, after start() is called the second time, clients are in fact able to connect. There seems to be something happening when start() is called, even if the port is busy, that is needed to make the server able to accept connections again.

          I have not done any code inspection in these cases, so I don't know if this is very helpful with regards to this Jira issue. I just reacted when I saw the above comment.

          Show
          John H. Embretsen added a comment - > Deepa Remesh commented: > ----------------------- [snip] > From current behaviour, we do not shutdown network server when we shutdown Derby. Does anyone know if there is any plan to do as indicated by the comment: "Want to use Monitor.startModule, so it can all get shutdown when cloudscape shuts down, but can't get it working right now." ? Speaking of current behaviour, here are some observations I have made recently: a) In the "embedded server using properties" scenario, the Network Server starts when the embedded driver is loaded and stops when it is "unloaded", i.e. when the Derby system is shut down: DriverManager.getConnection("jdbc:derby:;shutdown=true"); b) In the "embedded server using NetworkServerControl API" scenario, the Network server of course starts when NetworkServerControl.start() is called, and stops when shutdown() is called. However, if the Derby system is shut down (by loading the embedded driver and doing as shown above), clients can no longer connect (getting "No suitable driver". Why?). The Network Server port seems to be listening on the same port still, though, according to netstat. Then, if I try to restart the server using the API (without calling shutdown() first), I get the message "Could not listen on port 1527 on host 0.0.0.0.". This makes sense, since the port is already in use. However, after start() is called the second time, clients are in fact able to connect. There seems to be something happening when start() is called, even if the port is busy, that is needed to make the server able to accept connections again. I have not done any code inspection in these cases, so I don't know if this is very helpful with regards to this Jira issue. I just reacted when I saw the above comment.
          Hide
          Deepa Remesh added a comment -

          Thanks again for posting a detailed description. I read through the description and patch "unify_NSImpl_instances.diff " quickly.

          I have few small comments/questions:

          • Currently, the patch does the following "Modify the DRDAServerStarter boot code so that it loads the embedded engine. " The description also mentions this: "an alternative would be to load the embedded engine in NetworkServerControlImpl.start()". I like this alternative. I am thinking of the scenario where derby.drda.startNetowrkServer property is set and we boot embedded driver. In this scenario, if we load the embedded engine in the DRDAServerStarter boot code, it will appear that we are trying to boot the embedded driver twice. However, this will not do any harm as it will be a no-op. I think it would be better to avoid doing this as it causes confusion why we are loading the driver again when we trace the path from loading embedded engine.
          • There is a slightly different code path when we start network server from the command line. In this case, we create an instance of NetworkServerControlImpl and then call executeWork which processes the "start" command by calling blockingStart(). There is no DRDAServerStarter involved here as we do not need to launch another thread. This is not very relevant for the current patch but just mentioning it in case we plan to re-organize the code.
          • It looks like the only purpose of calling DRDAServerStarter in NetworkServerControlImpl.start method is to launch a separate thread. The javadoc comment for NetworkServerControlImpl.start says:
            /**
          • Start a network server. Launches a separate thread with
          • DRDAServerStarter. Want to use Monitor.startModule,
          • so it can all get shutdown when cloudscape shuts down, but
          • can't get it working right now.

          From current behaviour, we do not shutdown network server when we shutdown Derby. Does anyone know if there is any plan to do as indicated by the comment: "Want to use Monitor.startModule, so it can all get shutdown when cloudscape shuts down, but can't get it working right now." ?

          Show
          Deepa Remesh added a comment - Thanks again for posting a detailed description. I read through the description and patch "unify_NSImpl_instances.diff " quickly. I have few small comments/questions: Currently, the patch does the following "Modify the DRDAServerStarter boot code so that it loads the embedded engine. " The description also mentions this: "an alternative would be to load the embedded engine in NetworkServerControlImpl.start()". I like this alternative. I am thinking of the scenario where derby.drda.startNetowrkServer property is set and we boot embedded driver. In this scenario, if we load the embedded engine in the DRDAServerStarter boot code, it will appear that we are trying to boot the embedded driver twice. However, this will not do any harm as it will be a no-op. I think it would be better to avoid doing this as it causes confusion why we are loading the driver again when we trace the path from loading embedded engine. There is a slightly different code path when we start network server from the command line. In this case, we create an instance of NetworkServerControlImpl and then call executeWork which processes the "start" command by calling blockingStart(). There is no DRDAServerStarter involved here as we do not need to launch another thread. This is not very relevant for the current patch but just mentioning it in case we plan to re-organize the code. It looks like the only purpose of calling DRDAServerStarter in NetworkServerControlImpl.start method is to launch a separate thread. The javadoc comment for NetworkServerControlImpl.start says: /** Start a network server. Launches a separate thread with DRDAServerStarter. Want to use Monitor.startModule, so it can all get shutdown when cloudscape shuts down, but can't get it working right now. From current behaviour, we do not shutdown network server when we shutdown Derby. Does anyone know if there is any plan to do as indicated by the comment: "Want to use Monitor.startModule, so it can all get shutdown when cloudscape shuts down, but can't get it working right now." ?
          Hide
          Bryan Pendleton added a comment -

          Attached is 'unify_NSImpl_instances.diff', with these notes to explain. This
          patch belongs with the rest of the DERBY-1326 changes, but I've extracted it
          out into a standalone patch to make it easier to study and review.

          Recently, I've been investigating that hang that Deepa uncovered which occurs
          with the latest patch proposal for DERBY-1326.
          http://issues.apache.org/jira/secure/attachment/12334997/withNewThreadAndNoShutdownOfCurrentSession.diff

          Investigating that problem has taken me through some interesting analysis
          of NetworkServer startup/shutdown issues, which I'd like to describe here, to
          see what sort of reaction I get from the team.

          There are several different ways to start up the Network Server, but after
          some twists and turns there appear to be two essential techniques:
          1) You can instantiate a NetworkServerControl object, and call the start()
          method on it.
          2) You can define the derby.drda.startNetworkServer property to true, and then
          start the embedded engine.

          These two code paths proceed somewhat differently, but end up getting to the
          same place:
          1) NetworkServerControl.start() delegates to NetworkServerControlImpl.start(),
          which calls NetworkServerControlImpl.startNetworkServer(), then instantiates
          a DRDAServerStarter object and calls its boot() method.
          2) Loading the embedded engine eventually calls JDBCBoot.boot(), which notices
          that derby.drda.startNetworkServer has been set, and calls
          Monitor.startSystemModule() to instantiate a DRDAServerStarter object and
          call its boot() method.

          The DRDAServerStarter.boot() method uses reflection to:

          • dynamically load the Network Server code
          • create an instance of NetworkServerControlImpl
          • create a daemon server thread, which will then:
          • call NetworkServerControlImpl.blockingStart()

          So whichever way that we start the Network Server, we end up constructing
          a NetworkServerControlImpl instance and calling blockingStart() on that
          instance from a background thread. This is good.

          However, there is an important and interesting difference between the two
          startup methods:

          DRDAServerStarter.boot() always creates its own instance of the
          NetworkServerControlImpl object and calls blockingStart() on that instance.
          So if you start the Network Server by instantiating a NetworkServerControl
          object in your own code, that NetworkServerControl instance creates a
          NetworkServerControlImpl instance, but then when it calls DRDAServerStarter,
          we end up creating a second NetworkServerControlImpl instance to use as
          the master instance, and we never really use the NetworkServerControlImpl
          instance that was used by the NetworkServerControl object that you created.

          Worse, both NetworkServerControlImpl instances are started up, in a certain
          sense, because NetworkServerControlImpl.startNetworkServer() is called on
          each instance:

          • NetworkServerControlImpl.start() calls startNetworkServer() on itself before
            it instantiates the DRDAServerStarter object and calls boot()
          • NetworkServerControlImpl.blockingStart() calls startNetworkServer() on
            itself as its first action.

          So if you start up the Network Server by calling NetworkServerControl.start(),
          you end up actually creating two Network Server instances, although one of
          them doesn't really do very much. This is the source of the hang that I created
          with the most recent patch proposal to DERBY-1326: when I changed the
          startNetworkServer() method to create a new thread, that code ends up being
          called twice, but we don't shut down both of these instances, so we never
          shut down the extra thread that I created, and that caused the process to
          hang and not terminate.

          Clearly, one way out of this problem is to re-arrange the thread creation logic
          so that the thread is only created when the real NetworkServerControlImpl
          instance is started.

          But I was curious about why there were two NetworkServerControlImpl instances,
          since it seems quite clear to me that the NetworkServerControlImpl class is
          designed to behave as a singleton object. I thought that if I had made this
          assumption that there would only ever be a single instance in a Java app,
          others might tend to make this same assumption in the future, and more bugs
          like the one I accidentally introduced might get introduced in the future.

          So, for the last week or so, I've been studying this code, trying to figure
          out if we really need to have two NetworkServerControlImpl instances, or
          whether it would be cleaner to just have a single instance.

          My conclusion is that, for the most part, we don't need to have both
          instances, and it is cleaner to just have a single instance, but there
          are a few messy details.

          The changes that I've been experimenting with in this area are as follows:
          1) Refactor the DRDAServerStarter code slightly so that the caller can
          optionally pass in the NetworkServerControlImpl instance to use. If the
          caller does not pass in an instance, DRDAServerStarter creates one via
          reflection, as it does now.
          2) Simplify NetworkServerControlImpl.start() so that it no longer calls
          startNetworkServer() on itself, but instead simply calls DRDAServerStarter,
          passing the "this" pointer in as the instance to use for starting the server.
          3) Modify the DRDAServerStarter boot code so that it loads the embedded
          engine. The current DRDAServerStarter code assumes that the embedded engine
          has already been started; I believe that this is the main reason for the
          current code's call to startNetworkServer() from NetworkServerControlImpl's
          start() method, as a side effect of startNetworkServer() is to load the
          embedded engine. It seems simple to have DRDAServerStarter load the
          embedded engine itself, but an alternative would be to load the embedded
          engine in NetworkServerControlImpl.start().

          With these changes in place, the Network Server startup and shutdown code
          seems substantially cleaner to me, and the tests seem to run very well.
          I'm pleased with the results.

          However, there is one messy detail that remains to be resolved, and it
          involves the Network Server shutdown processing. The shutdown() method for
          the Network Server, in NetworkServerControlImpl.shutdown(), uses the following
          algorithm:

          • tell the server to shut down
          • loop for a while, ping'ing the Network server, until we get an error

          Unfortunately, during this "ping loop", the code wants to ensure that no
          stray messages from the failed ping operation escape to the user, so the
          code intentionally disables the Network Server console.

          When there were two NetworkServerControlImpl instances in play, intentionally
          disabling the console of the one instance was a relatively safe thing to do.

          However, when there is only a single instance in the process, disabling the
          console is a quite disruptive action, because it affects all the messages
          that the Network Server might want to print, including the "normal shutdown"
          message that the Network Server prints at the conclusion of shutdown.

          Therefore, I'm coming to believe that we shouldn't disable the console
          during shutdown, but rather should implement a variant "ping" operation which
          doesn't print any messages when it fails, but instead just throws an exception,
          which the shutdown code could catch and handle. Unfortunately, this is not
          a trivial one-line change, as the code which prints error messages to the
          Network Server console is buried deep in the Network Server command
          processing code.

          Obviously, I've still got more work to do here. But I'm attaching the current
          state of my changes anyway, since I think it's a useful intermediate stage
          and would be interesting to reviewers.

          The other comment to be made is that this change is now becoming quite large,
          and probably should be split into multiple smaller patches. I'll pursue that
          strategy later, once I get to a point where I have code that seems solid.

          Thanks in advance for all comments, suggestions, and other feedback!

          Show
          Bryan Pendleton added a comment - Attached is 'unify_NSImpl_instances.diff', with these notes to explain. This patch belongs with the rest of the DERBY-1326 changes, but I've extracted it out into a standalone patch to make it easier to study and review. Recently, I've been investigating that hang that Deepa uncovered which occurs with the latest patch proposal for DERBY-1326 . http://issues.apache.org/jira/secure/attachment/12334997/withNewThreadAndNoShutdownOfCurrentSession.diff Investigating that problem has taken me through some interesting analysis of NetworkServer startup/shutdown issues, which I'd like to describe here, to see what sort of reaction I get from the team. There are several different ways to start up the Network Server, but after some twists and turns there appear to be two essential techniques: 1) You can instantiate a NetworkServerControl object, and call the start() method on it. 2) You can define the derby.drda.startNetworkServer property to true, and then start the embedded engine. These two code paths proceed somewhat differently, but end up getting to the same place: 1) NetworkServerControl.start() delegates to NetworkServerControlImpl.start(), which calls NetworkServerControlImpl.startNetworkServer(), then instantiates a DRDAServerStarter object and calls its boot() method. 2) Loading the embedded engine eventually calls JDBCBoot.boot(), which notices that derby.drda.startNetworkServer has been set, and calls Monitor.startSystemModule() to instantiate a DRDAServerStarter object and call its boot() method. The DRDAServerStarter.boot() method uses reflection to: dynamically load the Network Server code create an instance of NetworkServerControlImpl create a daemon server thread, which will then: call NetworkServerControlImpl.blockingStart() So whichever way that we start the Network Server, we end up constructing a NetworkServerControlImpl instance and calling blockingStart() on that instance from a background thread. This is good. However, there is an important and interesting difference between the two startup methods: DRDAServerStarter.boot() always creates its own instance of the NetworkServerControlImpl object and calls blockingStart() on that instance. So if you start the Network Server by instantiating a NetworkServerControl object in your own code, that NetworkServerControl instance creates a NetworkServerControlImpl instance, but then when it calls DRDAServerStarter, we end up creating a second NetworkServerControlImpl instance to use as the master instance, and we never really use the NetworkServerControlImpl instance that was used by the NetworkServerControl object that you created. Worse, both NetworkServerControlImpl instances are started up, in a certain sense, because NetworkServerControlImpl.startNetworkServer() is called on each instance: NetworkServerControlImpl.start() calls startNetworkServer() on itself before it instantiates the DRDAServerStarter object and calls boot() NetworkServerControlImpl.blockingStart() calls startNetworkServer() on itself as its first action. So if you start up the Network Server by calling NetworkServerControl.start(), you end up actually creating two Network Server instances, although one of them doesn't really do very much. This is the source of the hang that I created with the most recent patch proposal to DERBY-1326 : when I changed the startNetworkServer() method to create a new thread, that code ends up being called twice, but we don't shut down both of these instances, so we never shut down the extra thread that I created, and that caused the process to hang and not terminate. Clearly, one way out of this problem is to re-arrange the thread creation logic so that the thread is only created when the real NetworkServerControlImpl instance is started. But I was curious about why there were two NetworkServerControlImpl instances, since it seems quite clear to me that the NetworkServerControlImpl class is designed to behave as a singleton object. I thought that if I had made this assumption that there would only ever be a single instance in a Java app, others might tend to make this same assumption in the future, and more bugs like the one I accidentally introduced might get introduced in the future. So, for the last week or so, I've been studying this code, trying to figure out if we really need to have two NetworkServerControlImpl instances, or whether it would be cleaner to just have a single instance. My conclusion is that, for the most part, we don't need to have both instances, and it is cleaner to just have a single instance, but there are a few messy details. The changes that I've been experimenting with in this area are as follows: 1) Refactor the DRDAServerStarter code slightly so that the caller can optionally pass in the NetworkServerControlImpl instance to use. If the caller does not pass in an instance, DRDAServerStarter creates one via reflection, as it does now. 2) Simplify NetworkServerControlImpl.start() so that it no longer calls startNetworkServer() on itself, but instead simply calls DRDAServerStarter, passing the "this" pointer in as the instance to use for starting the server. 3) Modify the DRDAServerStarter boot code so that it loads the embedded engine. The current DRDAServerStarter code assumes that the embedded engine has already been started; I believe that this is the main reason for the current code's call to startNetworkServer() from NetworkServerControlImpl's start() method, as a side effect of startNetworkServer() is to load the embedded engine. It seems simple to have DRDAServerStarter load the embedded engine itself, but an alternative would be to load the embedded engine in NetworkServerControlImpl.start(). With these changes in place, the Network Server startup and shutdown code seems substantially cleaner to me, and the tests seem to run very well. I'm pleased with the results. However, there is one messy detail that remains to be resolved, and it involves the Network Server shutdown processing. The shutdown() method for the Network Server, in NetworkServerControlImpl.shutdown(), uses the following algorithm: tell the server to shut down loop for a while, ping'ing the Network server, until we get an error Unfortunately, during this "ping loop", the code wants to ensure that no stray messages from the failed ping operation escape to the user, so the code intentionally disables the Network Server console. When there were two NetworkServerControlImpl instances in play, intentionally disabling the console of the one instance was a relatively safe thing to do. However, when there is only a single instance in the process, disabling the console is a quite disruptive action, because it affects all the messages that the Network Server might want to print, including the "normal shutdown" message that the Network Server prints at the conclusion of shutdown. Therefore, I'm coming to believe that we shouldn't disable the console during shutdown, but rather should implement a variant "ping" operation which doesn't print any messages when it fails, but instead just throws an exception, which the shutdown code could catch and handle. Unfortunately, this is not a trivial one-line change, as the code which prints error messages to the Network Server console is buried deep in the Network Server command processing code. Obviously, I've still got more work to do here. But I'm attaching the current state of my changes anyway, since I think it's a useful intermediate stage and would be interesting to reviewers. The other comment to be made is that this change is now becoming quite large, and probably should be split into multiple smaller patches. I'll pursue that strategy later, once I get to a point where I have code that seems solid. Thanks in advance for all comments, suggestions, and other feedback!
          Hide
          Bryan Pendleton added a comment -

          Well this is very interesting! There is always something new to learn

          The test program intends to do the following:

          • start up the Network Server
          • check that it started successfully
          • shut it down
          • start it up with a slightly different configuration
          • check that it started successfully
          • shut it down

          So we think we are starting it up twice and shutting it down twice

          When I investigated the hang that Deepa discovered, however, I learned that
          in fact we are starting up the Network Server four times, and shutting it down
          twice. These extra Network Server instances didn't harm anything in the past,
          because it was just a waste of memory to start up the extra instances.

          However, when I added the new thread creation code in the network server
          startup, it meant that whenever the network server started, a DRDAConnThread
          was also started. And if that network server wasn't shut down, then the
          DRDAConnThread instance remained, and prevented the server from shutting
          down, causing the hang.

          The reason that we are getting the extra NetworkServerControlImpl instances is very
          simple: when you call NetworkServerControlImpl.start, it does the following:

          public void start(PrintWriter consoleWriter)
          throws Exception

          { DRDAServerStarter starter = new DRDAServerStarter(); starter.setStartInfo(hostAddress,portNumber,consoleWriter); startNetworkServer(null); starter.boot(false,null); }

          That is, it directly starts this instance, then it spawns a background thread to
          run the DRDAServerStarter code, which instantiates another NetworkServerControlImpl

          My first idea was to remove the call to startNetworkServer() from this method, and just
          let DRDAServerStarter start up the Network Server.

          Sure enough, this fixes the hang in DerbyNetNewServer.

          And, conceptually, it seems like a good change to make, because it doesn't seem
          like we should be starting the Network Server twice.

          However, removing the call to startNetworkServer causes test failures in a bunch of
          other Network Server tests, such as NSinSameJVM, dataSourcePermissions_net,
          ownServerTests, and testSecMec.

          So, there's a bunch more investigation to do here

          Show
          Bryan Pendleton added a comment - Well this is very interesting! There is always something new to learn The test program intends to do the following: start up the Network Server check that it started successfully shut it down start it up with a slightly different configuration check that it started successfully shut it down So we think we are starting it up twice and shutting it down twice When I investigated the hang that Deepa discovered, however, I learned that in fact we are starting up the Network Server four times, and shutting it down twice. These extra Network Server instances didn't harm anything in the past, because it was just a waste of memory to start up the extra instances. However, when I added the new thread creation code in the network server startup, it meant that whenever the network server started, a DRDAConnThread was also started. And if that network server wasn't shut down, then the DRDAConnThread instance remained, and prevented the server from shutting down, causing the hang. The reason that we are getting the extra NetworkServerControlImpl instances is very simple: when you call NetworkServerControlImpl.start, it does the following: public void start(PrintWriter consoleWriter) throws Exception { DRDAServerStarter starter = new DRDAServerStarter(); starter.setStartInfo(hostAddress,portNumber,consoleWriter); startNetworkServer(null); starter.boot(false,null); } That is, it directly starts this instance, then it spawns a background thread to run the DRDAServerStarter code, which instantiates another NetworkServerControlImpl My first idea was to remove the call to startNetworkServer() from this method, and just let DRDAServerStarter start up the Network Server. Sure enough, this fixes the hang in DerbyNetNewServer. And, conceptually, it seems like a good change to make, because it doesn't seem like we should be starting the Network Server twice. However, removing the call to startNetworkServer causes test failures in a bunch of other Network Server tests, such as NSinSameJVM, dataSourcePermissions_net, ownServerTests, and testSecMec. So, there's a bunch more investigation to do here
          Hide
          Deepa Remesh added a comment -

          I am seeing a hang in the test derbynet/DerbyNetNewServer.java with 'resolve_DRDConnThread_conflict.diff'' patch.

          I was using the same workspace where I had applied this patch to try out some things for DERBY-803. I started seeing a hang in derbynet/DerbyNetNewServer.java test. On looking at the .tmp file, I found that the test itself was running to the end. It looked like it was waiting for network server process to stop. I rechecked by reverting my changes, reapplying the patch and running a full build. I was still getting a hang in the test. It looks like there are additional changes in network server shutdown introduced by this patch.

          In an earlier mail, Bryan had mentioned that derbyall ran successfully with this patch. This could be because of "timeout=60" property specified in derbynetclientmats suite. To check this, I ran the test by specifying "-Dtimeout=3" on the command line. Then, the test process ran to the end within the timeout period and the test passed. It had this additional line on the console:
          "Server Process did not complete in time. Destroying..."

          Bryan, if you still have the results from derbyall run with this patch, can you please check to see how long the test derbynet/DerbyNetNewServer.java took? If it took close to an hour to complete, it will confirm what I just observed. Otherwise, it may be a problem with my test environment.

          Show
          Deepa Remesh added a comment - I am seeing a hang in the test derbynet/DerbyNetNewServer.java with 'resolve_DRDConnThread_conflict.diff'' patch. I was using the same workspace where I had applied this patch to try out some things for DERBY-803 . I started seeing a hang in derbynet/DerbyNetNewServer.java test. On looking at the .tmp file, I found that the test itself was running to the end. It looked like it was waiting for network server process to stop. I rechecked by reverting my changes, reapplying the patch and running a full build. I was still getting a hang in the test. It looks like there are additional changes in network server shutdown introduced by this patch. In an earlier mail, Bryan had mentioned that derbyall ran successfully with this patch. This could be because of "timeout=60" property specified in derbynetclientmats suite. To check this, I ran the test by specifying "-Dtimeout=3" on the command line. Then, the test process ran to the end within the timeout period and the test passed. It had this additional line on the console: "Server Process did not complete in time. Destroying..." Bryan, if you still have the results from derbyall run with this patch, can you please check to see how long the test derbynet/DerbyNetNewServer.java took? If it took close to an hour to complete, it will confirm what I just observed. Otherwise, it may be a problem with my test environment.
          Hide
          Deepa Remesh added a comment -

          I applied 'resolve_DRDConnThread_conflict.diff' and ran the repro 50 times without any hang. This patch solves the hang mentioned in this issue. Also resolves the problems discovered while working on this issue.

          I have only minor comments:

          • Indentation seems to be out of place in NetworkServerControlImpl.java in the comment added to startNetworkServer method. Also in the changes in checkDataSource.java
          • It may be useful to have some more comment in startNetworkServer method as to why we interrupt the threads in addition to calling close on them.
          Show
          Deepa Remesh added a comment - I applied 'resolve_DRDConnThread_conflict.diff' and ran the repro 50 times without any hang. This patch solves the hang mentioned in this issue. Also resolves the problems discovered while working on this issue. I have only minor comments: Indentation seems to be out of place in NetworkServerControlImpl.java in the comment added to startNetworkServer method. Also in the changes in checkDataSource.java It may be useful to have some more comment in startNetworkServer method as to why we interrupt the threads in addition to calling close on them.
          Hide
          Bryan Pendleton added a comment -

          Thanks for desk-checking the patch, Deepa! I updated my workspace and resolved the merge conflicts in DRDAConnThread, and am re-attaching the patch proposal as 'resolve_DRDAConnThread_conflict.diff'. Hopefully you can now apply this patch to the head of the trunk.

          Show
          Bryan Pendleton added a comment - Thanks for desk-checking the patch, Deepa! I updated my workspace and resolved the merge conflicts in DRDAConnThread, and am re-attaching the patch proposal as 'resolve_DRDAConnThread_conflict.diff'. Hopefully you can now apply this patch to the head of the trunk.
          Hide
          Deepa Remesh added a comment -

          I read through the changes in 'withNewThreadAndNoShutdownOfCurrentSession.diff' patch and I think they address all the problems found in the previous comments. I could not apply the patch to my workspace. I think some intermediate check-ins changed the same file.

          Show
          Deepa Remesh added a comment - I read through the changes in 'withNewThreadAndNoShutdownOfCurrentSession.diff' patch and I think they address all the problems found in the previous comments. I could not apply the patch to my workspace. I think some intermediate check-ins changed the same file.
          Hide
          Bryan Pendleton added a comment -

          This latest attachment, 'withNewThreadAndNoShutdownOfCurrentSession.diff',
          contains the following changes relative to the previous patch proposal
          (sessionMgmt1_and_nosessionsforclosedthreads.diff):

          1) runQueueAdd now calls runQueue.notifyAll() rather than runQueue.notify()
          2) startNetworkServer now primes the thread pool by starting the first
          DRDAConnThread right away, rather than waiting for the first Session to come
          in. To make this work, I had to make a few small tweaks to DRDAConnThread
          initialization so that it would defer its initialization until it had its first Session.
          3) startNetworkServer now takes the calling Session as an argument. In the
          initial server start case this is null, but if a restart is triggered by a shutdown
          exception, the caller passes its Session in. startNetworkServer uses this to
          avoid closing the Session which is triggering the restart, so it can stay open
          long enough to return the shutdown exception back to the client.
          4) The client-masters for checkDataSource and checkDataSource30 are
          modified to add the shutdown exception. This modification of the master file
          goes hand-in-hand with the modification of the test program to re-enable
          the server shutdown/restart during the test.

          Please take a look at this latest proposal and let me know your comments.

          Show
          Bryan Pendleton added a comment - This latest attachment, 'withNewThreadAndNoShutdownOfCurrentSession.diff', contains the following changes relative to the previous patch proposal (sessionMgmt1_and_nosessionsforclosedthreads.diff): 1) runQueueAdd now calls runQueue.notifyAll() rather than runQueue.notify() 2) startNetworkServer now primes the thread pool by starting the first DRDAConnThread right away, rather than waiting for the first Session to come in. To make this work, I had to make a few small tweaks to DRDAConnThread initialization so that it would defer its initialization until it had its first Session. 3) startNetworkServer now takes the calling Session as an argument. In the initial server start case this is null, but if a restart is triggered by a shutdown exception, the caller passes its Session in. startNetworkServer uses this to avoid closing the Session which is triggering the restart, so it can stay open long enough to return the shutdown exception back to the client. 4) The client-masters for checkDataSource and checkDataSource30 are modified to add the shutdown exception. This modification of the master file goes hand-in-hand with the modification of the test program to re-enable the server shutdown/restart during the test. Please take a look at this latest proposal and let me know your comments.
          Hide
          Bryan Pendleton added a comment -

          I had an idea about how to address the "failure to create a new thread" aspect
          of the hang, as follows:

          Suppose that, at the end of restart processing, we go ahead and create a new
          DRDAConnThread instance, even though at this instance there may not be
          any sessions for it to process yet. If not, it will just go into getNextSession and wait.

          With the current combined patch, during restart processing, we "poison" each
          of the DRDAConnThread instances by closing and interrupting them, so that
          we know that at some point those threads will die.

          The problem that we had was that if a new session came in while all those
          poisoned threads were still in the process of dieing, then the create-new-session
          logic might not realize that it needed to create a new thread because it might
          mistakenly see a non-zero freeThreads count which was counting only poisoned
          threads.

          But if we always create at least one new thread, then we know that the new
          session will eventually get processed by that new thread.

          In order for this to be 100% solid I think we have to change runQueueAdd to do
          a notifyAll(), but that's a slightly different topic, and I think I'll put that bit into
          the Wiki page, because it should live beyond this bug.

          So, in conclusion, here is my latest idea:
          1) At the end of restart processing, create a new thread even though we may not
          have any new sessions for it to process yet.
          2) change runQueueAdd to do a notifyAll(), not a notify().

          Show
          Bryan Pendleton added a comment - I had an idea about how to address the "failure to create a new thread" aspect of the hang, as follows: Suppose that, at the end of restart processing, we go ahead and create a new DRDAConnThread instance, even though at this instance there may not be any sessions for it to process yet. If not, it will just go into getNextSession and wait. With the current combined patch, during restart processing, we "poison" each of the DRDAConnThread instances by closing and interrupting them, so that we know that at some point those threads will die. The problem that we had was that if a new session came in while all those poisoned threads were still in the process of dieing, then the create-new-session logic might not realize that it needed to create a new thread because it might mistakenly see a non-zero freeThreads count which was counting only poisoned threads. But if we always create at least one new thread, then we know that the new session will eventually get processed by that new thread. In order for this to be 100% solid I think we have to change runQueueAdd to do a notifyAll(), but that's a slightly different topic, and I think I'll put that bit into the Wiki page, because it should live beyond this bug. So, in conclusion, here is my latest idea: 1) At the end of restart processing, create a new thread even though we may not have any new sessions for it to process yet. 2) change runQueueAdd to do a notifyAll(), not a notify().
          Hide
          Deepa Remesh added a comment -

          Hi Bryan, I don't think you are being picky. After reading your explanation, I agree the combined patch still has some chance of causing a hang. As you explain, the following need not be always true: "The interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions ". So the assumption that we will have the right value for freeThreads (=0) when a new session comes in after a restart does not hold good.

          Keeping the code to reset "freeThreads = 0" during restart seems to be an option. But that will cause the problem that you describe under "freeThreads counter maintenance" in http://wiki.apache.org/db-derby/NetworkServerSessionManagement. The counter can become negative.

          If we plan to use this patch, we need to find out some way to ensure freeThreads counter is maintained correctly.

          Show
          Deepa Remesh added a comment - Hi Bryan, I don't think you are being picky. After reading your explanation, I agree the combined patch still has some chance of causing a hang. As you explain, the following need not be always true: "The interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions ". So the assumption that we will have the right value for freeThreads (=0) when a new session comes in after a restart does not hold good. Keeping the code to reset "freeThreads = 0" during restart seems to be an option. But that will cause the problem that you describe under "freeThreads counter maintenance" in http://wiki.apache.org/db-derby/NetworkServerSessionManagement . The counter can become negative. If we plan to use this patch, we need to find out some way to ensure freeThreads counter is maintained correctly.
          Hide
          Bryan Pendleton added a comment -

          Hi Deepa, thank you for the excellent analysis. It is very helpful.

          The only part that troubles me is this:

          > The interrupt calls and the added synchronization assure that freeThreads will
          > become 0 before we start accepting new sessions

          I am not yet able to see why that is necessarily true.

          I agree that we issue the interrupt() calls prior to completing the restart and prior
          to accepting new sessions.

          However, the thread which is interrupted responds to this interrupt asynchronously.

          I believe that the only guarantee made by Java is that the thread which we are interrupting
          will receive that interrupt some time in the future, but not necessarily immediately,
          and not necessarily before we reach the end of the startNetworkServer() method
          and declare the restart complete.

          In practice, given that the startNetworkServer() method is going to unload the
          embedded engine, call System.gc(), and then reload the embedded engine,
          that's probably enough work so that we have a very good chance that all the
          old threads have received and responded to their interrupt() before we complete.

          But I don't think it's actually guaranteed by the current changes in the combined patch.

          I may be being too picky here. What do you think?

          Show
          Bryan Pendleton added a comment - Hi Deepa, thank you for the excellent analysis. It is very helpful. The only part that troubles me is this: > The interrupt calls and the added synchronization assure that freeThreads will > become 0 before we start accepting new sessions I am not yet able to see why that is necessarily true. I agree that we issue the interrupt() calls prior to completing the restart and prior to accepting new sessions. However, the thread which is interrupted responds to this interrupt asynchronously. I believe that the only guarantee made by Java is that the thread which we are interrupting will receive that interrupt some time in the future, but not necessarily immediately, and not necessarily before we reach the end of the startNetworkServer() method and declare the restart complete. In practice, given that the startNetworkServer() method is going to unload the embedded engine, call System.gc(), and then reload the embedded engine, that's probably enough work so that we have a very good chance that all the old threads have received and responded to their interrupt() before we complete. But I don't think it's actually guaranteed by the current changes in the combined patch. I may be being too picky here. What do you think?
          Hide
          Deepa Remesh added a comment -

          Bryan,
          Combined with your code changes in "sessionMgmt1.diff ', I think my previous comment about 'no-sessions-for-closed-threads.diff' does not hold good. This is the reason I think the combined patch is okay:
          "
          The problem which we found previously with 'no-sessions-for-closed-threads.diff' was that the "createSession" was thinking there are freeThreads (freeThreads !=0 ). But actually the threads it were thinking as available were already closed. For the combined patch to work, freeThreads must become 0 after a restart. This is to ensure that a new thread is created when a new session comes in after a restart. I think the interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions.
          "

          To rephrase it,

          1) With only 'no-sessions-for-closed-threads.diff', we can get to the following scenario: When client starts a new session, the "createSession" logic in the server will check if freeThreads are available. Because of the restart, the freeThreads counter may be invalid (freeThreads !=0). This makes the "createSession" logic to think there are some free threads available. Hence, it will not create a new thread for the incoming session. Instead, the incoming session is put on the runQueue. However, all the existing threads were marked closed by the restart. And we do not give sessions to closed threads. So this session keeps waiting for a thread. When we start another session (e.g with ij), the hang goes away. When the connection from ij comes in, the newly created thread is able to work on the waiting session and resolve the hang. I think the patch does the right thing by not giving sessions to closed threads. However, it can sometimes cause the case of stranded sessions because of incorrect value of freeThreads counter in "createSession" logic.

          2) With the combined patch "sessionMgmt1_and_nosessionsforclosedthreads.diff", I think we can assume freeThreads counter will have the correct value when a new session comes in after a restart. The interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions. If this assumption is correct, then we will not have the case of stranded sessions mentioned in 1) above.

          Show
          Deepa Remesh added a comment - Bryan, Combined with your code changes in "sessionMgmt1.diff ', I think my previous comment about 'no-sessions-for-closed-threads.diff' does not hold good. This is the reason I think the combined patch is okay: " The problem which we found previously with 'no-sessions-for-closed-threads.diff' was that the "createSession" was thinking there are freeThreads (freeThreads !=0 ). But actually the threads it were thinking as available were already closed. For the combined patch to work, freeThreads must become 0 after a restart. This is to ensure that a new thread is created when a new session comes in after a restart. I think the interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions. " To rephrase it, 1) With only 'no-sessions-for-closed-threads.diff', we can get to the following scenario: When client starts a new session, the "createSession" logic in the server will check if freeThreads are available. Because of the restart, the freeThreads counter may be invalid (freeThreads !=0). This makes the "createSession" logic to think there are some free threads available. Hence, it will not create a new thread for the incoming session. Instead, the incoming session is put on the runQueue. However, all the existing threads were marked closed by the restart. And we do not give sessions to closed threads. So this session keeps waiting for a thread. When we start another session (e.g with ij), the hang goes away. When the connection from ij comes in, the newly created thread is able to work on the waiting session and resolve the hang. I think the patch does the right thing by not giving sessions to closed threads. However, it can sometimes cause the case of stranded sessions because of incorrect value of freeThreads counter in "createSession" logic. 2) With the combined patch "sessionMgmt1_and_nosessionsforclosedthreads.diff", I think we can assume freeThreads counter will have the correct value when a new session comes in after a restart. The interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions. If this assumption is correct, then we will not have the case of stranded sessions mentioned in 1) above.
          Hide
          Bryan Pendleton added a comment -

          Deepa, I was experimenting with your "sessionMgmt1_and_nosessionsforclosedthreads.diff" patch.

          I definitely cannot reproduce any hang with this diff; the only problem I can reproduce is the
          "Insufficient Data" exception which I discussed in the previous comment above
          (http://issues.apache.org/jira/browse/DERBY-1326#action_12414193).

          However, I am concerned about a comment you made several weeks back, when we
          were first discussing this problem in the context of DERBY-1219:
          http://issues.apache.org/jira/browse/DERBY-1219#action_12378954

          I think that your comment is still quite valid, and I am wondering how you think we should address it?

          Show
          Bryan Pendleton added a comment - Deepa, I was experimenting with your "sessionMgmt1_and_nosessionsforclosedthreads.diff" patch. I definitely cannot reproduce any hang with this diff; the only problem I can reproduce is the "Insufficient Data" exception which I discussed in the previous comment above ( http://issues.apache.org/jira/browse/DERBY-1326#action_12414193 ). However, I am concerned about a comment you made several weeks back, when we were first discussing this problem in the context of DERBY-1219 : http://issues.apache.org/jira/browse/DERBY-1219#action_12378954 I think that your comment is still quite valid, and I am wondering how you think we should address it?
          Hide
          Bryan Pendleton added a comment -

          I think that the cause of the new "Insufficient data" exception that is appearing with sessionMgmt1.diff
          has to do with a specific part of the change to the NetworkServerControlImpl restart processing.

          I changed the restart processing so that, instead of just closing the Session objects on the RunQueue,
          the restart code closes all the Session objects in the sessionTable. In general, I think that this is
          the right thing to do, as restart processing needs to close all the sessions, not just those that are
          currently waiting for a free thread.

          However, the new code is too powerful, because it also closes the current Session, which is the
          session which is performing the shutdown.

          Closing the current Session terminates it before it has a chance to format and return the Shutdown Exception
          message back to the client. When the client doesn't get the expected Shutdown Exception, it reports
          this as "Insufficient data".

          If I tweak the patch so that the restart logic is changed back to just closing the Sessions on the RunQueue,
          the nasty "Insufficient data" exception disappears, and the client gets a much more reasonable message:

          ERROR XJ015: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ015, SQLERRMC: Derby system shutdown.

          So I think that the patch should be slightly modified, so that this critical loop in the restart processing
          gets implemented as:

          "For each Session in the sessionTable, except for the Session performing the shutdown, close it"

          But I'm not quite sure how to code that "except" clause.

          Show
          Bryan Pendleton added a comment - I think that the cause of the new "Insufficient data" exception that is appearing with sessionMgmt1.diff has to do with a specific part of the change to the NetworkServerControlImpl restart processing. I changed the restart processing so that, instead of just closing the Session objects on the RunQueue, the restart code closes all the Session objects in the sessionTable. In general, I think that this is the right thing to do, as restart processing needs to close all the sessions, not just those that are currently waiting for a free thread. However, the new code is too powerful, because it also closes the current Session, which is the session which is performing the shutdown. Closing the current Session terminates it before it has a chance to format and return the Shutdown Exception message back to the client. When the client doesn't get the expected Shutdown Exception, it reports this as "Insufficient data". If I tweak the patch so that the restart logic is changed back to just closing the Sessions on the RunQueue, the nasty "Insufficient data" exception disappears, and the client gets a much more reasonable message: ERROR XJ015: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ015, SQLERRMC: Derby system shutdown. So I think that the patch should be slightly modified, so that this critical loop in the restart processing gets implemented as: "For each Session in the sessionTable, except for the Session performing the shutdown , close it" But I'm not quite sure how to code that "except" clause.
          Hide
          Deepa Remesh added a comment -

          Hi Bryan,

          I saw your question and suggestions in the previous comment while I was working on this new combined patch. So this comment may sound off-base but I am just posting my thoughts.

          I was experimenting a bit with your patch and I think it is not the code that you added in DRDAConnThread.run method which makes this new exception appear intstead of the old one. Reason I say this is I modified your patch trying to see how we can avoid the closed_thread_getting_valid_session scenario. In this patch, I removed the code you commented with //FIXME in DRDAConnThread.run. Even after that, I see the same exception instead of the usual shutdown exception. However, I have not figured out what is causing this new exception.

          In the attached trial patch 'sessionMgmt1_and_nosessionsforclosedthreads.diff', I combined your patches 'sessionMgmt1.diff' and 'no-sessions-for-closed-threads.diff'. To me, it looked like 'sessionMgmt1.diff' solves all problems which we found except for closed_thread_getting_valid_session scenario. This scenario seems to be solved by your previous patch 'no-sessions-for-closed-threads.diff'. The problem which we found previously with 'no-sessions-for-closed-threads.diff' was that the "createSession" was thinking there are freeThreads (freeThreads !=0 ). But actually the threads it were thinking as available were already closed. For the combined patch to work, freeThreads must become 0 after a restart. This is to ensure that a new thread is created when a new session comes in after a restart. I think the interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions. Does this sound right?

          I have run the repro multiple times with this patch and did not see the hang. If you think this patch makes sense, can you please try it on the machine where you can repro the hang easily to see if solves the hang? Thanks.

          I will think more about your question and suggestions in the previous comment as I am planning to look at DERBY-803/DERBY-1020 later today.

          Show
          Deepa Remesh added a comment - Hi Bryan, I saw your question and suggestions in the previous comment while I was working on this new combined patch. So this comment may sound off-base but I am just posting my thoughts. I was experimenting a bit with your patch and I think it is not the code that you added in DRDAConnThread.run method which makes this new exception appear intstead of the old one. Reason I say this is I modified your patch trying to see how we can avoid the closed_thread_getting_valid_session scenario. In this patch, I removed the code you commented with //FIXME in DRDAConnThread.run. Even after that, I see the same exception instead of the usual shutdown exception. However, I have not figured out what is causing this new exception. In the attached trial patch 'sessionMgmt1_and_nosessionsforclosedthreads.diff', I combined your patches 'sessionMgmt1.diff' and 'no-sessions-for-closed-threads.diff'. To me, it looked like 'sessionMgmt1.diff' solves all problems which we found except for closed_thread_getting_valid_session scenario. This scenario seems to be solved by your previous patch 'no-sessions-for-closed-threads.diff'. The problem which we found previously with 'no-sessions-for-closed-threads.diff' was that the "createSession" was thinking there are freeThreads (freeThreads !=0 ). But actually the threads it were thinking as available were already closed. For the combined patch to work, freeThreads must become 0 after a restart. This is to ensure that a new thread is created when a new session comes in after a restart. I think the interrupt calls and the added synchronization assure that freeThreads will become 0 before we start accepting new sessions. Does this sound right? I have run the repro multiple times with this patch and did not see the hang. If you think this patch makes sense, can you please try it on the machine where you can repro the hang easily to see if solves the hang? Thanks. I will think more about your question and suggestions in the previous comment as I am planning to look at DERBY-803 / DERBY-1020 later today.
          Hide
          Bryan Pendleton added a comment -

          Hi Deepa, thanks very much for looking at the patch!

          The new message that you see is definitely a result of this patch; I see it as well,
          and should have mentioned it when I attached the patch.

          I believe that the new symptom is a result of the code I added to DRDAConnThread's
          run() method, to try to handle the case when a thread returns from getNextSession() to
          find itself in the interesting situation of having been given a Session to run, but also having
          been closed():

          if (closed())
          + {
          + // FIXME – I think that if we return from
          + // getNextSession() with a non-null Session, but
          + // we ourselves have been closed(), this indicates
          + // a bug in the Network Server session management,
          + // as it shouldn't have assigned a valid session
          + // to a closed thread. Closing the session here
          + // is rude, but at least it prevents a hard hang. (DERBY-1326)
          + try

          { + if (session != null) + session.close(); + }

          catch (Exception e)

          { e.printStackTrace(); }

          break;
          + }

          It's clear that this code is wrong, as evidenced by the symptoms you see.

          Note also that in my code I didn't know what to do if I hit an exception; this is
          very closely related to bug DERBY-1020 (thanks Kathey for pointing this out).

          The question I have is (and this is the main question left in this bug, I think):

          what should the DRDAConnThread do in this situation?

          The possibilities I've thought of include:

          • ignore that fact that we've been closed; re-open ourselves and handle
            the Session we've been given
          • give the Session back to the NetworkServerControlImpl and ask it to
            assign a different thread to this task
          • hard-close the Session, then exit the thread. This was the solution I tried,
            and as you can see it has some awkard side-effects.

          Either of the first two seems like it might be a reasonable approach in the
          case where the Network Server was restarted, not shut down, but they both
          seem wrong in the case where the Network Server was shut down; in that
          case it seems that the Session is doomed and we should just pack our
          bags and leave quietly.

          Please let me know what you think about how best to handle this case.

          Show
          Bryan Pendleton added a comment - Hi Deepa, thanks very much for looking at the patch! The new message that you see is definitely a result of this patch; I see it as well, and should have mentioned it when I attached the patch. I believe that the new symptom is a result of the code I added to DRDAConnThread's run() method, to try to handle the case when a thread returns from getNextSession() to find itself in the interesting situation of having been given a Session to run, but also having been closed(): if (closed()) + { + // FIXME – I think that if we return from + // getNextSession() with a non-null Session, but + // we ourselves have been closed(), this indicates + // a bug in the Network Server session management, + // as it shouldn't have assigned a valid session + // to a closed thread. Closing the session here + // is rude, but at least it prevents a hard hang. ( DERBY-1326 ) + try { + if (session != null) + session.close(); + } catch (Exception e) { e.printStackTrace(); } break; + } It's clear that this code is wrong, as evidenced by the symptoms you see. Note also that in my code I didn't know what to do if I hit an exception; this is very closely related to bug DERBY-1020 (thanks Kathey for pointing this out). The question I have is (and this is the main question left in this bug, I think): what should the DRDAConnThread do in this situation? The possibilities I've thought of include: ignore that fact that we've been closed; re-open ourselves and handle the Session we've been given give the Session back to the NetworkServerControlImpl and ask it to assign a different thread to this task hard-close the Session, then exit the thread. This was the solution I tried, and as you can see it has some awkard side-effects. Either of the first two seems like it might be a reasonable approach in the case where the Network Server was restarted, not shut down, but they both seem wrong in the case where the Network Server was shut down; in that case it seems that the Session is doomed and we should just pack our bags and leave quietly. Please let me know what you think about how best to handle this case.
          Hide
          Deepa Remesh added a comment -

          Bryan, I applied the patch and ran the repro. I do not get the hang but seeing this exception instead of the usual shutdown exception after a system shutdown:

          EXPECTED EXCEPTION: Insufficient data while reading from the network - expected a minimum of 6 bytes and received only -1 bytes. The connection has been terminated.

          instead of

          EXPECTED EXCEPTION: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ015, SQLERRMC: Derby system shutdown.

          I am still looking at the changes but wanted to check if you get the same results with the patch ?

          Show
          Deepa Remesh added a comment - Bryan, I applied the patch and ran the repro. I do not get the hang but seeing this exception instead of the usual shutdown exception after a system shutdown: EXPECTED EXCEPTION: Insufficient data while reading from the network - expected a minimum of 6 bytes and received only -1 bytes. The connection has been terminated. instead of EXPECTED EXCEPTION: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ015, SQLERRMC: Derby system shutdown. I am still looking at the changes but wanted to check if you get the same results with the patch ?
          Hide
          Bryan Pendleton added a comment -

          I've been experimenting with the attached patch "sessionMgmt1.diff". In my environment, it resolves the hang in checkDataSource, and it also passes Deepa's custom "repro1326" test, even after I modified that test to run the shutdown test case 200 times. So I'm starting to feel it's pretty reliable as far as avoiding hangs.

          However, I'm not completely happy with it. One issue is that I'm still seeing situations in which a closed() DRDAConnThread is given a non-closed Session to run, which I think is incorrect behavior (non-closed sessions should only be assigned to non-closed threads).

          So I've still got a ways to go on this problem. But since I've had some significant progress, I wanted to post the work-in-progress changes in order to get feedback, and in order not to lose them.

          Comments and suggestions most welcome!

          Show
          Bryan Pendleton added a comment - I've been experimenting with the attached patch "sessionMgmt1.diff". In my environment, it resolves the hang in checkDataSource, and it also passes Deepa's custom "repro1326" test, even after I modified that test to run the shutdown test case 200 times. So I'm starting to feel it's pretty reliable as far as avoiding hangs. However, I'm not completely happy with it. One issue is that I'm still seeing situations in which a closed() DRDAConnThread is given a non-closed Session to run, which I think is incorrect behavior (non-closed sessions should only be assigned to non-closed threads). So I've still got a ways to go on this problem. But since I've had some significant progress, I wanted to post the work-in-progress changes in order to get feedback, and in order not to lose them. Comments and suggestions most welcome!
          Hide
          Deepa Remesh added a comment -

          Bryan, I took a quick look at excellent Wiki page you created about network server session management. I have one minor comment. In the sections for Session lifetime, it may be good to note that these are also ended when the network server restarts (when Derby system is shutdown) or shuts down. For DRDAConnThread lifetime, it would be good to mention about network server restart as this seems to be the cause of the bugs we are seeing.

          I am also attaching a smaller repro 'repro1326.java' I had created to simulate the hang. With this repro too, the hang is intermittent. I have not been able to get a predictable hang so far. To run the repro, start network server on port 1527 and run 'java -Dframework=DerbyNetClient repro1326'.

          Show
          Deepa Remesh added a comment - Bryan, I took a quick look at excellent Wiki page you created about network server session management. I have one minor comment. In the sections for Session lifetime, it may be good to note that these are also ended when the network server restarts (when Derby system is shutdown) or shuts down. For DRDAConnThread lifetime, it would be good to mention about network server restart as this seems to be the cause of the bugs we are seeing. I am also attaching a smaller repro 'repro1326.java' I had created to simulate the hang. With this repro too, the hang is intermittent. I have not been able to get a predictable hang so far. To run the repro, start network server on port 1527 and run 'java -Dframework=DerbyNetClient repro1326'.
          Hide
          Bryan Pendleton added a comment -

          I've placed some background notes at http://wiki.apache.org/db-derby/NetworkServerSessionManagement. I'd like to get some review and checking of this background material before moving on to the next step. Thanks!

          Show
          Bryan Pendleton added a comment - I've placed some background notes at http://wiki.apache.org/db-derby/NetworkServerSessionManagement . I'd like to get some review and checking of this background material before moving on to the next step. Thanks!
          Hide
          Bryan Pendleton added a comment -

          I am interested in working on this problem. I have some thoughts and ideas which I will try to develop on the wiki and on the mailing list. As the discussion evolves, hopefully the proper solution will become clear.

          Show
          Bryan Pendleton added a comment - I am interested in working on this problem. I have some thoughts and ideas which I will try to develop on the wiki and on the mailing list. As the discussion evolves, hopefully the proper solution will become clear.
          Hide
          Deepa Remesh added a comment -

          To repro this issue, run the test jdbcapi/checkDataSource.java with client framework. The test will hang intermittently.

          To avoid this hang, the code for system shutdown is currently not executed when running with client driver. This condition has to be removed when this issue is resolved.

          Show
          Deepa Remesh added a comment - To repro this issue, run the test jdbcapi/checkDataSource.java with client framework. The test will hang intermittently. To avoid this hang, the code for system shutdown is currently not executed when running with client driver. This condition has to be removed when this issue is resolved.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Deepa Remesh
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development