Derby
  1. Derby
  2. DERBY-1817

Race condition in network server's thread pool

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.2.1.6
    • Fix Version/s: 10.2.1.6
    • Component/s: Network Server
    • Labels:
      None

      Description

      If there is a free DRDAConnThread when a client connects to the network server, the session is put into a queue from which one of the free DRDAConnThreads can pick it up. However, if another client connects after the session was put into the queue, but before the DRDAConnThread has picked it up, one might end up with more sessions in the queue than there are free threads. This can lead to hangs like the ones that we currently see in many of Ole's tests (for instance checkDataSource - http://www.multinet.no/~solberg/public/Apache/TinderBox_Derby/testlog/SunOS-5.10_i86pc-i386/440518-derbyall_diff.txt).

      1. 1817-4-cleanup.stat
        0.2 kB
        Knut Anders Hatlen
      2. 1817-4-cleanup.diff
        4 kB
        Knut Anders Hatlen
      3. 1817-3-sync.diff
        3 kB
        Knut Anders Hatlen
      4. 1817-2.diff
        4 kB
        Knut Anders Hatlen
      5. 1817.stat
        0.1 kB
        Knut Anders Hatlen
      6. 1817.diff
        4 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Thank you for the quick reviews, Francois and Bryan!
          Committed 1817-3 with revision 442462, and 1817-4 with revision 442463.
          Leaving the issue open until the fixes have been merged to 10.2.

          Show
          Knut Anders Hatlen added a comment - Thank you for the quick reviews, Francois and Bryan! Committed 1817-3 with revision 442462, and 1817-4 with revision 442463. Leaving the issue open until the fixes have been merged to 10.2.
          Hide
          Francois Orsini added a comment -

          1817-3-sync.diff looks good to me Knut. Cheers.

          Show
          Francois Orsini added a comment - 1817-3-sync.diff looks good to me Knut. Cheers.
          Hide
          Bryan Pendleton added a comment -

          Both 1817-3 and 1817-4 look like good changes to me. Thanks for continuing
          to clean up this code. I'm eager for you to have a look at DERBY-1326 now

          Show
          Bryan Pendleton added a comment - Both 1817-3 and 1817-4 look like good changes to me. Thanks for continuing to clean up this code. I'm eager for you to have a look at DERBY-1326 now
          Hide
          Knut Anders Hatlen added a comment -

          The attached patch (1817-4-cleanup) contains the cleanup suggested by Bryan. It is independent of the 1817-3-sync patch, which is not committed yet.

          Description:

          • moves generation of connection number into addSession()
          • adds new method removeThread() which can be used instead of getThreadList().remove()
          • removes methods that are no longer used
          • makes methods that are only used by NetworkServerControlImpl private
          Show
          Knut Anders Hatlen added a comment - The attached patch (1817-4-cleanup) contains the cleanup suggested by Bryan. It is independent of the 1817-3-sync patch, which is not committed yet. Description: moves generation of connection number into addSession() adds new method removeThread() which can be used instead of getThreadList().remove() removes methods that are no longer used makes methods that are only used by NetworkServerControlImpl private
          Hide
          Knut Anders Hatlen added a comment -

          Attached 1817-3-sync.diff which reduces the amount of code synchronized on runQueue in NetworkServerControlImpl.addSession(). There is still more code synchronized on runQueue after this patch than before the 1817-2 patch, but the synchronization that was not necessary for the fix has been removed. Patch passes derbyall and DERBY-1757 still cannot be reproduced. Reviews are welcome. Thanks!

          Show
          Knut Anders Hatlen added a comment - Attached 1817-3-sync.diff which reduces the amount of code synchronized on runQueue in NetworkServerControlImpl.addSession(). There is still more code synchronized on runQueue after this patch than before the 1817-2 patch, but the synchronization that was not necessary for the fix has been removed. Patch passes derbyall and DERBY-1757 still cannot be reproduced. Reviews are welcome. Thanks!
          Hide
          Knut Anders Hatlen added a comment -

          One "unfortunate" effect of my changes is that addSession() has become
          thread safe. That is, multiple threads may invoke addSession()
          concurrently without interfering with each other. Since addSession()
          is only invoked from one thread (ClientThread), this is not strictly
          necessary.

          Also, the last sentence of this comment is not true as long as only
          one thread invokes addSession():

          // Synchronize on runQueue to prevent other threads from updating
          // runQueue or freeThreads. Hold the monitor's lock until a thread is
          // started or the session is added to the queue. If we release the lock
          // earlier, we might start too few threads (DERBY-1817).

          When I wrote that comment, I was thinking about multiple threads
          invoking addSession(), but that doesn't happen. The way the current
          code works, it would be enough to synchronize on runQueue when
          comparing runQueue.size() and freeThreads, like this:

          boolean enoughThreads;
          synchronized (runQueue)

          { enoughThreads = (runQueue.size() < freeThreads); }

          I am tempted to write a new patch which reduces the amount of code
          synchronized on runQueue, but I also see the value of addSession()
          being thread safe (in case the use of it changes, for instance by
          having multiple ClientThreads listening to different ports). Does
          anyone have an opinion on this?

          Thanks.

          Show
          Knut Anders Hatlen added a comment - One "unfortunate" effect of my changes is that addSession() has become thread safe. That is, multiple threads may invoke addSession() concurrently without interfering with each other. Since addSession() is only invoked from one thread (ClientThread), this is not strictly necessary. Also, the last sentence of this comment is not true as long as only one thread invokes addSession(): // Synchronize on runQueue to prevent other threads from updating // runQueue or freeThreads. Hold the monitor's lock until a thread is // started or the session is added to the queue. If we release the lock // earlier, we might start too few threads ( DERBY-1817 ). When I wrote that comment, I was thinking about multiple threads invoking addSession(), but that doesn't happen. The way the current code works, it would be enough to synchronize on runQueue when comparing runQueue.size() and freeThreads, like this: boolean enoughThreads; synchronized (runQueue) { enoughThreads = (runQueue.size() < freeThreads); } I am tempted to write a new patch which reduces the amount of code synchronized on runQueue, but I also see the value of addSession() being thread safe (in case the use of it changes, for instance by having multiple ClientThreads listening to different ports). Does anyone have an opinion on this? Thanks.
          Hide
          Francois Orsini added a comment -

          > It is true that maxThreads could potentially change after it has been copied into max and before the new thread is added to the thread list.

          This is what I meant Yes.

          > I don't think this is a problem since we could end up with more threads than maxThreads regardless of synchronization
          > (for instance by setting maxThreads to 3 when there are five threads). However, I don't see that expanding the synchronization should cause
          > any problems either, so I will update my patch.

          Great.

          > Thank you very much for looking at the patch!

          My pleasure Knut

          Show
          Francois Orsini added a comment - > It is true that maxThreads could potentially change after it has been copied into max and before the new thread is added to the thread list. This is what I meant Yes. > I don't think this is a problem since we could end up with more threads than maxThreads regardless of synchronization > (for instance by setting maxThreads to 3 when there are five threads). However, I don't see that expanding the synchronization should cause > any problems either, so I will update my patch. Great. > Thank you very much for looking at the patch! My pleasure Knut
          Hide
          Knut Anders Hatlen added a comment -

          Thank you for reviewing the patch, Bryan! I think you are right, those methods could be removed or made private now. I will address that in a follow-up patch. Committed 1817-2.diff into trunk with revision 441802. Leaving the issue open until the suggested clean-up of NetworkServerControlImpl has been performed.

          Show
          Knut Anders Hatlen added a comment - Thank you for reviewing the patch, Bryan! I think you are right, those methods could be removed or made private now. I will address that in a follow-up patch. Committed 1817-2.diff into trunk with revision 441802. Leaving the issue open until the suggested clean-up of NetworkServerControlImpl has been performed.
          Hide
          Bryan Pendleton added a comment -

          I looked at 1817-2.diff and I think it is a good change. I think that moving this logic
          from ClientThread to NetworkServerControlImpl is the right way to go; I had tried
          several variants that were quite similar to this as part of working on DERBY-1326.
          The code is simpler and cleaner after this change.

          My only substantive comment is that I think, with this change, you can now
          encapsulate the thread list entirely within NetworkServerControlImpl, because
          no other class now needs access to these server control internals. That is, I think
          you can remove the getFreeThreads, getThreadList, etc. methods from the
          NetworkServerControlImpl class because they were there only to allow that access
          from ClientThread which you have now removed, and so you can now hide the
          ThreadList inside NetworkServerControlImpl and not expose these internals.

          Thanks for working on this problem!

          Show
          Bryan Pendleton added a comment - I looked at 1817-2.diff and I think it is a good change. I think that moving this logic from ClientThread to NetworkServerControlImpl is the right way to go; I had tried several variants that were quite similar to this as part of working on DERBY-1326 . The code is simpler and cleaner after this change. My only substantive comment is that I think, with this change, you can now encapsulate the thread list entirely within NetworkServerControlImpl, because no other class now needs access to these server control internals. That is, I think you can remove the getFreeThreads, getThreadList, etc. methods from the NetworkServerControlImpl class because they were there only to allow that access from ClientThread which you have now removed, and so you can now hide the ThreadList inside NetworkServerControlImpl and not expose these internals. Thanks for working on this problem!
          Hide
          Knut Anders Hatlen added a comment -

          Updated the patch based on Francois's comment. Ran derbynetclientmats with no errors.

          Show
          Knut Anders Hatlen added a comment - Updated the patch based on Francois's comment. Ran derbynetclientmats with no errors.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks Francois,

          I noticed that one myself too, that's why I created the local copy (max) instead of calling getMaxThreads() twice, as the original code did. I'm not sure I understand exactly what you mean by "seems like 'max' could be reset". Since it is a local int variable, other threads cannot change it, if that's what you mean by "reset".

          It is true that maxThreads could potentially change after it has been copied into max and before the new thread is added to the thread list. I don't think this is a problem since we could end up with more threads than maxThreads regardless of synchronization (for instance by setting maxThreads to 3 when there are five threads). However, I don't see that expanding the synchronization should cause any problems either, so I will update my patch.

          (Talking about synchronization in NetworkServerControlImpl, I also noticed that timeSlice has a mutex object, timeSliceSync, but only setTimeSlice() synchronizes on that object. Seems like a bug...)

          Thank you very much for looking at the patch!

          Show
          Knut Anders Hatlen added a comment - Thanks Francois, I noticed that one myself too, that's why I created the local copy (max) instead of calling getMaxThreads() twice, as the original code did. I'm not sure I understand exactly what you mean by "seems like 'max' could be reset". Since it is a local int variable, other threads cannot change it, if that's what you mean by "reset". It is true that maxThreads could potentially change after it has been copied into max and before the new thread is added to the thread list. I don't think this is a problem since we could end up with more threads than maxThreads regardless of synchronization (for instance by setting maxThreads to 3 when there are five threads). However, I don't see that expanding the synchronization should cause any problems either, so I will update my patch. (Talking about synchronization in NetworkServerControlImpl, I also noticed that timeSlice has a mutex object, timeSliceSync, but only setTimeSlice() synchronizes on that object. Seems like a bug...) Thank you very much for looking at the patch!
          Hide
          Francois Orsini added a comment -

          Great changes Knut.

          My only comment thus far:

          • I think there is a tiny and potential race condition upon retrieving the maximum number of threads to create in the network server (via getMaxThreads()) and adding a newly created one on the thread list. Not sure if this is possible if there is some some higher-level lock but it seems like 'max' could be reset while holding the runQueue lock (as we're synchronizing on 2 different objects) - It seems like the number of maxThreads can be updated at runtime (see processCommands()) - just raising this issue for now - maybe something to investigate if you haven't done so already. Just from a quick review.

          Cheers.

          Show
          Francois Orsini added a comment - Great changes Knut. My only comment thus far: I think there is a tiny and potential race condition upon retrieving the maximum number of threads to create in the network server (via getMaxThreads()) and adding a newly created one on the thread list. Not sure if this is possible if there is some some higher-level lock but it seems like 'max' could be reset while holding the runQueue lock (as we're synchronizing on 2 different objects) - It seems like the number of maxThreads can be updated at runtime (see processCommands()) - just raising this issue for now - maybe something to investigate if you haven't done so already. Just from a quick review. Cheers.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching a patch for this bug. Instead of always putting new sessions in the run queue when there are free threads, the network server now compares the number of free threads and the size of the run queue. This is done to prevent the run queue from growing to a size greater than the number of free threads. Also, the server now synchronizes on runQueue until the session has been added to the queue. This is to prevent two threads from deciding that there are enough free threads and adding the session to the run queue, when there in fact only were enough free threads for one of them. With this patch, I am not able to reproduce DERBY-1757 on platforms where the failure was easily reproduced before.

          The patch passes derbyall (Sun JVM 1.4.2, FreeBSD) and is ready for review. Thanks!

          Show
          Knut Anders Hatlen added a comment - Attaching a patch for this bug. Instead of always putting new sessions in the run queue when there are free threads, the network server now compares the number of free threads and the size of the run queue. This is done to prevent the run queue from growing to a size greater than the number of free threads. Also, the server now synchronizes on runQueue until the session has been added to the queue. This is to prevent two threads from deciding that there are enough free threads and adding the session to the run queue, when there in fact only were enough free threads for one of them. With this patch, I am not able to reproduce DERBY-1757 on platforms where the failure was easily reproduced before. The patch passes derbyall (Sun JVM 1.4.2, FreeBSD) and is ready for review. Thanks!
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for the useful pointers, Deepa and Bryan. The wiki page was very interesting! I won't address the shutdown and restart problems as part of this issue, but I will take a look at DERBY-1326 later and see if I can come up with a bright idea (or use one of the bright ideas on the wiki page).

          Show
          Knut Anders Hatlen added a comment - Thanks for the useful pointers, Deepa and Bryan. The wiki page was very interesting! I won't address the shutdown and restart problems as part of this issue, but I will take a look at DERBY-1326 later and see if I can come up with a bright idea (or use one of the bright ideas on the wiki page).
          Hide
          Deepa Remesh added a comment -

          I think this issue generalizes the problem found in DERBY-1326. I am linking both the issues.

          Bryan created a nice wiki page discussing network server session management here: http://wiki.apache.org/db-derby/NetworkServerSessionManagement

          Show
          Deepa Remesh added a comment - I think this issue generalizes the problem found in DERBY-1326 . I am linking both the issues. Bryan created a nice wiki page discussing network server session management here: http://wiki.apache.org/db-derby/NetworkServerSessionManagement
          Hide
          Bryan Pendleton added a comment -

          For more thoughts on the issues involving Network Server session management,
          see DERBY-1326 and http://wiki.apache.org/db-derby/NetworkServerSessionManagement

          I should unassign myself from DERBY-1326 because I'm not actively
          working on it and won't have time to return to it right away. I'll do so now.

          Show
          Bryan Pendleton added a comment - For more thoughts on the issues involving Network Server session management, see DERBY-1326 and http://wiki.apache.org/db-derby/NetworkServerSessionManagement I should unassign myself from DERBY-1326 because I'm not actively working on it and won't have time to return to it right away. I'll do so now.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development