Attached are "skipThreads.diff" and "interrupt.diff", but before
reading the diffs, please read these notes.
I think I understand what is causing the hangs, and I can even make
the hangs go away. However, I don't think I yet understand how to
really fix the problem, so I'm sure we'll want to talk about this
for a while, to see if some of the reviewers can come up with a
proper solution or at least some techniques to pursue.
Here's what I see, and what I think it means:
1) One, or maybe several, times in the test, checkDataSource causes a
shutdown of the server. It has several different variants on the shutdown
processing, but at least one of them causes the server to go through
NetworkServerControlImpl.startNetworkServer() to perform a server restart.
2) During the server restart processing, the Network Server restart
code iterates through all the DRDAConnThread instances and closes them.
This close() call is supposed to cause the DRDAConnThread to terminate itself.
3) However, all the close() call actually does is mark the thread's
"close" variable as true, and depending on when the thread checks that
variable, it may or may not immediately exit. In my test runs, it is
often the case that at least one of the DRDAConnThread instances is, at this
point, sitting blocked in NetworkServerControlImpl.getNextSession().
Calling close() on this thread marks it as closed, but doesn't cause
it to exit the getNextSession() wait.
4) A little bit later, the test program makes some new connections
to the server, and one of those connections is given to the thread
which was blocked in the getNextSession() call. The thread picks
up the session and returns to the DRDAConnThread.run() main loop.
5) At this point, the thread notices that it has been closed, and it
exits, without sending any response back to the client, and without
closing the connection to the client. This causes the hang.
Because this problem involves multi-threading, and thread scheduling,
there is a bunch of non-determinate behavior, which I believe is why
others have been experiencing varied results during their tests. The
behavior of the threads is definitely unpredictable for me.
There are several aspects to this scenario that puzzle me, but let
me describe what I've been experimenting with as a patch. I've changed
the NetworkServerControlImpl restart logic so that, instead of
closing the DRDAConnThreads, it just leaves the threads alone.
This change is in "skipThreads.diff", and it seems to make the hangs
The "skipThreads.diff" diff also contains some hacks to the test so
that I could run it multiple times in a row outside of the harness
without destroying and re-creating the database each time.
Those changes don't really belong with this diff, but I didn't bother
to edit them out.
I also experimented with a change which tried to close the threads,
but also, after closing, interrupts the threads, which
caused them to be blown out of the getNextSession loop and back to
the main run() loop, at which point the threads shut themselves down,
which seems like the right behavior for Network Server restart.
I was hoping that this was the "right" fix, but unfortunately this
change fixed some, but not all, of the hangs, which was too bad.
And I'm nervous about adding the call to Thread.interrupt(), which is
an extremely powerful call and not to be used lightly. For reviewers
who want to experiment with this change and see how it works for them,
I've also attached "interrupt.diff"
I'm still disturbed by the fact that when the main run() method
in DRDAConnThread noticed that it was closed, it just exited without
apparently sending any response back to the server or closing the
And, although my change makes the hangs go away, it does not make the
checkDataSource and checkDataSource30 tests pass. Instead, they run
to completion, and get a bunch of diffs, and I'm not sure whether my
changes caused these diffs or not.
But at this point, before I work on this much more, I'd like to get
some feedback from the reviewers about the analysis up to this point,
and the effects of this patch in their environment:
- does this patch cause the hangs to disappear for you?
- if so, do the checkDataSource and checkDataSource30 tests pass for you?
- if they fail, do the failures make sense to you?
- what should we be doing with the background connection threads during
a Network Server restart?