Attached is 'unify_NSImpl_instances.diff', with these notes to explain. This
patch belongs with the rest of the
DERBY-1326 changes, but I've extracted it
out into a standalone patch to make it easier to study and review.
Recently, I've been investigating that hang that Deepa uncovered which occurs
with the latest patch proposal for
Investigating that problem has taken me through some interesting analysis
of NetworkServer startup/shutdown issues, which I'd like to describe here, to
see what sort of reaction I get from the team.
There are several different ways to start up the Network Server, but after
some twists and turns there appear to be two essential techniques:
1) You can instantiate a NetworkServerControl object, and call the start()
method on it.
2) You can define the derby.drda.startNetworkServer property to true, and then
start the embedded engine.
These two code paths proceed somewhat differently, but end up getting to the
1) NetworkServerControl.start() delegates to NetworkServerControlImpl.start(),
which calls NetworkServerControlImpl.startNetworkServer(), then instantiates
a DRDAServerStarter object and calls its boot() method.
2) Loading the embedded engine eventually calls JDBCBoot.boot(), which notices
that derby.drda.startNetworkServer has been set, and calls
Monitor.startSystemModule() to instantiate a DRDAServerStarter object and
call its boot() method.
The DRDAServerStarter.boot() method uses reflection to:
- dynamically load the Network Server code
- create an instance of NetworkServerControlImpl
- create a daemon server thread, which will then:
- call NetworkServerControlImpl.blockingStart()
So whichever way that we start the Network Server, we end up constructing
a NetworkServerControlImpl instance and calling blockingStart() on that
instance from a background thread. This is good.
However, there is an important and interesting difference between the two
DRDAServerStarter.boot() always creates its own instance of the
NetworkServerControlImpl object and calls blockingStart() on that instance.
So if you start the Network Server by instantiating a NetworkServerControl
object in your own code, that NetworkServerControl instance creates a
NetworkServerControlImpl instance, but then when it calls DRDAServerStarter,
we end up creating a second NetworkServerControlImpl instance to use as
the master instance, and we never really use the NetworkServerControlImpl
instance that was used by the NetworkServerControl object that you created.
Worse, both NetworkServerControlImpl instances are started up, in a certain
sense, because NetworkServerControlImpl.startNetworkServer() is called on
- NetworkServerControlImpl.start() calls startNetworkServer() on itself before
it instantiates the DRDAServerStarter object and calls boot()
- NetworkServerControlImpl.blockingStart() calls startNetworkServer() on
itself as its first action.
So if you start up the Network Server by calling NetworkServerControl.start(),
you end up actually creating two Network Server instances, although one of
them doesn't really do very much. This is the source of the hang that I created
with the most recent patch proposal to
DERBY-1326: when I changed the
startNetworkServer() method to create a new thread, that code ends up being
called twice, but we don't shut down both of these instances, so we never
shut down the extra thread that I created, and that caused the process to
hang and not terminate.
Clearly, one way out of this problem is to re-arrange the thread creation logic
so that the thread is only created when the real NetworkServerControlImpl
instance is started.
But I was curious about why there were two NetworkServerControlImpl instances,
since it seems quite clear to me that the NetworkServerControlImpl class is
designed to behave as a singleton object. I thought that if I had made this
assumption that there would only ever be a single instance in a Java app,
others might tend to make this same assumption in the future, and more bugs
like the one I accidentally introduced might get introduced in the future.
So, for the last week or so, I've been studying this code, trying to figure
out if we really need to have two NetworkServerControlImpl instances, or
whether it would be cleaner to just have a single instance.
My conclusion is that, for the most part, we don't need to have both
instances, and it is cleaner to just have a single instance, but there
are a few messy details.
The changes that I've been experimenting with in this area are as follows:
1) Refactor the DRDAServerStarter code slightly so that the caller can
optionally pass in the NetworkServerControlImpl instance to use. If the
caller does not pass in an instance, DRDAServerStarter creates one via
reflection, as it does now.
2) Simplify NetworkServerControlImpl.start() so that it no longer calls
startNetworkServer() on itself, but instead simply calls DRDAServerStarter,
passing the "this" pointer in as the instance to use for starting the server.
3) Modify the DRDAServerStarter boot code so that it loads the embedded
engine. The current DRDAServerStarter code assumes that the embedded engine
has already been started; I believe that this is the main reason for the
current code's call to startNetworkServer() from NetworkServerControlImpl's
start() method, as a side effect of startNetworkServer() is to load the
embedded engine. It seems simple to have DRDAServerStarter load the
embedded engine itself, but an alternative would be to load the embedded
engine in NetworkServerControlImpl.start().
With these changes in place, the Network Server startup and shutdown code
seems substantially cleaner to me, and the tests seem to run very well.
I'm pleased with the results.
However, there is one messy detail that remains to be resolved, and it
involves the Network Server shutdown processing. The shutdown() method for
the Network Server, in NetworkServerControlImpl.shutdown(), uses the following
- tell the server to shut down
- loop for a while, ping'ing the Network server, until we get an error
Unfortunately, during this "ping loop", the code wants to ensure that no
stray messages from the failed ping operation escape to the user, so the
code intentionally disables the Network Server console.
When there were two NetworkServerControlImpl instances in play, intentionally
disabling the console of the one instance was a relatively safe thing to do.
However, when there is only a single instance in the process, disabling the
console is a quite disruptive action, because it affects all the messages
that the Network Server might want to print, including the "normal shutdown"
message that the Network Server prints at the conclusion of shutdown.
Therefore, I'm coming to believe that we shouldn't disable the console
during shutdown, but rather should implement a variant "ping" operation which
doesn't print any messages when it fails, but instead just throws an exception,
which the shutdown code could catch and handle. Unfortunately, this is not
a trivial one-line change, as the code which prints error messages to the
Network Server console is buried deep in the Network Server command
Obviously, I've still got more work to do here. But I'm attaching the current
state of my changes anyway, since I think it's a useful intermediate stage
and would be interesting to reviewers.
The other comment to be made is that this change is now becoming quite large,
and probably should be split into multiple smaller patches. I'll pursue that
strategy later, once I get to a point where I have code that seems solid.
Thanks in advance for all comments, suggestions, and other feedback!