Details
-
Bug
-
Status: Closed
-
Resolution: Fixed
-
2.1.0
-
None
-
Operating System: Solaris
Platform: Sun
-
17516
Description
Hi all,
Firstly a quick caveat: While I'm only running under 2.1.0, I've checked 2.2.0
at cvs.apache.org and the faults I've found in the code appear to still exist.
But, since I haven't actually compiled and run under 2.2.0, I'm more than happy
to be proved wrong...
I've written an application which contains multiple DOM parsers each running in
a seperate thread, each doing full validation using a schema that contains regex
facets. When this application is run on a multi-processor host (SunFire 880)
fatal errors (core dumps, assertions, etc) are regularly observed.
From my investigations I believe the fault is at least partially due to errors
in the thread-saftey of the RangeTokenMap class (.../util/regx/RangeTokenMap.*).
The single static RangeTokenMap object is created in the first call to the
static RangeTokenMap::instance() method. The race condition is supposedly
handled by the compareAndSwap() method which initialises the static fInstance
pointer only if it's NULL. Even though compareAndSwap is mutex protected, by
this stage it's already too late.
Because the race condition results in double construction of the RangeTokenMap
object, the second instance has to be deleted. Unfortunately the destructor
contains a line setting the fInstance pointer back to NULL. Since fInstance is
a static pointer, this clobbers the RangeTokenMap we actually want to use. So,
by the time we come to return fInstance from RangeTokenMap::instance() it is
useless...
I replaced the implementation of RangeTokenMap::instance() with a version that
only constructs the object once...
RangeTokenMap* RangeTokenMap::instance()
{
static XMLRegisterCleanup instanceCleanup;
if( !fInstance )
{
XMLMutexLock lockInit( &fMutex );
if( !fInstance )
}
return (fInstance);
}
Obviously this change meant fMutex needed to become static.
I also added mutex protection for the other methods of RangeTokenMap, including
moving the locking in getRange() to the start of the method. These extra locks
may have been overkill, but since the methods didn't appear to be obviously
const I thought it was probably prudent. Can someone who knows the intended
uses of RangeTokenMap better than I do (i.e. at all) advise me otherwise?
After rebuilding Xerces with these fixes, the fatal errors became less
frequent, but didn't disappear completely.
I grepped around in the source a bit more, and I found the RangeTokenMap
approach to constructing static objects was reused in the EncodingValidator
class (.../util/EncodingValidator.*). I made the same single construction and
method mutexing changes there as well.
Finally tracked another problem to the use of the RangeTokenMap in the
TokenFactory class (.../util/regx/TokenFactory.*). The
TokenFactory::initializeRegistry() method calls multiple methods of the
(static) RangeTokenMap object, but only if its fRangeInitialized flag is false.
Since fRangeInitialized wasn't static, every thread that came to initialise the
TokenFactory did the RangeTokenMap initialisation again.
I don't fully recall why this multiple initialisation was harmful, as opposed
to just unecessary, since it's a little while since I was looking at this
problem. The fix I made was to add a static mutex to TokenFactory and made the
fRangeInitialized flag static. The mutex is locked before attempting anything
inside TokenFactory::initializeRegistry()
After making all three of these changes I haven't had any further problems with
my app. But, since most of the changes were made without any real understanding
of what these objects actually do, I have a suspicion that I may have let
myself in for some more subtle problems in the future.
Unfortunately I can't provide any test code for these bugs, since my app is too
tightly bound to our code base to be useful to anyone else and I don't have time
to create a fully portable version. Besides which, I doubt that test code would
necessarily guarantee these issues were excercised anyway... Hopefully I've
provided enough explanation to illustrate the problem without example code.
Thanks for your time...
Cheers,
Andrew