Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
-
6355743
Description
Bugtraq ID 6355743
Taken from jini-users mailing list http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095:
This is a sad horror story about a default value for a load factor in
Mahalo that turned out to halt our software system at regular intervals,
but never in a deterministic way, leading to many lost development
hours, loss of faith and even worse.
In short what we experienced was that some operations in our software
system (includes JavaSpaces and various services that perform operations
under a distributed transaction) that should take place in parallel
took place in a serialized manner. We noticed this behavior only
occurred under some (at that time unknown) conditions. Not only
throughput was harmed but our assumptions with regard to the maximum
time in which operations should complete didn't hold any longer and
things started to fail. One can argue well that is what distributed
systems is all about, but nevertheless it is something you try to avoid,
especially when all parts seem to function properly.
We were not able to find dead-locks in our code or some other problem
that could cause this behavior. Given the large number of services,
their interaction and associated thousands of threads over multiple JVMs
and that you can't freeze-frame time for your system, this appeared as a
tricky problem to tackle. One of those moments you really regret you
started to develop a distributed application at the first place.
However a little voice told me that Mahalo must be involved in all this
trouble, this was in line with my feeling with respect to Mahalo as I
knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
at the 7th JCM "Mahalo is the weakest child of the contributed services"
or similar wording.
So I decided to assume there was a bug in Mahalo and the only way to
find out was to develop a scenario that could make that bug obvious and
to improve logging a lot (proper tracking of transactions and
participants involved). So lately I started to developed some scenario's
and none of them could reproduce a bug or explain what we saw. Until
lately I tried to experiment with transaction participants that are able
to 'take their time' in the prepare method [1]. When using random
prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
and the througput of a transaction (time from client commit to
completion) varied and was no direct funtion of the prepare time. The
behavior I experienced could only be explained when the schedular of the
various internal tasks was constrained by something. Knowing the code I
suddenly realized there must have been a 'load factor' applied to the
thread pool that was used for the commit related tasks. I was rather
shocked to find out that the default was 3.0 and suddenly the mistery
became completely clear to me. Mahalo has out-of-the-box a built-in
constraint that can make the system serialize transaction related
operation in case participants really take their time to return.
So it turned out that Mahalo is a fine services after all, but that one
'freak' chose a very unfortunate default value for the load factor [2].
Load-factors for thread pools (and max limits to a lesser degree) are so
tricky to get right [3] and therefore IMHO high load factors should only
be used in case you know for sure you are dealing with bursts of tasks
with a guaranteed short duration and I think that is really something
people should tune themselves.
Maybe it was stupid of me and I should have read and understand the
Mahalo documentation better. But I would expect any system to use
out-of-the-box load-factors of 1.0 for tasks in a thread pool that
are potentially long running tasks [3], especially for something as
delecate as a transaction manager that seems to operate as the so called
spider in the web. It is better to have a system consuming too much
threads opposed to constrain it in a way that leads to problems that are
very hard to find out.
I hope this mail is seen as an RFE for a default load factor of 1.0 to
prevent from people running into similar problems as we had and as a
lesson/warning for those working with Mahalo and the risk of using
load-factors in general.
[1] in our system some service have to consult external systems when
prepare is called on them and under some conditions it can take a long
time to return from the prepare method. We are aware this is something
you want to prevent but we have requirements that mandate this.
[2] the one that gave us problems in production was Mahalo from JTSK
2.0 that didn't have the ability to specify a taskpool through the
configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
documented at that time if I recall correctly (don't have a 2.0
distribution at hand).
[3] more and more I'm starting to believe that each task in a thread
pool should have a dead-line in which they should be assigned to a
worker thread, for this purpose we support in our thread pools a
priority constraint to attach to Runnables, see
http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
In a discussion in the Porter mailing list I know Bob Scheifler once
said "I have in a past life been a fan of deadline scheduling.", I'm
very interested to know whether he still is a fan.
Evaluation:
Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.