Uploaded image for project: 'River (Retired)'
  1. River (Retired)
  2. RIVER-206

Change default load factors from 3 to 1

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • River_2.1.2
    • None
    • None
    • 6355743

    Description

      Bugtraq ID 6355743
      Taken from jini-users mailing list http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095:

      This is a sad horror story about a default value for a load factor in
      Mahalo that turned out to halt our software system at regular intervals,
      but never in a deterministic way, leading to many lost development
      hours, loss of faith and even worse.

      In short what we experienced was that some operations in our software
      system (includes JavaSpaces and various services that perform operations
      under a distributed transaction) that should take place in parallel
      took place in a serialized manner. We noticed this behavior only
      occurred under some (at that time unknown) conditions. Not only
      throughput was harmed but our assumptions with regard to the maximum
      time in which operations should complete didn't hold any longer and
      things started to fail. One can argue well that is what distributed
      systems is all about, but nevertheless it is something you try to avoid,
      especially when all parts seem to function properly.

      We were not able to find dead-locks in our code or some other problem
      that could cause this behavior. Given the large number of services,
      their interaction and associated thousands of threads over multiple JVMs
      and that you can't freeze-frame time for your system, this appeared as a
      tricky problem to tackle. One of those moments you really regret you
      started to develop a distributed application at the first place.

      However a little voice told me that Mahalo must be involved in all this
      trouble, this was in line with my feeling with respect to Mahalo as I
      knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
      at the 7th JCM "Mahalo is the weakest child of the contributed services"
      or similar wording.

      So I decided to assume there was a bug in Mahalo and the only way to
      find out was to develop a scenario that could make that bug obvious and
      to improve logging a lot (proper tracking of transactions and
      participants involved). So lately I started to developed some scenario's
      and none of them could reproduce a bug or explain what we saw. Until
      lately I tried to experiment with transaction participants that are able
      to 'take their time' in the prepare method [1]. When using random
      prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
      and the througput of a transaction (time from client commit to
      completion) varied and was no direct funtion of the prepare time. The
      behavior I experienced could only be explained when the schedular of the
      various internal tasks was constrained by something. Knowing the code I
      suddenly realized there must have been a 'load factor' applied to the
      thread pool that was used for the commit related tasks. I was rather
      shocked to find out that the default was 3.0 and suddenly the mistery
      became completely clear to me. Mahalo has out-of-the-box a built-in
      constraint that can make the system serialize transaction related
      operation in case participants really take their time to return.

      So it turned out that Mahalo is a fine services after all, but that one
      'freak' chose a very unfortunate default value for the load factor [2].

      Load-factors for thread pools (and max limits to a lesser degree) are so
      tricky to get right [3] and therefore IMHO high load factors should only
      be used in case you know for sure you are dealing with bursts of tasks
      with a guaranteed short duration and I think that is really something
      people should tune themselves.

      Maybe it was stupid of me and I should have read and understand the
      Mahalo documentation better. But I would expect any system to use
      out-of-the-box load-factors of 1.0 for tasks in a thread pool that
      are potentially long running tasks [3], especially for something as
      delecate as a transaction manager that seems to operate as the so called
      spider in the web. It is better to have a system consuming too much
      threads opposed to constrain it in a way that leads to problems that are
      very hard to find out.

      I hope this mail is seen as an RFE for a default load factor of 1.0 to
      prevent from people running into similar problems as we had and as a
      lesson/warning for those working with Mahalo and the risk of using
      load-factors in general.

      [1] in our system some service have to consult external systems when
      prepare is called on them and under some conditions it can take a long
      time to return from the prepare method. We are aware this is something
      you want to prevent but we have requirements that mandate this.

      [2] the one that gave us problems in production was Mahalo from JTSK
      2.0 that didn't have the ability to specify a taskpool through the
      configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
      documented at that time if I recall correctly (don't have a 2.0
      distribution at hand).

      [3] more and more I'm starting to believe that each task in a thread
      pool should have a dead-line in which they should be assigned to a
      worker thread, for this purpose we support in our thread pools a
      priority constraint to attach to Runnables, see
      http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
      In a discussion in the Porter mailing list I know Bob Scheifler once
      said "I have in a past life been a fan of deadline scheduling.", I'm
      very interested to know whether he still is a fan.

      Evaluation:
      Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

      Attachments

        1. RIVER-206.diff
          0.5 kB
          Robert Resendes

        Activity

          People

            resendes Robert Resendes
            rjmann Ronald Mann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: