Uploaded image for project: 'Karaf'
  1. Karaf
  2. KARAF-6224

Race condition in BaseActivator on first launch

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 4.0.10, 4.1.7, 4.2.4
    • Fix Version/s: 4.3.0, 4.2.7
    • Component/s: karaf
    • Labels:
      None

      Description

      We have several karaf containers we run on single machine that contains a large number of cores (20).  The machine core count is high so this may be a hard problem to reproduce.  We have customized the RMI and JMX ports for each of the containers so that they do not conflict.  However, after the first karaf VM is launched and claims ports 1099/44444, the second VM will attempt to do the same briefly before its customized configuration can be read from the ${karaf.etc} directory.   You can see that the management bundle gets started and then a configuration update will happen immediately with the corrected values.

      In looking over BaseActivator, it seems that a thread is created to dispatch the initialization and sometimes this thread will encounter a null field "config" before the asynchronous managed service event arrives.  In this case, the configuration is missing and defaults will be used.  Because of this, ports 1099 and 44444 are temporarily attempted to be used until the first managed service event arrives with the updated() method.   Immediately after that, the service reconfigures and uses the proper customized values.

      This is a problem for us because at times this temporary event can cause a client to mistakenly connect to the wrong container.  We use JMX over RMI to perform a number of management operations and this initial startup is unreliable.  Our three karaf containers have some interdependencies that this temporary condition is causing problems with.

      This problem does not occur as often on subsequent restarts, which means that initial provisioning of the ${karaf.etc} must be racing here.  We have seen it happen, however, although rarer, at any time.  It is believed that the high core count of the server this happens to be running on results in the race condition.

      Suggested fix is to make a call to config admin at run() to read the configuration if this.config is null.  This would handle the race here but it could cause other bad interactions with config admin?  Not sure.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jbonofre Jean-Baptiste Onofré
                Reporter:
                kurt.westerfeld@gmail.com Kurt Westerfeld
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: