Karaf
  1. Karaf
  2. KARAF-1309

Cellar causes Karaf container to freeze if system got network interface changes between container restarts

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: cellar-2.2.4
    • Component/s: cellar-hazelcast
    • Labels:
      None
    • Environment:

      Karaf-2.2.6-SNAPSHOT from 20120403, Cellar-2.2.4-SNAPSHOT from 20120312

      Description

      1. Started 4 Karaf instances and installed Cellar on those
      cluster:node-list
      No. Host Name Port ID

      • 1 opti.local 5701 opti.local:5701
        2 opti.local 5702 opti.local:5702
        3 opti.local 5703 opti.local:5703
        4 opti.local 5704 opti.local:5704

      opti.local = 192.168.1.86

      2. 1,2 - default group; 3,4 - "group1" (not sure if groups are essential here, but were a part of the test case)

      cluster:group-list
      Node Group
      opti.local:5701 default
      opti.local:5702 default
      opti.local:5703 group1

      • opti.local:5704 group1

      3. Stopped Karaf containers

      4. Got the VPN client on - tun0 (172.27.210.11) network interface added

      5. Restarted Karaf containers

      • the 1st and the 2nd ones seemed to be working fine:
        cluster:node-list
        No. Host Name Port ID
      • 1 172.27.210.11 5701 172.27.210.11:5701
        2 172.27.210.11 5703 172.27.210.11:5703
        3 172.27.210.11 5702 172.27.210.11:5702
        4 172.27.210.11 5704 172.27.210.11:5704
      • the 3rd container got inresponsive:

      karaf@trun> osgi:list
      ...no response....

      • the 4th shows the following static picture:

      ...skipped...
      [ 211] [Active ] [Creating ] [ ] [ 60] hazelcast (1.9.4.6)
      Fragments: 213
      [ 212] [Active ] [Created ] [ ] [ 60] Apache Karaf :: Cellar :: Core (2.2.4.SNAPSHOT)
      [ 213] [Resolved ] [ ] [ ] [ 60] Apache Karaf :: Cellar :: Hazelcast (2.2.4.SNAPSHOT)
      Hosts: 211
      [ 214] [Active ] [GracePeriod ] [ ] [ 60] Apache Karaf :: Cellar :: Config (2.2.4.SNAPSHOT)
      [ 215] [Active ] [GracePeriod ] [ ] [ 60] Apache Karaf :: Cellar :: Features (2.2.4.SNAPSHOT)
      [ 216] [Active ] [GracePeriod ] [ ] [ 60] Apache Karaf :: Cellar :: Bundle (2.2.4.SNAPSHOT)
      [ 217] [Active ] [Created ] [ ] [ 60] Apache Karaf :: Cellar :: DOSGi (2.2.4.SNAPSHOT)
      [ 218] [Active ] [Created ] [ ] [ 60] Apache Karaf :: Cellar :: Utils (2.2.4.SNAPSHOT)
      [ 219] [Active ] [Created ] [ ] [ 60] Apache Karaf :: Cellar :: Shell (2.2.4.SNAPSHOT)
      [ 220] [Active ] [Created ] [ ] [ 60] Apache Karaf :: Cellar :: Management (2.2.4.SNAPSHOT)
      karaf@trun>
      ...and the status does not get changed over the time

      Could it be that the stored config conflicts with the newly detected one and brings such an instability in?

      Logs of 4 Karaf instances are attached.

      1. logs.tgz
        141 kB
        Alexey Bespaly

        Activity

        Hide
        Alexey Bespaly added a comment -

        Log files of the corresponding Karaf instances

        Show
        Alexey Bespaly added a comment - Log files of the corresponding Karaf instances
        Hide
        Jean-Baptiste Onofré added a comment -

        It should be fixed with the "merged" Hazelcast configuration.

        Show
        Jean-Baptiste Onofré added a comment - It should be fixed with the "merged" Hazelcast configuration.
        Hide
        Alexey Bespaly added a comment -

        TESB-EE-Runtime 5.1.1 SNAPSHOT #365
        Apache Cellar 2.2.4 SNAPSHOT as of 8.05.2012

        While retesting got the following situation with the 4-th node (after adding a network interface and starting all containers):

        [[ 198] [Active ] [Created ] [ ] [ 80] Apache Karaf :: Cellar :: Core (2.2.4.SNAPSHOT)
        [ 199] [Resolved ] [ ] [ ] [ 80] Apache Karaf :: Cellar :: Hazelcast (2.2.4.SNAPSHOT)
        Hosts: 197
        [ 200] [Active ] [Failure ] [ ] [ 80] Apache Karaf :: Cellar :: Config (2.2.4.SNAPSHOT)
        [ 201] [Active ] [Failure ] [ ] [ 80] Apache Karaf :: Cellar :: Features (2.2.4.SNAPSHOT)
        [ 202] [Active ] [GracePeriod ] [ ] [ 80] Apache Karaf :: Cellar :: Bundle (2.2.4.SNAPSHOT)
        [ 203] [Active ] [Failure ] [ ] [ 80] Apache Karaf :: Cellar :: DOSGi (2.2.4.SNAPSHOT)
        [ 204] [Active ] [Created ] [ ] [ 80] Apache Karaf :: Cellar :: Utils (2.2.4.SNAPSHOT)
        [ 205] [Resolved ] [ ] [ ] [ 80] Apache Karaf :: Cellar :: Shell (2.2.4.SNAPSHOT)
        [ 206] [Resolved ] [ ] [ ] [ 80] Apache Karaf :: Cellar :: Management (2.2.4.SNAPSHOT)

        tesb.log excerpt:
        ...
        12:44:10,724 | WARN | ol-10-thread-130 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent
        12:44:10,732 | WARN | ol-10-thread-131 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent
        12:44:10,752 | WARN | ol-10-thread-132 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent
        12:44:10,758 | WARN | ol-10-thread-133 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent
        ...

        and finally:

        12:48:41,257 | ERROR | rint Extender: 3 | ntainer.BlueprintContainerImpl$1 293 | 10 - org.apache.aries.blueprint - 0.3.1 | Unable to start blueprint container for bundle org.apache.karaf.cellar.config due to unresolved dependencies [(objectClass=org.apache.karaf.cellar.core.GroupManager)]
        java.util.concurrent.TimeoutException
        at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:287)[10:org.apache.aries.blueprint:0.3.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)[:1.6.0_30]
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)[:1.6.0_30]
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)[:1.6.0_30]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)[:1.6.0_30]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)[:1.6.0_30]
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)[:1.6.0_30]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)[:1.6.0_30]
        at java.lang.Thread.run(Thread.java:662)[:1.6.0_30]

        • for all failed bundles.

        Afterall, the container failed to go away cleanly (hung) with the following exceptions duplicated through the log:

        13:39:55,322 | WARN | cached.thread-73 | dardLoggerFactory$StandardLogger 51 | - - | 172.27.210.7/172.27.210.7:5704 [cellar] You probably have too long Hazelcast configuration!
        java.io.IOException: Invalid argument
        at java.net.PlainDatagramSocketImpl.send(Native Method)[:1.6.0_30]
        at java.net.DatagramSocket.send(DatagramSocket.java:625)[:1.6.0_30]
        at com.hazelcast.impl.MulticastService.send(MulticastService.java:148)[197:hazelcast:1.9.4.8]
        at com.hazelcast.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:95)[197:hazelcast:1.9.4.8]
        at com.hazelcast.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:58)[197:hazelcast:1.9.4.8]
        at com.hazelcast.impl.SplitBrainHandler.access$000(SplitBrainHandler.java:22)[197:hazelcast:1.9.4.8]
        at com.hazelcast.impl.SplitBrainHandler$1.doRun(SplitBrainHandler.java:46)[197:hazelcast:1.9.4.8]
        at com.hazelcast.impl.FallThroughRunnable.run(FallThroughRunnable.java:23)[197:hazelcast:1.9.4.8]
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)[:1.6.0_30]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)[:1.6.0_30]
        at java.lang.Thread.run(Thread.java:662)[:1.6.0_30]

        Show
        Alexey Bespaly added a comment - TESB-EE-Runtime 5.1.1 SNAPSHOT #365 Apache Cellar 2.2.4 SNAPSHOT as of 8.05.2012 While retesting got the following situation with the 4-th node (after adding a network interface and starting all containers): [[ 198] [Active ] [Created ] [ ] [ 80] Apache Karaf :: Cellar :: Core (2.2.4.SNAPSHOT) [ 199] [Resolved ] [ ] [ ] [ 80] Apache Karaf :: Cellar :: Hazelcast (2.2.4.SNAPSHOT) Hosts: 197 [ 200] [Active ] [Failure ] [ ] [ 80] Apache Karaf :: Cellar :: Config (2.2.4.SNAPSHOT) [ 201] [Active ] [Failure ] [ ] [ 80] Apache Karaf :: Cellar :: Features (2.2.4.SNAPSHOT) [ 202] [Active ] [GracePeriod ] [ ] [ 80] Apache Karaf :: Cellar :: Bundle (2.2.4.SNAPSHOT) [ 203] [Active ] [Failure ] [ ] [ 80] Apache Karaf :: Cellar :: DOSGi (2.2.4.SNAPSHOT) [ 204] [Active ] [Created ] [ ] [ 80] Apache Karaf :: Cellar :: Utils (2.2.4.SNAPSHOT) [ 205] [Resolved ] [ ] [ ] [ 80] Apache Karaf :: Cellar :: Shell (2.2.4.SNAPSHOT) [ 206] [Resolved ] [ ] [ ] [ 80] Apache Karaf :: Cellar :: Management (2.2.4.SNAPSHOT) tesb.log excerpt: ... 12:44:10,724 | WARN | ol-10-thread-130 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent 12:44:10,732 | WARN | ol-10-thread-131 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent 12:44:10,752 | WARN | ol-10-thread-132 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent 12:44:10,758 | WARN | ol-10-thread-133 | lar.core.event.EventDispatchTask 88 | 198 - org.apache.karaf.cellar.core - 2.2.4.SNAPSHOT | Failed to retrieve handler for event class org.apache.karaf.cellar.features.RemoteFeaturesEvent ... and finally: 12:48:41,257 | ERROR | rint Extender: 3 | ntainer.BlueprintContainerImpl$1 293 | 10 - org.apache.aries.blueprint - 0.3.1 | Unable to start blueprint container for bundle org.apache.karaf.cellar.config due to unresolved dependencies [(objectClass=org.apache.karaf.cellar.core.GroupManager)] java.util.concurrent.TimeoutException at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:287) [10:org.apache.aries.blueprint:0.3.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) [:1.6.0_30] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) [:1.6.0_30] at java.util.concurrent.FutureTask.run(FutureTask.java:138) [:1.6.0_30] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) [:1.6.0_30] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) [:1.6.0_30] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [:1.6.0_30] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [:1.6.0_30] at java.lang.Thread.run(Thread.java:662) [:1.6.0_30] for all failed bundles. Afterall, the container failed to go away cleanly (hung) with the following exceptions duplicated through the log: 13:39:55,322 | WARN | cached.thread-73 | dardLoggerFactory$StandardLogger 51 | - - | 172.27.210.7/172.27.210.7:5704 [cellar] You probably have too long Hazelcast configuration! java.io.IOException: Invalid argument at java.net.PlainDatagramSocketImpl.send(Native Method) [:1.6.0_30] at java.net.DatagramSocket.send(DatagramSocket.java:625) [:1.6.0_30] at com.hazelcast.impl.MulticastService.send(MulticastService.java:148) [197:hazelcast:1.9.4.8] at com.hazelcast.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:95) [197:hazelcast:1.9.4.8] at com.hazelcast.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:58) [197:hazelcast:1.9.4.8] at com.hazelcast.impl.SplitBrainHandler.access$000(SplitBrainHandler.java:22) [197:hazelcast:1.9.4.8] at com.hazelcast.impl.SplitBrainHandler$1.doRun(SplitBrainHandler.java:46) [197:hazelcast:1.9.4.8] at com.hazelcast.impl.FallThroughRunnable.run(FallThroughRunnable.java:23) [197:hazelcast:1.9.4.8] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [:1.6.0_30] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [:1.6.0_30] at java.lang.Thread.run(Thread.java:662) [:1.6.0_30]
        Hide
        Jean-Baptiste Onofré added a comment -

        It should be fixed with the Hazelcast 3.2.3 upgrade.

        Show
        Jean-Baptiste Onofré added a comment - It should be fixed with the Hazelcast 3.2.3 upgrade.

          People

          • Assignee:
            Jean-Baptiste Onofré
            Reporter:
            Alexey Bespaly
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development