ServiceMix NMR
  1. ServiceMix NMR
  2. SMX4NMR-310

NMR Registration Fails on Startup Intermittently

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Environment:

      Fuse 4.3

    • Regression:
      Regression

      Description

      We have a number of services which start when servicemix starts, based on servicemix-cxf-se, which publish endpoints to the NMR implicitly by using org.apache.servicemix.common.osgi.EndpointExporter. We are seeing the situation where after startup, some of our services are missing from the endpoint registry. We cannot see them using the admin command "nmr:list", and in tracking this down, have noticed that the org.apache.servicemix.nmr.core.EndpointRegistryImpl does not contain a mapping for the service. This is clearly a bug, because if we do an osgi:restart on the bundle containing our service the NMR will correctly register the endpoints.

      This is obviously causing us concern. We have worked around other initialization problems with our servicemix-cxf-se services related to generation of proxies by injecting the component registry (org.apache.servicemix.jbi.runtime.ComponentRegistry) to the proxy. We are wondering if there is a similar workaround we must perform in order to ensure our components do not initialize before the NMR service registry tracker is ready to receive them.

      Noticing a normal initialization sees this stacktrace fragment:

      org.apache.servicemix.nmr.core.EndpointRegistryImpl.register(org.apache.servicemix.nmr.api.Endpoint, java.util.Map<java.lang.String,?>) line: 115
      org.apache.servicemix.nmr.core.EndpointRegistryImpl.register(java.lang.Object, java.util.Map) line: 49
      org.apache.servicemix.nmr.osgi.OsgiServiceRegistryTracker<T>.addingService(org.osgi.framework.ServiceReference) line: 78

      So, it seems that the org.apache.servicemix.nmr.osgi.OsgiServiceRegistryTracker listener must be getting initialized after our own bundle in this case. Since this class registers a service tracker, it is my bet that it is not tracking services before our service's endpoint exporter has started running. In another odd twist, the main culprit in our use case exports 3 services, and only one of them seems to be missing.

      The complexity of getting our services to start properly and reliably has been the most significant issue facing us in our port from servicemix 3 to 4. We'd appreciate some help and/or fixes to get us moving in the right direction.

      1. sm-2011-testcase.zip
        20 kB
        Kurt Westerfeld

        Activity

        Gert Vanthienen made changes -
        Project ServiceMix [ 12311206 ] ServiceMix NMR [ 12311209 ]
        Key SM-2011 SMX4NMR-310
        Jeff Turner made changes -
        Project Import Sat Nov 27 00:46:19 EST 2010 [ 1290836779991 ]
        Kurt Westerfeld made changes -
        Attachment sm-2011-testcase.zip [ 19874 ]
        Hide
        Kurt Westerfeld added a comment -

        Re-attaching.

        Show
        Kurt Westerfeld added a comment - Re-attaching.
        Kurt Westerfeld made changes -
        Attachment sm-2011-testcase.zip [ 19873 ]
        Kurt Westerfeld made changes -
        Field Original Value New Value
        Attachment sm-2011-testcase.zip [ 19873 ]
        Hide
        Kurt Westerfeld added a comment -

        This is a testcase to demonstrates the issue.

        Note: the testcase actually does a System.exit() from the seeker service when it successfully locates all the target services. This is actually usually the case, and is intended to be run in a loop where the system is restarted until the problem occurs. Obviously, this is not a normal occurrence-the exiting of the system-but is really the only way to reliably create the conditions of the bug during initialization.

        Show
        Kurt Westerfeld added a comment - This is a testcase to demonstrates the issue. Note: the testcase actually does a System.exit() from the seeker service when it successfully locates all the target services. This is actually usually the case, and is intended to be run in a loop where the system is restarted until the problem occurs. Obviously, this is not a normal occurrence- the exiting of the system -but is really the only way to reliably create the conditions of the bug during initialization.
        Hide
        Freeman Fang added a comment -

        Hi Kurt,

        Thanks for telling use the workaround.

        And absolutely the testcase is necessary and more helpful, we can play with your testcase and do start/clean/restarts to try to reproduce it from our side.

        Best Regards
        Freeman

        Show
        Freeman Fang added a comment - Hi Kurt, Thanks for telling use the workaround. And absolutely the testcase is necessary and more helpful, we can play with your testcase and do start/clean/restarts to try to reproduce it from our side. Best Regards Freeman
        Hide
        Kurt Westerfeld added a comment -

        Chris & Freeman,

        I stumbled onto a fix for this issue, on a hunch I had by looking at how the endpoint exporter does it's thing. Basically, endpoint exporter queries the application context for a list of endpoints, and that process seems to return the list with a set of services that all have manufactured identifiers. For example, each service looks something like "...CxfEndpoint#0", "...CxfEndpoint#1", etc.

        My thinking was that if I declared my cxfse:endpoint with ids, the problem would go away. And sure enough, it does. Now, to me this looks like a bug in spring, but I don't even see how the application context is manufactured from more than one thread. If it is, then that's a problem because it doesn't seem thread-safe. If it isn't, well, I have no explanation.

        To sum up, if I declare my services something like this:

        <cxfse:endpoint id="service1">
        <cxfse:pojo>
        <bean class="com.example.service.Service1" />
        </cxfse:pojo>
        </cxfse:endpoint>

        <cxfse:endpoint id="service2">
        <cxfse:pojo>
        <bean class="com.example.service.Service2" />
        </cxfse:pojo>
        </cxfse:endpoint>

        Then they always are constructed and found on the bus. Without an id declaration, I get random and silent failures on init....basically some of the services never get constructed.

        I did create an example service and the "seeker" service that looks for the services to reproduce this. I can see about whittling that down some to provide you with a testcase, unless you don't need it. Even with this testcase, it does require several start/clean/restarts before the problem ever shows up.

        Show
        Kurt Westerfeld added a comment - Chris & Freeman, I stumbled onto a fix for this issue, on a hunch I had by looking at how the endpoint exporter does it's thing. Basically, endpoint exporter queries the application context for a list of endpoints, and that process seems to return the list with a set of services that all have manufactured identifiers. For example, each service looks something like "...CxfEndpoint#0", "...CxfEndpoint#1", etc. My thinking was that if I declared my cxfse:endpoint with ids, the problem would go away. And sure enough, it does. Now, to me this looks like a bug in spring, but I don't even see how the application context is manufactured from more than one thread. If it is, then that's a problem because it doesn't seem thread-safe. If it isn't, well, I have no explanation. To sum up, if I declare my services something like this: <cxfse:endpoint id="service1"> <cxfse:pojo> <bean class="com.example.service.Service1" /> </cxfse:pojo> </cxfse:endpoint> <cxfse:endpoint id="service2"> <cxfse:pojo> <bean class="com.example.service.Service2" /> </cxfse:pojo> </cxfse:endpoint> Then they always are constructed and found on the bus. Without an id declaration, I get random and silent failures on init....basically some of the services never get constructed. I did create an example service and the "seeker" service that looks for the services to reproduce this. I can see about whittling that down some to provide you with a testcase, unless you don't need it. Even with this testcase, it does require several start/clean/restarts before the problem ever shows up.
        Hide
        Kurt Westerfeld added a comment -

        Chris, on your request here is what I did.

        I have a "broken" scenario where my cxf-se components are deployed from one
        spring artifact. It turns out breaking this scenario into three spring files
        does not help either, so at least it consistently breaks. I mistakenly
        thought this change fixed things, but in creating this controlled test
        scenario it's clear that it does not (read below).

        Using this scenario, I'm able to reproduce my problem about 1 times in 10 starts.
        I wrote a small service that detects the problem by querying the JBI/NMR registry to
        find the 3 services in question, and when it finds all three it simply does a
        System.exit(). This "seeker" service has to be careful on system startup to
        try multiple times after sleeping, to be sure that it indeed cannot contact all
        three of these services. The reason for this is, for my test scenario I would like this
        to run in a loop, where the system is started, and if it starts OK, just restart.

        To test your single-processor affinity theory, I am using Windows so I had to
        use a slightly different technique than your suggestion. But I validated that
        on startup the java executable is indeed running in single CPU affinity (I use
        sysinternals process explorer for this validation). For those that might be
        interested in how to do this, you can download an executable utility that marks
        an executable as having a specific affinity mask here:

        http://www.robpol86.com/index.php/ImageCFG.

        I simply copyied java.exe to another executable file and launch the marked one
        instead of the stock java.exe.

        So, all this runs, and I can reproduce the problem with both multi-processor
        startup and single-processor startup. So, that theory goes out the window.

        I have the ability to test different theories on why this failure occurs if you have
        any other ideas.

        Show
        Kurt Westerfeld added a comment - Chris, on your request here is what I did. I have a "broken" scenario where my cxf-se components are deployed from one spring artifact. It turns out breaking this scenario into three spring files does not help either, so at least it consistently breaks. I mistakenly thought this change fixed things, but in creating this controlled test scenario it's clear that it does not (read below). Using this scenario, I'm able to reproduce my problem about 1 times in 10 starts. I wrote a small service that detects the problem by querying the JBI/NMR registry to find the 3 services in question, and when it finds all three it simply does a System.exit(). This "seeker" service has to be careful on system startup to try multiple times after sleeping, to be sure that it indeed cannot contact all three of these services. The reason for this is, for my test scenario I would like this to run in a loop, where the system is started, and if it starts OK, just restart. To test your single-processor affinity theory, I am using Windows so I had to use a slightly different technique than your suggestion. But I validated that on startup the java executable is indeed running in single CPU affinity (I use sysinternals process explorer for this validation). For those that might be interested in how to do this, you can download an executable utility that marks an executable as having a specific affinity mask here: http://www.robpol86.com/index.php/ImageCFG . I simply copyied java.exe to another executable file and launch the marked one instead of the stock java.exe. So, all this runs, and I can reproduce the problem with both multi-processor startup and single-processor startup. So, that theory goes out the window. I have the ability to test different theories on why this failure occurs if you have any other ideas.
        Hide
        Chris Custine added a comment -

        Hi Kurt,
        This is an interesting issue and has me a bit concerned. We don't currently have any options for debugging/testing on this many cores but I will see if I can find something to reproduce it. It might take some time to get access to this kind of hardware so do let us know if you find anything.

        In the mean time, I am wondering if the frequency of these occurrences are reduced if you run it on a machine with fewer cores. If that isn't an option for you to test with, could you set cpu affinity on start up and see if that makes things better? Adding something like "taskset -c 1,2" to the java command in the bin/servicemix startup script (near line 337 I believe, after the exec command) should limit execution to 2 cores.

        Let us know if you find anything, and we'll try to get this resolved...

        Show
        Chris Custine added a comment - Hi Kurt, This is an interesting issue and has me a bit concerned. We don't currently have any options for debugging/testing on this many cores but I will see if I can find something to reproduce it. It might take some time to get access to this kind of hardware so do let us know if you find anything. In the mean time, I am wondering if the frequency of these occurrences are reduced if you run it on a machine with fewer cores. If that isn't an option for you to test with, could you set cpu affinity on start up and see if that makes things better? Adding something like "taskset -c 1,2" to the java command in the bin/servicemix startup script (near line 337 I believe, after the exec command) should limit execution to 2 cores. Let us know if you find anything, and we'll try to get this resolved...
        Hide
        Kurt Westerfeld added a comment -

        I broke the three service declarations into three separate spring files and a fourth to declare the EndpointExporter, and the initialization problem went away. I don't know what to make of that.

        About the only working theory I have is that on a 64bit machine each of these services could potentially return the same hashcode, since none override hashCode(). If that is the case, then putting them into a map by key might de-duplicate that, resulting in 2 of the three services resulting in a map. I'm only guessing – I've seen this before on especially IBM VMs which return System.identityHashCode() duplicates with greater frequency....I am only using the Sun JVM (1.6.0_18).

        Show
        Kurt Westerfeld added a comment - I broke the three service declarations into three separate spring files and a fourth to declare the EndpointExporter, and the initialization problem went away. I don't know what to make of that. About the only working theory I have is that on a 64bit machine each of these services could potentially return the same hashcode, since none override hashCode(). If that is the case, then putting them into a map by key might de-duplicate that, resulting in 2 of the three services resulting in a map. I'm only guessing – I've seen this before on especially IBM VMs which return System.identityHashCode() duplicates with greater frequency....I am only using the Sun JVM (1.6.0_18).
        Hide
        Kurt Westerfeld added a comment -

        Another update. I added a log message to each of the three services in question's constructor. When this problem occurs, spring does not even create the three services--only 2 of them are created.

        What could cause this? It no longer looks like a servicemix/nmr init timing issue, but one of spring's basic construction of services. Specifically, the cxfse:endpoint is simply not generating a new class in 1 of these 3 services.

        Quite baffling.

        Show
        Kurt Westerfeld added a comment - Another update. I added a log message to each of the three services in question's constructor. When this problem occurs, spring does not even create the three services--only 2 of them are created. What could cause this? It no longer looks like a servicemix/nmr init timing issue, but one of spring's basic construction of services. Specifically, the cxfse:endpoint is simply not generating a new class in 1 of these 3 services. Quite baffling.
        Hide
        Kurt Westerfeld added a comment -

        OK, it looks like this latest workaround I tried didn't solve the problem.

        Here's some factors that might be part of the equation:

        • we install our own feature bundle at first startup
        • the service missing in question is bundled along with 2 other services
        • we create 3 binding components as well, which are companions to these 3 services by the same service name (ie. external)
        • the 3 binding components initialize fine
        • 2 of the 3 service components initialize fine
        • 1 of the 3 service components does not initialize, and there are no messages in the logs indicating why, or what might have gone wrong
        • a successful initialization path just results in all 3 of the service components initializing
        • the service that fails to register is always the first of the 3 service components in the grouping

        The 3 service initializations look like the following (all in the same spring beans.xml file):

        <cxfse:endpoint>
        <cxfse:pojo>
        <bean class="<impl.class.name>" />
        </cxfse:pojo>
        </cxfse:endpoint>

        Show
        Kurt Westerfeld added a comment - OK, it looks like this latest workaround I tried didn't solve the problem. Here's some factors that might be part of the equation: we install our own feature bundle at first startup the service missing in question is bundled along with 2 other services we create 3 binding components as well, which are companions to these 3 services by the same service name (ie. external) the 3 binding components initialize fine 2 of the 3 service components initialize fine 1 of the 3 service components does not initialize, and there are no messages in the logs indicating why, or what might have gone wrong a successful initialization path just results in all 3 of the service components initializing the service that fails to register is always the first of the 3 service components in the grouping The 3 service initializations look like the following (all in the same spring beans.xml file): <cxfse:endpoint> <cxfse:pojo> <bean class="<impl.class.name>" /> </cxfse:pojo> </cxfse:endpoint>
        Hide
        Kurt Westerfeld added a comment -

        I thought about it, and I can't see how I'd create a testcase. My issue involves start/stop/start/stop with the servicemix container, and I believe the issue to be that my box, having 8 cores, can start initialization of some of the OSGi bundles that need to register with the NMR and it hasn't finished installing it's endpoint tracker listener.

        If you do an ls on:

        [ 82] [Active ] [Created ] [ ] [ 60] Apache ServiceMix NMR OSGi (1.3.0.fuse-01-00)

        This service produces the NMR to the registry very early in its initialization:

        osgi.service.blueprint.compname = servicemix
        objectClass = org.apache.servicemix.nmr.api.NMR
        service.id = 178

        I had a look at its blueprint file here:

        http://svn.apache.org/repos/asf/servicemix/smx4/nmr/trunk/nmr/osgi/src/main/resources/OSGI-INF/blueprint/servicemix-nmr.xml

        And it looks to me that service registration of the NMR occurs independent of the OsgiServiceRegistryTracker, and I don't know enough about blueprint to say whether serivce registrations happen inline with bean creation. I can say this: based on the fact that we're still having the problem sporadically, I really don't think the NMR injection would guarantee that the service tracker that OsgiServiceRegistryTracker installs is active.

        So, back to workaound. We currently inject a reference to the NMR to hope that the OsgiServiceRegistryTracker is installed. This doesn't work. So, the only thing else I can try is perhaps make our services depend on the servicemix components themselves, this way:

        <osgi:reference
        id="servicemix-cxf-bc-component"
        interface="javax.jbi.component.Component"
        filter="(NAME=servicemix-cxf-bc)"
        cardinality="1..1" />

        Which is ugly, and it has to be done on each of our bundles that create a service or binding. We have about 40 of these, so it's not that big a deal. I do think, however, an initialization problem like this should/could be handled by a well-known technique.

        Show
        Kurt Westerfeld added a comment - I thought about it, and I can't see how I'd create a testcase. My issue involves start/stop/start/stop with the servicemix container, and I believe the issue to be that my box, having 8 cores, can start initialization of some of the OSGi bundles that need to register with the NMR and it hasn't finished installing it's endpoint tracker listener. If you do an ls on: [ 82] [Active ] [Created ] [ ] [ 60] Apache ServiceMix NMR OSGi (1.3.0.fuse-01-00) This service produces the NMR to the registry very early in its initialization: osgi.service.blueprint.compname = servicemix objectClass = org.apache.servicemix.nmr.api.NMR service.id = 178 I had a look at its blueprint file here: http://svn.apache.org/repos/asf/servicemix/smx4/nmr/trunk/nmr/osgi/src/main/resources/OSGI-INF/blueprint/servicemix-nmr.xml And it looks to me that service registration of the NMR occurs independent of the OsgiServiceRegistryTracker, and I don't know enough about blueprint to say whether serivce registrations happen inline with bean creation. I can say this: based on the fact that we're still having the problem sporadically, I really don't think the NMR injection would guarantee that the service tracker that OsgiServiceRegistryTracker installs is active. So, back to workaound. We currently inject a reference to the NMR to hope that the OsgiServiceRegistryTracker is installed. This doesn't work. So, the only thing else I can try is perhaps make our services depend on the servicemix components themselves, this way: <osgi:reference id="servicemix-cxf-bc-component" interface="javax.jbi.component.Component" filter="(NAME=servicemix-cxf-bc)" cardinality="1..1" /> Which is ugly, and it has to be done on each of our bundles that create a service or binding. We have about 40 of these, so it's not that big a deal. I do think, however, an initialization problem like this should/could be handled by a well-known technique.
        Hide
        Freeman Fang added a comment -

        Hi Kurt,

        Any chance you append a testcase which I can build and deploy to reproduce this problem?
        You can remove your business logic and just a simple testcase to produce the problem you mentioned.

        Freeman

        Show
        Freeman Fang added a comment - Hi Kurt, Any chance you append a testcase which I can build and deploy to reproduce this problem? You can remove your business logic and just a simple testcase to produce the problem you mentioned. Freeman
        Hide
        Kurt Westerfeld added a comment -

        Freeman, I added this osgi reference and it indeed improved our situation.

        However, I cannot report the problem as fixed. We are able to reproduce this problem a lot less often, but I believe there to be a race condition related to the OsgiServiceRegistryTracker initialization.

        So, what should we inject such that we can guarantee our endpoint exporter does indeed register all endpoints to the NMR?

        Show
        Kurt Westerfeld added a comment - Freeman, I added this osgi reference and it indeed improved our situation. However, I cannot report the problem as fixed. We are able to reproduce this problem a lot less often, but I believe there to be a race condition related to the OsgiServiceRegistryTracker initialization. So, what should we inject such that we can guarantee our endpoint exporter does indeed register all endpoints to the NMR?
        Hide
        Kurt Westerfeld added a comment -

        I will try this and report back.

        Show
        Kurt Westerfeld added a comment - I will try this and report back.
        Hide
        Freeman Fang added a comment -

        Hi Kurt,

        How about add

        <osgi:reference id="nmr"
        interface="org.apache.servicemix.nmr.api.NMR"
        cardinality="1..1" />

        into your endpoint beans.xml, this can ensure the NMR always get started before your bundle.

        Freeman

        Show
        Freeman Fang added a comment - Hi Kurt, How about add <osgi:reference id="nmr" interface="org.apache.servicemix.nmr.api.NMR" cardinality="1..1" /> into your endpoint beans.xml, this can ensure the NMR always get started before your bundle. Freeman
        Kurt Westerfeld created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Kurt Westerfeld
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development