Uploaded image for project: 'CXF Distributed OSGi (Retired)'
  1. CXF Distributed OSGi (Retired)
  2. DOSGI-173

unregistering an exported service does not remove it from zookeeper (and remote clients)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.5.0
    • 1.5.0
    • None
    • None
    • Unknown

    Description

      I have some bundles exporting and consuming services, running on two hosts. I've noticed more than once that while stopping and starting different bundles on the two hosts (just playing around with them manually to see how robust the distributed system is), at some point one of the hosts doesn't see that a service it was using from the other host is down. Connecting to ZooKeeper directly, I see the node for that service is still there, i.e. the service was not properly removed from ZK even though the bundle is stopped and service is gone.

      Investigating this is a bit tricky, since it involves various trackers, endpoint listeners and service listeners and there is not enough code documentation to understand what the intended flow is... however I've found a few interesting related findings that may point at the solution:

      1. Following the logs and some debugging, it appears that the problem is not with the discovery.zookeeper package/bundle itself, since the endpoint removed event never gets there.

      2. In EndpointListenerNotifier.notifyListenersOfRemoval(), the EndpointDescription appears to be null, so there is never a filter match and the endpointRemoved callback is never triggered on the EndpointListeners. This is because all of the ExportRegistrations are already closed by the time they get there. It seems that the premature closing is done by the service tracker created in ExportRegistrationImpl.startServiceTracker(). My guess is that the order in which the service tracker and service listener (in TopologyManagerExport, which triggers the EndpointListenerNotifier) receive the events is arbitrary depending on some race condition somewhere, which may explain why this is an inconsistently reproducible bug. I would like to say that the solution is to get rid of the service tracker altogether (it doesn't do anything else, and as a separate bug, is never closed), but I'm not sure why it was introduced in the first place or if there are any other scenarios in which it was necessary, so I really don't know what the proper solution should be.

      3. Another element that may have been masking this bug to some degree is the local discovery bundle which was running, and during debugging I saw it triggering some EndpointListener removal events which were picked up by the other components. I'm not entirely sure yet of what this bundle does (I didn't find any mention of it on the website, and didn't get to the code yet), but I just leave this bundle in the stopped state for now, with no visible effects on the testing, making debugging easier.

      4. An additional related issue which bugged me during a previous code review was that InterfaceMonitorManager.addInterest() is closing and recreating an InterfaceMonitor every time it is invoked with an existing scope, even though the old and new IMs monitor the same ZK node and are practically identical - so why not just leave the old monitor running? This replacement causes a bunch of unnecessary extra work (including several ZK server accesses), a flurry of unnecessary filter-matching logs, and and unnecessary gap in monitoring for ZK changes. This also relates to the bug at hand since InterfaceMonitor.close() also sends some EndpointListener notifications about the endpoints being removed, which leaves some gaps in the registration coverage (before they are re-added moments later) and might interact in some other unpredictable (at least to me) way with the rest of the mechanism. It seems these IM close/start cycles sometimes occur tens of times in a row.

      To sum it up, there's definitely a bug occurring. When I tested a bit with fixes for both potential causes above (IM stop/start replaced with a single start the first time a given scope is encountered, and close invocation in service tracker removed) - I could no longer recreate the bug, but I don't understand all the component interactions well enough to know if there are any side effects, or why they were implemented this way in the first place (I tend to assume there was a good reason for it which I'm unaware of).

      Attachments

        1. fix_zk_unregisteration.diff
          6 kB
          Amichai Rothman

        Issue Links

          Activity

            People

              amichai Amichai Rothman
              amichai Amichai Rothman
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: