Uploaded image for project: 'Sling'
  1. Sling
  2. SLING-4139

regression: stale topology announcements possible after crash/reconfig

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Discovery Impl 1.0.12
    • Discovery Impl 1.1.0
    • Extensions
    • None

    Description

      discovery.impl 1.0.4, with SLING-3389, introduced a bug whereas it got possible that a stale topology announcement remained in the system (and was not cleaned up) when a combination of crash/restart and reconfiguration/switch-over of topology connectors occurred.

      SLING-3726 discribes one symptom of this problem, which resulted in a duplicate instance in the topology-tree reported by discovery.impl.

      Another case where this can be reproduced is the following scenario:

      • consider 3 instances A, B and C. A and B are in the same cluster. C has a topology connector to A.
      • now A crashes - which leaves B and C not seeing each other through the topology (which is correct since the connector C-A is not possible)
      • now consider C removing the topology connector (config change) - hence C will see itself isolated in a topology (which is correct)
      • now consider A to restart
        • at this point the announcement from C is still stored under /var/discovery/impl/clusterInstance/A/announcements/C
        • there is a filter which only reports incoming announcements (ie A's announcements in this case) if the connector-client (C in this case) is really connected. This results in A reporting a topology which consists only of 1 cluster containing A and B (which is correct).
        • above mentioned filter however does not apply to B. SLING-3389 introduced removal of announcement-timestamps being written to the repository in order to reduce write-activity (which was thought of being unnecessary). Thus after SLING-3389 the idea is that it is A's responsibility to make sure all the announcements it contains (from C in this case) are current/alive/correct.
        • now unfortunately (and that's the bug in this case) there is only a filter (which applies to A) but not actual removal of outdated announcements. Thus B will report a topology consisting of 1 cluster containing A and B - plus it reports C in the topology as well (as it 'sees' that through the announcement stored at A/announcements/C).

      Hence the filter mechanism which replaced timestamps in SLING-3389 introduced a regression and must be replaced with a proper cleanup mechanism of outdated announcements.

      Attachments

        Issue Links

          Activity

            People

              stefanegli Stefan Egli
              stefanegli Stefan Egli
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: