discovery.impl 1.0.4, with
SLING-3389, introduced a bug whereas it got possible that a stale topology announcement remained in the system (and was not cleaned up) when a combination of crash/restart and reconfiguration/switch-over of topology connectors occurred. SLING-3726 discribes one symptom of this problem, which resulted in a duplicate instance in the topology-tree reported by discovery.impl.
Another case where this can be reproduced is the following scenario:
- consider 3 instances A, B and C. A and B are in the same cluster. C has a topology connector to A.
- now A crashes - which leaves B and C not seeing each other through the topology (which is correct since the connector C-A is not possible)
- now consider C removing the topology connector (config change) - hence C will see itself isolated in a topology (which is correct)
- now consider A to restart
- at this point the announcement from C is still stored under /var/discovery/impl/clusterInstance/A/announcements/C
- there is a filter which only reports incoming announcements (ie A's announcements in this case) if the connector-client (C in this case) is really connected. This results in A reporting a topology which consists only of 1 cluster containing A and B (which is correct).
- above mentioned filter however does not apply to B.
SLING-3389introduced removal of announcement-timestamps being written to the repository in order to reduce write-activity (which was thought of being unnecessary). Thus after SLING-3389the idea is that it is A's responsibility to make sure all the announcements it contains (from C in this case) are current/alive/correct.
- now unfortunately (and that's the bug in this case) there is only a filter (which applies to A) but not actual removal of outdated announcements. Thus B will report a topology consisting of 1 cluster containing A and B - plus it reports C in the topology as well (as it 'sees' that through the announcement stored at A/announcements/C).
Hence the filter mechanism which replaced timestamps in
SLING-3389 introduced a regression and must be replaced with a proper cleanup mechanism of outdated announcements.