[SLING-5030] replace isolated mode with (larger) TOPOLOGY_CHANGING phase - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Discovery Impl 1.0.2
Fix Version/s: Discovery Impl 1.1.8
Component/s: Extensions
Labels:
None

Description

As described in SLING-3432 one major reason why duplicate leaders happen in discovery.impl is the isolated mode: the rule of discovery API is that every instance is always in a cluster. That kind of makes sense. However, when the connection to the cluster (ie to the repository) is faulty or delayed for some reason - and the remaining cluster does no longer interpret the local instance as being alive (ie heartbeats have timed out), then currently the local instance notices this 'isolated' state and wraps itself into a pseudo cluster consisting only of itself. Of which it by definition is the leader.

This is completely wrong: there should be no isolated mode. When this 'cut off' the cluster happens, the local instance should just immediately send out a TOPOLOGY_CHANGING and remain in this state until things have settled with the repository and it successfully has taken part of a voting. Only then can it send out a TOPOLOGY_CHANGED event.

This should fix a large number of situations where ~~SLING-3432~~ has been seen.

Attachments

Issue Links

blocks

SLING-3432 pseudo network partition causes job deserialization issue in a cluster (when reading while job is being reassigned)

Closed

is related to

SLING-5058 introduce viewCnt to ./establishedView to be able to detect missing changes

Closed

Activity

People

Assignee:: Stefan Egli

Reporter:: Stefan Egli

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Sep/15 08:17

Updated:: 30/Sep/15 05:44

Resolved:: 24/Sep/15 14:37