[SOLR-13072] Management of markers for nodeLost / nodeAdded events is broken - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 7.5, 7.6, 8.0
Fix Version/s: 7.7, 8.0, 9.0
Component/s: AutoScaling
Labels:
None

Description

In order to prevent nodeLost events from being lost when it's the Overseer leader that is the node that was lost a mechanism was added to record markers for these events by any other live node, in ZkController.registerLiveNodesListener(). As similar mechanism also exists for nodeAdded events.

On Overseer leader restart if the autoscaling configuration didn't contain any triggers that consume nodeLost events then these markers are removed. If there are 1 or more trigger configs that consume nodeLost events then these triggers would read the markers, remove them and generate appropriate events.

However, as the NodeMarkersRegistrationTest shows this mechanism is broken and susceptible to race conditions.

It's not unusual to have more than 1 nodeLost trigger because in addition to any user-defined triggers there's always one that is automatically defined if missing: .auto_add_replicas. However, if there's more than 1 nodeLost trigger then the process of consuming and removing the markers becomes non-deterministic - each trigger may pick up (and delete) all, none, or some of the markers.

So as it is now this mechanism is broken if more than 1 nodeLost or more than 1 nodeAdded trigger is defined.

Attachments

Issue Links

causes

SOLR-14504 ZkController LiveNodesListener has NullPointerException in startup race

Closed

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Andrzej Bialecki

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Dec/18 18:43

Updated:: 07/Aug/23 17:25

Resolved:: 06/Mar/19 12:52