Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
7.5, 7.6, 8.0
-
None
Description
In order to prevent nodeLost events from being lost when it's the Overseer leader that is the node that was lost a mechanism was added to record markers for these events by any other live node, in ZkController.registerLiveNodesListener(). As similar mechanism also exists for nodeAdded events.
On Overseer leader restart if the autoscaling configuration didn't contain any triggers that consume nodeLost events then these markers are removed. If there are 1 or more trigger configs that consume nodeLost events then these triggers would read the markers, remove them and generate appropriate events.
However, as the NodeMarkersRegistrationTest shows this mechanism is broken and susceptible to race conditions.
It's not unusual to have more than 1 nodeLost trigger because in addition to any user-defined triggers there's always one that is automatically defined if missing: .auto_add_replicas. However, if there's more than 1 nodeLost trigger then the process of consuming and removing the markers becomes non-deterministic - each trigger may pick up (and delete) all, none, or some of the markers.
So as it is now this mechanism is broken if more than 1 nodeLost or more than 1 nodeAdded trigger is defined.
Attachments
Issue Links
- causes
-
SOLR-14504 ZkController LiveNodesListener has NullPointerException in startup race
- Closed