Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
8.6
-
None
-
None
Description
When a node is shutting down, it calls into:
This sends a request to Overseer to mark all of the replicas DOWN for the soon-to-be down node.
The issue we encountered was as follows:
- Solr node shuts down
- DOWNNODE message is enqueued for Overseer
- Solr node comes back up (running on K8s, so a new node is auto-started as soon as the old node was detected as down)
- DOWNNODE was dequeued for processing, and marked all replicas DOWN for the node that is now live.
The only place where these replicas would later be marked ACTIVE again is after ShardLeaderElection, but we did not reach that case. An easy fix is to add a check for node liveness prior to marking replicas down, but a lot of tests fail with this change. Was this the intended functionality?