Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15386

Internal DOWNNODE request will mark replicas down even if their host node is now live

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 8.6
    • Fix Version/s: None
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      When a node is shutting down, it calls into:

      1. CoreContainer.shutdown()
      2. ZkController.preClose()
      3. ZkController.publishNodeAsDown

      This sends a request to Overseer to mark all of the replicas DOWN for the soon-to-be down node.

      1. Overseer.processMessage()
      2. NodeMutator.downNode()

      The issue we encountered was as follows:

      1. Solr node shuts down
      2. DOWNNODE message is enqueued for Overseer
      3. Solr node comes back up (running on K8s, so a new node is auto-started as soon as the old node was detected as down)
      4. DOWNNODE was dequeued for processing, and marked all replicas DOWN for the node that is now live.

      The only place where these replicas would later be marked ACTIVE again is after ShardLeaderElection, but we did not reach that case. An easy fix is to add a check for node liveness prior to marking replicas down, but a lot of tests fail with this change. Was this the intended functionality?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              megancarey Megan Carey
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: