Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17049

Marking replicas down at startup and waiting does not wait

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 8.6
    • 9.7, 9.6.1
    • None
    • None

    Description

      We observed an unexpected behaviour where a node was taking traffic for a replica that was not ready to take it. It seems to happen when the node is marked as live and the replica is marked as active, while the corresponding core is not loaded yet on the node.
       
      I looked at the code and in theory it should not happen, since the following happens in ZkController#init: mark node as down, wait for replicas to be marked as down, and then register the node as live. However, after looking at the code of publishAndWaitForDownStates, I observed that we wait for down states for replicas associated with cores as returned by CoreContainer#getCoreDescriptors... which is empty at this point since ZkController#init is called before cores are discovered (which happens later in CoreContainer#load).
       
      It hence seems to me that we basically never wait for any replicas to be marked as down, and continue the startup sequence by marking the node as live, and hence might take traffic for a short period of time for a replica that is not ready (e.g., if the node previously crashed and the replica stayed active).

      Attachments

        Issue Links

          Activity

            People

              houston Houston Putman
              vprimault Vincent Primault
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m