Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-12291

Async prematurely reports completed status that causes severe shard loss

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      The OverseerCollectionMessageHandler sliceCmd assumes only one replica exists on one node

      When multiple replicas of a slice are on the same node we only track one replica's async request. This happens because the async requestMap's key is "node_name"

      I discovered this when Mikhail shared some logs of a restore issue, where the second replica got added before the first replica had completed it's restorecore action.

      While looking at the logs I noticed that the overseer never called REQUESTSTATUS for the restorecore action , almost as if it had missed tracking that particular async request.

        Attachments

        1. SOLR-122911.patch
          49 kB
          Varun Thacker
        2. SOLR-12291.patch
          29 kB
          Mikhail Khludnev
        3. SOLR-12291.patch
          65 kB
          Mikhail Khludnev
        4. SOLR-12291.patch
          6 kB
          Mikhail Khludnev
        5. SOLR-12291.patch
          3 kB
          Mikhail Khludnev
        6. SOLR-12291.patch
          56 kB
          Mikhail Khludnev
        7. SOLR-12291.patch
          73 kB
          Mikhail Khludnev
        8. SOLR-12291.patch
          75 kB
          Mikhail Khludnev

        Issue Links

          Activity

            People

            • Assignee:
              mkhl Mikhail Khludnev
              Reporter:
              varun Varun Thacker

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment