Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-12729

SplitShardCmd should lock the parent shard to prevent parallel splitting requests

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.6, 8.0
    • AutoScaling
    • None

    Description

      This scenario was discovered by the simulation framework, but it exists also in the non-simulated code.

      When IndexSizeTrigger requests SPLITSHARD, which is then successfully started and “completed” from the point of view of ExecutePlanAction, the reality is that it still can take significant amount of time until the moment when the new replicas fully recover and cause the switch of shard states (parent to INACTIVE, child from RECOVERY to ACTIVE).

      If this time is longer than the trigger's waitFor the trigger will issue the same SPLITSHARD request again. SplitShardCmd doesn't prevent this new request from being processed because the parent shard is still ACTIVE. However, a section of the code in SplitShardCmd will realize that sub-slices with the target names already exist and they are not active, at which point it will delete the new sub-slices (SplitShardCmd:182).

      The end result is an infinite loop, where IndexSizeTrigger will keep generating SPLITSHARD, and SplitShardCmd will keep deleting the recovering sub-slices created by the previous command.

      A simple solution is for the parent shard to be marked to indicate that it’s in a process of splitting, so that no other split is attempted on the same shard. Furthermore, IndexSizeTrigger could temporarily exclude such shards from monitoring.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ab Andrzej Bialecki
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment