Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17331

Improve OrderedNodePlacementPlugin placements

    XMLWordPrintableJSON

Details

    • Test
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 9.7
    • SolrCloud
    • None

    Description

      The test MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is sometimes (< 3% failure rate) failing on its last assertion, as shows the trend history of test failures.

       

      This test spins off a 5 nodes cluster, creates a collection with 3 shards and a replication factor of 2.

      It then vacate 2 randomly chosen nodes using the Migrate Replicas command and, after the migration completion, expect the vacated node to be assigned no replicas and the 6 replicas to be evenly spread across the 3 non-vacated nodes (i.e., 2 replicas positioned on each node).

      However, this last assertion happen to fail as the replicas are sometimes not evenly spread over the 3 non-vacated nodes.

      The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas after the migration expected:<2> but was:<1> 

       

      If we analyse more in detail a failure situation, it appears that this test is inherently expected to fail under some circumstances, given how the Migrate Replicas command operate.

      When migrating replicas, the new position of the replicas to be moved are calculated sequentially and, for every consecutive move, the position is decided according to the logic implemented by the replica placement plugin currently configured.

      We can therefore end up in the following situation.

      Failing scenario

      Note that this test always uses the default replica placement strategy, which is Simple as of today.

      Let's assume the following initial state, after the collection creation.

              |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
      --------+---------+---------+---------+---------+---------+
      SHARD_1 |    X    |         |         |    X    |         |
      SHARD_2 |         |    X    |         |    X    |         |
      SHARD_3 |         |         |    X    |         |    X    | 

      The test now runs the migrate command to vacate NODE_3 and NODE_4. It therefore needs to go through 3 replica movements for emptying these two nodes.

      Move 1

      We are moving the replica of SHARD_1 positioned on NODE_3.

      NODE_0 is not an eligible destination for this replica as this node is already assigned a replica of SHARD_1, and both NODE_1 and NODE_2 can be chosen as they host the same number of replicas.

      NODE_1 is arbitrarily chosen amongst the two best candidate destination nodes.

              |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
      --------+---------+---------+---------+---------+---------+
      SHARD_1 |    X    |    X    |         |         |         |
      SHARD_2 |         |    X    |         |    X    |         |
      SHARD_3 |         |         |    X    |         |    X    | 

      Move 2

      We are moving the replica of SHARD_2 positioned on NODE_3.

      NODE_1 is not an eligible destination for this replica as this node is already assigned a replica of SHARD_2, and both NODE_0 and NODE_2 can be chosen as they host the same number of replicas.

      NODE_0 is arbitrarily chosen amongst the two best candidate destination nodes.

              |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
      --------+---------+---------+---------+---------+---------+
      SHARD_1 |    X    |    X    |         |         |         |
      SHARD_2 |    X    |    X    |         |         |         |
      SHARD_3 |         |         |    X    |         |    X    |

      Move 3

      We are moving the replica of SHARD_3 positioned on NODE_4.

      NODE_2 is not an eligible destination for this replica as this node is already assigned a replica of SHARD_3, and both NODE_0 and NODE_1 can be chosen as they host the same number of replicas.

      NODE_1 is arbitrarily chosen amongst the two best candidate destination nodes.

              |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
      --------+---------+---------+---------+---------+---------+
      SHARD_1 |    X    |    X    |         |         |         |
      SHARD_2 |    X    |    X    |         |         |         |
      SHARD_3 |         |    X    |    X    |         |         |

       

      The test will then fail as the replicas are not evenly positioned across the non-vacated nodes, while it is arguably the expected outcome in the current situation given the Simple placement strategy implementation.

      Attachments

        Issue Links

          Activity

            People

              houston Houston Putman
              ycallea Yohann Callea
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m