Details

      Description

      Solr’s shard split can lose documents if the parent/sub-shard leader is killed (or crashes) between the time that the new sub-shard replica is created and before it recovers. In such a case the slice has already been set to ‘recovery’ state, the sub-shard replica comes up, finds that no other replica is up, waits until the leader vote wait time and then proceeds to become the leader as well as publish itself as active. If the former leader node comes back online, the overseer seeing that all replicas of the sub-shard are now ‘active’, sets the parent slice as ‘inactive’ and the new sub-shard as ‘active’.

      1. SOLR-9438.patch
        27 kB
        Shalin Shekhar Mangar
      2. SOLR-9438.patch
        26 kB
        Shalin Shekhar Mangar
      3. SOLR-9438.patch
        22 kB
        Shalin Shekhar Mangar
      4. SOLR-9438.patch
        21 kB
        Shalin Shekhar Mangar
      5. SOLR-9438.patch
        9 kB
        Shalin Shekhar Mangar
      6. SOLR-9438.patch
        9 kB
        Shalin Shekhar Mangar
      7. SOLR-9438-false-replication.log
        1007 kB
        Shalin Shekhar Mangar
      8. SOLR-9438-split-data-loss.log
        1.00 MB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          A simple fix is for the overseer to check live node information before setting the parent shard as ‘invalid’. This will work because by the time the leader vote wait period expires, the killed former-leader’s ephemeral nodes should have expired.

          But it gets trickier if the leader comes back online and recovers from this new (incomplete) replica. This will again mark the sub-shard as active. To prevent this, the overseer must ensure that the live node of the sub-shard leader still exists (with the same sequence number assigned at the time of split) before changing the sub-slice state to active.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - A simple fix is for the overseer to check live node information before setting the parent shard as ‘invalid’. This will work because by the time the leader vote wait period expires, the killed former-leader’s ephemeral nodes should have expired. But it gets trickier if the leader comes back online and recovers from this new (incomplete) replica. This will again mark the sub-shard as active. To prevent this, the overseer must ensure that the live node of the sub-shard leader still exists (with the same sequence number assigned at the time of split) before changing the sub-slice state to active.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          We should also mark the slice to recovery_failed in case we find the live_node has changed. Any sub-shard in this new state should not be forwarded updates. It will also be a clear indication that the shard split operation has failed and must be re-tried.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - We should also mark the slice to recovery_failed in case we find the live_node has changed. Any sub-shard in this new state should not be forwarded updates. It will also be a clear indication that the shard split operation has failed and must be re-tried.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The test which attempts to tickle this. This is not deterministic as it tries to kill the leader at just the right time. We retry a few times and timeout after 2 minutes. Beasting this test around 50 times has reproduced the bug once. This test works around SOLR-9440 by calling cloudClient.getZkStateReader().registerCore("collection1");

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The test which attempts to tickle this. This is not deterministic as it tries to kill the leader at just the right time. We retry a few times and timeout after 2 minutes. Beasting this test around 50 times has reproduced the bug once. This test works around SOLR-9440 by calling cloudClient.getZkStateReader().registerCore("collection1");
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Patch with the test updated to master.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Patch with the test updated to master.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Beasting this test sometimes fails with nodes not recovering even after working around SOLR-9440. I finally found the cause:

            [beaster]   2> 232316 ERROR (coreZkRegister-123-thread-1-processing-n:127.0.0.1:54683_ x:collection1_shard1_0_replica0 s:shard1_0 c:collection1 r:core_node7) [n:127.0.0.1:54683_ c:collection1 s:shard1_0 r:core_node7 x:collection1_shard1_0_replica0] o.a.s.c.ZkContainer :org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1_0
            [beaster]   2> 	at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:994)
            [beaster]   2> 	at org.apache.solr.cloud.ZkController.register(ZkController.java:900)
            [beaster]   2> 	at org.apache.solr.cloud.ZkController.register(ZkController.java:843)
            [beaster]   2> 	at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
            [beaster]   2> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
            [beaster]   2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            [beaster]   2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            [beaster]   2> 	at java.lang.Thread.run(Thread.java:745)
            [beaster]   2> Caused by: org.apache.solr.common.SolrException: There is conflicting information about the leader of shard: shard1_0 our state says:http://127.0.0.1:54683/collection1_shard1_0_replica1/ but zookeeper says:http://127.0.0.1:49547/collection1_shard1_0_replica1/
            [beaster]   2> 	at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:975)
            [beaster]   2> 	... 7 more
          

          The problem is that restarting the node (which assigns a new port number) sometimes confuses the hell out of SolrCloud and then such nodes keep their old port number in cluster state and never recover, can't elect leaders etc. I have a suspicion that this behavior is intentional. I'll keep digging.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Beasting this test sometimes fails with nodes not recovering even after working around SOLR-9440 . I finally found the cause: [beaster] 2> 232316 ERROR (coreZkRegister-123-thread-1-processing-n:127.0.0.1:54683_ x:collection1_shard1_0_replica0 s:shard1_0 c:collection1 r:core_node7) [n:127.0.0.1:54683_ c:collection1 s:shard1_0 r:core_node7 x:collection1_shard1_0_replica0] o.a.s.c.ZkContainer :org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1_0 [beaster] 2> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:994) [beaster] 2> at org.apache.solr.cloud.ZkController.register(ZkController.java:900) [beaster] 2> at org.apache.solr.cloud.ZkController.register(ZkController.java:843) [beaster] 2> at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181) [beaster] 2> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) [beaster] 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [beaster] 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [beaster] 2> at java.lang. Thread .run( Thread .java:745) [beaster] 2> Caused by: org.apache.solr.common.SolrException: There is conflicting information about the leader of shard: shard1_0 our state says:http: //127.0.0.1:54683/collection1_shard1_0_replica1/ but zookeeper says:http://127.0.0.1:49547/collection1_shard1_0_replica1/ [beaster] 2> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:975) [beaster] 2> ... 7 more The problem is that restarting the node (which assigns a new port number) sometimes confuses the hell out of SolrCloud and then such nodes keep their old port number in cluster state and never recover, can't elect leaders etc. I have a suspicion that this behavior is intentional. I'll keep digging.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          This log is from a run which reproduced this bug. A newly created replica logs the following:

          38737 INFO  (parallelCoreAdminExecutor-8-thread-1-processing-n:127.0.0.1:42309_ 9208de91-9c97-4a42-94f5-9e00e3b6189b388000949708638 CREATE) [n:127.0.0.1:42309_ c:collection1 s:shard1_1 r:core_node8 x:collection1_shard1_1_replica0] o.a.s.c.ShardLeaderElectionContext Was waiting for replicas to come up, but they are taking too long - assuming they won't come back till later
          

          After this point, this replica becomes the leader (with 0 docs inside!) and eventually when the old replica comes back up, it syncs with this empty index and loses all data except for whatever was indexed after the split.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - This log is from a run which reproduced this bug. A newly created replica logs the following: 38737 INFO (parallelCoreAdminExecutor-8-thread-1-processing-n:127.0.0.1:42309_ 9208de91-9c97-4a42-94f5-9e00e3b6189b388000949708638 CREATE) [n:127.0.0.1:42309_ c:collection1 s:shard1_1 r:core_node8 x:collection1_shard1_1_replica0] o.a.s.c.ShardLeaderElectionContext Was waiting for replicas to come up, but they are taking too long - assuming they won't come back till later After this point, this replica becomes the leader (with 0 docs inside!) and eventually when the old replica comes back up, it syncs with this empty index and loses all data except for whatever was indexed after the split.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The attached SOLR-9438-false-replication.log shows another kind of failure.

          1. We create the sub-shard replica and restart the leader node
          2. The leader comes back online. The replica tries to recover.
          3. Leader reports its version as 0
          4. Replica seeing the master version as 0, assumes it is an empty index, reports the replication successful
          5. sub-shard becomes active.

          The root cause is that after split we do not commit and so the commit timestamp (used for version checks) is not written to the index. If the leader is restarted, the IndexWriter.close calls a commit on close. Upon restart, the leader will report its version as 0 even though it contains data.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The attached SOLR-9438 -false-replication.log shows another kind of failure. We create the sub-shard replica and restart the leader node The leader comes back online. The replica tries to recover. Leader reports its version as 0 Replica seeing the master version as 0, assumes it is an empty index, reports the replication successful sub-shard becomes active. The root cause is that after split we do not commit and so the commit timestamp (used for version checks) is not written to the index. If the leader is restarted, the IndexWriter.close calls a commit on close. Upon restart, the leader will report its version as 0 even though it contains data.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          A scenario related to the above that I ran into is that if this new replica becomes the leader, any one else trying to replicate from the leader will also report replication successful with 0 docs.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - A scenario related to the above that I ran into is that if this new replica becomes the leader, any one else trying to replicate from the leader will also report replication successful with 0 docs.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Patch with a simplified testSplitWithChaosMonkey() that doesn't restart repeatedly (to avoid the port change problems) and yet reproduces all bugs on beasting.

          I also added another test called testSplitStaticIndexReplication which tickles the bug related to commit data not being present. A fix for this is also included.

          There are still a few nocommits.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Patch with a simplified testSplitWithChaosMonkey() that doesn't restart repeatedly (to avoid the port change problems) and yet reproduces all bugs on beasting. I also added another test called testSplitStaticIndexReplication which tickles the bug related to commit data not being present. A fix for this is also included. There are still a few nocommits.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Changes:

          1. We record the parent leader node name and the ephemeral owner of its live node (zk session id which created the live node) at the start of the split process.
          2. These two pieces of information called "shard_parent_node" and "shard_parent_zk_session" respectively, are stored in the cluster state along with the slice information.
          3. When all replicas of all sub-shards are live, the overseer checks if the parent leader node is still live and if its ephemeral owner is still the same. If yes, it switches the sub-shard states to active and parent to inactive. If not, it changes the sub-shard state to a newly introduced "recovery_failed" state.
          4. Any shard in "recovery_failed" state does not receive any indexing or querying traffic.
          5. I beefed up the test to check for both outcomes and to assert that all documents that were successfully indexed are visible on a distributed search. Additionally, if the split succeeds, we also assert that all replicas of the sub-shards are consistent i.e. have the same number of docs.
          6. Fixed a test bug where concurrent watcher invocations on collection state would shutdown the leader node again even after the test had restarted it already to assert document counts.

          Results of beasting are looking good as far as this particular bug is concerned, but there is a curious failure where one and only core stays down and times out the waiting for recovery check. I'm still digging.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Changes: We record the parent leader node name and the ephemeral owner of its live node (zk session id which created the live node) at the start of the split process. These two pieces of information called "shard_parent_node" and "shard_parent_zk_session" respectively, are stored in the cluster state along with the slice information. When all replicas of all sub-shards are live, the overseer checks if the parent leader node is still live and if its ephemeral owner is still the same. If yes, it switches the sub-shard states to active and parent to inactive. If not, it changes the sub-shard state to a newly introduced "recovery_failed" state. Any shard in "recovery_failed" state does not receive any indexing or querying traffic. I beefed up the test to check for both outcomes and to assert that all documents that were successfully indexed are visible on a distributed search. Additionally, if the split succeeds, we also assert that all replicas of the sub-shards are consistent i.e. have the same number of docs. Fixed a test bug where concurrent watcher invocations on collection state would shutdown the leader node again even after the test had restarted it already to assert document counts. Results of beasting are looking good as far as this particular bug is concerned, but there is a curious failure where one and only core stays down and times out the waiting for recovery check. I'm still digging.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The test failure due to only one node remaining down was because sometimes the parent leader node itself is selected to host the new sub-shard replica. When we shutdown that node at the right time, the add replica call fails but the replica has already been created in the cluster state. Since the physical core doesn't actually exist, it will never recover and stay in down state.

          Changes:

          1. The test now checks if all replicas actually exist as a core before we wait for recovery and for sub-shards to switch states.
          2. The split shard API puts sub-shards into recovery_failed state itself if the parent leader changes before any replicas can be created.

          I've been beasting this test and so far everything looks good but I'll continue beasting for a little while more.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The test failure due to only one node remaining down was because sometimes the parent leader node itself is selected to host the new sub-shard replica. When we shutdown that node at the right time, the add replica call fails but the replica has already been created in the cluster state. Since the physical core doesn't actually exist, it will never recover and stay in down state. Changes: The test now checks if all replicas actually exist as a core before we wait for recovery and for sub-shards to switch states. The split shard API puts sub-shards into recovery_failed state itself if the parent leader changes before any replicas can be created. I've been beasting this test and so far everything looks good but I'll continue beasting for a little while more.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Fixed comments and javadocs in a few places. I beasted this test overnight for more than 500 runs and everything looks good! I'll commit this shortly.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Fixed comments and javadocs in a few places. I beasted this test overnight for more than 500 runs and everything looks good! I'll commit this shortly.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f177a660f5745350207dc61b46396b49404fd383 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f177a66 ]

          SOLR-9438: Shard split can be marked successful and sub-shard states switched to 'active' even when one or more sub-shards replicas do not recover due to the leader crashing or restarting between the time the replicas are created and before they can recover

          Show
          jira-bot ASF subversion and git services added a comment - Commit f177a660f5745350207dc61b46396b49404fd383 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f177a66 ] SOLR-9438 : Shard split can be marked successful and sub-shard states switched to 'active' even when one or more sub-shards replicas do not recover due to the leader crashing or restarting between the time the replicas are created and before they can recover
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1b03c940398de384c60fb0083a82ddb601db3909 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1b03c94 ]

          SOLR-9438: Shard split can be marked successful and sub-shard states switched to 'active' even when one or more sub-shards replicas do not recover due to the leader crashing or restarting between the time the replicas are created and before they can recover

          (cherry picked from commit f177a66)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1b03c940398de384c60fb0083a82ddb601db3909 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1b03c94 ] SOLR-9438 : Shard split can be marked successful and sub-shard states switched to 'active' even when one or more sub-shards replicas do not recover due to the leader crashing or restarting between the time the replicas are created and before they can recover (cherry picked from commit f177a66)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8027eb9803c2248a427f45c3771d1cb88d57f4b1 in lucene-solr's branch refs/heads/branch_6_2 from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8027eb9 ]

          SOLR-9438: Shard split can be marked successful and sub-shard states switched to 'active' even when one or more sub-shards replicas do not recover due to the leader crashing or restarting between the time the replicas are created and before they can recover

          (cherry picked from commit f177a66)

          (cherry picked from commit 1b03c94)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8027eb9803c2248a427f45c3771d1cb88d57f4b1 in lucene-solr's branch refs/heads/branch_6_2 from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8027eb9 ] SOLR-9438 : Shard split can be marked successful and sub-shard states switched to 'active' even when one or more sub-shards replicas do not recover due to the leader crashing or restarting between the time the replicas are created and before they can recover (cherry picked from commit f177a66) (cherry picked from commit 1b03c94)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.2.1 release

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.2.1 release

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              shalinmangar Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development