Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9439

Shard split clean up logic for older failed splits is faulty

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.10.4, 5.5.2, 6.1
    • Fix Version/s: 6.2.1, 6.3, master (7.0)
    • Component/s: SolrCloud
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      In case a split finds that previous sub-shards exist in construction or recovery state. it tries to clean them up by invoking deleteshard API. However, the clean up logic tries to invoke deleteshard on the same sub-shards as many times as the requested number of sub-ranges. Such repeat calls to deleteshard fail and therefore fail the entire shard split operation.

      1. Lucene-Solr-tests-master.8015.log.gz
        343 kB
        Steve Rowe
      2. SOLR-9439.patch
        11 kB
        Shalin Shekhar Mangar
      3. SOLR-9439.patch
        9 kB
        Shalin Shekhar Mangar
      4. SOLR-9439.patch
        2 kB
        Shalin Shekhar Mangar
      5. SOLR-9439-fix-deleteshard.patch
        11 kB
        Shalin Shekhar Mangar
      6. SOLR-9439-fix-deleteshard.patch
        9 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Trivial fix is attached. The test which tickled this bug is part of SOLR-9438 but I'll try to write a minimal test case here as well.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Trivial fix is attached. The test which tickled this bug is part of SOLR-9438 but I'll try to write a minimal test case here as well.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -
          1. Uses TestInjection to inject failure into the split process before creating additional replicas
          2. New TestInjection#injectSplitFailureBeforeReplicaCreation method for above
          3. New test ShardSplitTest#testSplitAfterFailedSplit which fails without the fix but passes with.

          I'll commit shortly.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Uses TestInjection to inject failure into the split process before creating additional replicas New TestInjection#injectSplitFailureBeforeReplicaCreation method for above New test ShardSplitTest#testSplitAfterFailedSplit which fails without the fix but passes with. I'll commit shortly.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The last patch didn't handle exceptions due to non-existent cores. This patch adds metadata to the SolrException if a non-existent core is deleted. We check for this metadata and abort only if the cause is null or not non-existent.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The last patch didn't handle exceptions due to non-existent cores. This patch adds metadata to the SolrException if a non-existent core is deleted. We check for this metadata and abort only if the cause is null or not non-existent.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          With the right patch this time.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - With the right patch this time.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7d2f42e5436dc669cd48df8dafd45036bd6f9d76 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d2f42e ]

          SOLR-9439: Shard split clean up logic for older failed splits is faulty

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7d2f42e5436dc669cd48df8dafd45036bd6f9d76 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d2f42e ] SOLR-9439 : Shard split clean up logic for older failed splits is faulty
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 97b62160e90a262e7b05883d13b8af45d9052705 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=97b6216 ]

          SOLR-9439: Shard split clean up logic for older failed splits is faulty
          (cherry picked from commit 7d2f42e)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 97b62160e90a262e7b05883d13b8af45d9052705 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=97b6216 ] SOLR-9439 : Shard split clean up logic for older failed splits is faulty (cherry picked from commit 7d2f42e)
          Hide
          steve_rowe Steve Rowe added a comment -

          My Jenkins has seen TestShardSplit.testSplitAfterFailedSplit() (new test committed under this issue) fail 4 times (links below). I tried a couple of the repro lines and they did not reproduce for me (on the same machine where my Jenkins runs).

          One of the failures - I'm also attaching a gzipped excerpt from the build log for this run (Lucene-Solr-tests-master.8015.log.gz):

             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ShardSplitTest -Dtests.method=testSplitAfterFailedSplit -Dtests.seed=F8621B62A68543EC -Dtests.slow=true -Dtests.locale=mt-MT -Dtests.timezone=America/Costa_Rica -Dtests.asserts=true -Dtests.file.encoding=UTF-8
             [junit4] FAILURE 34.6s J10 | ShardSplitTest.testSplitAfterFailedSplit <<<
             [junit4]    > Throwable #1: java.lang.AssertionError: Shard split did not succeed after a previous failed split attempt left sub-shards in construction state
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([F8621B62A68543EC:12F88CD9AF00E66]:0)
             [junit4]    > 	at org.apache.solr.cloud.ShardSplitTest.testSplitAfterFailedSplit(ShardSplitTest.java:138)
             [junit4]    > 	at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:985)
             [junit4]    > 	at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:960)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          

          Links to all 4 failing runs (in case more logs would be helpful):

          http://jenkins.sarowe.net/job/Lucene-Solr-tests-6.x/2194/
          http://jenkins.sarowe.net/job/Lucene-Solr-tests-master/8015/
          http://jenkins.sarowe.net/job/Lucene-Solr-tests-6.x/2207/
          http://jenkins.sarowe.net/job/Lucene-Solr-tests-6.x/2214/

          Show
          steve_rowe Steve Rowe added a comment - My Jenkins has seen TestShardSplit.testSplitAfterFailedSplit() (new test committed under this issue) fail 4 times (links below). I tried a couple of the repro lines and they did not reproduce for me (on the same machine where my Jenkins runs). One of the failures - I'm also attaching a gzipped excerpt from the build log for this run ( Lucene-Solr-tests-master.8015.log.gz ): [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ShardSplitTest -Dtests.method=testSplitAfterFailedSplit -Dtests.seed=F8621B62A68543EC -Dtests.slow=true -Dtests.locale=mt-MT -Dtests.timezone=America/Costa_Rica -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 34.6s J10 | ShardSplitTest.testSplitAfterFailedSplit <<< [junit4] > Throwable #1: java.lang.AssertionError: Shard split did not succeed after a previous failed split attempt left sub-shards in construction state [junit4] > at __randomizedtesting.SeedInfo.seed([F8621B62A68543EC:12F88CD9AF00E66]:0) [junit4] > at org.apache.solr.cloud.ShardSplitTest.testSplitAfterFailedSplit(ShardSplitTest.java:138) [junit4] > at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:985) [junit4] > at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:960) [junit4] > at java.lang.Thread.run(Thread.java:745) Links to all 4 failing runs (in case more logs would be helpful): http://jenkins.sarowe.net/job/Lucene-Solr-tests-6.x/2194/ http://jenkins.sarowe.net/job/Lucene-Solr-tests-master/8015/ http://jenkins.sarowe.net/job/Lucene-Solr-tests-6.x/2207/ http://jenkins.sarowe.net/job/Lucene-Solr-tests-6.x/2214/
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Steve, that is very helpful. I'm seeing related failures in SOLR-9438 so this might have a clue.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Steve, that is very helpful. I'm seeing related failures in SOLR-9438 so this might have a clue.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The root cause of these test failures is that the deleteshard API is not resilient against non-existent cores. If it fails trying to delete a core which is already deleted then it fails to remove the slice from the cluster state.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The root cause of these test failures is that the deleteshard API is not resilient against non-existent cores. If it fails trying to delete a core which is already deleted then it fails to remove the slice from the cluster state.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment - - edited

          Actually the fixes for ignoring unload failures for non-existent cores that I had done in this issue are not necessary if the deleteshard API can internally call delete replica API which does the right thing. Now that we have a parallel mode for the delete replica API, we can just invoke it and then clear the slice from the cluster state. I'll put up a patch.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - - edited Actually the fixes for ignoring unload failures for non-existent cores that I had done in this issue are not necessary if the deleteshard API can internally call delete replica API which does the right thing. Now that we have a parallel mode for the delete replica API, we can just invoke it and then clear the slice from the cluster state. I'll put up a patch.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Patch which reverts some of the changes I made earlier to track and ignore unload failures non-existent cores because they are no longer necessary. This patch changes the delete shard API to call deletereplica API for all replicas in parallel instead of custom delete logic.

          I am going to beast this test for a bit before committing.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Patch which reverts some of the changes I made earlier to track and ignore unload failures non-existent cores because they are no longer necessary. This patch changes the delete shard API to call deletereplica API for all replicas in parallel instead of custom delete logic. I am going to beast this test for a bit before committing.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The last patch failed CollectionsAPISolrJTest.testCreateAndDeleteShard because the changed implementation did not return the "success" flag. I beasted this test 50 times but couldn't get it to fail.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The last patch failed CollectionsAPISolrJTest.testCreateAndDeleteShard because the changed implementation did not return the "success" flag. I beasted this test 50 times but couldn't get it to fail.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 02b97a29b747e439bba8ad95a0269f959bea965e in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=02b97a2 ]

          SOLR-9439: The delete shard API has been made more resilient against failures resulting from non-existent cores.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 02b97a29b747e439bba8ad95a0269f959bea965e in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=02b97a2 ] SOLR-9439 : The delete shard API has been made more resilient against failures resulting from non-existent cores.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6bf9513b9385a53557dc0849eb36a062aceb8e8c in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6bf9513 ]

          SOLR-9439: The delete shard API has been made more resilient against failures resulting from non-existent cores.
          (cherry picked from commit 02b97a2)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6bf9513b9385a53557dc0849eb36a062aceb8e8c in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6bf9513 ] SOLR-9439 : The delete shard API has been made more resilient against failures resulting from non-existent cores. (cherry picked from commit 02b97a2)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Steve!

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Steve!
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Re-opened to back-port to 6.2.1

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Re-opened to back-port to 6.2.1
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.2.1 release

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.2.1 release

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              shalinmangar Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development