Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8983

Failed Collection CREATE call should cleanup the cluster state before returning

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 5.5, 6.0
    • Fix Version/s: 5.5.1, 6.0.1, 6.1
    • Component/s: None
    • Labels:
      None

      Description

      In case of a failed collection creation call, the cluster state is updated leaving an entry for the failed collection. This is also returned by the LIST command, allowing the users to believe that the collection exists.

      CREATE call should cleanup in case of a failed attempt at creating the collection.

      1. SOLR-8983.patch
        9 kB
        Anshum Gupta
      2. SOLR-8983.patch
        9 kB
        Anshum Gupta
      3. SOLR-8983.patch
        7 kB
        Anshum Gupta
      4. SOLR-8983.patch
        7 kB
        Anshum Gupta
      5. SOLR-8983.patch
        4 kB
        Anshum Gupta
      6. SOLR-8983.patch
        4 kB
        Anshum Gupta
      7. SOLR-8983-test.patch
        3 kB
        Anshum Gupta

        Issue Links

          Activity

          Hide
          anshumg Anshum Gupta added a comment -

          Here's a test that highlights the issue. Will work on a patch so that this gets into 5.5.1.

          Show
          anshumg Anshum Gupta added a comment - Here's a test that highlights the issue. Will work on a patch so that this gets into 5.5.1.
          Hide
          anshumg Anshum Gupta added a comment -

          Patch with test

          Show
          anshumg Anshum Gupta added a comment - Patch with test
          Hide
          anshumg Anshum Gupta added a comment -

          Ignore that patch, though I'd want to do this but it would break existing apps as it would throw an exception instead of returning a "failure" map. I'll post an updated patch.

          Show
          anshumg Anshum Gupta added a comment - Ignore that patch, though I'd want to do this but it would break existing apps as it would throw an exception instead of returning a "failure" map. I'll post an updated patch.
          Hide
          anshumg Anshum Gupta added a comment -

          I'll commit this first thing tomorrow morning, before starting the back-ports for 5.5.1.

          Show
          anshumg Anshum Gupta added a comment - I'll commit this first thing tomorrow morning, before starting the back-ports for 5.5.1.
          Hide
          anshumg Anshum Gupta added a comment -

          Had to increase the socket time out for one of the tests as it involved cleaning up of a collection and so had started to take longer.

          Show
          anshumg Anshum Gupta added a comment - Had to increase the socket time out for one of the tests as it involved cleaning up of a collection and so had started to take longer.
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Anshum,

          Patch looks good!

          Few comments on the patch:

          1. Can we move the test to not use AbstractFullDistribZkTestBase and use SolrCloudTestCase instead? It should make the test a lot faster in my experience. The current test on my machine took 55s .
          2. In testCreateCollectionCleanup() after we confirm that the collection doesn't exist via the LIST api call, can we try to create a collection without the bogus dataDir param to make sure it gets created successfully.
          3. Maybe we can randomize what param we pass wrongly? We can try with a wrong configName , try with legacyCloud=false etc. It might trigger different artifacts which get left behind to give better test coverage?
          4. A separate test which uses more than one replica so that we know all the cores were cleaned up properly?
          5. In OverseerCollectionMessageHandler there is one INFO and one WARN statement with a similar message. Maybe we should just put across a message like "failed to create collection. Cleaning up artifacts .." ?
          6. The cleanup delete operation does not pass along the async param which might have been used.
          7. Slightly unrelated but in createCollection instead of catching Exception, catching IOException should be enough and we don't even need to catch SolrException ?

          Show
          varunthacker Varun Thacker added a comment - Hi Anshum, Patch looks good! Few comments on the patch: 1. Can we move the test to not use AbstractFullDistribZkTestBase and use SolrCloudTestCase instead? It should make the test a lot faster in my experience. The current test on my machine took 55s . 2. In testCreateCollectionCleanup() after we confirm that the collection doesn't exist via the LIST api call, can we try to create a collection without the bogus dataDir param to make sure it gets created successfully. 3. Maybe we can randomize what param we pass wrongly? We can try with a wrong configName , try with legacyCloud=false etc. It might trigger different artifacts which get left behind to give better test coverage? 4. A separate test which uses more than one replica so that we know all the cores were cleaned up properly? 5. In OverseerCollectionMessageHandler there is one INFO and one WARN statement with a similar message. Maybe we should just put across a message like "failed to create collection. Cleaning up artifacts .." ? 6. The cleanup delete operation does not pass along the async param which might have been used. 7. Slightly unrelated but in createCollection instead of catching Exception, catching IOException should be enough and we don't even need to catch SolrException ?
          Hide
          anshumg Anshum Gupta added a comment -

          Thanks for taking a look Varun!

          Can we move the test to not use AbstractFullDistribZkTestBase and use SolrCloudTestCase instead? It should make the test a lot faster in my experience. The current test on my machine took 55s .

          sure

          In testCreateCollectionCleanup() after we confirm that the collection doesn't exist via the LIST api call, can we try to create a collection without the bogus dataDir param to make sure it gets created successfully.

          There are other tests that check for exactly that, I think we can skip that and save time here.

          Maybe we can randomize what param we pass wrongly? We can try with a wrong configName , try with legacyCloud=false etc. It might trigger different artifacts which get left behind to give better test coverage?

          wrong configName etc, fail fast i.e. without creating any core. If they don't they should. There are wrong config name issues that we could simulate e.g. create a collection, and during the process, force remove the config from zk so that a core creation fails but I think that isn't something that we should include with this issue.

          A separate test which uses more than one replica so that we know all the cores were cleaned up properly?

          Yes, but I think the deleteCollection would already be testing the same. We just reuse the code.

          In OverseerCollectionMessageHandler there is one INFO and one WARN statement with a similar message. Maybe we should just put across a message like "failed to create collection. Cleaning up artifacts .." ?

          warn ? Can you point me to it? Here's what I see:

          Failure
          log.error("Cleaning up collection [" + collectionName + "]." ); and
          log.info("Cleaned up collection [" + collectionName + "].");
          
          Success
          log.debug("Finished create command on all shards for collection: "
                      + collectionName);
          

          The cleanup delete operation does not pass along the async param which might have been used.

          I thought about that but decided against passing that param. Is there a reason why you would want to pass that along? We aren't passing/appending the core responses and if the request was async, the user already received a response about request submission.

          Slightly unrelated but in createCollection instead of catching Exception, catching IOException should be enough and we don't even need to catch SolrException ?

          +1, but I didn't want to include it here. I ideally wanted to fail fast and throw an exception from here (see my previous patches) but that would change the behavior for end users so decided against it for now.

          Show
          anshumg Anshum Gupta added a comment - Thanks for taking a look Varun! Can we move the test to not use AbstractFullDistribZkTestBase and use SolrCloudTestCase instead? It should make the test a lot faster in my experience. The current test on my machine took 55s . sure In testCreateCollectionCleanup() after we confirm that the collection doesn't exist via the LIST api call, can we try to create a collection without the bogus dataDir param to make sure it gets created successfully. There are other tests that check for exactly that, I think we can skip that and save time here. Maybe we can randomize what param we pass wrongly? We can try with a wrong configName , try with legacyCloud=false etc. It might trigger different artifacts which get left behind to give better test coverage? wrong configName etc, fail fast i.e. without creating any core. If they don't they should. There are wrong config name issues that we could simulate e.g. create a collection, and during the process, force remove the config from zk so that a core creation fails but I think that isn't something that we should include with this issue. A separate test which uses more than one replica so that we know all the cores were cleaned up properly? Yes, but I think the deleteCollection would already be testing the same. We just reuse the code. In OverseerCollectionMessageHandler there is one INFO and one WARN statement with a similar message. Maybe we should just put across a message like "failed to create collection. Cleaning up artifacts .." ? warn ? Can you point me to it? Here's what I see: Failure log.error( "Cleaning up collection [" + collectionName + "]." ); and log.info( "Cleaned up collection [" + collectionName + "]." ); Success log.debug( "Finished create command on all shards for collection: " + collectionName); The cleanup delete operation does not pass along the async param which might have been used. I thought about that but decided against passing that param. Is there a reason why you would want to pass that along? We aren't passing/appending the core responses and if the request was async, the user already received a response about request submission. Slightly unrelated but in createCollection instead of catching Exception, catching IOException should be enough and we don't even need to catch SolrException ? +1, but I didn't want to include it here. I ideally wanted to fail fast and throw an exception from here (see my previous patches) but that would change the behavior for end users so decided against it for now.
          Hide
          anshumg Anshum Gupta added a comment -

          Change the CreateCollectionCleanupTest to extend SolrCloudTestCase.

          Show
          anshumg Anshum Gupta added a comment - Change the CreateCollectionCleanupTest to extend SolrCloudTestCase.
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Anshum,

          Awesome. The test now runs in 34s vs 55s ! Most of the time was spent waiting for the collection to come up live

          6435 INFO  (qtp1682863192-26) [n:127.0.0.1:59727_solr    ] o.a.s.h.a.CollectionsHandler Wait for new collection to be active for at most 30 seconds. Check all shard replicas
          36507 ERROR (qtp1682863192-26) [n:127.0.0.1:59727_solr    ] o.a.s.h.a.CollectionsHandler Timed out waiting for new collection's replicas to become ACTIVE with timeout=30
          

          Maybe since we know the collection is going to fail pass a custom solrXML string to MiniSolrCloudCluster which has a smaller createCollectionWaitTimeTillActive ?

          Is there a reason why you would want to pass that along? We aren't passing/appending the core responses and if the request was async, the user already received a response about request submission.

          I guess adding the async param would make the call more resilient for longer delete times?

          we don't even need to catch SolrException

          I think we shouldn't change that . Sorry for the noise. It makes sense to catch it and throw it back so that the caller is informed correctly about a runtime exception.

          Other changes look good! Thanks

          Show
          varunthacker Varun Thacker added a comment - Hi Anshum, Awesome. The test now runs in 34s vs 55s ! Most of the time was spent waiting for the collection to come up live 6435 INFO (qtp1682863192-26) [n:127.0.0.1:59727_solr ] o.a.s.h.a.CollectionsHandler Wait for new collection to be active for at most 30 seconds. Check all shard replicas 36507 ERROR (qtp1682863192-26) [n:127.0.0.1:59727_solr ] o.a.s.h.a.CollectionsHandler Timed out waiting for new collection's replicas to become ACTIVE with timeout=30 Maybe since we know the collection is going to fail pass a custom solrXML string to MiniSolrCloudCluster which has a smaller createCollectionWaitTimeTillActive ? Is there a reason why you would want to pass that along? We aren't passing/appending the core responses and if the request was async, the user already received a response about request submission. I guess adding the async param would make the call more resilient for longer delete times? we don't even need to catch SolrException I think we shouldn't change that . Sorry for the noise. It makes sense to catch it and throw it back so that the caller is informed correctly about a runtime exception. Other changes look good! Thanks
          Hide
          anshumg Anshum Gupta added a comment -

          The test now runs in 12s on my machine. Thanks Varun
          I think we should slowly and gradually start doing this to more of our tests.

          Show
          anshumg Anshum Gupta added a comment - The test now runs in 12s on my machine. Thanks Varun I think we should slowly and gradually start doing this to more of our tests.
          Hide
          anshumg Anshum Gupta added a comment -

          updated patch that improves the logging to explicitly mention artifacts cleanup for a failed create collection request

          Show
          anshumg Anshum Gupta added a comment - updated patch that improves the logging to explicitly mention artifacts cleanup for a failed create collection request
          Hide
          varunthacker Varun Thacker added a comment -

          +1 . Looks good!

          Show
          varunthacker Varun Thacker added a comment - +1 . Looks good!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 84a6ff697e78cc944067bf7e196533dff7d88b8e in lucene-solr's branch refs/heads/master from anshum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=84a6ff6 ]

          SOLR-8983: Cleanup clusterstate in case of a failed CREATE collection call

          Show
          jira-bot ASF subversion and git services added a comment - Commit 84a6ff697e78cc944067bf7e196533dff7d88b8e in lucene-solr's branch refs/heads/master from anshum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=84a6ff6 ] SOLR-8983 : Cleanup clusterstate in case of a failed CREATE collection call
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 55de1ba539ca16b0b68de4ad8280d41e3e4bc60d in lucene-solr's branch refs/heads/branch_6x from anshum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=55de1ba ]

          SOLR-8983: Cleanup clusterstate in case of a failed CREATE collection call

          Show
          jira-bot ASF subversion and git services added a comment - Commit 55de1ba539ca16b0b68de4ad8280d41e3e4bc60d in lucene-solr's branch refs/heads/branch_6x from anshum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=55de1ba ] SOLR-8983 : Cleanup clusterstate in case of a failed CREATE collection call
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f5d84b6fce82df87c8bb8a0d585ff32e5a9fcd66 in lucene-solr's branch refs/heads/branch_5x from anshum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f5d84b6 ]

          SOLR-8983: Cleanup clusterstate in case of a failed CREATE collection call

          Show
          jira-bot ASF subversion and git services added a comment - Commit f5d84b6fce82df87c8bb8a0d585ff32e5a9fcd66 in lucene-solr's branch refs/heads/branch_5x from anshum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f5d84b6 ] SOLR-8983 : Cleanup clusterstate in case of a failed CREATE collection call
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b727111732417871050c80fd7f598f78e1729a2e in lucene-solr's branch refs/heads/branch_5_5 from anshum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b727111 ]

          SOLR-8983: Cleanup clusterstate in case of a failed CREATE collection call

          Show
          jira-bot ASF subversion and git services added a comment - Commit b727111732417871050c80fd7f598f78e1729a2e in lucene-solr's branch refs/heads/branch_5_5 from anshum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b727111 ] SOLR-8983 : Cleanup clusterstate in case of a failed CREATE collection call
          Hide
          steve_rowe Steve Rowe added a comment -

          Reopening to backport to 6.0.1.

          Show
          steve_rowe Steve Rowe added a comment - Reopening to backport to 6.0.1.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4f9159744c914f881f422f1da8ecef5e0e25cfb6 in lucene-solr's branch refs/heads/branch_6_0 from anshum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4f91597 ]

          SOLR-8983: Cleanup clusterstate in case of a failed CREATE collection call

          For branch_6_0: Modified CreateCollectionCleanupTest.java to use old-style CollectionAdminRequest.Create and .List, since SOLR-8976 will land in 6.1.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4f9159744c914f881f422f1da8ecef5e0e25cfb6 in lucene-solr's branch refs/heads/branch_6_0 from anshum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4f91597 ] SOLR-8983 : Cleanup clusterstate in case of a failed CREATE collection call For branch_6_0: Modified CreateCollectionCleanupTest.java to use old-style CollectionAdminRequest.Create and .List, since SOLR-8976 will land in 6.1.
          Hide
          steve_rowe Steve Rowe added a comment -

          Replying here because I can't change the commit message: SOLR-8976 is not relevant - I meant that SOLR-8765 will land in 6.1.

          Show
          steve_rowe Steve Rowe added a comment - Replying here because I can't change the commit message: SOLR-8976 is not relevant - I meant that SOLR-8765 will land in 6.1.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5856dcf5082b2016f94c63cf697fa9f5a44ab303 in lucene-solr's branch refs/heads/branch_6_0 from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5856dcf ]

          SOLR-8983: backport to branch_6_0 the required new cloud-minimal configset, originally committed on branch_6x under SOLR-8782

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5856dcf5082b2016f94c63cf697fa9f5a44ab303 in lucene-solr's branch refs/heads/branch_6_0 from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5856dcf ] SOLR-8983 : backport to branch_6_0 the required new cloud-minimal configset, originally committed on branch_6x under SOLR-8782
          Hide
          steve_rowe Steve Rowe added a comment -

          Bulk close issues included in the 6.0.1 release.

          Show
          steve_rowe Steve Rowe added a comment - Bulk close issues included in the 6.0.1 release.

            People

            • Assignee:
              anshumg Anshum Gupta
              Reporter:
              anshumg Anshum Gupta
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development