Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10365

Collection re-creation fails if previous collection creation had failed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Steps to reproduce:

      1. Create collection using a bad configset that has some errors, due to which collection creation fails.
      2. Now, create a collection using the same name, but a good configset. This fails sometimes (about 25-30% of the times, according to my rough estimate).

      Here's what happens during the second step (can be seen from stacktrace below):

      1. In CoreContainer's create(CoreDescriptor, boolean, boolean), there's a line {{ zkSys.getZkController().preRegister(dcore);}}.
      2. This calls ZkController's publish(), which in turn calls CoreContainer's getCore() method. This call should return null (since previous attempt of core creation didn't succeed). But, it throws the exception associated with the previous failure.

      Here's the stack trace for the same.

      Caused by: org.apache.solr.common.SolrException: SolrCore 'newcollection2_shard1_replica1' is not available due to init failure: blahblah
      	at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1312)
      	at org.apache.solr.cloud.ZkController.publish(ZkController.java:1225)
      	at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1399)
      	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:945)
      

      While working on SOLR-6736, I ran into this (nasty?) issue. I'll try to isolate this into a standalone test that demonstrates this issue. Otherwise, as of now, this can be seen in the SOLR-6736's testUploadWithScriptUpdateProcessor() test (which tries to re-create the collection, but sometimes fails).

      1. SOLR-10365.patch
        4 kB
        Ishan Chattopadhyaya
      2. SOLR-10365.patch
        4 kB
        Ishan Chattopadhyaya
      3. SOLR-10365.patch
        4 kB
        Ishan Chattopadhyaya
      4. SOLR-10365.patch
        0.8 kB
        Ishan Chattopadhyaya

        Activity

        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment -

        Here's a patch that fixes this situation.

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - Here's a patch that fixes this situation.
        Hide
        noble.paul Noble Paul added a comment -

        LGTM

        BUt how can we test this?

        Show
        noble.paul Noble Paul added a comment - LGTM BUt how can we test this?
        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment -

        Thanks for your review, Noble.

        I think what is happening is the following:

        How does a failed collection get cleaned up?

        1. At CoreContainer's create(CoreDescriptor,boolean,boolean) method, there's a preRegister step. This publishes the core as DOWN before even attempting to initialize the core.
        2. When there's a failure to initialize the core, the CoreContainer's coreInitFailures map gets populated with the exception.
        3. At OCMH, when there's a failure with the CreateCollection command, an attempt to clean up is performed. This actually calls DELETE, which in turn calls UNLOAD core admin command from DeleteCollectionCmd.java.
        4. This UNLOAD command is invoked from OCMH's collectionCmd() method, which calls UNLOAD core on every replica registered in step 1.
        5. At CoreContainer of the replica, when unload() method is invoked, the coreInitFailures map gets cleared.

        This is all fine, when it works. However, the publish step in preRegister seems intermittent. Sometimes, the publish doesn't work. I can see that the state opertion is offered to the distributed queue properly, but that message actually doesn't seem to get processed. Hence, at step 4, no UNLOAD command is sent to the replica. The latest SOLR-6736 patch's TestConfigSetsAPI#testUploadWithScriptUpdateProcessor() demonstrates this.

        While this maybe a larger issue with the way OCMH works, I can see that the patch I added here does the job in those circumstances, and the code path followed after the core is registered successfully properly removes the previous exception from the coreInitFailures map. Unless someone has any objections, I am inclined to commit this patch, and hence commit SOLR-6736 and then continue investigating the above scenario.

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - Thanks for your review, Noble. I think what is happening is the following: How does a failed collection get cleaned up? At CoreContainer's create(CoreDescriptor,boolean,boolean) method, there's a preRegister step. This publishes the core as DOWN before even attempting to initialize the core. When there's a failure to initialize the core, the CoreContainer's coreInitFailures map gets populated with the exception. At OCMH, when there's a failure with the CreateCollection command, an attempt to clean up is performed. This actually calls DELETE, which in turn calls UNLOAD core admin command from DeleteCollectionCmd.java. This UNLOAD command is invoked from OCMH's collectionCmd() method, which calls UNLOAD core on every replica registered in step 1. At CoreContainer of the replica, when unload() method is invoked, the coreInitFailures map gets cleared. This is all fine, when it works. However, the publish step in preRegister seems intermittent. Sometimes, the publish doesn't work. I can see that the state opertion is offered to the distributed queue properly, but that message actually doesn't seem to get processed. Hence, at step 4, no UNLOAD command is sent to the replica. The latest SOLR-6736 patch's TestConfigSetsAPI#testUploadWithScriptUpdateProcessor() demonstrates this. While this maybe a larger issue with the way OCMH works, I can see that the patch I added here does the job in those circumstances, and the code path followed after the core is registered successfully properly removes the previous exception from the coreInitFailures map. Unless someone has any objections, I am inclined to commit this patch, and hence commit SOLR-6736 and then continue investigating the above scenario.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0322068ea4648c93405da5b60fcbcc3467f5b009 in lucene-solr's branch refs/heads/master from Ishan Chattopadhyaya
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0322068 ]

        SOLR-10365: Handle a SolrCoreInitializationException while publishing core state during SolrCore creation

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0322068ea4648c93405da5b60fcbcc3467f5b009 in lucene-solr's branch refs/heads/master from Ishan Chattopadhyaya [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0322068 ] SOLR-10365 : Handle a SolrCoreInitializationException while publishing core state during SolrCore creation
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c37cb7e94e312fbfe650cb4cc4e812dbc2034478 in lucene-solr's branch refs/heads/branch_6x from Ishan Chattopadhyaya
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c37cb7e ]

        SOLR-10365: Handle a SolrCoreInitializationException while publishing core state during SolrCore creation

        Show
        jira-bot ASF subversion and git services added a comment - Commit c37cb7e94e312fbfe650cb4cc4e812dbc2034478 in lucene-solr's branch refs/heads/branch_6x from Ishan Chattopadhyaya [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c37cb7e ] SOLR-10365 : Handle a SolrCoreInitializationException while publishing core state during SolrCore creation

          People

          • Assignee:
            ichattopadhyaya Ishan Chattopadhyaya
            Reporter:
            ichattopadhyaya Ishan Chattopadhyaya
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development