Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10020

CoreAdminHandler silently swallows some errors

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      With the setup on SOLR-10006, after removing some index files and starting that Solr instance I tried issuing a REQUESTRECOVERY command and it came back as a success even though nothing actually happened. When the core is accessed, a core init exception is returned by subsequent calls to getCore(). There is no catch block after the try so no error is returned.

      Looking through the code I see several other commands that have a similar pattern:
      FORCEPREPAREFORLEADERSHIP_OP
      LISTSNAPSHOTS_OP
      getCoreStatus

      and perhaps others. getCore() can throw an exception, about the only explicit one it does throw is if the core has an initialization error.

      1. SOLR-10020.patch
        4 kB
        Erick Erickson
      2. SOLR-10020.patch
        4 kB
        Mike Drob
      3. SOLR-10020.patch
        2 kB
        Mike Drob
      4. SOLR-10020.patch
        2 kB
        Mike Drob

        Activity

        Hide
        mdrob Mike Drob added a comment -

        When trying this I see the following in the solr logs:

        7329 INFO  (qtp407457709-27) [n:127.0.0.1:40165_solr    ] o.a.s.h.a.CoreAdminOperation It has been requested that we recover: core=collection1_shard1_replica1
        7330 INFO  (qtp407457709-27) [n:127.0.0.1:40165_solr    ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=collection1_shard1_replica1&action=REQUESTRECOVERY&wt=javabin&version=2} status=0 QTime=1
        7331 ERROR (Thread-30) [n:127.0.0.1:40165_solr    ] o.a.s.h.a.CoreAdminOperation Could not find core to call recovery:collection1_shard1_replica1
        

        So, two things to note here - there is an error logged somewhere that this operation failed, and the error logged is after the request has already completed. I think it is correct to process the recovery async, since it could presumably take a long time, but I guess we should also be treating it like every other async request and return an id for it.

        In LISTSNAPSHOTS_OP there is an explicit null check and throw SolrException. Agree with you that FORCEPREPAREFORLEADERSHIP_OP should likely throw instead of log, that might be better as a separate issue though since I don't know what relies on current behaviour. getCoreStatus is probably fine, since it will be evident from the lack of info returned. I didn't see any other cases where we need to fix anything (REQUESTBUFFERUPDATES_OP throws an exception).

        Show
        mdrob Mike Drob added a comment - When trying this I see the following in the solr logs: 7329 INFO (qtp407457709-27) [n:127.0.0.1:40165_solr ] o.a.s.h.a.CoreAdminOperation It has been requested that we recover: core=collection1_shard1_replica1 7330 INFO (qtp407457709-27) [n:127.0.0.1:40165_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=collection1_shard1_replica1&action=REQUESTRECOVERY&wt=javabin&version=2} status=0 QTime=1 7331 ERROR (Thread-30) [n:127.0.0.1:40165_solr ] o.a.s.h.a.CoreAdminOperation Could not find core to call recovery:collection1_shard1_replica1 So, two things to note here - there is an error logged somewhere that this operation failed, and the error logged is after the request has already completed. I think it is correct to process the recovery async, since it could presumably take a long time, but I guess we should also be treating it like every other async request and return an id for it. In LISTSNAPSHOTS_OP there is an explicit null check and throw SolrException. Agree with you that FORCEPREPAREFORLEADERSHIP_OP should likely throw instead of log, that might be better as a separate issue though since I don't know what relies on current behaviour. getCoreStatus is probably fine, since it will be evident from the lack of info returned. I didn't see any other cases where we need to fix anything (REQUESTBUFFERUPDATES_OP throws an exception).
        Hide
        mdrob Mike Drob added a comment -

        I haven't run any tests yet, but this is what one approach could look like. What do you think, Erick Erickson?

        Show
        mdrob Mike Drob added a comment - I haven't run any tests yet, but this is what one approach could look like. What do you think, Erick Erickson ?
        Hide
        mdrob Mike Drob added a comment -

        ant test came back clean for me.

        Show
        mdrob Mike Drob added a comment - ant test came back clean for me.
        Hide
        mdrob Mike Drob added a comment -

        Erick Erickson - Do you have time to look at this patch?

        Show
        mdrob Mike Drob added a comment - Erick Erickson - Do you have time to look at this patch?
        Hide
        erickerickson Erick Erickson added a comment -

        Sorry, this week is crazy, so I probably can't until at least the weekend.

        Show
        erickerickson Erick Erickson added a comment - Sorry, this week is crazy, so I probably can't until at least the weekend.
        Hide
        erickerickson Erick Erickson added a comment -

        Had a chance to look and this looks fine. We now get a response that shows the FileNotFound error for the three commands I'd eyeballed.....

        +1 and thanks!

        Show
        erickerickson Erick Erickson added a comment - Had a chance to look and this looks fine. We now get a response that shows the FileNotFound error for the three commands I'd eyeballed..... +1 and thanks!
        Hide
        mdrob Mike Drob added a comment -

        Looked at this again and realized I had the core close in the wrong spot. New patch attached.

        Show
        mdrob Mike Drob added a comment - Looked at this again and realized I had the core close in the wrong spot. New patch attached.
        Hide
        mdrob Mike Drob added a comment -

        Adding a test, and also unwrapping a layer of async because DefaultSolrCoreState.doRecovery already uses a separate thread/executor, so we don't need to do it directly in the CoreAdminOperation.

        Show
        mdrob Mike Drob added a comment - Adding a test, and also unwrapping a layer of async because DefaultSolrCoreState.doRecovery already uses a separate thread/executor, so we don't need to do it directly in the CoreAdminOperation.
        Hide
        erickerickson Erick Erickson added a comment -

        Mike Drob Do you think this patch is ready to commit? If so I'll look it over again and commit it sometime Real Soon Now.

        Erick

        Show
        erickerickson Erick Erickson added a comment - Mike Drob Do you think this patch is ready to commit? If so I'll look it over again and commit it sometime Real Soon Now. Erick
        Hide
        mdrob Mike Drob added a comment -

        Yea, I think this one is ready.

        Show
        mdrob Mike Drob added a comment - Yea, I think this one is ready.
        Hide
        erickerickson Erick Erickson added a comment -

        Mike:

        Just to check my understanding here. Essentially you took out a thread that had no other real purpose than to start a thread, right? We haven't changed the asynchronous nature of the call at all for RequestRecovery.

        Looking more closely at
        FORCEPREPAREFORLEADERSHIP_OP
        LISTSNAPSHOTS_OP

        I don't think the same problem occurs there since they don't spawn threads that can't really propagate the error back.

        Testing & etc now.

        Show
        erickerickson Erick Erickson added a comment - Mike: Just to check my understanding here. Essentially you took out a thread that had no other real purpose than to start a thread, right? We haven't changed the asynchronous nature of the call at all for RequestRecovery. Looking more closely at FORCEPREPAREFORLEADERSHIP_OP LISTSNAPSHOTS_OP I don't think the same problem occurs there since they don't spawn threads that can't really propagate the error back. Testing & etc now.
        Hide
        mdrob Mike Drob added a comment -

        Exactly. And because we don't start a new thread in the handler, we can throw an exception which eventually gets back to the caller instead of solely logging the problem.

        LISTSNAPSHOTS_OP also throws an exception, so does not have this problem.

        FORCEPREPAREFORLEADERSHIP_OP logs but does not throw, so a client will not see the problem with a non-existent core. This is easy to fix with something like

                core.getCoreDescriptor().getCloudDescriptor().setLastPublished(Replica.State.ACTIVE);
                log().info("Setting the last published state for this core, {}, to {}", core.getName(), Replica.State.ACTIVE);
              } else {
        -        SolrException.log(log(), "Could not find core: " + cname);
        +        throw new SolrException(ErrorCode.BAD_REQUEST, "Unable to locate core " + cname);
              }
            }
          }),
        

        I didn't do that in this patch because I'm not sure who the callers of this API are and didn't want to rock too many boats at once.

        Show
        mdrob Mike Drob added a comment - Exactly. And because we don't start a new thread in the handler, we can throw an exception which eventually gets back to the caller instead of solely logging the problem. LISTSNAPSHOTS_OP also throws an exception, so does not have this problem. FORCEPREPAREFORLEADERSHIP_OP logs but does not throw, so a client will not see the problem with a non-existent core. This is easy to fix with something like core.getCoreDescriptor().getCloudDescriptor().setLastPublished(Replica.State.ACTIVE); log().info("Setting the last published state for this core, {}, to {}", core.getName(), Replica.State.ACTIVE); } else { - SolrException.log(log(), "Could not find core: " + cname); + throw new SolrException(ErrorCode.BAD_REQUEST, "Unable to locate core " + cname); } } }), I didn't do that in this patch because I'm not sure who the callers of this API are and didn't want to rock too many boats at once.
        Hide
        erickerickson Erick Erickson added a comment -

        Same patch with CHANGES attribution.

        Show
        erickerickson Erick Erickson added a comment - Same patch with CHANGES attribution.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 14b3622608d3312eca32ba749132ce2f8531326a in lucene-solr's branch refs/heads/master from Erick
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=14b3622 ]

        SOLR-10020: CoreAdminHandler silently swallows some errors

        Show
        jira-bot ASF subversion and git services added a comment - Commit 14b3622608d3312eca32ba749132ce2f8531326a in lucene-solr's branch refs/heads/master from Erick [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=14b3622 ] SOLR-10020 : CoreAdminHandler silently swallows some errors
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f5ea2022097503df4ed62e59f7d1cb061c8266ee in lucene-solr's branch refs/heads/branch_6x from Erick
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f5ea202 ]

        SOLR-10020: CoreAdminHandler silently swallows some errors

        (cherry picked from commit 14b3622608d3312eca32ba749132ce2f8531326a)

        Show
        jira-bot ASF subversion and git services added a comment - Commit f5ea2022097503df4ed62e59f7d1cb061c8266ee in lucene-solr's branch refs/heads/branch_6x from Erick [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f5ea202 ] SOLR-10020 : CoreAdminHandler silently swallows some errors (cherry picked from commit 14b3622608d3312eca32ba749132ce2f8531326a)
        Hide
        erickerickson Erick Erickson added a comment -

        Mark MillerYonik SeeleyIshan Chattopadhyaya I had some merge issues when I merged CHANGES.txt from trunk to 6x for this JIRA. It seems like some entries were moved around in trunk (SOLR-10114) in CHANGES.txt but not merged into the 6x version.

        So the changes from around SOLR-10114 through "optimizations" for the 6.5 version of solr/CHANGES.txt where I had the unexpected conflict looks OK to me, this is just a heads-up in case I messed up the merge and you wanted to take a look.

        Show
        erickerickson Erick Erickson added a comment - Mark Miller Yonik Seeley Ishan Chattopadhyaya I had some merge issues when I merged CHANGES.txt from trunk to 6x for this JIRA. It seems like some entries were moved around in trunk ( SOLR-10114 ) in CHANGES.txt but not merged into the 6x version. So the changes from around SOLR-10114 through "optimizations" for the 6.5 version of solr/CHANGES.txt where I had the unexpected conflict looks OK to me, this is just a heads-up in case I messed up the merge and you wanted to take a look.

          People

          • Assignee:
            erickerickson Erick Erickson
            Reporter:
            erickerickson Erick Erickson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development