Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9092

Add safety checks to delete replica/shard/collection commands

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      We should verify the delete commands against live_nodes to make sure the API can atleast be executed correctly

      If we have a two node cluster, a collection with 1 shard 2 replica. Call the delete replica command against for the replica whose node is currently down.

      You get an exception:

      <response>
         <lst name="responseHeader">
            <int name="status">0</int>
            <int name="QTime">5173</int>
         </lst>
         <lst name="failure">
            <str name="192.168.1.101:7574_solr">org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.1.101:7574/solr</str>
         </lst>
      </response>
      

      At this point the entry for the replica is gone from state.json . The client application retries since an error was thrown but the delete command will never succeed now and an error like this will be seen-

      <response>
         <lst name="responseHeader">
            <int name="status">400</int>
            <int name="QTime">137</int>
         </lst>
         <str name="Operation deletereplica caused exception:">org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
         <lst name="exception">
            <str name="msg">Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
            <int name="rspCode">400</int>
         </lst>
         <lst name="error">
            <lst name="metadata">
               <str name="error-class">org.apache.solr.common.SolrException</str>
               <str name="root-error-class">org.apache.solr.common.SolrException</str>
            </lst>
            <str name="msg">Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
            <int name="code">400</int>
         </lst>
      </response>
      

      For create collection/add-replica we check the "createNodeSet" and "node" params respectively against live_nodes to make sure it has a chance of succeeding.

      We should add a check against live_nodes for the delete commands as well.

      Another situation where I saw this can be a problem - A second solr cluster cloned from the first but the script didn't correctly change the hostnames in the state.json file. When a delete command was issued against the second cluster Solr deleted the replica from the first cluster.

      In the above case the script was buggy obviously but if we verify against live_nodes then Solr wouldn't have gone ahead and deleted replicas not belonging to its cluster.

      1. SOLR-9092.patch
        3 kB
        Varun Thacker
      2. SOLR-9092.patch
        3 kB
        Varun Thacker

        Activity

        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        The delete replica API already has a 'onlyIfDown' parameter. I think you want to add a `onlyIfLive` parameter? The delete replica API is supposed to delete non-live replicas from cluster state and I think that should be the default because otherwise how else would you remove a replica which has been decommissioned. If we change the default to be onlyIfLive=true then that's a major back-compat break for this API.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - The delete replica API already has a 'onlyIfDown' parameter. I think you want to add a `onlyIfLive` parameter? The delete replica API is supposed to delete non-live replicas from cluster state and I think that should be the default because otherwise how else would you remove a replica which has been decommissioned. If we change the default to be onlyIfLive=true then that's a major back-compat break for this API.
        Hide
        anshumg Anshum Gupta added a comment -

        Agreed that changing the default of onlyIfLive to true would break back-compat, but we still need to fix this. This could easily spiral into a bigger problem with large clusters. Also, we should fix whatever is returned to the user in terms of failure/success. If it's a failure, the user should have a way to at least retry and clean up whatever cruft was left behind.
        If we choose to return a success, we should be responsible for returning more information about what was not cleaned up from the disk in a manner that makes it easier to parse and process for the users.

        Show
        anshumg Anshum Gupta added a comment - Agreed that changing the default of onlyIfLive to true would break back-compat, but we still need to fix this. This could easily spiral into a bigger problem with large clusters. Also, we should fix whatever is returned to the user in terms of failure/success. If it's a failure, the user should have a way to at least retry and clean up whatever cruft was left behind. If we choose to return a success, we should be responsible for returning more information about what was not cleaned up from the disk in a manner that makes it easier to parse and process for the users.
        Hide
        varunthacker Varun Thacker added a comment -

        otherwise how else would you remove a replica which has been decommissioned.

        Fair point. One caveat though - for people who aren't using legacyCloud=true ( default ) , if ever the node comes back up the collection will be created , which will be puzzling to the user
        But maybe we should stop optimizing for this as we want to move to "zk as truth" sooner rather than later.

        Show
        varunthacker Varun Thacker added a comment - otherwise how else would you remove a replica which has been decommissioned. Fair point. One caveat though - for people who aren't using legacyCloud=true ( default ) , if ever the node comes back up the collection will be created , which will be puzzling to the user But maybe we should stop optimizing for this as we want to move to "zk as truth" sooner rather than later.
        Hide
        mewmewball Jessica Cheng Mallet added a comment -

        We do want to delete the replica from cluster state even if the node doesn't appear under live_nodes, but it shouldn't send the core admin UNLOAD to the node unless it's appearing under live_nodes. In the situation where the node is down and we want to remove it from the cluster state, we would rely on the deleteCoreNode call to have it removed: https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/OverseerCollectionMessageHandler.java#L634

        Show
        mewmewball Jessica Cheng Mallet added a comment - We do want to delete the replica from cluster state even if the node doesn't appear under live_nodes, but it shouldn't send the core admin UNLOAD to the node unless it's appearing under live_nodes. In the situation where the node is down and we want to remove it from the cluster state, we would rely on the deleteCoreNode call to have it removed: https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/OverseerCollectionMessageHandler.java#L634
        Hide
        varunthacker Varun Thacker added a comment -

        I started to revisit this issue. Took a quick crack at it

        The approach I've taken is - before sending the core UNLOAD command verify if the core is part of a live node. If the node isn't part of live_nodes then it would fail anyways. So this check won't break back-compat. The cluster state info still gets deleted. It will protect us from the second scenario mentiond on the ticket

        Another situation where I saw this can be a problem - A second solr cluster cloned from the first but the script didn't correctly change the hostnames in the state.json file. When a delete command was issued against the second cluster Solr deleted the replica from the first cluster. In the above case the script was buggy obviously but if we verify against live_nodes then Solr wouldn't have gone ahead and deleted replicas not belonging to its cluster.

        Just wanted to get some feedback before writing tests etc.

        Show
        varunthacker Varun Thacker added a comment - I started to revisit this issue. Took a quick crack at it The approach I've taken is - before sending the core UNLOAD command verify if the core is part of a live node. If the node isn't part of live_nodes then it would fail anyways. So this check won't break back-compat. The cluster state info still gets deleted. It will protect us from the second scenario mentiond on the ticket Another situation where I saw this can be a problem - A second solr cluster cloned from the first but the script didn't correctly change the hostnames in the state.json file. When a delete command was issued against the second cluster Solr deleted the replica from the first cluster. In the above case the script was buggy obviously but if we verify against live_nodes then Solr wouldn't have gone ahead and deleted replicas not belonging to its cluster. Just wanted to get some feedback before writing tests etc.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        +1 to the patch. To summarise, we always delete the replica from the cluster state. But we attempt to send the core admin delete request only if that node is actually up. No new parameters are introduced.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - +1 to the patch. To summarise, we always delete the replica from the cluster state. But we attempt to send the core admin delete request only if that node is actually up. No new parameters are introduced.
        Hide
        varunthacker Varun Thacker added a comment -

        Updated patch which compiles against master.

        All the tests and precommit passes. I'll commit this now

        Show
        varunthacker Varun Thacker added a comment - Updated patch which compiles against master. All the tests and precommit passes. I'll commit this now
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1d9be84cb67ed5e57bcd60ae483f45d3abd09bd5 in lucene-solr's branch refs/heads/master from Varun Thacker
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1d9be84 ]

        SOLR-9092: In the deletereplica commandand add a live check before calling delete core

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1d9be84cb67ed5e57bcd60ae483f45d3abd09bd5 in lucene-solr's branch refs/heads/master from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1d9be84 ] SOLR-9092 : In the deletereplica commandand add a live check before calling delete core
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 45f9b6b1ffcae19bd43a523c0d8e62af5b71fafc in lucene-solr's branch refs/heads/branch_6x from Varun Thacker
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=45f9b6b ]

        SOLR-9092: In the deletereplica commandand add a live check before calling delete core

        Show
        jira-bot ASF subversion and git services added a comment - Commit 45f9b6b1ffcae19bd43a523c0d8e62af5b71fafc in lucene-solr's branch refs/heads/branch_6x from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=45f9b6b ] SOLR-9092 : In the deletereplica commandand add a live check before calling delete core
        Hide
        varunthacker Varun Thacker added a comment -

        Thanks Jessica for brining it up!

        Show
        varunthacker Varun Thacker added a comment - Thanks Jessica for brining it up!
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            varunthacker Varun Thacker
            Reporter:
            varunthacker Varun Thacker
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development