Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9527

Solr RESTORE api doesn't distribute the replicas uniformly

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.1
    • Fix Version/s: 6.6, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None
    1. SOLR-9527.patch
      18 kB
      Varun Thacker
    2. SOLR-9527.patch
      21 kB
      Hrishikesh Gadre
    3. SOLR-9527.patch
      10 kB
      Rohit
    4. SOLR-9527.patch
      9 kB
      Stephen Lewis
    5. SOLR-9527.patch
      7 kB
      Hrishikesh Gadre
    6. SOLR-9527.patch
      7 kB
      Stephen Lewis
    7. Solr 9527.pdf
      242 kB
      Rohit

      Issue Links

        Activity

        Hide
        steviebee Stephen Lewis added a comment -

        This patch accomplishes two things in the branch branch_6_2.

        (a) It fixes the issue of replica distribution for the leaders
        (b) Simultaneously, supports the CreateNodeSet parameter for collection restore.

        Show
        steviebee Stephen Lewis added a comment - This patch accomplishes two things in the branch branch_6_2. (a) It fixes the issue of replica distribution for the leaders (b) Simultaneously, supports the CreateNodeSet parameter for collection restore.
        Hide
        hgadre Hrishikesh Gadre added a comment -

        Stephen Lewis Thanks for the patch I updated your patch with a unit test which reproduces the problem (without your patch) and passes with your patch.

        The failure can be reproduced with this command,

        ant test -Dtestcase=TestLocalFSCloudBackupRestore -Dtests.method=test -Dtests.seed=F465CD94BAC9633C -Dtests.slow=true -Dtests.locale=lt -Dtests.timezone=America/Inuvik -Dtests.asserts=true -Dtests.file.encoding=UTF-8

        Varun Thacker Could you please take a look?

        Show
        hgadre Hrishikesh Gadre added a comment - Stephen Lewis Thanks for the patch I updated your patch with a unit test which reproduces the problem (without your patch) and passes with your patch. The failure can be reproduced with this command, ant test -Dtestcase=TestLocalFSCloudBackupRestore -Dtests.method=test -Dtests.seed=F465CD94BAC9633C -Dtests.slow=true -Dtests.locale=lt -Dtests.timezone=America/Inuvik -Dtests.asserts=true -Dtests.file.encoding=UTF-8 Varun Thacker Could you please take a look?
        Hide
        steviebee Stephen Lewis added a comment -

        This update removes some extra logging I'd meant to rip and passes through the createNodeSet param from the request to the function call.

        Show
        steviebee Stephen Lewis added a comment - This update removes some extra logging I'd meant to rip and passes through the createNodeSet param from the request to the function call.
        Hide
        hgadre Hrishikesh Gadre added a comment -

        Varun Thacker David Smiley Can you review this patch please?

        Show
        hgadre Hrishikesh Gadre added a comment - Varun Thacker David Smiley Can you review this patch please?
        Hide
        dsmiley David Smiley added a comment -

        FYI When I originally worked on the backup capability, I didn't add support for createNodeSet because I wasn't yet sure how to resolve the possibility of duplicating logic with regular collection creation. I gave a cursory look at the patch here; it's interesting that the patch is actually kinda small. I should look at it more when I have time. I noticed this doesn't seem to be tested; right? I saw a test modification but it doesn't seem to test the distribution.

        Show
        dsmiley David Smiley added a comment - FYI When I originally worked on the backup capability, I didn't add support for createNodeSet because I wasn't yet sure how to resolve the possibility of duplicating logic with regular collection creation. I gave a cursory look at the patch here; it's interesting that the patch is actually kinda small. I should look at it more when I have time. I noticed this doesn't seem to be tested; right? I saw a test modification but it doesn't seem to test the distribution.
        Hide
        hgadre Hrishikesh Gadre added a comment -

        Hi David Smiley

        I noticed this doesn't seem to be tested; right? I saw a test modification but it doesn't seem to test the distribution.

        I added assertion in the test to verify that the number of replicas per host are in-line with the expected value of max_shards_per_node param. I think this is sufficient to test the distribution . Although I have not added any test for "createNodeSet" functionality. Let me add that test as well.

        Show
        hgadre Hrishikesh Gadre added a comment - Hi David Smiley I noticed this doesn't seem to be tested; right? I saw a test modification but it doesn't seem to test the distribution. I added assertion in the test to verify that the number of replicas per host are in-line with the expected value of max_shards_per_node param. I think this is sufficient to test the distribution . Although I have not added any test for "createNodeSet" functionality. Let me add that test as well.
        Hide
        DewaldV Dewald Viljoen added a comment -

        I've run smack dab straight into this issue recently. I was wondering what the progress is on this patch?

        I'm currently running Solr 6.4.1 and would really like to take advantage of the Collections Backup/Restore functionality in combination with HDFS. All works well until I restore the collection and all my shards end up on one of my SolrCloud nodes. I can specify a replicationFactor of 2 and then though some other API calls make the replica's the leaders and rebalance everything but it's a bit of a mess.

        I'm happy to lend my efforts to get this issue resolved.

        Show
        DewaldV Dewald Viljoen added a comment - I've run smack dab straight into this issue recently. I was wondering what the progress is on this patch? I'm currently running Solr 6.4.1 and would really like to take advantage of the Collections Backup/Restore functionality in combination with HDFS. All works well until I restore the collection and all my shards end up on one of my SolrCloud nodes. I can specify a replicationFactor of 2 and then though some other API calls make the replica's the leaders and rebalance everything but it's a bit of a mess. I'm happy to lend my efforts to get this issue resolved.
        Hide
        rohitcse Rohit added a comment -

        Trying to refine the replica placement to be more uniform across multiple nodes if we have sufficient amount of live_nodes in Solr available,
        1. When we have live_nodes in excess of the replication factor which existed during the time to backup we can do away with creating multiple replicas for same shard on the same live_node to ensure better distribution.

        2. User should be given warning or, option to go ahead with the Restore API if the number of live_nodes is less that what the collection was originally created.

        Details of the two issue in the attached PDF and initial patch attached with the ticket.

        Show
        rohitcse Rohit added a comment - Trying to refine the replica placement to be more uniform across multiple nodes if we have sufficient amount of live_nodes in Solr available, 1. When we have live_nodes in excess of the replication factor which existed during the time to backup we can do away with creating multiple replicas for same shard on the same live_node to ensure better distribution. 2. User should be given warning or, option to go ahead with the Restore API if the number of live_nodes is less that what the collection was originally created. Details of the two issue in the attached PDF and initial patch attached with the ticket.
        Hide
        jove4015 Stephen Weiss added a comment -

        Any chance we can get this patch merged in? We really need this issue fixed, and it's made upgrading a serious pain. Every new version we have to recompile our own Solr for this one little detail. If you need it to be tested... well... we've used it in production for months and it's great. What's actually in master is no good. Why didn't that need to be tested as much?

        Show
        jove4015 Stephen Weiss added a comment - Any chance we can get this patch merged in? We really need this issue fixed, and it's made upgrading a serious pain. Every new version we have to recompile our own Solr for this one little detail. If you need it to be tested... well... we've used it in production for months and it's great. What's actually in master is no good. Why didn't that need to be tested as much?
        Hide
        dsmiley David Smiley added a comment -

        +1 to the patch

        Show
        dsmiley David Smiley added a comment - +1 to the patch
        Hide
        dsmiley David Smiley added a comment -

        +1 to the patch

        Show
        dsmiley David Smiley added a comment - +1 to the patch
        Hide
        varunthacker Varun Thacker added a comment -

        Thanks David for the review. I'll go through it and commit in on Monday

        Show
        varunthacker Varun Thacker added a comment - Thanks David for the review. I'll go through it and commit in on Monday
        Hide
        varunthacker Varun Thacker added a comment -

        I reviewed the latest patch ( 6th April ) and here are a few things I believe we should fix:

        • Should we rename DocCollection#getReplicaCount to DocCollection#getNodeCount ?
        • RestoreCmd has an unused import. We should generally remove all unused imports to make "ant precommit" happy
        String createNodeArg = message.getStr(CREATE_NODE_SET, null);
        if (createNodeArg == CREATE_NODE_SET_EMPTY) {
          throw new SolrException(
              SolrException.ErrorCode.BAD_REQUEST,
              "Cannot restore with a CREATE_NODE_SET of CREATE_NODE_SET_EMPTY."
          );
        }
        
        • Can we move this check from RestoreCmd to the CollectionsHandler#RESTORE_OP?
        • In line 206 of RestoreCmd , shouldn't

        int repFactor = backupCollectionState.getReplicaCount();

        be

        int repFactor = message.getInt(REPLICATION_FACTOR, backupCollectionState.getReplicaCount());

        • We should also allow users to pass createNodeSet.shuffle
        Show
        varunthacker Varun Thacker added a comment - I reviewed the latest patch ( 6th April ) and here are a few things I believe we should fix: Should we rename DocCollection#getReplicaCount to DocCollection#getNodeCount ? RestoreCmd has an unused import. We should generally remove all unused imports to make "ant precommit" happy String createNodeArg = message.getStr(CREATE_NODE_SET, null ); if (createNodeArg == CREATE_NODE_SET_EMPTY) { throw new SolrException( SolrException.ErrorCode.BAD_REQUEST, "Cannot restore with a CREATE_NODE_SET of CREATE_NODE_SET_EMPTY." ); } Can we move this check from RestoreCmd to the CollectionsHandler#RESTORE_OP? In line 206 of RestoreCmd , shouldn't int repFactor = backupCollectionState.getReplicaCount(); be int repFactor = message.getInt(REPLICATION_FACTOR, backupCollectionState.getReplicaCount()); We should also allow users to pass createNodeSet.shuffle
        Hide
        hgadre Hrishikesh Gadre added a comment -

        Varun Thacker I have addressed all the review comments. In addition I also added a unit test to verify createNodeSet functionality. I made few changes along the way,

        (a) Added support for configuring createNodeSet.shuffle parameter for restore command.
        (b) The logic in RestoreCmd.java comparing the current num_live_nodes against the num_nodes for backup collection was not quite correct since it was not considering the createNodeSet as well as maxShardsPerNode parameters. So I fixed that logic.
        (c) Fixed UpdateRequestProcessorFactoryTest (unrelated to these changes). It was failing for following combination of parameters,

        ant test  -Dtestcase=UpdateRequestProcessorFactoryTest -Dtests.method=testRequestTimeUrp -Dtests.seed=91817E95F0748507 -Dtests.slow=true -Dtests.locale=be-BY -Dtests.timezone=America/Indiana/Vevay -Dtests.asserts=true -Dtests.file.encoding=UTF-8
        

        Can you take a look and let me know your feedback?

        Show
        hgadre Hrishikesh Gadre added a comment - Varun Thacker I have addressed all the review comments. In addition I also added a unit test to verify createNodeSet functionality. I made few changes along the way, (a) Added support for configuring createNodeSet.shuffle parameter for restore command. (b) The logic in RestoreCmd.java comparing the current num_live_nodes against the num_nodes for backup collection was not quite correct since it was not considering the createNodeSet as well as maxShardsPerNode parameters. So I fixed that logic. (c) Fixed UpdateRequestProcessorFactoryTest (unrelated to these changes). It was failing for following combination of parameters, ant test -Dtestcase=UpdateRequestProcessorFactoryTest -Dtests.method=testRequestTimeUrp -Dtests.seed=91817E95F0748507 -Dtests.slow=true -Dtests.locale=be-BY -Dtests.timezone=America/Indiana/Vevay -Dtests.asserts=true -Dtests.file.encoding=UTF-8 Can you take a look and let me know your feedback?
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        testRequestTimeUrp has been fixed. Just pull

        Show
        mkhludnev Mikhail Khludnev added a comment - testRequestTimeUrp has been fixed. Just pull
        Hide
        varunthacker Varun Thacker added a comment -

        Hi Hrishikesh,

        Starting to look at it right now.

        Show
        varunthacker Varun Thacker added a comment - Hi Hrishikesh, Starting to look at it right now.
        Hide
        varunthacker Varun Thacker added a comment -

        Hi Hrishikesh,

        Patch looks good. Attaching a new patch without the changes to UpdateRequestProcessorFactoryTest as that's already been committed.

        I'm going to run all the tests and commit this soon.

        Show
        varunthacker Varun Thacker added a comment - Hi Hrishikesh, Patch looks good. Attaching a new patch without the changes to UpdateRequestProcessorFactoryTest as that's already been committed. I'm going to run all the tests and commit this soon.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit b1efd37ba7b57cc01b3e2ac740b4f1d13fb86cd2 in lucene-solr's branch refs/heads/master from Varun Thacker
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b1efd37 ]

        SOLR-9527: Improve distribution of replicas when restoring a collection

        Show
        jira-bot ASF subversion and git services added a comment - Commit b1efd37ba7b57cc01b3e2ac740b4f1d13fb86cd2 in lucene-solr's branch refs/heads/master from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b1efd37 ] SOLR-9527 : Improve distribution of replicas when restoring a collection
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1adeebdd92a96233836f83f6144da9eead65a2cf in lucene-solr's branch refs/heads/branch_6x from Varun Thacker
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1adeebd ]

        SOLR-9527: Improve distribution of replicas when restoring a collection

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1adeebdd92a96233836f83f6144da9eead65a2cf in lucene-solr's branch refs/heads/branch_6x from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1adeebd ] SOLR-9527 : Improve distribution of replicas when restoring a collection
        Hide
        varunthacker Varun Thacker added a comment -

        Thanks all for contributing!

        Show
        varunthacker Varun Thacker added a comment - Thanks all for contributing!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 201e2125f4db3b4fd1ecfe5ddef5471f305fc1f0 in lucene-solr's branch refs/heads/branch_6_6 from Varun Thacker
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=201e212 ]

        SOLR-9527: Improve distribution of replicas when restoring a collection

        Show
        jira-bot ASF subversion and git services added a comment - Commit 201e2125f4db3b4fd1ecfe5ddef5471f305fc1f0 in lucene-solr's branch refs/heads/branch_6_6 from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=201e212 ] SOLR-9527 : Improve distribution of replicas when restoring a collection

          People

          • Assignee:
            varunthacker Varun Thacker
            Reporter:
            hgadre Hrishikesh Gadre
          • Votes:
            9 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development