Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      We should have an easy way to do backups and restores in SolrCloud. The ReplicationHandler supports a backup command which can create snapshots of the index but that is too little.

      The command should be able to backup:

      1. Snapshots of all indexes or indexes from the leader or the shards
      2. Config set
      3. Cluster state
      4. Cluster properties
      5. Aliases
      6. Overseer work queue?

      A restore should be able to completely restore the cloud i.e. no manual steps required other than bringing nodes back up or setting up a new cloud cluster.

      SOLR-5340 will be a part of this issue.

      1. SOLR-5750.patch
        35 kB
        Varun Thacker
      2. SOLR-5750.patch
        38 kB
        Varun Thacker
      3. SOLR-5750.patch
        35 kB
        Varun Thacker
      4. SOLR-5750.patch
        47 kB
        Varun Thacker
      5. SOLR-5750.patch
        40 kB
        Varun Thacker
      6. SOLR-5750.patch
        48 kB
        David Smiley
      7. SOLR-5750.patch
        67 kB
        David Smiley

        Issue Links

          Activity

          Hide
          hgadre Hrishikesh Gadre added a comment -

          Tim Owen David Smiley I think this is not yet implemented (due to some unit test failure ?).

          https://github.com/apache/lucene-solr/commit/70bcd562f98ede21dfc93a1ba002c61fac888b29#diff-e864a6be5b98b5340273c1db4f4677a6R107

          I am not sure why this problem exists just for restore operation (and not for create).

          Show
          hgadre Hrishikesh Gadre added a comment - Tim Owen David Smiley I think this is not yet implemented (due to some unit test failure ?). https://github.com/apache/lucene-solr/commit/70bcd562f98ede21dfc93a1ba002c61fac888b29#diff-e864a6be5b98b5340273c1db4f4677a6R107 I am not sure why this problem exists just for restore operation (and not for create).
          Hide
          TimOwen Tim Owen added a comment -

          David Smiley you mentioned in the mailing list back in March that you'd fixed the situation where restored collections are created using the old stateFormat=1 but it still seems to be doing that ... did that fix not make it into this ticket before merging? We've been trying out the backup/restore and noticed it's putting the collection's state into the global clusterstate.json instead of where it should be.

          Show
          TimOwen Tim Owen added a comment - David Smiley you mentioned in the mailing list back in March that you'd fixed the situation where restored collections are created using the old stateFormat=1 but it still seems to be doing that ... did that fix not make it into this ticket before merging? We've been trying out the backup/restore and noticed it's putting the collection's state into the global clusterstate.json instead of where it should be.
          Hide
          TimOwen Tim Owen added a comment -

          David Smiley you mentioned in the mailing list back in March that you'd fixed the situation where restored collections are created using the old stateFormat=1 but it still seems to be doing that ... did that fix not make it into this ticket before merging? We've been trying out the backup/restore and noticed it's putting the collection's state into the global clusterstate.json instead of where it should be.

          Show
          TimOwen Tim Owen added a comment - David Smiley you mentioned in the mailing list back in March that you'd fixed the situation where restored collections are created using the old stateFormat=1 but it still seems to be doing that ... did that fix not make it into this ticket before merging? We've been trying out the backup/restore and noticed it's putting the collection's state into the global clusterstate.json instead of where it should be.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Varun Thacker Thanks for the documentation. I have few comments,

          Backup and Restore Solr collections and it's associated configurations to a shared filesystem - for example HDFS or a Network File System

          Currently we don't support integration with HDFS directly. This is being done as part of SOLR-9055.

          location/string/no/The location on the shared drive for the backup command to write to. Alternately it can be set as a cluster property (hyperlink to CLUSTERPROP and document it as a supported property)

          I don't think we have added support for configuring location via CLUSTERPROP API. We have discussed an alternate approach of configuring file-systems (or Backup repositories) via solr.xml. So I think it would make more sense to configure "default" location as part of that change (instead of enabling it via CLUSTERPROP API). This is being done as part of SOLR-9055.

          Show
          hgadre Hrishikesh Gadre added a comment - Varun Thacker Thanks for the documentation. I have few comments, Backup and Restore Solr collections and it's associated configurations to a shared filesystem - for example HDFS or a Network File System Currently we don't support integration with HDFS directly. This is being done as part of SOLR-9055 . location/string/no/The location on the shared drive for the backup command to write to. Alternately it can be set as a cluster property (hyperlink to CLUSTERPROP and document it as a supported property) I don't think we have added support for configuring location via CLUSTERPROP API. We have discussed an alternate approach of configuring file-systems (or Backup repositories) via solr.xml. So I think it would make more sense to configure "default" location as part of that change (instead of enabling it via CLUSTERPROP API). This is being done as part of SOLR-9055 .
          Hide
          varunthacker Varun Thacker added a comment -

          Some docs on the feature . Any thoughts on this otherwise I'll add it over to the ref guide

          SolrCloud Backup and Restore of Collections

          Backup and Restore Solr collections and it's associated configurations to a shared filesystem - for example HDFS or a Network File System

          Backup command:

          /admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive

          The backup command will backup Solr indexes and configurations for a specified collection.
          The backup command takes one copy from each shard for the indexes. For configurations it backs up the configSet that was associated with the collection and other meta information.

          key/Type/Required/Default/Description

          name/string/yes/<empty>/The backup name
          collection/string/yes/<empty>/The name of the collection that needs to be backed up
          location/string/no/The location on the shared drive for the backup command to write to. Alternately it can be set as a cluster property (hyperlink to CLUSTERPROP and document it as a supported property)
          async ( copy over from existing docs)

          Restore command:

          /admin/collections?action=RESTORE&name=myBackupName&location=/path/to/my/sharded/drive&collection=myRestoredCollectionName

          Restores Solr indexes and associated configurations.
          The restore operation will create a collection with the specified name from the collection parameter. You cannot restore into the same collection and the collection should not be present at the time of restoring the collection, Solr will create it for you. The collection created will be of the same number of shards and replicas as the original colleciton, preserving routing information etc. Optionally you can overide some parameters documented below. For restoring the associated configSet if a configSet with the same name exists in ZK then Solr will reuse that else it will upload the backed up configSet in ZooKeeper and use that for the restored collection.

          You can use the Collection ALIAS (hyperlink) feature to make sure client's don't need to change the endpoint to query or index against the restored collection.

          key/Type/Required/Default/Description

          name/string/yes/<empty>/The backup name that needs to be restored
          collection/string/yes/<empty>/The collection where the indexes will be restored to.
          location/string/no/The location on the shared drive for the restore command to read from. Alternately it can be set as a cluster property (hyperlink to CLUSTERPROP and document it as a supported property)

          (copy over from existing docs)
          async
          collection.configName
          replicationFactor
          maxShardsPerNode
          autoAddReplicas
          property.Param
          stateFormat

          Show
          varunthacker Varun Thacker added a comment - Some docs on the feature . Any thoughts on this otherwise I'll add it over to the ref guide SolrCloud Backup and Restore of Collections Backup and Restore Solr collections and it's associated configurations to a shared filesystem - for example HDFS or a Network File System Backup command: /admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive The backup command will backup Solr indexes and configurations for a specified collection. The backup command takes one copy from each shard for the indexes. For configurations it backs up the configSet that was associated with the collection and other meta information. key/Type/Required/Default/Description name/string/yes/<empty>/The backup name collection/string/yes/<empty>/The name of the collection that needs to be backed up location/string/no/The location on the shared drive for the backup command to write to. Alternately it can be set as a cluster property (hyperlink to CLUSTERPROP and document it as a supported property) async ( copy over from existing docs) Restore command: /admin/collections?action=RESTORE&name=myBackupName&location=/path/to/my/sharded/drive&collection=myRestoredCollectionName Restores Solr indexes and associated configurations. The restore operation will create a collection with the specified name from the collection parameter. You cannot restore into the same collection and the collection should not be present at the time of restoring the collection, Solr will create it for you. The collection created will be of the same number of shards and replicas as the original colleciton, preserving routing information etc. Optionally you can overide some parameters documented below. For restoring the associated configSet if a configSet with the same name exists in ZK then Solr will reuse that else it will upload the backed up configSet in ZooKeeper and use that for the restored collection. You can use the Collection ALIAS (hyperlink) feature to make sure client's don't need to change the endpoint to query or index against the restored collection. key/Type/Required/Default/Description name/string/yes/<empty>/The backup name that needs to be restored collection/string/yes/<empty>/The collection where the indexes will be restored to. location/string/no/The location on the shared drive for the restore command to read from. Alternately it can be set as a cluster property (hyperlink to CLUSTERPROP and document it as a supported property) (copy over from existing docs) async collection.configName replicationFactor maxShardsPerNode autoAddReplicas property.Param stateFormat
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 83640278860d58ac01a7233ea4a96a49a3b6843d in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8364027 ]

          SOLR-5750: Fix test to specify the collection on add
          (cherry picked from commit 18d933e)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 83640278860d58ac01a7233ea4a96a49a3b6843d in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8364027 ] SOLR-5750 : Fix test to specify the collection on add (cherry picked from commit 18d933e)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 18d933ee65320ab1cae92a79d6635996fee9e818 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=18d933e ]

          SOLR-5750: Fix test to specify the collection on add

          Show
          jira-bot ASF subversion and git services added a comment - Commit 18d933ee65320ab1cae92a79d6635996fee9e818 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=18d933e ] SOLR-5750 : Fix test to specify the collection on add
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 0a20dd47d1abbf0036896fac03dc7d801ebcd5bd in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0a20dd4 ]

          SOLR-5750: Fix test to specify the collection on commit

          Show
          jira-bot ASF subversion and git services added a comment - Commit 0a20dd47d1abbf0036896fac03dc7d801ebcd5bd in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0a20dd4 ] SOLR-5750 : Fix test to specify the collection on commit
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 13832b4f857bc6f726a4764aa446a487b847bfee in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=13832b4 ]

          SOLR-5750: Fix test to specify the collection on commit
          (cherry picked from commit 0a20dd4)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 13832b4f857bc6f726a4764aa446a487b847bfee in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=13832b4 ] SOLR-5750 : Fix test to specify the collection on commit (cherry picked from commit 0a20dd4)
          Hide
          dsmiley David Smiley added a comment -

          Thanks for reporting; I'll dig.

          Show
          dsmiley David Smiley added a comment - Thanks for reporting; I'll dig.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I am seeing a reproducible failure in TestCloudBackupRestore

            2> NOTE: reproduce with: ant test  -Dtestcase=TestCloudBackupRestore -Dtests.method=test -Dtests.seed=4AD582D4894F28CF -Dtests.slow=true -Dtests.locale=fr-BE -Dtests.timezone=Europe/Simferopol -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
          [19:07:22.801] ERROR   28.2s J0 | TestCloudBackupRestore.test <<<
             > Throwable #1: org.apache.solr.client.solrj.SolrServerException: No collection param specified on request and no default collection has been set.
             >    at __randomizedtesting.SeedInfo.seed([4AD582D4894F28CF:C281BD0E27B34537]:0)
             >    at org.apache.solr.client.solrj.impl.CloudSolrClient.directUpdate(CloudSolrClient.java:590)
             >    at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1073)
             >    at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:962)
             >    at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:898)
             >    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
             >    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
             >    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
             >    at org.apache.solr.cloud.TestCloudBackupRestore.test(TestCloudBackupRestore.java:105)
          
          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I am seeing a reproducible failure in TestCloudBackupRestore 2> NOTE: reproduce with: ant test -Dtestcase=TestCloudBackupRestore -Dtests.method=test -Dtests.seed=4AD582D4894F28CF -Dtests.slow= true -Dtests.locale=fr-BE -Dtests.timezone=Europe/Simferopol -Dtests.asserts= true -Dtests.file.encoding=ISO-8859-1 [19:07:22.801] ERROR 28.2s J0 | TestCloudBackupRestore.test <<< > Throwable #1: org.apache.solr.client.solrj.SolrServerException: No collection param specified on request and no default collection has been set. > at __randomizedtesting.SeedInfo.seed([4AD582D4894F28CF:C281BD0E27B34537]:0) > at org.apache.solr.client.solrj.impl.CloudSolrClient.directUpdate(CloudSolrClient.java:590) > at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1073) > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:962) > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:898) > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484) > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463) > at org.apache.solr.cloud.TestCloudBackupRestore.test(TestCloudBackupRestore.java:105)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit dac044c94a33ebd655c1d5f5c628c83c75bf8697 in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dac044c ]

          SOLR-5750: Add /admin/collections?action=BACKUP and RESTORE
          (cherry picked from commit 70bcd56)

          Show
          jira-bot ASF subversion and git services added a comment - Commit dac044c94a33ebd655c1d5f5c628c83c75bf8697 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dac044c ] SOLR-5750 : Add /admin/collections?action=BACKUP and RESTORE (cherry picked from commit 70bcd56)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 70bcd562f98ede21dfc93a1ba002c61fac888b29 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=70bcd56 ]

          SOLR-5750: Add /admin/collections?action=BACKUP and RESTORE

          Show
          jira-bot ASF subversion and git services added a comment - Commit 70bcd562f98ede21dfc93a1ba002c61fac888b29 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=70bcd56 ] SOLR-5750 : Add /admin/collections?action=BACKUP and RESTORE
          Hide
          hgadre Hrishikesh Gadre added a comment -

          David Smiley sure. I have filed SOLR-9055 and uploaded the patch. Please take a look and let me have your feedback.

          Show
          hgadre Hrishikesh Gadre added a comment - David Smiley sure. I have filed SOLR-9055 and uploaded the patch. Please take a look and let me have your feedback.
          Hide
          dsmiley David Smiley added a comment -

          I made some small updates to the branch like I said I would. I elected not to switch Snapshooter to use mkdir instead of mkdirs for back-compat concerns so I have the backup caller make this check instead.

          Hrishikesh Gadre I took a look. As this issue finally nears the end, I don't feel good about rushing into the multiple abstractions introduced in the P/R (not to say I don't like them). Please file a new issue and add me as a Watcher. I have some comments but I'll wait to put them there.

          Show
          dsmiley David Smiley added a comment - I made some small updates to the branch like I said I would. I elected not to switch Snapshooter to use mkdir instead of mkdirs for back-compat concerns so I have the backup caller make this check instead. Hrishikesh Gadre I took a look. As this issue finally nears the end, I don't feel good about rushing into the multiple abstractions introduced in the P/R (not to say I don't like them). Please file a new issue and add me as a Watcher. I have some comments but I'll wait to put them there.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          David Smiley

          Please take a look at following pull request https://github.com/apache/lucene-solr/pull/36

          Unfortunately I wasn't able to apply your latest patch. Hence I had to send a PR. I am still working on refactoring the "restore" API. But in the mean time any feedback would be great.

          Show
          hgadre Hrishikesh Gadre added a comment - David Smiley Please take a look at following pull request https://github.com/apache/lucene-solr/pull/36 Unfortunately I wasn't able to apply your latest patch. Hence I had to send a PR. I am still working on refactoring the "restore" API. But in the mean time any feedback would be great.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          David Smiley I am almost done with refactoring the patch. I will submit the patch in next couple of hours.

          >>to avoid risk of confusion with "snapshot" possibly being a named commit (SOLR-9038) in the log statements and backup.properites I'll call it a backupName, not snapshotName.

          I have already fixed this in my patch.

          Show
          hgadre Hrishikesh Gadre added a comment - David Smiley I am almost done with refactoring the patch. I will submit the patch in next couple of hours. >>to avoid risk of confusion with "snapshot" possibly being a named commit ( SOLR-9038 ) in the log statements and backup.properites I'll call it a backupName, not snapshotName. I have already fixed this in my patch.
          Hide
          dsmiley David Smiley added a comment -

          I'll give some more time for review, like maybe Monday unless there are further changes to be done from any review/feedback. Some things I think I want to change (which I will do today):

          • simply remove the Overseer.processMessage case statements for RESTORE & BACKUP as they simply aren't used. This resolves a nocommit.
          • to avoid risk of confusion with "snapshot" possibly being a named commit (SOLR-9038) in the log statements and backup.properites I'll call it a backupName, not snapshotName.

          Tentative CHANGES.txt is as follows:

          * SOLR-5750: Add /admin/collections?action=BACKUP and RESTORE assuming access to a shared file system.
            (Varun Thacker, David Smiley)
          

          About the "shared file system" requirement, it occurred to me this isn't really tested; it'd be nice it if failed fast if not all shards can see the backup location's ZK backup export. I'm working on ensuring the backup fails if all slices don't see the backup directory that should be created at the start of the backup process. This seems a small matter of ensuring that SnapShooter.validateCreateSnapshot call mkdir (which will fail if the parent dir isn't there) and not mkdirs but I'm testing to ensure the replication handler's use of SnapShooter is fine with this; I think it is.

          Show
          dsmiley David Smiley added a comment - I'll give some more time for review, like maybe Monday unless there are further changes to be done from any review/feedback. Some things I think I want to change (which I will do today): simply remove the Overseer.processMessage case statements for RESTORE & BACKUP as they simply aren't used. This resolves a nocommit. to avoid risk of confusion with "snapshot" possibly being a named commit ( SOLR-9038 ) in the log statements and backup.properites I'll call it a backupName, not snapshotName. Tentative CHANGES.txt is as follows: * SOLR-5750: Add /admin/collections?action=BACKUP and RESTORE assuming access to a shared file system. (Varun Thacker, David Smiley) About the "shared file system" requirement, it occurred to me this isn't really tested; it'd be nice it if failed fast if not all shards can see the backup location's ZK backup export. I'm working on ensuring the backup fails if all slices don't see the backup directory that should be created at the start of the backup process. This seems a small matter of ensuring that SnapShooter.validateCreateSnapshot call mkdir (which will fail if the parent dir isn't there) and not mkdirs but I'm testing to ensure the replication handler's use of SnapShooter is fine with this; I think it is.
          Hide
          dsmiley David Smiley added a comment -

          I commented on SOLR-9038. Should that issue come to pass, you're right that we need to differentiate a "backup" from a "snapshot" in our terminology, assuming we even want these names. As it is, in some of the code I used "snapshotName" as a nod to the existing snapshooter code using such terminology.

          Show
          dsmiley David Smiley added a comment - I commented on SOLR-9038 . Should that issue come to pass, you're right that we need to differentiate a "backup" from a "snapshot" in our terminology, assuming we even want these names. As it is, in some of the code I used "snapshotName" as a nod to the existing snapshooter code using such terminology.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Ok please ignore the last point regarding the API naming. I see that we are already using BACKUP/RESTORE as API names...

          Show
          hgadre Hrishikesh Gadre added a comment - Ok please ignore the last point regarding the API naming. I see that we are already using BACKUP/RESTORE as API names...
          Hide
          hgadre Hrishikesh Gadre added a comment -

          David Smiley Shalin Shekhar Mangar I am refactoring the the backup/restore implementation in the Overseer primarily to make it extensible. Although we can do that as part of a separate JIRA, it would be nice if we can get it done as part of this patch itself. I should be able to complete this in next day or two.

          Also I filed SOLR-9038 to provide a lightweight alternative to the "full" backup solution implemented by this JIRA. Both these JIRAs need following functionality

          • Backing up collection metadata
          • Restore the collection (both index data and collection metadata).

          Since the APIs defined in this patch use the name "snapshot", I wonder how should we reconcile these two features? I could think of following options,

          • provide a way to override the behavior (e.g. as an API parameter OR some other global configuration)
          • rename the APIs here to indicate that its a "full" backup (e.g. CREATEBACKUP/DELETEBACKUP). This will be consistent with the current core level operations (BACKUP/RESTORE).

          Please let me know your thoughts on this...

          Show
          hgadre Hrishikesh Gadre added a comment - David Smiley Shalin Shekhar Mangar I am refactoring the the backup/restore implementation in the Overseer primarily to make it extensible. Although we can do that as part of a separate JIRA, it would be nice if we can get it done as part of this patch itself. I should be able to complete this in next day or two. Also I filed SOLR-9038 to provide a lightweight alternative to the "full" backup solution implemented by this JIRA. Both these JIRAs need following functionality Backing up collection metadata Restore the collection (both index data and collection metadata). Since the APIs defined in this patch use the name "snapshot", I wonder how should we reconcile these two features? I could think of following options, provide a way to override the behavior (e.g. as an API parameter OR some other global configuration) rename the APIs here to indicate that its a "full" backup (e.g. CREATEBACKUP/DELETEBACKUP). This will be consistent with the current core level operations (BACKUP/RESTORE). Please let me know your thoughts on this...
          Hide
          dsmiley David Smiley added a comment -

          I pushed changes to the branch and attached a patch. The changes include:

          • test asyncId
          • refactored some Snapshooter.create* logic which included fixing a bug in which a core backup wasn't reserving the IndexCommit
          • some small miscellaneous stuff to resolve nocommits

          I didn't add to the test that core properties made their wait to the restored core as I'm not sure exactly how to do that but I manually verified they got there.

          There are just 2 nocommits to resolve:

          1. Are we sure we want the parameter "name" for backup & restore to be the backup/snapshot name and not the collection name, and furthermore are we sure we want the collection name to be the parameter "collection"? I have no strong convictions but I see for other collection oriented commands we use "name" as the name of the collection. The requests here extend CollectionSpecificAdminRequest which have a getParams that put the collection name into "name" but we override it... which gave me some pause to question if these parameter names are best. Perhaps the backup name parameter could be "snapshot" or "snapshotName"? (note that "snapshotName" shows up in some backup/snapshot related properties files).
          2. Varun Thacker I don't understand why Overseer.processMessage has case statements for RESTORE & BACKUP that do nothing. At a minimum there should be comments there explaining why; it sure looks buggy the way it is. I set breakpoints there and the test never hit it. I don't understand the differentiation between Overseer.processMessage and OverseerCollectionMessageHandler.processMessage which seem remarkably similar and redundant.

          Shalin Shekhar Mangar if you have time I would love a code review.

          Otherwise, I think it's committable. Tests pass. If I don't get a code review or further comments for that matter, I'll commit in a couple days.

          Propagating createNodeSet, snitch, and rule options can be a follow-on issue. Using HDFS as a backup location can be another issue too.

          Show
          dsmiley David Smiley added a comment - I pushed changes to the branch and attached a patch. The changes include: test asyncId refactored some Snapshooter.create* logic which included fixing a bug in which a core backup wasn't reserving the IndexCommit some small miscellaneous stuff to resolve nocommits I didn't add to the test that core properties made their wait to the restored core as I'm not sure exactly how to do that but I manually verified they got there. There are just 2 nocommits to resolve: Are we sure we want the parameter "name" for backup & restore to be the backup/snapshot name and not the collection name, and furthermore are we sure we want the collection name to be the parameter "collection"? I have no strong convictions but I see for other collection oriented commands we use "name" as the name of the collection. The requests here extend CollectionSpecificAdminRequest which have a getParams that put the collection name into "name" but we override it... which gave me some pause to question if these parameter names are best. Perhaps the backup name parameter could be "snapshot" or "snapshotName"? (note that "snapshotName" shows up in some backup/snapshot related properties files). Varun Thacker I don't understand why Overseer.processMessage has case statements for RESTORE & BACKUP that do nothing. At a minimum there should be comments there explaining why; it sure looks buggy the way it is. I set breakpoints there and the test never hit it. I don't understand the differentiation between Overseer.processMessage and OverseerCollectionMessageHandler.processMessage which seem remarkably similar and redundant. Shalin Shekhar Mangar if you have time I would love a code review. Otherwise, I think it's committable. Tests pass. If I don't get a code review or further comments for that matter, I'll commit in a couple days. Propagating createNodeSet, snitch, and rule options can be a follow-on issue. Using HDFS as a backup location can be another issue too.
          Hide
          dsmiley David Smiley added a comment -

          I pushed some new commits to the branch: https://github.com/apache/lucene-solr/commits/solr-5750
          Some important bits:

          • Restored conf name is configurable; won't overwrite existing. Defaults to original.
          • replicationFactor and some other settings are customizable on restoration.
          • Shard/slice info is restored instead of reconstituted. Thus shard hash ranges (from e.g. shard split) is restored.

          I still want to:

          • test asyncId
          • test property.customKey=customVal passes through
          • resolve nocommits on how to pass config name & replicationFactor on the SolrJ side

          Beyond that there are just some nocommits of a cleanup/document nature. I code review would be much appreciated. Should we just use GitHub branches (do I need my own fork?) for that or ReviewBoard?

          Possibly in this issue but more likely in another, it would be great if collection restoration could honor CREATE_NODE_SET, snitches, and rules. I need to see how much extra work that'd be but I don't want to delay this too much.

          Show
          dsmiley David Smiley added a comment - I pushed some new commits to the branch: https://github.com/apache/lucene-solr/commits/solr-5750 Some important bits: Restored conf name is configurable; won't overwrite existing. Defaults to original. replicationFactor and some other settings are customizable on restoration. Shard/slice info is restored instead of reconstituted. Thus shard hash ranges (from e.g. shard split) is restored. I still want to: test asyncId test property.customKey=customVal passes through resolve nocommits on how to pass config name & replicationFactor on the SolrJ side Beyond that there are just some nocommits of a cleanup/document nature. I code review would be much appreciated. Should we just use GitHub branches (do I need my own fork?) for that or ReviewBoard? Possibly in this issue but more likely in another, it would be great if collection restoration could honor CREATE_NODE_SET, snitches, and rules. I need to see how much extra work that'd be but I don't want to delay this too much.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          David Smiley Yes collection "aliases" is a nice option to satisfy above requirement. I have made few comments in SOLR-8950, mostly from usability perspective. Please share your thoughts there (just to keep all relevant discussion in one thread).

          Show
          hgadre Hrishikesh Gadre added a comment - David Smiley Yes collection "aliases" is a nice option to satisfy above requirement. I have made few comments in SOLR-8950 , mostly from usability perspective. Please share your thoughts there (just to keep all relevant discussion in one thread).
          Hide
          dsmiley David Smiley added a comment - - edited

          (I'm working moving this forward still; expect another update in a day or so)

          Hrishikesh Gadre the current functionality in the patch allows one to specify a restored collection name. In the event one is trying to restore without downtime, I think this can and should be accomplished with collection aliases. So if you intend on needing that, you would create the collection at the outset with a sequence number 0. When restoring, use the next sequence number and then switch the alias to this new one, and then delete the old one. I'll grant you that it's less ideal in that one has to plan for this in advance and do this little dance.

          Show
          dsmiley David Smiley added a comment - - edited (I'm working moving this forward still; expect another update in a day or so) Hrishikesh Gadre the current functionality in the patch allows one to specify a restored collection name. In the event one is trying to restore without downtime, I think this can and should be accomplished with collection aliases. So if you intend on needing that, you would create the collection at the outset with a sequence number 0. When restoring, use the next sequence number and then switch the alias to this new one, and then delete the old one. I'll grant you that it's less ideal in that one has to plan for this in advance and do this little dance.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          The current patch assumes that we are restoring on a cluster which does not contain the specified "collection". In case user wants to restore a snapshot on the same cluster, we would need to replace the index state (after verifying that the zk metadata state is valid). In this scenario, we would need to "disable" solr collection so that we can safely restore the snapshot. Please refer to SOLR-8950 for details.

          Show
          hgadre Hrishikesh Gadre added a comment - The current patch assumes that we are restoring on a cluster which does not contain the specified "collection". In case user wants to restore a snapshot on the same cluster, we would need to replace the index state (after verifying that the zk metadata state is valid). In this scenario, we would need to "disable" solr collection so that we can safely restore the snapshot. Please refer to SOLR-8950 for details.
          Hide
          hgadre Hrishikesh Gadre added a comment - - edited

          Varun Thacker Thanks for the info. That make sense.

          I am currently investigating Solr Backup/restore changes for HDFS. I found this old email thread which discusses different approaches,
          http://qnalist.com/questions/5433155/proper-way-to-backup-solr

          I think we settled on "copying" behavior for portability reason. I think we should make this a configurable behavior so that we can make the best use of capabilities of underlying file-system. May be we introduce a higher level API? I think the API should consider source and destination file-systems to figure out the "correct" and "optimized" behavior...

          Any thoughts?

          CC Shawn Heisey Mark Miller Yonik Seeley

          Show
          hgadre Hrishikesh Gadre added a comment - - edited Varun Thacker Thanks for the info. That make sense. I am currently investigating Solr Backup/restore changes for HDFS. I found this old email thread which discusses different approaches, http://qnalist.com/questions/5433155/proper-way-to-backup-solr I think we settled on "copying" behavior for portability reason. I think we should make this a configurable behavior so that we can make the best use of capabilities of underlying file-system. May be we introduce a higher level API? I think the API should consider source and destination file-systems to figure out the "correct" and "optimized" behavior... Any thoughts? CC Shawn Heisey Mark Miller Yonik Seeley
          Hide
          dsmiley David Smiley added a comment -

          Agreed; I noted this as a nocommit in my last comment. "option of not restoring the config set; assume there's one there that is suitable". Or to be more precise, restore the configset but don't overwrite if it's already there.

          Show
          dsmiley David Smiley added a comment - Agreed; I noted this as a nocommit in my last comment. "option of not restoring the config set; assume there's one there that is suitable". Or to be more precise, restore the configset but don't overwrite if it's already there.
          Hide
          janhoy Jan Høydahl added a comment -

          Creates a core-less collection with the config set from the backup ( it appends a restore.configSetName to it for avoiding collissions )

          What if we backup two collections using the same config "foo"? After restore of the first collection, we will have a config "restore.foo", but what happens during restore of the second collection? Have not looked at the patch code. Should it not be a goal that a backup + nuke + restore leaves you with a state as close to the original as possible, i.e. with one config called "foo", used by both collections? Perhaps a restore option shareConfig=true which creates the config if missing, otherwise reuses what is there?

          Show
          janhoy Jan Høydahl added a comment - Creates a core-less collection with the config set from the backup ( it appends a restore.configSetName to it for avoiding collissions ) What if we backup two collections using the same config "foo"? After restore of the first collection, we will have a config "restore.foo", but what happens during restore of the second collection? Have not looked at the patch code. Should it not be a goal that a backup + nuke + restore leaves you with a state as close to the original as possible, i.e. with one config called "foo", used by both collections? Perhaps a restore option shareConfig=true which creates the config if missing, otherwise reuses what is there?
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Hrishikesh,

          Is there a reason why we have introduced a core Admin API for Backup/restore instead of reusing the replication handler?

          When we issue a backup command and it's with an async param then the core admin handler already has the necessary hooks to deal with the async requests. Hence in my patch I added core admin api's for backup/restore .

          Show
          varunthacker Varun Thacker added a comment - Hi Hrishikesh, Is there a reason why we have introduced a core Admin API for Backup/restore instead of reusing the replication handler? When we issue a backup command and it's with an async param then the core admin handler already has the necessary hooks to deal with the async requests. Hence in my patch I added core admin api's for backup/restore .
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Varun Thacker Is there a reason why we have introduced a core Admin API for Backup/restore instead of reusing the replication handler?

          Show
          hgadre Hrishikesh Gadre added a comment - Varun Thacker Is there a reason why we have introduced a core Admin API for Backup/restore instead of reusing the replication handler?
          Hide
          dsmiley David Smiley added a comment -

          I made a branch from master at the time of Varun's last patch and applied it, then merged in the latest from master, then did some development and committed and pushed it "solr-5750". I also have an attached patch file. The tests no pass consistently for me. I know I fixed some bugs in this code. I also added some nocommit comments, some related to this issue and some related to a bug I'm seeing in Overseer (which I'll file a separate issue for).

          Some nocommits:

          • specify what the restored config set name is; default to that of the original
          • option of not restoring the config set; assume there's one there that is suitable
          • shard hash ranges aren't restored; this error could be disasterous
          • user defined collection properties aren't restored, nor are snitches and perhaps some other props. IMO it's better to copy everything over to the extent we can. The user is free to edit the backup properties to their needs before restoring.
          Show
          dsmiley David Smiley added a comment - I made a branch from master at the time of Varun's last patch and applied it, then merged in the latest from master, then did some development and committed and pushed it "solr-5750". I also have an attached patch file. The tests no pass consistently for me. I know I fixed some bugs in this code. I also added some nocommit comments, some related to this issue and some related to a bug I'm seeing in Overseer (which I'll file a separate issue for). Some nocommits: specify what the restored config set name is; default to that of the original option of not restoring the config set; assume there's one there that is suitable shard hash ranges aren't restored; this error could be disasterous user defined collection properties aren't restored, nor are snitches and perhaps some other props. IMO it's better to copy everything over to the extent we can. The user is free to edit the backup properties to their needs before restoring.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Varun Thacker Is it possible that when the backup command is sent, one of the shards (or specifically the shard leader) is in "recovering" state ? If yes what happens with the current implementation?

          Show
          hgadre Hrishikesh Gadre added a comment - Varun Thacker Is it possible that when the backup command is sent, one of the shards (or specifically the shard leader) is in "recovering" state ? If yes what happens with the current implementation?
          Hide
          dsmiley David Smiley added a comment -

          Hello Varun. I've been kicking the tires on this. When I run it explicitly, it seems to work fine – no errors. I also tested changing my IP address between backup & restore, simply by switching my network connection, and I'm glad to see the restored collection with the correct IPs.

          Unfortunately, the test has never succeeded for me. Half the time I get an NPE from Overseer, the other half the time the test fails its asserts at the end but before then I connection failure problems that I have yet to look closer at. The Overseer NPE is curious... clusterState.json is null causing an NPE when a log message is printed around ~line 214, log.info("processMessage: ..... I started to debug that a bit but moved on. I plan to look further into these things but I wanted to report this progress.

          BTW I see no RestoreStatus in the patch.

          I was thinking maybe I'll throw up an updated patch in ReviewBoard, but I'm waiting on INFRA-11152

          Show
          dsmiley David Smiley added a comment - Hello Varun. I've been kicking the tires on this. When I run it explicitly, it seems to work fine – no errors. I also tested changing my IP address between backup & restore, simply by switching my network connection, and I'm glad to see the restored collection with the correct IPs. Unfortunately, the test has never succeeded for me. Half the time I get an NPE from Overseer, the other half the time the test fails its asserts at the end but before then I connection failure problems that I have yet to look closer at. The Overseer NPE is curious... clusterState.json is null causing an NPE when a log message is printed around ~line 214, log.info("processMessage: ..... I started to debug that a bit but moved on. I plan to look further into these things but I wanted to report this progress. BTW I see no RestoreStatus in the patch. I was thinking maybe I'll throw up an updated patch in ReviewBoard, but I'm waiting on INFRA-11152
          Hide
          varunthacker Varun Thacker added a comment -
          • Added SolrJ support for Backup and Restore Collection Admin actions
          • 2 API calls - Backup and Restore . Both support async and is recommended to use them for polling to see if the task completed. There are they BackupStatus and RestoreStatus commands like there were in previous patches.

          Backup:
          Required Params - name and collection.
          "location" can be optionally set via the cluster prop api. If the query parameter does not have it we refer to the value set in the cluster prop api

          What it backs up in the location directory

          • Index data from the shard leaders
          • collection_state_backup.json ( the backed up collection state )
          • backup.properties ( meta-data information )
          • configSet

          Restore:
          Required Params - name and collection.
          "location" can be optionally set via the cluster prop api. If the query parameter does not have it we refer to the value set in the cluster prop api

          How it works

          • The restore collection name should not be present . Restore will create it for you. You can use collection alias to use it once it has been restored. We purposely don’t allow restoring into an existing collection since rolling back in a distributed setup would be tricky . Maybe in the future if we are confident we can allow this.
          • Creates a core-less collection with the config set from the backup ( it appends a restore.configSetName to it for avoiding collissions )
          • Marks the shards in "construction" state so that if someone is sending it documents they get buffered in the tlog . TODO don't do
          • Create one replica per shard and restore the data into this
          • Adds the necessary replicas to meet the same replication factor

          Another question is I wonder if any of these loops should be done in parallel or if they are issuing asynchronous requests so it isn't necessary. It would help to document the pertinent loops with this information, and possibly do some in parallel if they should be done so.

          Yes that makes sense. We need to add this

          I looked at the patch. On the restore side I noticed a loop of slices and then a loop of replicas starting with this comment: "//Copy data from backed up index to each replica". Shouldn't there be just one replica per shard to restore, and then later the replicationFactor will expand to the desired level?

          Yeah true. This patch has those changes.

          It's still a work in progress. The restore needs hardening.

          Show
          varunthacker Varun Thacker added a comment - Added SolrJ support for Backup and Restore Collection Admin actions 2 API calls - Backup and Restore . Both support async and is recommended to use them for polling to see if the task completed. There are they BackupStatus and RestoreStatus commands like there were in previous patches. Backup : Required Params - name and collection. "location" can be optionally set via the cluster prop api. If the query parameter does not have it we refer to the value set in the cluster prop api What it backs up in the location directory Index data from the shard leaders collection_state_backup.json ( the backed up collection state ) backup.properties ( meta-data information ) configSet Restore : Required Params - name and collection. "location" can be optionally set via the cluster prop api. If the query parameter does not have it we refer to the value set in the cluster prop api How it works The restore collection name should not be present . Restore will create it for you. You can use collection alias to use it once it has been restored. We purposely don’t allow restoring into an existing collection since rolling back in a distributed setup would be tricky . Maybe in the future if we are confident we can allow this. Creates a core-less collection with the config set from the backup ( it appends a restore.configSetName to it for avoiding collissions ) Marks the shards in "construction" state so that if someone is sending it documents they get buffered in the tlog . TODO don't do Create one replica per shard and restore the data into this Adds the necessary replicas to meet the same replication factor Another question is I wonder if any of these loops should be done in parallel or if they are issuing asynchronous requests so it isn't necessary. It would help to document the pertinent loops with this information, and possibly do some in parallel if they should be done so. Yes that makes sense. We need to add this I looked at the patch. On the restore side I noticed a loop of slices and then a loop of replicas starting with this comment: "//Copy data from backed up index to each replica". Shouldn't there be just one replica per shard to restore, and then later the replicationFactor will expand to the desired level? Yeah true. This patch has those changes. It's still a work in progress. The restore needs hardening.
          Hide
          dsmiley David Smiley added a comment -

          What's left to do on this issue? Does it actually work? The feedback I've seen seem to be around improvements to make it better – i.e. using a shared filesystem resulting in no actual copying of files.

          I looked at the patch. On the restore side I noticed a loop of slices and then a loop of replicas starting with this comment: "//Copy data from backed up index to each replica". Shouldn't there be just one replica per shard to restore, and then later the replicationFactor will expand to the desired level?
          Another question is I wonder if any of these loops should be done in parallel or if they are issuing asynchronous requests so it isn't necessary. It would help to document the pertinent loops with this information, and possibly do some in parallel if they should be done so.

          Show
          dsmiley David Smiley added a comment - What's left to do on this issue? Does it actually work? The feedback I've seen seem to be around improvements to make it better – i.e. using a shared filesystem resulting in no actual copying of files. I looked at the patch. On the restore side I noticed a loop of slices and then a loop of replicas starting with this comment: "//Copy data from backed up index to each replica". Shouldn't there be just one replica per shard to restore, and then later the replicationFactor will expand to the desired level? Another question is I wonder if any of these loops should be done in parallel or if they are issuing asynchronous requests so it isn't necessary. It would help to document the pertinent loops with this information, and possibly do some in parallel if they should be done so.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Might be a bit hackey and certainly not necessary for first impl, but would be great if we could do hard snapshots if we detect a UNIX like OS.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Might be a bit hackey and certainly not necessary for first impl, but would be great if we could do hard snapshots if we detect a UNIX like OS.
          Hide
          gchanan Gregory Chanan added a comment -

          I think the best approach is to make sure "snapshotDirectory" is mandatory if DirectoryFactory.isShared() is false . If it is a shared drive we can have the path default to a directory relative to dataDir.

          sgtm, thanks.

          Show
          gchanan Gregory Chanan added a comment - I think the best approach is to make sure "snapshotDirectory" is mandatory if DirectoryFactory.isShared() is false . If it is a shared drive we can have the path default to a directory relative to dataDir. sgtm, thanks.
          Hide
          varunthacker Varun Thacker added a comment - - edited

          I think the best approach is to make sure "snapshotDirectory" is mandatory if DirectoryFactory.isShared() is false . If it is a shared drive we can have the path default to a directory relative to dataDir.

          Show
          varunthacker Varun Thacker added a comment - - edited I think the best approach is to make sure "snapshotDirectory" is mandatory if DirectoryFactory.isShared() is false . If it is a shared drive we can have the path default to a directory relative to dataDir.
          Hide
          gchanan Gregory Chanan added a comment -

          When we do a restore, the current impl. restores into a fresh collection. So the backed up data can be on another node not accessible to the restoring collection shard.

          Hard times on a non-shared file system . In that case, sure, require it to be set for non-shared file systems (can do it for all filesystems for now, we can change in the future). Or, restore on the same node as the existing shard where you took the snapshot to remove the shared mount requirement?

          Show
          gchanan Gregory Chanan added a comment - When we do a restore, the current impl. restores into a fresh collection. So the backed up data can be on another node not accessible to the restoring collection shard. Hard times on a non-shared file system . In that case, sure, require it to be set for non-shared file systems (can do it for all filesystems for now, we can change in the future). Or, restore on the same node as the existing shard where you took the snapshot to remove the shared mount requirement?
          Hide
          varunthacker Varun Thacker added a comment -

          We can't default to some path relative to the shard's dataDir?

          When we do a restore, the current impl. restores into a fresh collection. So the backed up data can be on another node not accessible to the restoring collection shard.

          Show
          varunthacker Varun Thacker added a comment - We can't default to some path relative to the shard's dataDir? When we do a restore, the current impl. restores into a fresh collection. So the backed up data can be on another node not accessible to the restoring collection shard.
          Hide
          gchanan Gregory Chanan added a comment -

          I don't think it should default to anything and if not specified snapshots should fail. The reason being users not on a shared filesystem need to specify a shared mount for this to work.

          We can't default to some path relative to the shard's dataDir?

          Show
          gchanan Gregory Chanan added a comment - I don't think it should default to anything and if not specified snapshots should fail. The reason being users not on a shared filesystem need to specify a shared mount for this to work. We can't default to some path relative to the shard's dataDir?
          Hide
          varunthacker Varun Thacker added a comment -

          1) I know we already have "location" via https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups+of+SolrCores but it just seems needlessly risky / error prone. What if a user purposefully or accidentally overwrites important data? You are giving anyone making a snapshot call solr's permissions. Beyond that, making "location" a required param is not the greatest interface. Most of the time when I'm taking a snapshot I don't even care where it is, I expect the system to just do something sensible and let me interact with the API with some id (i.e. name). HDFS and HBase snapshots work in this way, for example. Why not just have a backup location specified in solr.xml with some sensible default?

          2) On the above point: "I expect the system to just do something sensible and let me interact with the API with some id (i.e. name)" – why do I pass in a location for RESTORE? Can't the system just remember that from the backup call?

          Nice idea. We could let users specify it within the solr.xml solrcloud tag .
          I don't think it should default to anything and if not specified snapshots should fail. The reason being users not on a shared filesystem need to specify a shared mount for this to work.
          And by specifying it within the solr.xml file , even the restore command wouldn't need the user to specify the location which will solve your concern in the 2nd point

          <solr>
            <solrcloud>
              <str name="snapshotDirectory">${snapshotDirectory:}</str> <!-- the path specified here should be a shared mount accessible by all nodes in the cluster for backup/restore to work on non shared file systems -->
            </solrcloud>
          </solr>
          
          Show
          varunthacker Varun Thacker added a comment - 1) I know we already have "location" via https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups+of+SolrCores but it just seems needlessly risky / error prone. What if a user purposefully or accidentally overwrites important data? You are giving anyone making a snapshot call solr's permissions. Beyond that, making "location" a required param is not the greatest interface. Most of the time when I'm taking a snapshot I don't even care where it is, I expect the system to just do something sensible and let me interact with the API with some id (i.e. name). HDFS and HBase snapshots work in this way, for example. Why not just have a backup location specified in solr.xml with some sensible default? 2) On the above point: "I expect the system to just do something sensible and let me interact with the API with some id (i.e. name)" – why do I pass in a location for RESTORE? Can't the system just remember that from the backup call? Nice idea. We could let users specify it within the solr.xml solrcloud tag . I don't think it should default to anything and if not specified snapshots should fail. The reason being users not on a shared filesystem need to specify a shared mount for this to work. And by specifying it within the solr.xml file , even the restore command wouldn't need the user to specify the location which will solve your concern in the 2nd point <solr> <solrcloud> <str name= "snapshotDirectory" >${snapshotDirectory:}</str> <!-- the path specified here should be a shared mount accessible by all nodes in the cluster for backup/restore to work on non shared file systems --> </solrcloud> </solr>
          Hide
          gchanan Gregory Chanan added a comment - - edited

          Did a cursory look at this. Some questions/comments:

          1) I know we already have "location" via https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups+of+SolrCores but it just seems needlessly risky / error prone. What if a user purposefully or accidentally overwrites important data? You are giving anyone making a snapshot call solr's permissions. Beyond that, making "location" a required param is not the greatest interface. Most of the time when I'm taking a snapshot I don't even care where it is, I expect the system to just do something sensible and let me interact with the API with some id (i.e. name). HDFS and HBase snapshots work in this way, for example. Why not just have a backup location specified in solr.xml with some sensible default?
          2) On the above point: "I expect the system to just do something sensible and let me interact with the API with some id (i.e. name)" – why do I pass in a location for RESTORE? Can't the system just remember that from the backup call?
          3) There's no API for deleting a snapshot?
          4) There's no API for listing snapshots? (I don't think this needs to be in an initial version necessarily)
          5) From

          So the idea is the location that you give should be a shared file system so that all the replica backup along with the ZK information stay in one place. Then during restore the same location can be used.
          We can then support storing to other locations such as s3, hdfs etc as separate Jiras then

          I'm not sure the shard-at-a-time makes sense for shared file systems. For example, it's much more efficient to take an hdfs snapshot of the entire collection directory than of each individual shard. I haven't fully thought through how we support both, e.g. we could do something different based on the underlying storage of the collection (though that wouldn't let you backup a local FS collection to a shared FS), or allow a "snapshotType" parameter or something. I think we can just make whatever you have the default here, so I don't think we strictly need to do anything in this patch. Just pointing that out.

          Show
          gchanan Gregory Chanan added a comment - - edited Did a cursory look at this. Some questions/comments: 1) I know we already have "location" via https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups+of+SolrCores but it just seems needlessly risky / error prone. What if a user purposefully or accidentally overwrites important data? You are giving anyone making a snapshot call solr's permissions. Beyond that, making "location" a required param is not the greatest interface. Most of the time when I'm taking a snapshot I don't even care where it is, I expect the system to just do something sensible and let me interact with the API with some id (i.e. name). HDFS and HBase snapshots work in this way, for example. Why not just have a backup location specified in solr.xml with some sensible default? 2) On the above point: "I expect the system to just do something sensible and let me interact with the API with some id (i.e. name)" – why do I pass in a location for RESTORE? Can't the system just remember that from the backup call? 3) There's no API for deleting a snapshot? 4) There's no API for listing snapshots? (I don't think this needs to be in an initial version necessarily) 5) From So the idea is the location that you give should be a shared file system so that all the replica backup along with the ZK information stay in one place. Then during restore the same location can be used. We can then support storing to other locations such as s3, hdfs etc as separate Jiras then I'm not sure the shard-at-a-time makes sense for shared file systems. For example, it's much more efficient to take an hdfs snapshot of the entire collection directory than of each individual shard. I haven't fully thought through how we support both, e.g. we could do something different based on the underlying storage of the collection (though that wouldn't let you backup a local FS collection to a shared FS), or allow a "snapshotType" parameter or something. I think we can just make whatever you have the default here, so I don't think we strictly need to do anything in this patch. Just pointing that out.
          Hide
          varunthacker Varun Thacker added a comment -

          Patch which is updated to trunk.

          One change is the way restore works . Folding in Noble's suggestion this is how a restore works

          • creates a coreless collection for the indexes to be restored into
          • Marks all shards of it as CONSTRUCTION
          • Creates one replica for each shard and restores the indexes
          • Calls add replica for each shard to reach the replicationFactor

          It still needs iterating to make the command async.

          Show
          varunthacker Varun Thacker added a comment - Patch which is updated to trunk. One change is the way restore works . Folding in Noble's suggestion this is how a restore works creates a coreless collection for the indexes to be restored into Marks all shards of it as CONSTRUCTION Creates one replica for each shard and restores the indexes Calls add replica for each shard to reach the replicationFactor It still needs iterating to make the command async.
          Hide
          varunthacker Varun Thacker added a comment -

          HI Jan,

          Thanks for looking into this. Let me bring up the patch to date and post it again.

          Show
          varunthacker Varun Thacker added a comment - HI Jan, Thanks for looking into this. Let me bring up the patch to date and post it again.
          Hide
          gchanan Gregory Chanan added a comment -

          I'm planning to take a look at getting something like that up for shared file systems (hdfs). That's an easier problem in that the underlying storage mechanism already supports snapshots and taking a snapshot doesn't require going to every shard (assuming reasonable data dir settings).

          Show
          gchanan Gregory Chanan added a comment - I'm planning to take a look at getting something like that up for shared file systems (hdfs). That's an easier problem in that the underlying storage mechanism already supports snapshots and taking a snapshot doesn't require going to every shard (assuming reasonable data dir settings).
          Hide
          janhoy Jan Høydahl added a comment -

          Reviving this. What about getting a basic, even if unperfect, cloud backup command into the hands of users, then followup with improvements? Have not reviewed the patch but looks like it can add great value already as is.

          Show
          janhoy Jan Høydahl added a comment - Reviving this. What about getting a basic, even if unperfect, cloud backup command into the hands of users, then followup with improvements? Have not reviewed the patch but looks like it can add great value already as is.
          Hide
          noble.paul Noble Paul added a comment -

          2 design suggestions

          1) Move the operations to CollectionsHandler . When the process starts add the backup name and the node that is processing the task to ZK
          2) Do not restore all the replicas at once . Just create one replica of a shard first and then do ADDREPLICA till you have enough replicas

          Show
          noble.paul Noble Paul added a comment - 2 design suggestions 1) Move the operations to CollectionsHandler . When the process starts add the backup name and the node that is processing the task to ZK 2) Do not restore all the replicas at once . Just create one replica of a shard first and then do ADDREPLICA till you have enough replicas
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Jan,

          If you look at the patch most of the backup/restore work is being offloaded to the ReplicationHandler. So to make the underlying storage pluggable we just need to make the changes there. The only change required with respect to this patch is the backup of zk files.

          Show
          varunthacker Varun Thacker added a comment - Hi Jan, If you look at the patch most of the backup/restore work is being offloaded to the ReplicationHandler. So to make the underlying storage pluggable we just need to make the changes there. The only change required with respect to this patch is the backup of zk files.
          Hide
          janhoy Jan Høydahl added a comment -

          See my comment on SOLR-7374 - we should have something pluggable here.

          Show
          janhoy Jan Høydahl added a comment - See my comment on SOLR-7374 - we should have something pluggable here.
          Hide
          varunthacker Varun Thacker added a comment -

          So the idea is the location that you give should be a shared file system so that all the replica backup along with the ZK information stay in one place. Then during restore the same location can be used.

          We can then support storing to other locations such as s3, hdfs etc as separate Jiras then

          Show
          varunthacker Varun Thacker added a comment - So the idea is the location that you give should be a shared file system so that all the replica backup along with the ZK information stay in one place. Then during restore the same location can be used. We can then support storing to other locations such as s3, hdfs etc as separate Jiras then
          Hide
          varunthacker Varun Thacker added a comment -

          Updated patch to trunk.

          Show
          varunthacker Varun Thacker added a comment - Updated patch to trunk.
          Hide
          grishick Greg Solovyev added a comment -

          Varun Thacker if I am understanding the code correctly, when a collection has multiple shards spread over multiple nodes, data from each shard will be backed up only on that shard's node, metadata from ZK will be saved on whichever node happens to handle the backup request. Am I missing something here?

          Show
          grishick Greg Solovyev added a comment - Varun Thacker if I am understanding the code correctly, when a collection has multiple shards spread over multiple nodes, data from each shard will be backed up only on that shard's node, metadata from ZK will be saved on whichever node happens to handle the backup request. Am I missing something here?
          Hide
          varunthacker Varun Thacker added a comment -

          Updated patch to trunk and improved the existing test case to also check in scenarios when an implicit router is used.

          Show
          varunthacker Varun Thacker added a comment - Updated patch to trunk and improved the existing test case to also check in scenarios when an implicit router is used.
          Hide
          varunthacker Varun Thacker added a comment -

          First pass at the feature.

          BACKUP:
          Required params - collection, name, location

          Example API:
          /admin/collections?action=backup&name=my_backup&location=/my_location&collection=techproducts

          It will create a backup directory called my_location inside which it will store the following -

          /my_location
          /my_backup
          /shard1
          /shard2
          /zk_backup
          /zk_backup/configs/configName ( The config which was being used for the backup collection )
          /zk_backup/collection_state.json ( Always store the cluster state for that collection in collection_state.json )
          /backup.properties ( Metadata about the backup )

          If you have setup any aliases or roles or any other special property then that will not be backed up. That might not be that useful to restore as the it could be restored in some other cluster. We can add it later if its required.

          BACKUPSTATUS:
          Required params - name

          Example API: /admin/collections?action=backupstatus&name=my_backup

          RESTORE:
          Required params - collection, name, location

          Example API: /admin/collections?action=restore&name=my_backup&location=/my_location&collection=techproducts_restored

          You can't restore into an existing collection. Provide a collection name where you want to restore the index into. The restore process will create the collection similar to the backed up collection and restore the indexes.

          Restoring in the same collection would be simple to add. But in that case we should only restore the indexes.

          RESTORESTATUS:
          Required params - name

          Example API: /admin/collections?action=restorestatus&name=my_backup

          Would appreciate a review on this. I'll work on adding more tests

          Show
          varunthacker Varun Thacker added a comment - First pass at the feature. BACKUP: Required params - collection, name, location Example API: /admin/collections?action=backup&name=my_backup&location=/my_location&collection=techproducts It will create a backup directory called my_location inside which it will store the following - /my_location /my_backup /shard1 /shard2 /zk_backup /zk_backup/configs/configName ( The config which was being used for the backup collection ) /zk_backup/collection_state.json ( Always store the cluster state for that collection in collection_state.json ) /backup.properties ( Metadata about the backup ) If you have setup any aliases or roles or any other special property then that will not be backed up. That might not be that useful to restore as the it could be restored in some other cluster. We can add it later if its required. BACKUPSTATUS: Required params - name Example API: /admin/collections?action=backupstatus&name=my_backup RESTORE: Required params - collection, name, location Example API: /admin/collections?action=restore&name=my_backup&location=/my_location&collection=techproducts_restored You can't restore into an existing collection. Provide a collection name where you want to restore the index into. The restore process will create the collection similar to the backed up collection and restore the indexes. Restoring in the same collection would be simple to add. But in that case we should only restore the indexes. RESTORESTATUS: Required params - name Example API: /admin/collections?action=restorestatus&name=my_backup Would appreciate a review on this. I'll work on adding more tests
          Hide
          dk Damien Kamerman added a comment -

          Only snapshot if the index version has changed.

          Show
          dk Damien Kamerman added a comment - Only snapshot if the index version has changed.
          Hide
          mbennett Mark Bennett added a comment -

          At least 2 clients that I know of have recently asked about this. Would be awesome!

          Show
          mbennett Mark Bennett added a comment - At least 2 clients that I know of have recently asked about this. Would be awesome!
          Hide
          thetaphi Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          thetaphi Uwe Schindler added a comment - Move issue to Solr 4.9.
          Hide
          reparker Robert Parker added a comment -

          You should have the option of backing up/replicating a live searchable collection on SolrCloud A to a live searchable collection across a WAN on SolrCloud B, each with their own separate ZooKeeper ensemble. You should also be able to rename the collection on the fly so that the live searchable collection on SolrCloud A is called "collectionA" and its live updated searchable replication copy is known as "collectionB" so as to allow a single remote instance of SolrCloud to be multi-homed to act as a replication target for multiple other Solr instances' collections, even if those collections happen to have the same name on each of their source instances. Also, WAN compression/optimization would be helpful as well.

          Show
          reparker Robert Parker added a comment - You should have the option of backing up/replicating a live searchable collection on SolrCloud A to a live searchable collection across a WAN on SolrCloud B, each with their own separate ZooKeeper ensemble. You should also be able to rename the collection on the fly so that the live searchable collection on SolrCloud A is called "collectionA" and its live updated searchable replication copy is known as "collectionB" so as to allow a single remote instance of SolrCloud to be multi-homed to act as a replication target for multiple other Solr instances' collections, even if those collections happen to have the same name on each of their source instances. Also, WAN compression/optimization would be helpful as well.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          It should also be able to take a backup of a specific collection only
          Option of compressing data which is backed up

          +1

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - It should also be able to take a backup of a specific collection only Option of compressing data which is backed up +1
          Hide
          noble.paul Noble Paul added a comment -
          • It should also be able to take a backup of a specific collection only
          • Option of compressing data which is backed up
          Show
          noble.paul Noble Paul added a comment - It should also be able to take a backup of a specific collection only Option of compressing data which is backed up

            People

            • Assignee:
              varunthacker Varun Thacker
              Reporter:
              shalinmangar Shalin Shekhar Mangar
            • Votes:
              15 Vote for this issue
              Watchers:
              44 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development