Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9055

Make collection backup/restore extensible

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      SOLR-5750 implemented backup/restore API for Solr. This JIRA is to track the code cleanup/refactoring. Specifically following improvements should be made,

      • Add Solr/Lucene version to check the compatibility between the backup version and the version of Solr on which it is being restored.
      • Add a backup implementation version to check the compatibility between the "restore" implementation and backup format.
      • Introduce a Strategy interface to define how the Solr index data is backed up (e.g. using file copy approach).
      • Introduce a Repository interface to define the file-system used to store the backup data. (currently works only with local file system but can be extended). This should be enhanced to introduce support for "registering" repositories (e.g. HDFS, S3 etc.)
      1. SOLR-9055.patch
        80 kB
        Hrishikesh Gadre
      2. SOLR-9055.patch
        34 kB
        Hrishikesh Gadre
      3. SOLR-9055.patch
        56 kB
        Hrishikesh Gadre

        Issue Links

          Activity

          Hide
          hgadre Hrishikesh Gadre added a comment -

          The repository interface defined as part of this patch could be used while defining APIs in SOLR-7374

          Show
          hgadre Hrishikesh Gadre added a comment - The repository interface defined as part of this patch could be used while defining APIs in SOLR-7374
          Hide
          dsmiley David Smiley added a comment -

          Thanks for contributing, Hrishikesh Gadre. I suggest renaming this to: Make collection backup/restore extensible

          Does this API enable the possibility of a hard-link based copy (applicable for both backup & restore). It doesn't seem so but I'm unsure?

          I have a general question about HDFS; I have no real experience with it: I wonder if Java's NIO file abstractions could be used so we don't have to have separate code? If so it would be wonderful – simpler and less code to maintain. See https://github.com/damiencarol/jsr203-hadoop What do you think?

          Nitpick: In your editor, if it has this feature (IntelliJ does), configure it to strip trailing whitespace only on Modified Lines. IntelliJ: Editor/General/Other "Strip trailing spaces on Save".

          Before committing to this API, it would be good to have it implement something useful (HDFS or whatever), otherwise we very well may miss problems with the API – we probably will. I'm not saying this issue needs to implement HDFS, but at least the proposed patch might have an implementation specific part in some separate files that wouldn't be committed with this issue. I suppose this isn't strictly required but it would help.

          Show
          dsmiley David Smiley added a comment - Thanks for contributing, Hrishikesh Gadre . I suggest renaming this to: Make collection backup/restore extensible Does this API enable the possibility of a hard-link based copy (applicable for both backup & restore). It doesn't seem so but I'm unsure? I have a general question about HDFS; I have no real experience with it: I wonder if Java's NIO file abstractions could be used so we don't have to have separate code? If so it would be wonderful – simpler and less code to maintain. See https://github.com/damiencarol/jsr203-hadoop What do you think? Nitpick: In your editor, if it has this feature (IntelliJ does), configure it to strip trailing whitespace only on Modified Lines . IntelliJ: Editor/General/Other "Strip trailing spaces on Save". Before committing to this API, it would be good to have it implement something useful (HDFS or whatever), otherwise we very well may miss problems with the API – we probably will. I'm not saying this issue needs to implement HDFS, but at least the proposed patch might have an implementation specific part in some separate files that wouldn't be committed with this issue. I suppose this isn't strictly required but it would help.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          >>I have a general question about HDFS; I have no real experience with it: I wonder if Java's NIO file abstractions could be used so we don't have to have separate code? If so it would be wonderful – simpler and less code to maintain. See https://github.com/damiencarol/jsr203-hadoop What do you think?

          Although integrating HDFS and Java NIO API sounds interesting, I would prefer if it is directly provided by HDFS client library as against a third party library which may/may not be supported in future. Also since Solr provides a HDFS backed Directory implementation, it probably make sense to reuse it.

          However if we want to keep things simple, we can choose to not provide separate APIs to configure "repositories". Instead we can just pick the same file-system used to store the indexed data. That means in case of local file-system, the backup will be stored on shared file-system using SimpleFSDirectory implementation AND for HDFS we will use HdfsDirectory impl. Make sense?

          Show
          hgadre Hrishikesh Gadre added a comment - >>I have a general question about HDFS; I have no real experience with it: I wonder if Java's NIO file abstractions could be used so we don't have to have separate code? If so it would be wonderful – simpler and less code to maintain. See https://github.com/damiencarol/jsr203-hadoop What do you think? Although integrating HDFS and Java NIO API sounds interesting, I would prefer if it is directly provided by HDFS client library as against a third party library which may/may not be supported in future. Also since Solr provides a HDFS backed Directory implementation, it probably make sense to reuse it. However if we want to keep things simple, we can choose to not provide separate APIs to configure "repositories". Instead we can just pick the same file-system used to store the indexed data. That means in case of local file-system, the backup will be stored on shared file-system using SimpleFSDirectory implementation AND for HDFS we will use HdfsDirectory impl. Make sense?
          Hide
          hgadre Hrishikesh Gadre added a comment -

          >>Nitpick: In your editor, if it has this feature (IntelliJ does), configure it to strip trailing whitespace only on Modified Lines. IntelliJ: Editor/General/Other "Strip trailing spaces on Save".

          Sorry about that. Let me resubmit the patch without this noise.

          >>Does this API enable the possibility of a hard-link based copy (applicable for both backup & restore). It doesn't seem so but I'm unsure?

          The current "IndexBackupStrategy" API works at the Overseer level and not at the "core" level. Since "hard-link" based copy needs to be done at the "core" level, it doesn't handle this use-case.

          >>Before committing to this API, it would be good to have it implement something useful (HDFS or whatever), otherwise we very well may miss problems with the API – we probably will. I'm not saying this issue needs to implement HDFS, but at least the proposed patch might have an implementation specific part in some separate files that wouldn't be committed with this issue. I suppose this isn't strictly required but it would help.

          My primary motivation was just to make the code modular (instead of having one gigantic method incorporating all logic). But I agree that delaying the interface definition would probably be better. So I can remove the "IndexBackupStrategy" interface and have BackupManager use "CopyFilesStrategy" by default. Would that be sufficient?

          Show
          hgadre Hrishikesh Gadre added a comment - >>Nitpick: In your editor, if it has this feature (IntelliJ does), configure it to strip trailing whitespace only on Modified Lines. IntelliJ: Editor/General/Other "Strip trailing spaces on Save". Sorry about that. Let me resubmit the patch without this noise. >>Does this API enable the possibility of a hard-link based copy (applicable for both backup & restore). It doesn't seem so but I'm unsure? The current "IndexBackupStrategy" API works at the Overseer level and not at the "core" level. Since "hard-link" based copy needs to be done at the "core" level, it doesn't handle this use-case. >>Before committing to this API, it would be good to have it implement something useful (HDFS or whatever), otherwise we very well may miss problems with the API – we probably will. I'm not saying this issue needs to implement HDFS, but at least the proposed patch might have an implementation specific part in some separate files that wouldn't be committed with this issue. I suppose this isn't strictly required but it would help. My primary motivation was just to make the code modular (instead of having one gigantic method incorporating all logic). But I agree that delaying the interface definition would probably be better. So I can remove the "IndexBackupStrategy" interface and have BackupManager use "CopyFilesStrategy" by default. Would that be sufficient?
          Hide
          hgadre Hrishikesh Gadre added a comment -

          >>However if we want to keep things simple, we can choose to not provide separate APIs to configure "repositories". Instead we can just pick the same file-system used to store the indexed data. That means in case of local file-system, the backup will be stored on shared file-system using SimpleFSDirectory implementation AND for HDFS we will use HdfsDirectory impl. Make sense?

          I think the main problem here is identifying type of file-system used for a given collection at the Overseer (The solr core on the other hand already has a Directory factory reference. So we can instantiate appropriate directory in the snapshooter).

          Show
          hgadre Hrishikesh Gadre added a comment - >>However if we want to keep things simple, we can choose to not provide separate APIs to configure "repositories". Instead we can just pick the same file-system used to store the indexed data. That means in case of local file-system, the backup will be stored on shared file-system using SimpleFSDirectory implementation AND for HDFS we will use HdfsDirectory impl. Make sense? I think the main problem here is identifying type of file-system used for a given collection at the Overseer (The solr core on the other hand already has a Directory factory reference. So we can instantiate appropriate directory in the snapshooter).
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Let me resubmit the patch without this noise.

          Can you attach a cleaned up patch so it's easier to review?

          Show
          markrmiller@gmail.com Mark Miller added a comment - Let me resubmit the patch without this noise. Can you attach a cleaned up patch so it's easier to review?
          Hide
          dsmiley David Smiley added a comment -

          (p.s. use bq. to quote)

          (me) I have a general question about HDFS; I have no real experience with it: I wonder if Java's NIO file abstractions could be used so we don't have to have separate code? If so it would be wonderful – simpler and less code to maintain. See https://github.com/damiencarol/jsr203-hadoop What do you think?

          (Gadre) Although integrating HDFS and Java NIO API sounds interesting, I would prefer if it is directly provided by HDFS client library as against a third party library which may/may not be supported in future. Also since Solr provides a HDFS backed Directory implementation, it probably make sense to reuse it.

          Any thoughts on this one Mark Miller or Gregory Chanan perhaps?

          However if we want to keep things simple, we can choose to not provide separate APIs to configure "repositories". Instead we can just pick the same file-system used to store the indexed data. That means in case of local file-system, the backup will be stored on shared file-system using SimpleFSDirectory implementation AND for HDFS we will use HdfsDirectory impl. Make sense?

          I understand what you mean, but it seems a shame, and loses the extensibility we want. I think what this comes down to is, should we re-use the Lucene Directory API for moving data in/out of the backup location, or should we use something else.

          I think the main problem here is identifying type of file-system used for a given collection at the Overseer (The solr core on the other hand already has a Directory factory reference. So we can instantiate appropriate directory in the snapshooter).

          It was suggested early in SOLR-5750 that the location param should have a protocol/impl scheme URL prefix (assume file:// if not specified). That may help the Overseer? Or if you mean it needs to know the directory impl of the live indexes well I imagine it could look this up in the same way that it is done from Solr's admin screen (which shows the impl factory).

          I doubt I'll have time to help much more here... I'm a bit behind on my work load.

          Show
          dsmiley David Smiley added a comment - (p.s. use bq. to quote) (me) I have a general question about HDFS; I have no real experience with it: I wonder if Java's NIO file abstractions could be used so we don't have to have separate code? If so it would be wonderful – simpler and less code to maintain. See https://github.com/damiencarol/jsr203-hadoop What do you think? (Gadre) Although integrating HDFS and Java NIO API sounds interesting, I would prefer if it is directly provided by HDFS client library as against a third party library which may/may not be supported in future. Also since Solr provides a HDFS backed Directory implementation, it probably make sense to reuse it. Any thoughts on this one Mark Miller or Gregory Chanan perhaps? However if we want to keep things simple, we can choose to not provide separate APIs to configure "repositories". Instead we can just pick the same file-system used to store the indexed data. That means in case of local file-system, the backup will be stored on shared file-system using SimpleFSDirectory implementation AND for HDFS we will use HdfsDirectory impl. Make sense? I understand what you mean, but it seems a shame, and loses the extensibility we want. I think what this comes down to is, should we re-use the Lucene Directory API for moving data in/out of the backup location, or should we use something else. I think the main problem here is identifying type of file-system used for a given collection at the Overseer (The solr core on the other hand already has a Directory factory reference. So we can instantiate appropriate directory in the snapshooter). It was suggested early in SOLR-5750 that the location param should have a protocol/impl scheme URL prefix (assume file:// if not specified). That may help the Overseer? Or if you mean it needs to know the directory impl of the live indexes well I imagine it could look this up in the same way that it is done from Solr's admin screen (which shows the impl factory). I doubt I'll have time to help much more here... I'm a bit behind on my work load.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          It was suggested early in SOLR-5750 that the location param should have a protocol/impl scheme URL prefix (assume file:// if not specified). That may help the Overseer?

          I thought about that and the BackupRepositoryFactory implementation (in my patch) is using the the "scheme" of the URI to instantiate the correct repository instance. The problem is that the repository implementation may require additional parameters (e.g. S3 credentials, kerberos settings for HDFS etc.) which will also need to be propagated from client to the overseer AND from overseer to the individual cores. We will also need to come up with a mechanism to specify these "extra" parameters. Instead of providing such complicated interface to the users, I am thinking to provide a "registry" of repositories configured across the cluster. The users will need to configure the registry once and just refer to it by name. In this case Solr will already have all the information necessary to communicate with the repository (which can be a file-system or an object store).

          I think what this comes down to is, should we re-use the Lucene Directory API for moving data in/out of the backup location, or should we use something else.

          Yes I think it is possible to use Lucene Directory implementation without requiring a different "Repository" interface. Currently we don't have Directory implementation available for S3 though. Should we do that?

          Show
          hgadre Hrishikesh Gadre added a comment - It was suggested early in SOLR-5750 that the location param should have a protocol/impl scheme URL prefix (assume file:// if not specified). That may help the Overseer? I thought about that and the BackupRepositoryFactory implementation (in my patch) is using the the "scheme" of the URI to instantiate the correct repository instance. The problem is that the repository implementation may require additional parameters (e.g. S3 credentials, kerberos settings for HDFS etc.) which will also need to be propagated from client to the overseer AND from overseer to the individual cores. We will also need to come up with a mechanism to specify these "extra" parameters. Instead of providing such complicated interface to the users, I am thinking to provide a "registry" of repositories configured across the cluster. The users will need to configure the registry once and just refer to it by name. In this case Solr will already have all the information necessary to communicate with the repository (which can be a file-system or an object store). I think what this comes down to is, should we re-use the Lucene Directory API for moving data in/out of the backup location, or should we use something else. Yes I think it is possible to use Lucene Directory implementation without requiring a different "Repository" interface. Currently we don't have Directory implementation available for S3 though. Should we do that?
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Mark Miller Please take a look at the updated patch.

          Show
          hgadre Hrishikesh Gadre added a comment - Mark Miller Please take a look at the updated patch.
          Hide
          gchanan Gregory Chanan added a comment -

          Any thoughts on this one Mark Miller or Gregory Chanan perhaps?

          I agree with Hrishikesh's take here. In addition to the third party library issue, HDFS does not in general support all the local file system APIs you'd expect. I don't know if there are issues with nio specifically, but two examples in the past have been truncate and directory sync.

          Show
          gchanan Gregory Chanan added a comment - Any thoughts on this one Mark Miller or Gregory Chanan perhaps? I agree with Hrishikesh's take here. In addition to the third party library issue, HDFS does not in general support all the local file system APIs you'd expect. I don't know if there are issues with nio specifically, but two examples in the past have been truncate and directory sync.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I think Uwe has brought up a similar idea for the current HDFS integration. I think it's certainly worth exploring at some time, but for this issue, I'm also +1 on using our current 'known' approach to interacting with HDFS.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I think Uwe has brought up a similar idea for the current HDFS integration. I think it's certainly worth exploring at some time, but for this issue, I'm also +1 on using our current 'known' approach to interacting with HDFS.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Yes I think it is possible to use Lucene Directory implementation without requiring a different "Repository" interface. Currently we don't have Directory implementation available for S3 though. Should we do that?

          OK. I will update the patch to use Directory interface (and remove Repository interface). But still I would like to understand how should we proceed with integration with different file-systems? It occurs to me that the "DirectoryFactory" configuration in solrconfig.xml can be exposed at a higher level so that it would be useful for both both index management and backup/restore. e.g. consider how HDFS configuration is done today,
          https://github.com/cloudera/lucene-solr/blob/25d722e35238cca776abbe3a621e0c5b733e762d/cloudera/solrconfig.xml#L119

          If this is exposed via a separate "Repository" API, then solrconfig.xml can also refer to it via user-configurable "name". (Please note that some care needs to be taken to allow "block-cache" to be configured selectively as backup/restore solution does not need it). This way users can register multiple repositories (e.g. local file-system, HDFS etc.) and choose one of index management and other for backup/restore without duplicate configuration (e.g. one in solrconfig.xml and other as part of "Repository" API).

          It seems like a major change though it is a "correct" solution. So any feedback on this would be great.

          Show
          hgadre Hrishikesh Gadre added a comment - Yes I think it is possible to use Lucene Directory implementation without requiring a different "Repository" interface. Currently we don't have Directory implementation available for S3 though. Should we do that? OK. I will update the patch to use Directory interface (and remove Repository interface). But still I would like to understand how should we proceed with integration with different file-systems? It occurs to me that the "DirectoryFactory" configuration in solrconfig.xml can be exposed at a higher level so that it would be useful for both both index management and backup/restore. e.g. consider how HDFS configuration is done today, https://github.com/cloudera/lucene-solr/blob/25d722e35238cca776abbe3a621e0c5b733e762d/cloudera/solrconfig.xml#L119 If this is exposed via a separate "Repository" API, then solrconfig.xml can also refer to it via user-configurable "name". (Please note that some care needs to be taken to allow "block-cache" to be configured selectively as backup/restore solution does not need it). This way users can register multiple repositories (e.g. local file-system, HDFS etc.) and choose one of index management and other for backup/restore without duplicate configuration (e.g. one in solrconfig.xml and other as part of "Repository" API). It seems like a major change though it is a "correct" solution. So any feedback on this would be great.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I don't know that a Repository is a bad idea. A Directory is a very specific object that is made for accessing files for reading and writing Lucene indexes - I don't know that we want to tie that up with simply needing to be able to copy off files to different places. A Repository impl might use Directory impls to do it's work, but I don't know that I buy that Directory is a good replacement for Repository. Many possible Repository locations may not even make reasonable Directory implementations.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I don't know that a Repository is a bad idea. A Directory is a very specific object that is made for accessing files for reading and writing Lucene indexes - I don't know that we want to tie that up with simply needing to be able to copy off files to different places. A Repository impl might use Directory impls to do it's work, but I don't know that I buy that Directory is a good replacement for Repository. Many possible Repository locations may not even make reasonable Directory implementations.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Yes that's why I thought to define a separate interface (although there is some redundancy wrt Directory interface). So I am thinking to define a new section i n the solr.xml to configure the backup directories. e.g.

          <backup-locations>
          <backup-location name="hdfs" type="solr.HdfsRepository">
          <base_location>/solr-backups</base_location>
          ... (Other params)
          </backup-location>
          ...
          </backup-locations>

          During the backup/restore operation, user can specify "name" of the location. In case this parameter is absent, we will use the local file-system implementation for backwards compatibility.

          Thoughts?

          Show
          hgadre Hrishikesh Gadre added a comment - Yes that's why I thought to define a separate interface (although there is some redundancy wrt Directory interface). So I am thinking to define a new section i n the solr.xml to configure the backup directories. e.g. <backup-locations> <backup-location name="hdfs" type="solr.HdfsRepository"> <base_location>/solr-backups</base_location> ... (Other params) </backup-location> ... </backup-locations> During the backup/restore operation, user can specify "name" of the location. In case this parameter is absent, we will use the local file-system implementation for backwards compatibility. Thoughts?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          +1, that makes sense to me.

          Show
          markrmiller@gmail.com Mark Miller added a comment - +1, that makes sense to me.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Mark Miller

          I successfully refactored the "core backup" logic to use the Repository interface. But it looks like during "restore" operation we need to use Lucene API to compute the file checksum. This API unfortunately requires the Lucene Directory/IndexInput implementation.
          https://github.com/apache/lucene-solr/blob/a5586d29b23f7d032e6d8f0cf8758e56b09e0208/solr/core/src/java/org/apache/solr/handler/RestoreCore.java#L83

          Is there any other way to compute the checksum? Since without such support, we will need to implement Directory interface for each type of file-system we want to integrate with (and it makes Repository interface redundant).

          Also I didn't quite understand the logic in the following code-block,
          https://github.com/apache/lucene-solr/blob/a5586d29b23f7d032e6d8f0cf8758e56b09e0208/solr/core/src/java/org/apache/solr/handler/RestoreCore.java#L88-L95

          Why would we want to use the files in the "local" directory? In case of collection restoration, there will be no files (since we create a new core). I am not sure if I understand the actual problem here...

          Show
          hgadre Hrishikesh Gadre added a comment - Mark Miller I successfully refactored the "core backup" logic to use the Repository interface. But it looks like during "restore" operation we need to use Lucene API to compute the file checksum. This API unfortunately requires the Lucene Directory/IndexInput implementation. https://github.com/apache/lucene-solr/blob/a5586d29b23f7d032e6d8f0cf8758e56b09e0208/solr/core/src/java/org/apache/solr/handler/RestoreCore.java#L83 Is there any other way to compute the checksum? Since without such support, we will need to implement Directory interface for each type of file-system we want to integrate with (and it makes Repository interface redundant). Also I didn't quite understand the logic in the following code-block, https://github.com/apache/lucene-solr/blob/a5586d29b23f7d032e6d8f0cf8758e56b09e0208/solr/core/src/java/org/apache/solr/handler/RestoreCore.java#L88-L95 Why would we want to use the files in the "local" directory? In case of collection restoration, there will be no files (since we create a new core). I am not sure if I understand the actual problem here...
          Hide
          varunthacker Varun Thacker added a comment -

          Also I didn't quite understand the logic in the following code-block,

          You can restore to an existing core. So for example:

          • Core=CoreX gets backed up
          • Some docs gets added to CoreX
          • We want to restore CoreX to an earlier point.

          In this scenario there can be lots of segment files which are already present for CoreX . So instead of copying the entire restore folder, this code block tries to optimize and prefers local copy so that the copy is faster theoretically.

          Show
          varunthacker Varun Thacker added a comment - Also I didn't quite understand the logic in the following code-block, You can restore to an existing core. So for example: Core=CoreX gets backed up Some docs gets added to CoreX We want to restore CoreX to an earlier point. In this scenario there can be lots of segment files which are already present for CoreX . So instead of copying the entire restore folder, this code block tries to optimize and prefers local copy so that the copy is faster theoretically.
          Hide
          varunthacker Varun Thacker added a comment -

          Should we mark our current backup/restore API as experimental as this patch might need to change the API? It would be safer in-case this doesn't make it to whenever the 6.1 release ( which we don't know when either though )

          Show
          varunthacker Varun Thacker added a comment - Should we mark our current backup/restore API as experimental as this patch might need to change the API? It would be safer in-case this doesn't make it to whenever the 6.1 release ( which we don't know when either though )
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          But it looks like during "restore" operation we need to use Lucene API to compute the file checksum.

          I think it's reading it from the index files, not calculating it? But yeah, I'm sure you could recalculate it, though some files may not even have it.

          It doesn't seem like a full Directory implementation is required - just the ability to read the checksums for a file in the repo (read a header and see if it matches a checksum). And of course an impl could just do the same checksum computation over itself. It should just be used to be sure we don't treat a file with the same name and size as another file the same when the data is actually different. This can happen fairly easily with different Lucene indexes.

          Show
          markrmiller@gmail.com Mark Miller added a comment - But it looks like during "restore" operation we need to use Lucene API to compute the file checksum. I think it's reading it from the index files, not calculating it? But yeah, I'm sure you could recalculate it, though some files may not even have it. It doesn't seem like a full Directory implementation is required - just the ability to read the checksums for a file in the repo (read a header and see if it matches a checksum). And of course an impl could just do the same checksum computation over itself. It should just be used to be sure we don't treat a file with the same name and size as another file the same when the data is actually different. This can happen fairly easily with different Lucene indexes.
          Hide
          thetaphi Uwe Schindler added a comment -

          Yes' you can read it from any index file that has a codec header. The CodecUtil class has methods for it: https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/codecs/CodecUtil.html#retrieveChecksum-org.apache.lucene.store.IndexInput-

          To validate and recalculate, you can also use a method, but this may take long...: https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/codecs/CodecUtil.html#checksumEntireFile-org.apache.lucene.store.IndexInput-

          Show
          thetaphi Uwe Schindler added a comment - Yes' you can read it from any index file that has a codec header. The CodecUtil class has methods for it: https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/codecs/CodecUtil.html#retrieveChecksum-org.apache.lucene.store.IndexInput- To validate and recalculate, you can also use a method, but this may take long...: https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/codecs/CodecUtil.html#checksumEntireFile-org.apache.lucene.store.IndexInput-
          Hide
          thetaphi Uwe Schindler added a comment -

          In both cases you don't need a full Directory impl. A IndexInput impl for read access to file is enough.

          Show
          thetaphi Uwe Schindler added a comment - In both cases you don't need a full Directory impl. A IndexInput impl for read access to file is enough.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Uwe Schindler Mark Miller

          Thanks for the comments. Yes we just need IndexInput and not the entire directory implementation. Let me work on this.

          Show
          hgadre Hrishikesh Gadre added a comment - Uwe Schindler Mark Miller Thanks for the comments. Yes we just need IndexInput and not the entire directory implementation. Let me work on this.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Yes we just need IndexInput and not the entire directory implementation.

          Or even just something that can read the start of files and do the same hash calc, or does the hash calc while copying...but yeah, probably by an IndexInput usually.

          I have not had a chance to see why we do this though. I know we do it when replicating because you are copying a lucene index into an existing index. But when backing up, you don't expect any files to already exist do you? Can't you you just ensure the backup is only done to new / empty locations? I have to look at some more context still though, I've only looked at the couple lines pointed at github above.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Yes we just need IndexInput and not the entire directory implementation. Or even just something that can read the start of files and do the same hash calc, or does the hash calc while copying...but yeah, probably by an IndexInput usually. I have not had a chance to see why we do this though. I know we do it when replicating because you are copying a lucene index into an existing index. But when backing up, you don't expect any files to already exist do you? Can't you you just ensure the backup is only done to new / empty locations? I have to look at some more context still though, I've only looked at the couple lines pointed at github above.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Varun Thacker Thanks for the comments!

          Should we mark our current backup/restore API as experimental as this patch might need to change the API? It would be safer in-case this doesn't make it to whenever the 6.1 release ( which we don't know when either though )

          I think it make sense.

          BTW I have filed SOLR-9091 to capture various problems with the "restore" operation.

          Show
          hgadre Hrishikesh Gadre added a comment - Varun Thacker Thanks for the comments! Should we mark our current backup/restore API as experimental as this patch might need to change the API? It would be safer in-case this doesn't make it to whenever the 6.1 release ( which we don't know when either though ) I think it make sense. BTW I have filed SOLR-9091 to capture various problems with the "restore" operation.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Mark Miller I have posted partial patch in SOLR-7374. The primary reason for this split is to keep the patch short and easy to review. I will post the remaining changes as part of this issue soon.

          Please take a look and let me have your feedback.

          Show
          hgadre Hrishikesh Gadre added a comment - Mark Miller I have posted partial patch in SOLR-7374 . The primary reason for this split is to keep the patch short and easy to review. I will post the remaining changes as part of this issue soon. Please take a look and let me have your feedback.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Cool, I'll take a look over the next day or two.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Cool, I'll take a look over the next day or two.
          Hide
          hgadre Hrishikesh Gadre added a comment - - edited

          Mark Miller Please find the updated patch. This builds on the patch submitted as part of SOLR-7374.

          This patch implements following,

          • Added Solr/Lucene version to check the compatibility between the backup version and the version of Solr on which it is being restored.
          • Added a backup implementation version to check the compatibility between the "restore" implementation and backup format.
          • Introduced a Strategy interface to define how the Solr index data is backed up (e.g. using file copy approach)
          • Solr cloud backup/restore implementation is file-system agnostic.
          • Unit test added to verify integration with HDFS ( + some unit test refactoring ).
          Show
          hgadre Hrishikesh Gadre added a comment - - edited Mark Miller Please find the updated patch. This builds on the patch submitted as part of SOLR-7374 . This patch implements following, Added Solr/Lucene version to check the compatibility between the backup version and the version of Solr on which it is being restored. Added a backup implementation version to check the compatibility between the "restore" implementation and backup format. Introduced a Strategy interface to define how the Solr index data is backed up (e.g. using file copy approach) Solr cloud backup/restore implementation is file-system agnostic. Unit test added to verify integration with HDFS ( + some unit test refactoring ).
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Sorry! I got sucked deep into vacation. Tomorrow I spend time on this.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Sorry! I got sucked deep into vacation. Tomorrow I spend time on this.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Hmm...going to try again, but having some trouble compiling after applying the patch.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Hmm...going to try again, but having some trouble compiling after applying the patch.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Oh wait, sorry, missed the "builds on" part. Let me go again.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Oh wait, sorry, missed the "builds on" part. Let me go again.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Mark Miller Varun Thacker Based on our discussion in SOLR-7374, I created SOLR-9242 to track changes required to support collection level backup/restore for other file systems. Once those changes are committed, I will submit another patch here. It would include following,

          • Add Solr/Lucene version to check the compatibility between the backup version and the version of Solr on which it is being restored.
          • Add a backup implementation version to check the compatibility between the "restore" implementation and backup format.
          • Introduce a Strategy interface to define how the Solr index data is backed up (e.g. using file copy approach).
          Show
          hgadre Hrishikesh Gadre added a comment - Mark Miller Varun Thacker Based on our discussion in SOLR-7374 , I created SOLR-9242 to track changes required to support collection level backup/restore for other file systems. Once those changes are committed, I will submit another patch here. It would include following, Add Solr/Lucene version to check the compatibility between the backup version and the version of Solr on which it is being restored. Add a backup implementation version to check the compatibility between the "restore" implementation and backup format. Introduce a Strategy interface to define how the Solr index data is backed up (e.g. using file copy approach).
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user hgadre opened a pull request:

          https://github.com/apache/lucene-solr/pull/67

          SOLR-9055 Make collection backup/restore extensible

          • Introduced a Strategy interface to define how the Solr index data is backed up.
          • Two concrete implementations of this strategy interface defined.
          • One using core Admin API (BACKUPCORE)
          • Other skipping the backup of index data altogether. This is useful when
            the index data is copied via an external mechanism in combination with named
            snapshots (Please refer to SOLR-9038 for details)
          • In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.)
          • Added a backup property to record the Solr version. This helps to check the compatibility
            of backup with respect to the current version during the restore operation. This
            compatibility check is not added since its unclear what the Solr level compatibility guidelines
            are. But at-least having version information as part of the backup would be very useful.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/hgadre/lucene-solr SOLR-9055_fix

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/67.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #67


          commit ee07e54a36989637c39b110f1cba19c8af14a0fb
          Author: Hrishikesh Gadre <hgadre@cloudera.com>
          Date: 2016-08-10T21:41:12Z

          SOLR-9055 Make collection backup/restore extensible

          • Introduced a Strategy interface to define how the Solr index data is backed up.
          • Two concrete implementations of this strategy interface defined.
          • One using core Admin API (BACKUPCORE)
          • Other skipping the backup of index data altogether. This is useful when
            the index data is copied via an external mechanism in combination with named
            snapshots (Please refer to SOLR-9038 for details)
          • In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.)
          • Added a backup property to record the Solr version. This helps to check the compatibility
            of backup with respect to the current version during the restore operation. This
            compatibility check is not added since its unclear what the Solr level compatibility guidelines
            are. But at-least having version information as part of the backup would be very useful.

          Show
          githubbot ASF GitHub Bot added a comment - GitHub user hgadre opened a pull request: https://github.com/apache/lucene-solr/pull/67 SOLR-9055 Make collection backup/restore extensible Introduced a Strategy interface to define how the Solr index data is backed up. Two concrete implementations of this strategy interface defined. One using core Admin API (BACKUPCORE) Other skipping the backup of index data altogether. This is useful when the index data is copied via an external mechanism in combination with named snapshots (Please refer to SOLR-9038 for details) In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.) Added a backup property to record the Solr version. This helps to check the compatibility of backup with respect to the current version during the restore operation. This compatibility check is not added since its unclear what the Solr level compatibility guidelines are. But at-least having version information as part of the backup would be very useful. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hgadre/lucene-solr SOLR-9055 _fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/67.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #67 commit ee07e54a36989637c39b110f1cba19c8af14a0fb Author: Hrishikesh Gadre <hgadre@cloudera.com> Date: 2016-08-10T21:41:12Z SOLR-9055 Make collection backup/restore extensible Introduced a Strategy interface to define how the Solr index data is backed up. Two concrete implementations of this strategy interface defined. One using core Admin API (BACKUPCORE) Other skipping the backup of index data altogether. This is useful when the index data is copied via an external mechanism in combination with named snapshots (Please refer to SOLR-9038 for details) In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.) Added a backup property to record the Solr version. This helps to check the compatibility of backup with respect to the current version during the restore operation. This compatibility check is not added since its unclear what the Solr level compatibility guidelines are. But at-least having version information as part of the backup would be very useful.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          Varun Thacker Mark Miller Please take a look at this pull request. I think we should check the compatibility of the backup index version during restore operation. But I am not quite sure about the compatibility guidelines for Solr (which includes Lucene index format + Solr config files + other collection level meta-data).

          Show
          hgadre Hrishikesh Gadre added a comment - Varun Thacker Mark Miller Please take a look at this pull request. I think we should check the compatibility of the backup index version during restore operation. But I am not quite sure about the compatibility guidelines for Solr (which includes Lucene index format + Solr config files + other collection level meta-data).
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1381dd9287a23c950eaaa3c258249a5ebc812f35 in lucene-solr's branch refs/heads/master from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1381dd9 ]

          SOLR-9055: Make collection backup/restore extensible.

          • Introduced a parameter for the Backup operation to specify index backup strategy.
          • Introduced two strategies for backing up index data.
          • One using core Admin API (BACKUPCORE)
          • Other skipping the backup of index data altogether. This is useful when
            the index data is copied via an external mechanism in combination with named
            snapshots (Please refer to SOLR-9038 for details)
          • In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.)
          • Added a backup property to record the Solr version. This helps to check the compatibility
            of backup with respect to the current version during the restore operation. This
            compatibility check is not added since its unclear what the Solr level compatibility guidelines
            are. But at-least having version information as part of the backup would be very useful.
          Show
          jira-bot ASF subversion and git services added a comment - Commit 1381dd9287a23c950eaaa3c258249a5ebc812f35 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1381dd9 ] SOLR-9055 : Make collection backup/restore extensible. Introduced a parameter for the Backup operation to specify index backup strategy. Introduced two strategies for backing up index data. One using core Admin API (BACKUPCORE) Other skipping the backup of index data altogether. This is useful when the index data is copied via an external mechanism in combination with named snapshots (Please refer to SOLR-9038 for details) In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.) Added a backup property to record the Solr version. This helps to check the compatibility of backup with respect to the current version during the restore operation. This compatibility check is not added since its unclear what the Solr level compatibility guidelines are. But at-least having version information as part of the backup would be very useful.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 03cac8c7b5cb03a0940b1810bcece58466744f26 in lucene-solr's branch refs/heads/branch_6x from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03cac8c ]

          SOLR-9055: Make collection backup/restore extensible.

          • Introduced a parameter for the Backup operation to specify index backup strategy.
          • Introduced two strategies for backing up index data.
          • One using core Admin API (BACKUPCORE)
          • Other skipping the backup of index data altogether. This is useful when
            the index data is copied via an external mechanism in combination with named
            snapshots (Please refer to SOLR-9038 for details)
          • In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.)
          • Added a backup property to record the Solr version. This helps to check the compatibility
            of backup with respect to the current version during the restore operation. This
            compatibility check is not added since its unclear what the Solr level compatibility guidelines
            are. But at-least having version information as part of the backup would be very useful.
          1. Conflicts:
          2. solr/CHANGES.txt
          Show
          jira-bot ASF subversion and git services added a comment - Commit 03cac8c7b5cb03a0940b1810bcece58466744f26 in lucene-solr's branch refs/heads/branch_6x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03cac8c ] SOLR-9055 : Make collection backup/restore extensible. Introduced a parameter for the Backup operation to specify index backup strategy. Introduced two strategies for backing up index data. One using core Admin API (BACKUPCORE) Other skipping the backup of index data altogether. This is useful when the index data is copied via an external mechanism in combination with named snapshots (Please refer to SOLR-9038 for details) In future we can add additional implementations of this interface (e.g. based on HDFS snapshots etc.) Added a backup property to record the Solr version. This helps to check the compatibility of backup with respect to the current version during the restore operation. This compatibility check is not added since its unclear what the Solr level compatibility guidelines are. But at-least having version information as part of the backup would be very useful. Conflicts: solr/CHANGES.txt
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks! If you want to discuss enhancing that version check, let's spin it off into a new issue.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks! If you want to discuss enhancing that version check, let's spin it off into a new issue.

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              hgadre Hrishikesh Gadre
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development