Lucene - Core
  1. Lucene - Core
  2. LUCENE-3082

Add tool to upgrade all segments of an index to last recent supported index format without optimizing

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently if you want to upgrade an old index to the format of your current Lucene version, you have to optimize your index or use addIndexes(IndexReader...) [see LUCENE-2893] to copy to a new directory. The optimize() approach fails if your index is already optimized.

      I propose to add a custom MergePolicy to upgrade all segments to the last format. This MergePolicy could simply also ignore all segments already up-to-date. All segments in prior formats would be merged to a new segment using another MergePolicy's optimize strategy.

      This issue is different from LUCENE-2893, as it would only support upgrading indexes from previous Lucene versions in-place using the official path. Its a tool for the end user, not a developer tool.

      This addition should also go to Lucene 3.x, as we need to make users with pre-3.0 indexes go the step through 3.x, else they would not be able to open their index with 4.0. With this tool in 3.x the users could safely upgrade their index without relying on optimize to work on already-optimized indexes.

      1. LUCENE-3082-reorder-warnings.patch
        2 kB
        Uwe Schindler
      2. LUCENE-3082-reorder-warnings.patch
        5 kB
        Uwe Schindler
      3. LUCENE-3082.patch
        10 kB
        Uwe Schindler
      4. LUCENE-3082.patch
        10 kB
        Uwe Schindler
      5. LUCENE-3082.patch
        10 kB
        Uwe Schindler
      6. LUCENE-3082.patch
        18 kB
        Uwe Schindler
      7. LUCENE-3082.patch
        19 kB
        Uwe Schindler
      8. index.31.optimized.nocfs.zip
        4 kB
        Uwe Schindler
      9. index.31.optimized.cfs.zip
        2 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Bulk closing for 3.2

          Show
          Robert Muir added a comment - Bulk closing for 3.2
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1102658
          Merged 3.x revision: 1102659

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1102658 Merged 3.x revision: 1102659
          Hide
          Uwe Schindler added a comment -

          Upgraded patch. Will now be committed.

          I added Version ctor argument, as in 3.x this would chose the default merge policy.

          Show
          Uwe Schindler added a comment - Upgraded patch. Will now be committed. I added Version ctor argument, as in 3.x this would chose the default merge policy.
          Hide
          Uwe Schindler added a comment -

          Patch that adds some warnings about reordering of documents IDs if the index was partially upgraded before execution.

          Show
          Uwe Schindler added a comment - Patch that adds some warnings about reordering of documents IDs if the index was partially upgraded before execution.
          Hide
          Uwe Schindler added a comment -

          We should add a warning to the MergePolicy/IndexUpgrader, that this tool reorders segments, if the index was partially upgraded before (e.g. by adding new documents). Segments that were upgraded before a call to MP's optimize come first, then the upgraded ones.

          Show
          Uwe Schindler added a comment - We should add a warning to the MergePolicy/IndexUpgrader, that this tool reorders segments, if the index was partially upgraded before (e.g. by adding new documents). Segments that were upgraded before a call to MP's optimize come first, then the upgraded ones.
          Hide
          Uwe Schindler added a comment -

          I also used the full random IndexWriterConfig now after LUCENE-3083 was committed (Fix MockRandomMergePolicy).

          I will now commit and merge the test code to produce the optimized indexes.

          Show
          Uwe Schindler added a comment - I also used the full random IndexWriterConfig now after LUCENE-3083 was committed (Fix MockRandomMergePolicy). I will now commit and merge the test code to produce the optimized indexes.
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1101088
          Committed 3.x revision: 1101093

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1101088 Committed 3.x revision: 1101093
          Hide
          Uwe Schindler added a comment -

          New patch with renamed class and added documentation as suggested by Mike.

          The previous patch had also a bug in the command line tool (instead of "dir" it used still "args[0]" to invoke the ctor, which was a relict from earlier tool version).

          I also fixed javadocs and added lucene.experimental to the UpgradeIndexMergePolicy, as we should not make it too public (but its not really "internal" because there are use cases not covered by the easy-to-use IndexUpgrader tool.

          Show
          Uwe Schindler added a comment - New patch with renamed class and added documentation as suggested by Mike. The previous patch had also a bug in the command line tool (instead of "dir" it used still "args [0] " to invoke the ctor, which was a relict from earlier tool version). I also fixed javadocs and added lucene.experimental to the UpgradeIndexMergePolicy, as we should not make it too public (but its not really "internal" because there are use cases not covered by the easy-to-use IndexUpgrader tool.
          Hide
          Michael McCandless added a comment -

          How about this wording:

          Expert: this tool keeps only the last commit in an index; for this
          reason, if the incoming index has more than one commit, the tool
          refuses to run by default. Specify -delete-prior-commits to override
          this, allowing the tool to delete all but the last commit.

          Maybe just call it IndexUpgrader? (Format seems redundant?)

          There's a missing

          { and }

          after the "if (commits.size() > 1)"

          Show
          Michael McCandless added a comment - How about this wording: Expert: this tool keeps only the last commit in an index; for this reason, if the incoming index has more than one commit, the tool refuses to run by default. Specify -delete-prior-commits to override this, allowing the tool to delete all but the last commit. Maybe just call it IndexUpgrader? (Format seems redundant?) There's a missing { and } after the "if (commits.size() > 1)"
          Hide
          Uwe Schindler added a comment -

          Patch with updated and randomized tests, command line tool (oal.index.IndexFormatUpgrader) and javadocs.

          I think it's ready to commit.

          Show
          Uwe Schindler added a comment - Patch with updated and randomized tests, command line tool (oal.index.IndexFormatUpgrader) and javadocs. I think it's ready to commit.
          Hide
          Uwe Schindler added a comment -

          Small change to the merging of the leftover segments, that are not scheduled for merge by the wrapped MergePolicy: They re now merged together into one segment instead of separately. Normally that are only few ones (e.g. when TieredMergePolicy only optimized the first 30 segments and leave the rest for later). As we have no cascading optimize, we merge the remaining segments into one.

          Show
          Uwe Schindler added a comment - Small change to the merging of the leftover segments, that are not scheduled for merge by the wrapped MergePolicy: They re now merged together into one segment instead of separately. Normally that are only few ones (e.g. when TieredMergePolicy only optimized the first 30 segments and leave the rest for later). As we have no cascading optimize, we merge the remaining segments into one.
          Hide
          Michael McCandless added a comment -

          Patch looks great!

          The segmentsToOptimize ought to contain every segment in the index; that's only present for the case where optimize() is called in a bg thread but other threads continue to index new documents causing new segments to be flushed. These new segments would then NOT be in the segmentsToOptimize when the optimize merges need to cascade.

          TODO: for the command-line tool, we should make sure the index only has a single commit point (ie, abort if not). Upgrading an index with more than one commit point is hairy (I think it's fine not to support this case... but we should not remove the commits).

          Show
          Michael McCandless added a comment - Patch looks great! The segmentsToOptimize ought to contain every segment in the index; that's only present for the case where optimize() is called in a bg thread but other threads continue to index new documents causing new segments to be flushed. These new segments would then NOT be in the segmentsToOptimize when the optimize merges need to cascade. TODO: for the command-line tool, we should make sure the index only has a single commit point (ie, abort if not). Upgrading an index with more than one commit point is hairy (I think it's fine not to support this case... but we should not remove the commits).
          Hide
          Uwe Schindler added a comment -

          Upgraded patch with a protected shouldUpgradeSegment(SI) method.

          Show
          Uwe Schindler added a comment - Upgraded patch with a protected shouldUpgradeSegment(SI) method.
          Hide
          Uwe Schindler added a comment -

          Shai:

          • The supplied patch should handle all you want (there would be only one addition, the proposed 'boolean shouldUpgradeSegment(SegmentInfo)' method, which is a one-liner, will upload new patch for that und make the merge policy unfinal.
          • It will not do cascading merges, because when the merge policy recognizes that all segments have already the new version it will not merge anything. So after the first iteration all segments will be upgraded, so on the next run of this policy, it will return null merges.

          The other ideas like PayloadProcessor can be done outside of that in user code (but beware, it will not touch segments already in new version).

          Show
          Uwe Schindler added a comment - Shai: The supplied patch should handle all you want (there would be only one addition, the proposed 'boolean shouldUpgradeSegment(SegmentInfo)' method, which is a one-liner, will upload new patch for that und make the merge policy unfinal. It will not do cascading merges, because when the merge policy recognizes that all segments have already the new version it will not merge anything. So after the first iteration all segments will be upgraded, so on the next run of this policy, it will return null merges. The other ideas like PayloadProcessor can be done outside of that in user code (but beware, it will not touch segments already in new version).
          Hide
          Uwe Schindler added a comment - - edited

          Patch that implements this with a merge policy:

          It does not yet contain the command line updater, if you want to upgrade an old index, the API code to do this is very simple:

          IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_XX, new KeywordAnalyzer());
          iwc = iwc.setMergePolicy(new UpgradeIndexMergePolicy(iwc.getMergePolicy()));
          IndexWriter w = new IndexWriter(dir, iwc);
          w.optimize();
          w.close();
          

          The patch contains new tests in TestBackwards that verify the upgrade process:

          • It tries to upgrade all old indexes from the well-known list in TestBackwards. When this is done, all of them should contain exactly one segment (because all segments previously in index are older version, so they are merged/optimized together in new format). It also verifies all segment versions to be Constants.LUCENE_MAIN_VERSION.
          • It tries to upgrade two old, already optimized indexes (with prev version, I changed TestBackwards in my 3.1 checkout to generate those). It verifies the segment versions after the upgrade. This special case is needed, as optimizing a one-segment index is a no-op without the special merge-policy
          • It uses the old optimized indexes, opens them using standard merge policy and adds some documents to them. After that it upgrades the index with a new IndexWriter using the special merge policy. In that case (as some segments are already in new version), the index should only have the old-segments merged together, the newly added ones are untouched. So segment is verified to be count > 1.
          Show
          Uwe Schindler added a comment - - edited Patch that implements this with a merge policy: It does not yet contain the command line updater, if you want to upgrade an old index, the API code to do this is very simple: IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_XX, new KeywordAnalyzer()); iwc = iwc.setMergePolicy( new UpgradeIndexMergePolicy(iwc.getMergePolicy())); IndexWriter w = new IndexWriter(dir, iwc); w.optimize(); w.close(); The patch contains new tests in TestBackwards that verify the upgrade process: It tries to upgrade all old indexes from the well-known list in TestBackwards. When this is done, all of them should contain exactly one segment (because all segments previously in index are older version, so they are merged/optimized together in new format). It also verifies all segment versions to be Constants.LUCENE_MAIN_VERSION. It tries to upgrade two old, already optimized indexes (with prev version, I changed TestBackwards in my 3.1 checkout to generate those). It verifies the segment versions after the upgrade. This special case is needed, as optimizing a one-segment index is a no-op without the special merge-policy It uses the old optimized indexes, opens them using standard merge policy and adds some documents to them. After that it upgrades the index with a new IndexWriter using the special merge policy. In that case (as some segments are already in new version), the index should only have the old-segments merged together, the newly added ones are untouched. So segment is verified to be count > 1.
          Hide
          Shai Erera added a comment -

          This is a great idea. We should also allow one to plug in a PayloadProcessorProvider so he can rewrite the payload "on the go" if need be.

          Also, while the index is being upgraded, I think it will be useful if we merge the segments that are upgraded, however not do cascading merges. Since segments are rewritten anyway, we can only gain from the merge. As always, if not everybody agree on this, we can make it a parameter.

          And let's make sure that whatever 'upgrade' means is at the application control. I.e., upgrade can be simply upgrading from 3x to 4.0, but it can also be using PayloadProcessorProvider as well suddenly deciding that all segments should be compound. I'm pretty sure I'll want to control the first two, not so about the last one.

          It can be a simple 'boolean shouldUpgradeSegment(SegmentInfo)' on this UpgradeMP, which apps can override.

          Show
          Shai Erera added a comment - This is a great idea. We should also allow one to plug in a PayloadProcessorProvider so he can rewrite the payload "on the go" if need be. Also, while the index is being upgraded, I think it will be useful if we merge the segments that are upgraded, however not do cascading merges. Since segments are rewritten anyway, we can only gain from the merge. As always, if not everybody agree on this, we can make it a parameter. And let's make sure that whatever 'upgrade' means is at the application control. I.e., upgrade can be simply upgrading from 3x to 4.0, but it can also be using PayloadProcessorProvider as well suddenly deciding that all segments should be compound. I'm pretty sure I'll want to control the first two, not so about the last one. It can be a simple 'boolean shouldUpgradeSegment(SegmentInfo)' on this UpgradeMP, which apps can override.
          Hide
          Uwe Schindler added a comment -
          Show
          Uwe Schindler added a comment - Here the discussion of #lucene-dev irc channel: http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2011-05-08#l117
          Hide
          Michael McCandless added a comment -

          Maybe instead of a new method on IW, this is a new tool (eg oal.index.UpgradeIndex)? That tool would create IW w/ a custom UpgradeMergePolicy that rewrites all segments (or only segments not matching current format, but often that would presumably be all segments).

          Show
          Michael McCandless added a comment - Maybe instead of a new method on IW, this is a new tool (eg oal.index.UpgradeIndex)? That tool would create IW w/ a custom UpgradeMergePolicy that rewrites all segments (or only segments not matching current format, but often that would presumably be all segments).

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development