Lucene - Core
  1. Lucene - Core
  2. LUCENE-3126

IndexWriter.addIndexes can make any incoming segment into CFS if it isn't already

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Today, IW.addIndexes(Directory) does not modify the CFS-mode of the incoming segments. However, if IndexWriter's MP wants to create CFS (in general), there's no reason why not turn the incoming non-CFS segments into CFS. We anyway copy them, and if MP is not against CFS, we should create a CFS out of them.

      Will need to use CFW, not sure it's ready for that w/ current API (I'll need to check), but luckily we're allowed to change it (@lucene.internal).

      This should be done, IMO, even if the incoming segment is large (i.e., passes MP.noCFSRatio) b/c like I wrote above, we anyway copy it. However, if you think otherwise, speak up .

      I'll take a look at this in the next few days.

      1. LUCENE-3126.patch
        15 kB
        Shai Erera
      2. LUCENE-3126.patch
        11 kB
        Shai Erera

        Activity

        Hide
        Michael McCandless added a comment -

        Cool idea!

        Show
        Michael McCandless added a comment - Cool idea!
        Hide
        Shai Erera added a comment -

        Hmm, that got me thinking – what if someone adds a CFS segment to IW, where IW's MP sets cfs=off? Should we consider it a violation (that's what happens today if you try it)? I personally think that this is not a common (or even expected) scenario, and therefore javadocs is enough, but if others think this is a violation, then we should fix it - we'll need to 'open' the CFS into its individual files.

        Show
        Shai Erera added a comment - Hmm, that got me thinking – what if someone adds a CFS segment to IW, where IW's MP sets cfs=off? Should we consider it a violation (that's what happens today if you try it)? I personally think that this is not a common (or even expected) scenario, and therefore javadocs is enough, but if others think this is a violation, then we should fix it - we'll need to 'open' the CFS into its individual files.
        Hide
        Michael McCandless added a comment -

        Seems like we should split open the CFS in this case? Though, I don't think it's super-important to do this... it ought to be rare?

        Show
        Michael McCandless added a comment - Seems like we should split open the CFS in this case? Though, I don't think it's super-important to do this... it ought to be rare?
        Hide
        Shai Erera added a comment -

        Patch includes:

        • Modify CFW to not assume the files it 'merges' into a CFS exist in the same Directory as it was initialized with + test
        • Modify SegmentMerger following CFW updates + test
        • Modify IW to copy a segment into a CFS + test.

        NOTES:

        • I did not cover 'unrolling' a CFS if the MP does not support CFS. I don't think it's so critical (left a TODO in TestAddIndexes)
        • About shared doc stores – I'd appreciate a review of this. Since we don't do shared doc stores anymore, I don't know how to simulate it for the test.
        Show
        Shai Erera added a comment - Patch includes: Modify CFW to not assume the files it 'merges' into a CFS exist in the same Directory as it was initialized with + test Modify SegmentMerger following CFW updates + test Modify IW to copy a segment into a CFS + test. NOTES: I did not cover 'unrolling' a CFS if the MP does not support CFS. I don't think it's so critical (left a TODO in TestAddIndexes) About shared doc stores – I'd appreciate a review of this. Since we don't do shared doc stores anymore, I don't know how to simulate it for the test.
        Hide
        Shai Erera added a comment -

        Patch does not handle all files well (few tests fail). Apparently, the .del file should not be rolled into the .cfs. SegmentMerger.createCompoundFile does this by default, however it's only called from code that ensures no deletions exist. Would have been nice if this method documented it .

        Also, I think *.s<num> should not be rolled into .cfs (those are the separate norms files). I don't know how to create such files in the first place (thought they're of old format, but 3.1 indexes have them also), and TestBackCompat fails. Is there a way to identify those files? Is it safe to check if the file extension starts w/ IndexFileNames.SEPARATE_NORMS_EXTENSION? Feels hacky to me.

        Another thing, I think in order to avoid shared doc stores (and whatever other old-format) stuff, since it's only an optimization, that the code should copy into CFS only if the segment version is on or after 3.1 (that is StringHelper.getVersionComparator().compare(info.getVersion, "3.1") >= 0).

        I think I'm close to finish it, just need to figure out the separate norms thing.

        Show
        Shai Erera added a comment - Patch does not handle all files well (few tests fail). Apparently, the .del file should not be rolled into the .cfs. SegmentMerger.createCompoundFile does this by default, however it's only called from code that ensures no deletions exist. Would have been nice if this method documented it . Also, I think *.s<num> should not be rolled into .cfs (those are the separate norms files). I don't know how to create such files in the first place (thought they're of old format, but 3.1 indexes have them also), and TestBackCompat fails. Is there a way to identify those files? Is it safe to check if the file extension starts w/ IndexFileNames.SEPARATE_NORMS_EXTENSION? Feels hacky to me. Another thing, I think in order to avoid shared doc stores (and whatever other old-format) stuff, since it's only an optimization, that the code should copy into CFS only if the segment version is on or after 3.1 (that is StringHelper.getVersionComparator().compare(info.getVersion, "3.1") >= 0). I think I'm close to finish it, just need to figure out the separate norms thing.
        Hide
        Michael McCandless added a comment -

        Patch does not handle all files well (few tests fail). Apparently, the .del file should not be rolled into the .cfs.

        Right, .del files never appear inside a CFS.

        SegmentMerger.createCompoundFile does this by default, however it's only called from code that ensures no deletions exist. Would have been nice if this method documented it .

        Please add comments to this! It's non-obvious

        Also, I think *.s<num> should not be rolled into .cfs (those are the separate norms files). I don't know how to create such files in the first place (thought they're of old format, but 3.1 indexes have them also), and TestBackCompat fails.

        Right, these too only live outside a CFS. You create them by opening a writable IndexReader (I know: confusing!) and calling setNorm, then closing it. They are not only for old indices... 4.0 creates them too.

        Is there a way to identify those files? Is it safe to check if the file extension starts w/ IndexFileNames.SEPARATE_NORMS_EXTENSION? Feels hacky to me.

        Hackish though it seems (I agree) I think that's the only way? SegmentInfo.hasSeparateNorms is equally hacky...

        Another thing, I think in order to avoid shared doc stores (and whatever other old-format) stuff, since it's only an optimization, that the code should copy into CFS only if the segment version is on or after 3.1 (that is StringHelper.getVersionComparator().compare(info.getVersion, "3.1") >= 0).

        Shared doc stores, yes, but the separate del docs / norms are produced by all versions.

        More generally: does addIndexes properly refuse to import a too-old index? We should throw IndexFormatTooOldExc in this case? (And, maybe also IndexFormatTooNewExc?).

        Show
        Michael McCandless added a comment - Patch does not handle all files well (few tests fail). Apparently, the .del file should not be rolled into the .cfs. Right, .del files never appear inside a CFS. SegmentMerger.createCompoundFile does this by default, however it's only called from code that ensures no deletions exist. Would have been nice if this method documented it . Please add comments to this! It's non-obvious Also, I think *.s<num> should not be rolled into .cfs (those are the separate norms files). I don't know how to create such files in the first place (thought they're of old format, but 3.1 indexes have them also), and TestBackCompat fails. Right, these too only live outside a CFS. You create them by opening a writable IndexReader (I know: confusing!) and calling setNorm, then closing it. They are not only for old indices... 4.0 creates them too. Is there a way to identify those files? Is it safe to check if the file extension starts w/ IndexFileNames.SEPARATE_NORMS_EXTENSION? Feels hacky to me. Hackish though it seems (I agree) I think that's the only way? SegmentInfo.hasSeparateNorms is equally hacky... Another thing, I think in order to avoid shared doc stores (and whatever other old-format) stuff, since it's only an optimization, that the code should copy into CFS only if the segment version is on or after 3.1 (that is StringHelper.getVersionComparator().compare(info.getVersion, "3.1") >= 0). Shared doc stores, yes, but the separate del docs / norms are produced by all versions. More generally: does addIndexes properly refuse to import a too-old index? We should throw IndexFormatTooOldExc in this case? (And, maybe also IndexFormatTooNewExc?).
        Hide
        Shai Erera added a comment -

        Right, these too only live outside a CFS. You create them by opening a writable IndexReader (I know: confusing!) and calling setNorm, then closing it. They are not only for old indices... 4.0 creates them too.

        Thanks Mike ! I was able to reproduce it and fix it (+ add to test). Are there other files that are normally created outside the .cfs? I've seen sometimes that the stored fields of CFS are created outside. Was it only for shared doc stores?

        More generally: does addIndexes properly refuse to import a too-old index? We should throw IndexFormatTooOldExc in this case? (And, maybe also IndexFormatTooNewExc?).

        Not today. I believe it will fail in later stages (e.g. commit()), but we better fail up front. I think it's a separate issue though, only for 4.0 (b/c 3x supports all formats today)?

        Show
        Shai Erera added a comment - Right, these too only live outside a CFS. You create them by opening a writable IndexReader (I know: confusing!) and calling setNorm, then closing it. They are not only for old indices... 4.0 creates them too. Thanks Mike ! I was able to reproduce it and fix it (+ add to test). Are there other files that are normally created outside the .cfs? I've seen sometimes that the stored fields of CFS are created outside. Was it only for shared doc stores? More generally: does addIndexes properly refuse to import a too-old index? We should throw IndexFormatTooOldExc in this case? (And, maybe also IndexFormatTooNewExc?). Not today. I believe it will fail in later stages (e.g. commit()), but we better fail up front. I think it's a separate issue though, only for 4.0 (b/c 3x supports all formats today)?
        Hide
        Shai Erera added a comment -

        Patch w/ all fixes. Tests pass. No CHANGES entry yet, I'll add it in the next patch (after some comments).

        Show
        Shai Erera added a comment - Patch w/ all fixes. Tests pass. No CHANGES entry yet, I'll add it in the next patch (after some comments).
        Hide
        Michael McCandless added a comment -

        Are there other files that are normally created outside the .cfs? I've seen sometimes that the stored fields of CFS are created outside. Was it only for shared doc stores?

        I think just separate norms (_N.sM) and deletions (_N.del) live
        outside CFS? Yes, only w/ shared doc stores did the shared doc store
        files live outside cfs...

        More generally: does addIndexes properly refuse to import a too-old index? We should throw IndexFormatTooOldExc in this case? (And, maybe also IndexFormatTooNewExc?).

        Not today. I believe it will fail in later stages (e.g. commit()), but
        we better fail up front. I think it's a separate issue though, only
        for 4.0 (b/c 3x supports all formats today)?

        OK I see LUCENE-3138 for this... thanks.

        Show
        Michael McCandless added a comment - Are there other files that are normally created outside the .cfs? I've seen sometimes that the stored fields of CFS are created outside. Was it only for shared doc stores? I think just separate norms (_N.sM) and deletions (_N.del) live outside CFS? Yes, only w/ shared doc stores did the shared doc store files live outside cfs... More generally: does addIndexes properly refuse to import a too-old index? We should throw IndexFormatTooOldExc in this case? (And, maybe also IndexFormatTooNewExc?). Not today. I believe it will fail in later stages (e.g. commit()), but we better fail up front. I think it's a separate issue though, only for 4.0 (b/c 3x supports all formats today)? OK I see LUCENE-3138 for this... thanks.
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        Maybe we should add asserts to SegmentMerger.createCompoundFile, that SI.files() did not return del/separate norms? I don't like the ambiguity here... and then strengthen the comment saying SI.files should never return these in this context? Hopefully this does not cause any test files! (We can do this as a separate issue...).

        Show
        Michael McCandless added a comment - Patch looks good! Maybe we should add asserts to SegmentMerger.createCompoundFile, that SI.files() did not return del/separate norms? I don't like the ambiguity here... and then strengthen the comment saying SI.files should never return these in this context? Hopefully this does not cause any test files! (We can do this as a separate issue...).
        Hide
        Shai Erera added a comment -

        Maybe we should add asserts to SegmentMerger.createCompoundFile, that SI.files() did not return del/separate norms?

        Let's do that in a separate issue? I'll also remove the comment from SM in this issue, and upload it to the other one (after I open it ). Keep this issue focused on IW.addIndexes.

        Show
        Shai Erera added a comment - Maybe we should add asserts to SegmentMerger.createCompoundFile, that SI.files() did not return del/separate norms? Let's do that in a separate issue? I'll also remove the comment from SM in this issue, and upload it to the other one (after I open it ). Keep this issue focused on IW.addIndexes.
        Hide
        Shai Erera added a comment -

        Committed revision 1127871 (trunk).
        Committed revision 1127872 (3x).

        Show
        Shai Erera added a comment - Committed revision 1127871 (trunk). Committed revision 1127872 (3x).
        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development