Lucene - Core
  1. Lucene - Core
  2. LUCENE-2773

Don't create compound file for large segments by default

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.4, 3.0.3, 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-2762.

      CFS is useful for keeping the open file count down. But, it costs
      some added time during indexing to build, and also ties up temporary
      disk space, causing eg a large spike on the final merge of an
      optimize.

      Since MergePolicy dictates which segments should be CFS, we can
      change it to only build CFS for "smallish" merges.

      I think we should also set a maxMergeMB by default so that very large
      merges aren't done.

      1. LUCENE-2773.patch
        13 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        I added a get/setNoCFSRatio to LogMergePolicy, defaulted to 10% (0.1)
        meaning if the estimated size of the merged segment is greater than
        10% of the total size of the index, then we leave the merge segment in
        non-compound format.

        I also defaulted the maxMergeMB to 2 GB, meaning (w/ the default
        mergeFactor of 10) your biggest segments will be 2 - 20 GB.

        Show
        Michael McCandless added a comment - Patch. I added a get/setNoCFSRatio to LogMergePolicy, defaulted to 10% (0.1) meaning if the estimated size of the merged segment is greater than 10% of the total size of the index, then we leave the merge segment in non-compound format. I also defaulted the maxMergeMB to 2 GB, meaning (w/ the default mergeFactor of 10) your biggest segments will be 2 - 20 GB.
        Hide
        Michael McCandless added a comment -

        I'll commit this soon to trunk, but...

        I think we should also back-port it to 2.9.x/3.0.x.

        On the one hand, it's a sizable change to IndexWriter's defaults, in that suddenly, if you use CFS, you'll see your large segments no longer being converted to CFS, and if you have a large index you'll see your large segments no longer getting merged away due to the change to maxMergeMB. Though, these decisions have always been "under the hood", so the change the app sees would be eg on listing the directory, and not really on any "external" factors.

        But, on the other hand, if we don't back port, then suddenly large merges require substantially more transient peak disk space than before, which is a very external change.

        So, it's a lesser-of-evils situation, and I think the lesser evil is to change the defaults.

        Show
        Michael McCandless added a comment - I'll commit this soon to trunk, but... I think we should also back-port it to 2.9.x/3.0.x. On the one hand, it's a sizable change to IndexWriter's defaults, in that suddenly, if you use CFS, you'll see your large segments no longer being converted to CFS, and if you have a large index you'll see your large segments no longer getting merged away due to the change to maxMergeMB. Though, these decisions have always been "under the hood", so the change the app sees would be eg on listing the directory, and not really on any "external" factors. But, on the other hand, if we don't back port, then suddenly large merges require substantially more transient peak disk space than before, which is a very external change. So, it's a lesser-of-evils situation, and I think the lesser evil is to change the defaults.
        Hide
        Shay Banon added a comment -

        Mike, are you sure regarding the default maxMergeMB set to 2gb? This ia a big change in default behavior. For systems that do updates (deletes) we are covered because they are taken (partially) into account when computing the segment size. But, lets say you have a 100gb size index, you will end up with 50 segments, no?

        Show
        Shay Banon added a comment - Mike, are you sure regarding the default maxMergeMB set to 2gb? This ia a big change in default behavior. For systems that do updates (deletes) we are covered because they are taken (partially) into account when computing the segment size. But, lets say you have a 100gb size index, you will end up with 50 segments, no?
        Hide
        Michael McCandless added a comment -

        But, lets say you have a 100gb size index, you will end up with 50 segments, no?

        So, in the "worst" case, yes... but in the "best" case you could end up with 5 segments. This threshold applies to segments-to-be-merged, so if you have a bunch of segments just under 2 GB, they will get merged and make a nearly 20 GB segment, which would then be merged.

        So basically this setting is terribly coarse. I think this can be improved (eg something the lines of BalancedSegmentMergePolicy), perhaps by merging (much) fewer than mergeFactor segments at a time to keep the immense merges "smallish". But until we cutover to a better merge policy, we're stuck with this coarse setting...

        So... maybe 5 GB?

        But, on the deletes... in 2.9.x and 3.0.x we do NOT in fact take deletions into account by default; I think, along with this change, we should also fix 2.9.x and 3.0.x to take deletions into account.

        Show
        Michael McCandless added a comment - But, lets say you have a 100gb size index, you will end up with 50 segments, no? So, in the "worst" case, yes... but in the "best" case you could end up with 5 segments. This threshold applies to segments-to-be-merged, so if you have a bunch of segments just under 2 GB, they will get merged and make a nearly 20 GB segment, which would then be merged. So basically this setting is terribly coarse. I think this can be improved (eg something the lines of BalancedSegmentMergePolicy), perhaps by merging (much) fewer than mergeFactor segments at a time to keep the immense merges "smallish". But until we cutover to a better merge policy, we're stuck with this coarse setting... So... maybe 5 GB? But, on the deletes... in 2.9.x and 3.0.x we do NOT in fact take deletions into account by default; I think, along with this change, we should also fix 2.9.x and 3.0.x to take deletions into account.
        Hide
        Michael McCandless added a comment -

        OK i think for 2.9/3.0, I will only backport the "don't make a CFS if the merged segment is large" change; that change will reduce temp disk space required.

        I think the change to maxMergeMB / take deletions into account is too big for 2.9/3.0.

        So for 3.x/trunk (which already take deletions into account by default), I'll switch maxMergeMB default to 2 GB. I think this is an OK default given that it means your biggest segments will range from 2GB - 20GB.

        Show
        Michael McCandless added a comment - OK i think for 2.9/3.0, I will only backport the "don't make a CFS if the merged segment is large" change; that change will reduce temp disk space required. I think the change to maxMergeMB / take deletions into account is too big for 2.9/3.0. So for 3.x/trunk (which already take deletions into account by default), I'll switch maxMergeMB default to 2 GB. I think this is an OK default given that it means your biggest segments will range from 2GB - 20GB.
        Hide
        Simon Willnauer added a comment -

        So for 3.x/trunk (which already take deletions into account by default), I'll switch maxMergeMB default to 2 GB. I think this is an OK default given that it means your biggest segments will range from 2GB - 20GB.

        Mike, this also means that an optimize will have no effect if all segments > 2GB with this as default? It seems kind of odd to me ey?

        Show
        Simon Willnauer added a comment - So for 3.x/trunk (which already take deletions into account by default), I'll switch maxMergeMB default to 2 GB. I think this is an OK default given that it means your biggest segments will range from 2GB - 20GB. Mike, this also means that an optimize will have no effect if all segments > 2GB with this as default? It seems kind of odd to me ey?
        Hide
        Michael McCandless added a comment -

        Mike, this also means that an optimize will have no effect if all segments > 2GB with this as default? It seems kind of odd to me ey?

        There was a separate issue for this – LUCENE-2701.

        I agree it's debatable... and it's not clear which way we should default it.

        Show
        Michael McCandless added a comment - Mike, this also means that an optimize will have no effect if all segments > 2GB with this as default? It seems kind of odd to me ey? There was a separate issue for this – LUCENE-2701 . I agree it's debatable... and it's not clear which way we should default it.
        Hide
        Simon Willnauer added a comment -

        There was a separate issue for this - LUCENE-2701.

        I think we should reopen and fix this. I expect optimize to have single segment semantics if I call optmize() as the JDocs states. However we do that

        Show
        Simon Willnauer added a comment - There was a separate issue for this - LUCENE-2701 . I think we should reopen and fix this. I expect optimize to have single segment semantics if I call optmize() as the JDocs states. However we do that

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development