Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4323

Add max cfs segment size to LogMergePolicy and TieredMergePolicy

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.0, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Our application is managing thousands of indexes ranging from a few KB to a few GB in size. To keep the number of files under control and at the same time avoid the overhead of compound file format for large segments, we would like to keep only small segments as CFS. The meaning of "small" here is in absolute byte size terms, not as a percentage of the overall index. It is ok and in fact desirable to have the entire index as CFS as long as it is below the threshold.

      The attached patch adds a new configuration option maxCFSSegmentSize which sets the absolute limit on the compound file segment size, in addition to the existing noCFSRatio, i.e. the lesser of the two will be used. The default is to allow any size (Long.MAX_VALUE) so that the default behavior is exactly as it was before.

      The patch is for the trunk as of Aug 23, 2012.

      1. LUCENE-4323.patch
        6 kB
        Alexey Lef
      2. LUCENE-4323.patch
        12 kB
        Uwe Schindler
      3. diskFullFix.patch
        1.0 kB
        Uwe Schindler

        Activity

        Hide
        thetaphi Uwe Schindler added a comment -

        Closed after release.

        Show
        thetaphi Uwe Schindler added a comment - Closed after release.
        Hide
        rcmuir Robert Muir added a comment -

        this is actually fixed, I reopened it because of a jenkins fail, but it was only slightly related (more to Uwe's new test randomization)

        Show
        rcmuir Robert Muir added a comment - this is actually fixed, I reopened it because of a jenkins fail, but it was only slightly related (more to Uwe's new test randomization)
        Hide
        rcmuir Robert Muir added a comment -

        Thanks for digging Uwe: I just wanted to be sure here. +1 to commit this. I'm not sure if any of the other disk full methods might need this too.

        Show
        rcmuir Robert Muir added a comment - Thanks for digging Uwe: I just wanted to be sure here. +1 to commit this. I'm not sure if any of the other disk full methods might need this too.
        Hide
        thetaphi Uwe Schindler added a comment -

        Here the patch that fixes the test bug, explanation (why Roberts fix fixed the bug was just a side effect of his code change, not setting the maxMergeSize). The code committed here as fine!

        Show
        thetaphi Uwe Schindler added a comment - Here the patch that fixes the test bug, explanation (why Roberts fix fixed the bug was just a side effect of his code change, not setting the maxMergeSize). The code committed here as fine!
        Hide
        thetaphi Uwe Schindler added a comment -

        You fix just hides the problem which is caused by another randomization in LTC:

        step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)!

        Here is the explanation:
        Your fix - (setting LMP's maxMerge) is not changing anything here - the max size is still Long.MAX_VALUE internally! The change is really that you change the order of newLogMergePolicy(),newIndexWriterConfig() [they were the other order before] -> this changes random numbers completely.
        What I changed in my patch was missing randomization in new LogMergePolicy for cfsRatio! Before it was always 0.1, preventing creating CFS on addIndexes at all! In the original seed without your patch, the cfsRatio was then randomized to ~0.8, causing CFS files to be created -> causing disk Full.

        The cause is simply additional randomization, missing before for newLogMergePolicy().

        The correct way to fix this test is to prevent CFS files at all: newLogMergePolicy(false).

        Should I commit that fix?

        Show
        thetaphi Uwe Schindler added a comment - You fix just hides the problem which is caused by another randomization in LTC: step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)! Here is the explanation: Your fix - (setting LMP's maxMerge) is not changing anything here - the max size is still Long.MAX_VALUE internally! The change is really that you change the order of newLogMergePolicy(),newIndexWriterConfig() [they were the other order before] -> this changes random numbers completely. What I changed in my patch was missing randomization in new LogMergePolicy for cfsRatio! Before it was always 0.1, preventing creating CFS on addIndexes at all! In the original seed without your patch, the cfsRatio was then randomized to ~0.8, causing CFS files to be created -> causing disk Full. The cause is simply additional randomization, missing before for newLogMergePolicy(). The correct way to fix this test is to prevent CFS files at all: newLogMergePolicy(false). Should I commit that fix?
        Hide
        rcmuir Robert Muir added a comment -

        Keep in mind we should I think, still apply something like my patch to disk full tests that count MBytes, as we don't want this kicking in here I think.

        However i want to be absolutely positive we aren't hiding bugs before doing that.

        Show
        rcmuir Robert Muir added a comment - Keep in mind we should I think, still apply something like my patch to disk full tests that count MBytes, as we don't want this kicking in here I think. However i want to be absolutely positive we aren't hiding bugs before doing that.
        Hide
        rcmuir Robert Muir added a comment -

        This does not actually work in megabytes: see my email response on the list to test failures.

        step1: ant test -Dtestcase=TestIndexWriterOnDiskFull -Dtests.method=testAddIndexOnDiskFull -Dtests.seed=17160CAE40DA6C29 -Dtests.slow=true -Dtests.locale=hr_HR -Dtests.timezone=Etc/GMT+12 -Dtests.file.encoding=UTF-8 -Dtests.verbose=true

        You can see all values are 8.796093022207999E12

        step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)!

        but now if you run the seed again, it passes!

        Show
        rcmuir Robert Muir added a comment - This does not actually work in megabytes: see my email response on the list to test failures. step1: ant test -Dtestcase=TestIndexWriterOnDiskFull -Dtests.method=testAddIndexOnDiskFull -Dtests.seed=17160CAE40DA6C29 -Dtests.slow=true -Dtests.locale=hr_HR -Dtests.timezone=Etc/GMT+12 -Dtests.file.encoding=UTF-8 -Dtests.verbose=true You can see all values are 8.796093022207999E12 step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)! but now if you run the seed again, it passes!
        Hide
        thetaphi Uwe Schindler added a comment -

        Committed trunk revision: 1376766
        Committed 4.x revision: 1376767

        Thanks Alexey!

        Show
        thetaphi Uwe Schindler added a comment - Committed trunk revision: 1376766 Committed 4.x revision: 1376767 Thanks Alexey!
        Hide
        thetaphi Uwe Schindler added a comment -

        Updated patch:
        The setter had a documentation and overflow problem. I also fixed the other setters in Tiered to behave correctly when Double.POSITIVE_INFINITY is passed. Finally I added tests for the setters.

        I will commit this later!

        Show
        thetaphi Uwe Schindler added a comment - Updated patch: The setter had a documentation and overflow problem. I also fixed the other setters in Tiered to behave correctly when Double.POSITIVE_INFINITY is passed. Finally I added tests for the setters. I will commit this later!
        Hide
        steve_rowe Steve Rowe added a comment -

        +1

        (Too bad the useCompoundFile() and the related configuration getters&setters have to be implemented in both LogMergePolicy and TieredMergePolicy, since MergePolicy, their shared superclass, doesn't have a concrete implementation.)

        Show
        steve_rowe Steve Rowe added a comment - +1 (Too bad the useCompoundFile() and the related configuration getters&setters have to be implemented in both LogMergePolicy and TieredMergePolicy, since MergePolicy, their shared superclass, doesn't have a concrete implementation.)
        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        thetaphi Uwe Schindler added a comment -

        Thanks. What think the others committers about this?

        Show
        thetaphi Uwe Schindler added a comment - Thanks. What think the others committers about this?
        Hide
        alexeylef Alexey Lef added a comment -

        Sorry about that. I had eclipse do some auto-formatting on save, so the patch included more than it should have. Updated patch is attached (no tabs).

        Show
        alexeylef Alexey Lef added a comment - Sorry about that. I had eclipse do some auto-formatting on save, so the patch included more than it should have. Updated patch is attached (no tabs).
        Hide
        thetaphi Uwe Schindler added a comment -

        Hi, thanks for patch. The only thing: Can you fix your coding style to use no TABS but instead 2 spaces for indentation? For eclipse and idea there are code styles in the checkout (ant eclipse, ant idea).

        Show
        thetaphi Uwe Schindler added a comment - Hi, thanks for patch. The only thing: Can you fix your coding style to use no TABS but instead 2 spaces for indentation? For eclipse and idea there are code styles in the checkout (ant eclipse, ant idea).

          People

          • Assignee:
            thetaphi Uwe Schindler
            Reporter:
            alexeylef Alexey Lef
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development