Lucene - Core
  1. Lucene - Core
  2. LUCENE-4323

Add max cfs segment size to LogMergePolicy and TieredMergePolicy

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.0, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Our application is managing thousands of indexes ranging from a few KB to a few GB in size. To keep the number of files under control and at the same time avoid the overhead of compound file format for large segments, we would like to keep only small segments as CFS. The meaning of "small" here is in absolute byte size terms, not as a percentage of the overall index. It is ok and in fact desirable to have the entire index as CFS as long as it is below the threshold.

      The attached patch adds a new configuration option maxCFSSegmentSize which sets the absolute limit on the compound file segment size, in addition to the existing noCFSRatio, i.e. the lesser of the two will be used. The default is to allow any size (Long.MAX_VALUE) so that the default behavior is exactly as it was before.

      The patch is for the trunk as of Aug 23, 2012.

      1. diskFullFix.patch
        1.0 kB
        Uwe Schindler
      2. LUCENE-4323.patch
        12 kB
        Uwe Schindler
      3. LUCENE-4323.patch
        6 kB
        Alexey Lef

        Activity

        Hide
        Uwe Schindler added a comment -

        Hi, thanks for patch. The only thing: Can you fix your coding style to use no TABS but instead 2 spaces for indentation? For eclipse and idea there are code styles in the checkout (ant eclipse, ant idea).

        Show
        Uwe Schindler added a comment - Hi, thanks for patch. The only thing: Can you fix your coding style to use no TABS but instead 2 spaces for indentation? For eclipse and idea there are code styles in the checkout (ant eclipse, ant idea).
        Hide
        Alexey Lef added a comment -

        Sorry about that. I had eclipse do some auto-formatting on save, so the patch included more than it should have. Updated patch is attached (no tabs).

        Show
        Alexey Lef added a comment - Sorry about that. I had eclipse do some auto-formatting on save, so the patch included more than it should have. Updated patch is attached (no tabs).
        Hide
        Uwe Schindler added a comment -

        Thanks. What think the others committers about this?

        Show
        Uwe Schindler added a comment - Thanks. What think the others committers about this?
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Steve Rowe added a comment -

        +1

        (Too bad the useCompoundFile() and the related configuration getters&setters have to be implemented in both LogMergePolicy and TieredMergePolicy, since MergePolicy, their shared superclass, doesn't have a concrete implementation.)

        Show
        Steve Rowe added a comment - +1 (Too bad the useCompoundFile() and the related configuration getters&setters have to be implemented in both LogMergePolicy and TieredMergePolicy, since MergePolicy, their shared superclass, doesn't have a concrete implementation.)
        Hide
        Uwe Schindler added a comment -

        Updated patch:
        The setter had a documentation and overflow problem. I also fixed the other setters in Tiered to behave correctly when Double.POSITIVE_INFINITY is passed. Finally I added tests for the setters.

        I will commit this later!

        Show
        Uwe Schindler added a comment - Updated patch: The setter had a documentation and overflow problem. I also fixed the other setters in Tiered to behave correctly when Double.POSITIVE_INFINITY is passed. Finally I added tests for the setters. I will commit this later!
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1376766
        Committed 4.x revision: 1376767

        Thanks Alexey!

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1376766 Committed 4.x revision: 1376767 Thanks Alexey!
        Hide
        Robert Muir added a comment -

        This does not actually work in megabytes: see my email response on the list to test failures.

        step1: ant test -Dtestcase=TestIndexWriterOnDiskFull -Dtests.method=testAddIndexOnDiskFull -Dtests.seed=17160CAE40DA6C29 -Dtests.slow=true -Dtests.locale=hr_HR -Dtests.timezone=Etc/GMT+12 -Dtests.file.encoding=UTF-8 -Dtests.verbose=true

        You can see all values are 8.796093022207999E12

        step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)!

        but now if you run the seed again, it passes!

        Show
        Robert Muir added a comment - This does not actually work in megabytes: see my email response on the list to test failures. step1: ant test -Dtestcase=TestIndexWriterOnDiskFull -Dtests.method=testAddIndexOnDiskFull -Dtests.seed=17160CAE40DA6C29 -Dtests.slow=true -Dtests.locale=hr_HR -Dtests.timezone=Etc/GMT+12 -Dtests.file.encoding=UTF-8 -Dtests.verbose=true You can see all values are 8.796093022207999E12 step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)! but now if you run the seed again, it passes!
        Hide
        Robert Muir added a comment -

        Keep in mind we should I think, still apply something like my patch to disk full tests that count MBytes, as we don't want this kicking in here I think.

        However i want to be absolutely positive we aren't hiding bugs before doing that.

        Show
        Robert Muir added a comment - Keep in mind we should I think, still apply something like my patch to disk full tests that count MBytes, as we don't want this kicking in here I think. However i want to be absolutely positive we aren't hiding bugs before doing that.
        Hide
        Uwe Schindler added a comment -

        You fix just hides the problem which is caused by another randomization in LTC:

        step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)!

        Here is the explanation:
        Your fix - (setting LMP's maxMerge) is not changing anything here - the max size is still Long.MAX_VALUE internally! The change is really that you change the order of newLogMergePolicy(),newIndexWriterConfig() [they were the other order before] -> this changes random numbers completely.
        What I changed in my patch was missing randomization in new LogMergePolicy for cfsRatio! Before it was always 0.1, preventing creating CFS on addIndexes at all! In the original seed without your patch, the cfsRatio was then randomized to ~0.8, causing CFS files to be created -> causing disk Full.

        The cause is simply additional randomization, missing before for newLogMergePolicy().

        The correct way to fix this test is to prevent CFS files at all: newLogMergePolicy(false).

        Should I commit that fix?

        Show
        Uwe Schindler added a comment - You fix just hides the problem which is caused by another randomization in LTC: step2: apply my patch, which forces this to instead be Double.POSITIVE_INFINITY instead of 8.796093022207999E12. This should really have no effect here, since the directories are tiny (all < 1MB)! Here is the explanation: Your fix - (setting LMP's maxMerge) is not changing anything here - the max size is still Long.MAX_VALUE internally! The change is really that you change the order of newLogMergePolicy(),newIndexWriterConfig() [they were the other order before] -> this changes random numbers completely. What I changed in my patch was missing randomization in new LogMergePolicy for cfsRatio! Before it was always 0.1, preventing creating CFS on addIndexes at all! In the original seed without your patch, the cfsRatio was then randomized to ~0.8, causing CFS files to be created -> causing disk Full. The cause is simply additional randomization, missing before for newLogMergePolicy(). The correct way to fix this test is to prevent CFS files at all: newLogMergePolicy(false). Should I commit that fix?
        Hide
        Uwe Schindler added a comment -

        Here the patch that fixes the test bug, explanation (why Roberts fix fixed the bug was just a side effect of his code change, not setting the maxMergeSize). The code committed here as fine!

        Show
        Uwe Schindler added a comment - Here the patch that fixes the test bug, explanation (why Roberts fix fixed the bug was just a side effect of his code change, not setting the maxMergeSize). The code committed here as fine!
        Hide
        Robert Muir added a comment -

        Thanks for digging Uwe: I just wanted to be sure here. +1 to commit this. I'm not sure if any of the other disk full methods might need this too.

        Show
        Robert Muir added a comment - Thanks for digging Uwe: I just wanted to be sure here. +1 to commit this. I'm not sure if any of the other disk full methods might need this too.
        Hide
        Robert Muir added a comment -

        this is actually fixed, I reopened it because of a jenkins fail, but it was only slightly related (more to Uwe's new test randomization)

        Show
        Robert Muir added a comment - this is actually fixed, I reopened it because of a jenkins fail, but it was only slightly related (more to Uwe's new test randomization)
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Alexey Lef
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development