Lucene - Core
  1. Lucene - Core
  2. LUCENE-2790

IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spin off from here: http://www.gossamer-threads.com/lists/lucene/java-dev/112311.

      I will attach a patch shortly that addresses the issue on trunk.

      1. LUCENE-2790.patch
        23 kB
        Shai Erera
      2. LUCENE-2790.patch
        23 kB
        Shai Erera
      3. LUCENE-2790.patch
        21 kB
        Earwin Burrfoot
      4. LUCENE-2790.patch
        20 kB
        Earwin Burrfoot
      5. LUCENE-2790.patch
        16 kB
        Shai Erera
      6. LUCENE-2790.patch
        14 kB
        Earwin Burrfoot
      7. LUCENE-2790.patch
        4 kB
        Shai Erera
      8. LUCENE-2790-3x.patch
        26 kB
        Shai Erera

        Activity

        Hide
        Shai Erera added a comment -

        Patch applied on trunk. I took the opportunity to fix some minor Javadoc warnings as well.

        Show
        Shai Erera added a comment - Patch applied on trunk. I took the opportunity to fix some minor Javadoc warnings as well.
        Hide
        Earwin Burrfoot added a comment -

        Fails addIndexesWithThreads with ConcurrentModificationException, if MergePolicy actually tries to iterate infos passed to useCompoundFile(SIS, SI).

        Show
        Earwin Burrfoot added a comment - Fails addIndexesWithThreads with ConcurrentModificationException, if MergePolicy actually tries to iterate infos passed to useCompoundFile(SIS, SI).
        Hide
        Earwin Burrfoot added a comment - - edited

        Check this patch out.
        It changes useCompoundFile(SIS, SI) to respect noCFSRatio and drops useCompoundFile from OneMerge, so all decisions about using compound files now happen in a single place.
        It also highlights the problem with your patch - when calling useCompoundFile from addIndexes, you should hold a lock, so segmentInfos won't be modified while mergePolicy inspects them.

        Show
        Earwin Burrfoot added a comment - - edited Check this patch out. It changes useCompoundFile(SIS, SI) to respect noCFSRatio and drops useCompoundFile from OneMerge, so all decisions about using compound files now happen in a single place. It also highlights the problem with your patch - when calling useCompoundFile from addIndexes, you should hold a lock, so segmentInfos won't be modified while mergePolicy inspects them.
        Hide
        Shai Erera added a comment -

        test-core passed for me before I uploaded the patch. Can you please post here the 'ant test' command that reproduces it?

        I checked who implements useCompoundFile and all I find is LogMP and NoMP, both don't iterate on the SegmentInfos. What MP did you test with?

        Anyway, need to take a closer look at that. So if you can paste here the 'ant test' that reproduces it, it'd be great.

        Show
        Shai Erera added a comment - test-core passed for me before I uploaded the patch. Can you please post here the 'ant test' command that reproduces it? I checked who implements useCompoundFile and all I find is LogMP and NoMP, both don't iterate on the SegmentInfos. What MP did you test with? Anyway, need to take a closer look at that. So if you can paste here the 'ant test' that reproduces it, it'd be great.
        Hide
        Earwin Burrfoot added a comment -

        I checked who implements useCompoundFile and all I find is LogMP and NoMP, both don't iterate on the SegmentInfos. What MP did you test with?

        Apply my patch, it changes LogMP to use SegmentInfos.

        So if you can paste here the 'ant test' that reproduces it, it'd be great.

        ant test -Dtestcase=TestAddIndexes -Dtestmethod=testAddIndexesWithThreads -Dtests.seed=5369960668186287821:331425426639083833 -Dtests.codec=randomPerField
        The test is threaded, so it doesn't fail always.

        Show
        Earwin Burrfoot added a comment - I checked who implements useCompoundFile and all I find is LogMP and NoMP, both don't iterate on the SegmentInfos. What MP did you test with? Apply my patch, it changes LogMP to use SegmentInfos. So if you can paste here the 'ant test' that reproduces it, it'd be great. ant test -Dtestcase=TestAddIndexes -Dtestmethod=testAddIndexesWithThreads -Dtests.seed=5369960668186287821:331425426639083833 -Dtests.codec=randomPerField The test is threaded, so it doesn't fail always.
        Hide
        Shai Erera added a comment -

        Patch fixes the threading issue Earwin reported, by checking whether to create the CFS in a sync block. Also, after discussing this on IRC, the code is further simplified by creating the compound file before the new segment is committed.

        However, some tests still fail on ConcurrentModException. I cannot debug it now, so am posting the patch in case someone wants to take a stab. I can continue later. To reproduce the failure:

        ant test -Dtestcase=TestIndexWriter -Dtestmethod=testDeleteUnusedFiles -Dtests.seed=-1861905402886420424:-8896948763797565454 -Dtests.codec=randomPerField

        Show
        Shai Erera added a comment - Patch fixes the threading issue Earwin reported, by checking whether to create the CFS in a sync block. Also, after discussing this on IRC, the code is further simplified by creating the compound file before the new segment is committed. However, some tests still fail on ConcurrentModException. I cannot debug it now, so am posting the patch in case someone wants to take a stab. I can continue later. To reproduce the failure: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testDeleteUnusedFiles -Dtests.seed=-1861905402886420424:-8896948763797565454 -Dtests.codec=randomPerField
        Hide
        Earwin Burrfoot added a comment -

        Okay, this patch fixes remaining threading issue in IW.mergeMiddle,
        and three tests that were expecting CFS segments and weren't getting ones
        due to flush now respecting noCFSRatio and noCFSRatio default of 0.1

        Show
        Earwin Burrfoot added a comment - Okay, this patch fixes remaining threading issue in IW.mergeMiddle, and three tests that were expecting CFS segments and weren't getting ones due to flush now respecting noCFSRatio and noCFSRatio default of 0.1
        Hide
        Michael McCandless added a comment -

        Patch looks great!

        My only concern is... it looks like addIndexes(IR[]), with compound file used in the end, may fail to delete the non-compound files once the SegmentInfo is committed? Maybe we should add a test to show the failure...

        I think we need to do something like this:

                  // delete new non cfs files directly: they were never
                  // registered with IFD
                  deleter.deleteNewFiles(merger.getMergedFiles(merge.info));
        
        Show
        Michael McCandless added a comment - Patch looks great! My only concern is... it looks like addIndexes(IR[]), with compound file used in the end, may fail to delete the non-compound files once the SegmentInfo is committed? Maybe we should add a test to show the failure... I think we need to do something like this: // delete new non cfs files directly: they were never // registered with IFD deleter.deleteNewFiles(merger.getMergedFiles(merge.info));
        Hide
        Michael McCandless added a comment -

        Hmm... something is amiss. I hit this failure:

        ant test -Dtestcase=TestIndexSplitter -Dtestmethod=test -Dtests.seed=5299033587626573117:-25334708766924714 -Dtests.codec=randomPerField
        

        But it passes on trunk...

        Show
        Michael McCandless added a comment - Hmm... something is amiss. I hit this failure: ant test -Dtestcase=TestIndexSplitter -Dtestmethod=test -Dtests.seed=5299033587626573117:-25334708766924714 -Dtests.codec=randomPerField But it passes on trunk...
        Hide
        Earwin Burrfoot added a comment -

        Fixed your test failure

        Show
        Earwin Burrfoot added a comment - Fixed your test failure
        Hide
        Shai Erera added a comment -

        Patch looks good. All tests pass for me. Let's give it a couple more tries, to allow for random tests to catch us. It'd be good if you can try running them too.

        Show
        Shai Erera added a comment - Patch looks good. All tests pass for me. Let's give it a couple more tries, to allow for random tests to catch us. It'd be good if you can try running them too.
        Hide
        Earwin Burrfoot added a comment -

        Shai, what about:

        My only concern is... it looks like addIndexes(IR[]), with compound file used in the end, may fail to delete the non-compound files once the SegmentInfo is committed?

        I fixed everything else, but can't answer this question.

        Show
        Earwin Burrfoot added a comment - Shai, what about: My only concern is... it looks like addIndexes(IR[]), with compound file used in the end, may fail to delete the non-compound files once the SegmentInfo is committed? I fixed everything else, but can't answer this question.
        Hide
        Shai Erera added a comment -

        Attached adds a test to TestAddIndexes w/ the fix as Mike proposed. The test fails w/o the fix and passes w/ it.

        Also, I noticed that if I don't set noCFSRatio to 1.0, then the added segments are not converted to a CFS. That is because useCompoundFiles on LMP decides not to do that, because the size of the segment, which is 377 bytes, is more than 10% of the total index size, which is ... 0. I wonder if we should handle that case, or leave it as is - at some point, when more documents are added, that segment will be converted to a CFS.

        I think that means that the first few segments that will be flushed will remain in non CFS format. I'm fine w/ it, just making sure I understand this right.

        Show
        Shai Erera added a comment - Attached adds a test to TestAddIndexes w/ the fix as Mike proposed. The test fails w/o the fix and passes w/ it. Also, I noticed that if I don't set noCFSRatio to 1.0, then the added segments are not converted to a CFS. That is because useCompoundFiles on LMP decides not to do that, because the size of the segment, which is 377 bytes, is more than 10% of the total index size, which is ... 0. I wonder if we should handle that case, or leave it as is - at some point, when more documents are added, that segment will be converted to a CFS. I think that means that the first few segments that will be flushed will remain in non CFS format. I'm fine w/ it, just making sure I understand this right.
        Hide
        Shai Erera added a comment -

        Same patch, only uses MockAnalyzer and not WhitespaceAnalyzer (which failed compilation from command line).

        Show
        Shai Erera added a comment - Same patch, only uses MockAnalyzer and not WhitespaceAnalyzer (which failed compilation from command line).
        Hide
        Earwin Burrfoot added a comment -

        Ok, let's commit?

        There's no need to force first few commits to CFS. CFS' sole purporse is to keep number of simultaneously open files low. Not likely you gonna see frightening numbers with only a pair of segments in index.
        Later these segments are merged (and probably CFSed), so no worries.

        Show
        Earwin Burrfoot added a comment - Ok, let's commit? There's no need to force first few commits to CFS. CFS' sole purporse is to keep number of simultaneously open files low. Not likely you gonna see frightening numbers with only a pair of segments in index. Later these segments are merged (and probably CFSed), so no worries.
        Hide
        Michael McCandless added a comment -

        Ok, let's commit?

        +1

        Show
        Michael McCandless added a comment - Ok, let's commit? +1
        Hide
        Shai Erera added a comment -

        Do you see any back-compat issues w/ back-porting it to 3x? I'm thinking about the change in behavior of useCompoundFile in LMP which now factors is noCFSRatio. However, I see that noCFSRatio is in 3x's LMP and defaults to 0.1, which already changes behavior, so I think we can apply this change to 3x as well. What do you think?

        Show
        Shai Erera added a comment - Do you see any back-compat issues w/ back-porting it to 3x? I'm thinking about the change in behavior of useCompoundFile in LMP which now factors is noCFSRatio. However, I see that noCFSRatio is in 3x's LMP and defaults to 0.1, which already changes behavior, so I think we can apply this change to 3x as well. What do you think?
        Hide
        Shai Erera added a comment -

        Committed revision 1042101 to trunk.

        I will back port to 3x if you agree this isn't a backwards break.

        BTW, I did not add a CHANGES entry, because it's an internal optimization we've made to IndexWriter. Hmm .. maybe we should document the changes to LMP.useCompoundFile (that it now factors in the noCFSRatio)?

        Show
        Shai Erera added a comment - Committed revision 1042101 to trunk. I will back port to 3x if you agree this isn't a backwards break. BTW, I did not add a CHANGES entry, because it's an internal optimization we've made to IndexWriter. Hmm .. maybe we should document the changes to LMP.useCompoundFile (that it now factors in the noCFSRatio)?
        Hide
        Michael McCandless added a comment -

        I think we should document the change to LMP.useCompoundFile?

        But: I don't consider this a backwards break.

        How Lucene manages the index files is under-the-hood so we are free to change it.

        Show
        Michael McCandless added a comment - I think we should document the change to LMP.useCompoundFile? But: I don't consider this a backwards break. How Lucene manages the index files is under-the-hood so we are free to change it.
        Hide
        Shai Erera added a comment -

        How Lucene manages the index files is under-the-hood so we are free to change it.

        That's correct. However, sadly, the backwards tests do not agree with you . Because the runtime behavior has changed, the tests fail. If you try to call LMP.setNoCFSRation, you get a NoSuchMethodError because the tests are compiled against 3.0's source, where indeed it does not exist.

        I'm trying to resolve it by fetching the method using reflection, but this shows another problem w/ how we maintain the backwards tests.

        Show
        Shai Erera added a comment - How Lucene manages the index files is under-the-hood so we are free to change it. That's correct. However, sadly, the backwards tests do not agree with you . Because the runtime behavior has changed, the tests fail. If you try to call LMP.setNoCFSRation, you get a NoSuchMethodError because the tests are compiled against 3.0's source, where indeed it does not exist. I'm trying to resolve it by fetching the method using reflection, but this shows another problem w/ how we maintain the backwards tests.
        Hide
        Shai Erera added a comment -

        Backport to 3x. Note the reflection hack I had to use to make the backwards tests run. I don't commit yet - waiting for some response about the backwards tests. If you're ok with it, I'll commit.

        Show
        Shai Erera added a comment - Backport to 3x. Note the reflection hack I had to use to make the backwards tests run. I don't commit yet - waiting for some response about the backwards tests. If you're ok with it, I'll commit.
        Hide
        Uwe Schindler added a comment - - edited

        I would simply disable the tests. Reflection should only be used when mock classes are used that affect thousands of tests. There are already lots of tests disabled.

        Show
        Uwe Schindler added a comment - - edited I would simply disable the tests. Reflection should only be used when mock classes are used that affect thousands of tests. There are already lots of tests disabled.
        Hide
        Shai Erera added a comment -

        I don't mind disabling the tests, but I think we should discuss the bigger issue (on that thread on the mailing list). If we decide to make it a 'policy' to disable backwards tests that break due to legal changes to the API and behavior, let's at least reach a consensus.

        Show
        Shai Erera added a comment - I don't mind disabling the tests, but I think we should discuss the bigger issue (on that thread on the mailing list). If we decide to make it a 'policy' to disable backwards tests that break due to legal changes to the API and behavior, let's at least reach a consensus.
        Hide
        Shai Erera added a comment -

        Committed revision 1042948 (3x)

        I decided to keep the reflection hack for now, until we come up w/ a better decision. One of the tests which had to be fixed is TestBackwardsCompatibility which needs to be in backwards and I don't think we can delete it, even if it's tested by 'core' as well.

        Thanks Earwin and others for your comments and help !

        Show
        Shai Erera added a comment - Committed revision 1042948 (3x) I decided to keep the reflection hack for now, until we come up w/ a better decision. One of the tests which had to be fixed is TestBackwardsCompatibility which needs to be in backwards and I don't think we can delete it, even if it's tested by 'core' as well. Thanks Earwin and others for your comments and help !
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development