Lucene - Core
  1. Lucene - Core
  2. LUCENE-5038

Don't call MergePolicy / IndexWriter during DWPT Flush

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3, 6.0
    • Fix Version/s: 4.4, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      We currently consult the indexwriter -> merge policy to decide if we need to write CFS or not which is bad in many ways.

      • we should call mergepolicy only during merges
      • we should never sync on IW during DWPT flush
      • we should be able to make the decision if we need to write CFS or not before flush, ie. we could write parts of the flush directly to CFS or even start writing stored fields directly.
      • in the NRT case it might make sense to write all flushes to CFS to minimize filedescriptors independent of the index size.

      I wonder if we can use a simple boolean for this in the IWC and get away with not consulting merge policy. This would simplify concurrency a lot here already.

      1. LUCENE-5038.patch
        70 kB
        Simon Willnauer
      2. LUCENE-5038.patch
        70 kB
        Simon Willnauer
      3. LUCENE-5038.patch
        48 kB
        Simon Willnauer
      4. LUCENE-5038.patch
        16 kB
        Simon Willnauer
      5. LUCENE-5038.patch
        14 kB
        Simon Willnauer
      6. LUCENE-5038.patch
        13 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          I wonder if we can use a simple boolean for this in the IWC and get away with not consulting merge policy.

          +1

          Show
          Michael McCandless added a comment - I wonder if we can use a simple boolean for this in the IWC and get away with not consulting merge policy. +1
          Hide
          Simon Willnauer added a comment -

          here is a patch removing the call to MP

          Show
          Simon Willnauer added a comment - here is a patch removing the call to MP
          Hide
          Simon Willnauer added a comment -

          updated patch that removes IW#useCompoundFile entirely only access via MP...

          Show
          Simon Willnauer added a comment - updated patch that removes IW#useCompoundFile entirely only access via MP...
          Hide
          Michael McCandless added a comment -

          Patch looks great, thanks Simon!

          I spotted this typo: indexWirterConfig

          Show
          Michael McCandless added a comment - Patch looks great, thanks Simon! I spotted this typo: indexWirterConfig
          Hide
          Simon Willnauer added a comment -

          new patch with changes entry and several fixed typos. Yet, I renamed the setter / getter on IWC to DocumentsWriterUseCompoundFile instead using the name flush in there which implies that this decision is made during flush but it could be made earlier in the future.

          Show
          Simon Willnauer added a comment - new patch with changes entry and several fixed typos. Yet, I renamed the setter / getter on IWC to DocumentsWriterUseCompoundFile instead using the name flush in there which implies that this decision is made during flush but it could be made earlier in the future.
          Hide
          Michael McCandless added a comment -

          I don't understand the rename ...I think the original name is better?

          Ie, DocumentsWriter is an internal class (implementation detail); I don't think we should expose it in our public APIs?

          Show
          Michael McCandless added a comment - I don't understand the rename ...I think the original name is better? Ie, DocumentsWriter is an internal class (implementation detail); I don't think we should expose it in our public APIs?
          Hide
          Simon Willnauer added a comment -

          The reasoning here is that if I put in flush into the name it implies that this decision is done at flush time but it might be in the future that we make the decision earlier ie. before the first doc is added so we can write into the CFS directly. I don't want this implying API here so I have chosen a different name. I agree DW is an impl detail and naming is hard what about NewSegmentsUseCompoundFile?

          Show
          Simon Willnauer added a comment - The reasoning here is that if I put in flush into the name it implies that this decision is done at flush time but it might be in the future that we make the decision earlier ie. before the first doc is added so we can write into the CFS directly. I don't want this implying API here so I have chosen a different name. I agree DW is an impl detail and naming is hard what about NewSegmentsUseCompoundFile ?
          Hide
          Shai Erera added a comment -

          Why not just setUseCompoundFiles with a simple jdoc that explains this sets how new segments will be written? Why do we have to have newSegment in there, or flush/commit/close? Let's not over-complicate? setIDP doesn't say setIDPCalledOnCommitOrCloseOrOpen .

          Show
          Shai Erera added a comment - Why not just setUseCompoundFiles with a simple jdoc that explains this sets how new segments will be written? Why do we have to have newSegment in there, or flush/commit/close? Let's not over-complicate? setIDP doesn't say setIDPCalledOnCommitOrCloseOrOpen .
          Hide
          Michael McCandless added a comment -

          I'm a little confused: what "decision" are you referring to? Isn't this just checking the boolean setting?

          Or are you talking about a liveness question, ie whether this setting takes effect "live"?

          If it's a live-ness issue, I don't think we need to rename because of that: other settings are not "live" either. I.e. I think naming it with "flush" is important because it tells the user that this setting impacts flushing. How "live" the setting is, is less important for naming I think?

          I'd rather not invent new terminology (NewSegment) when we already have flush ...

          Show
          Michael McCandless added a comment - I'm a little confused: what "decision" are you referring to? Isn't this just checking the boolean setting? Or are you talking about a liveness question, ie whether this setting takes effect "live"? If it's a live-ness issue, I don't think we need to rename because of that: other settings are not "live" either. I.e. I think naming it with "flush" is important because it tells the user that this setting impacts flushing. How "live" the setting is, is less important for naming I think? I'd rather not invent new terminology (NewSegment) when we already have flush ...
          Hide
          Michael McCandless added a comment -

          Shai, the issue is that (with this issue) we will now have two separate settings/times for whether a segment uses compound file: on flush vs on merge.

          Ie we will no longer consult the MergePolicy on whether to use CFS for a newly flushed segment.

          Show
          Michael McCandless added a comment - Shai, the issue is that (with this issue) we will now have two separate settings/times for whether a segment uses compound file: on flush vs on merge. Ie we will no longer consult the MergePolicy on whether to use CFS for a newly flushed segment.
          Hide
          Shai Erera added a comment -

          Oh I see, sorry I didn't read through the issue.

          So there will be two settings, one on IWC, the other on MP? What if a user sets IWC to true and is unaware that MP may make such decision too, and actually but default MP decides to keep large segments as non-CFS? Won't that be confusing?

          How complicated is it to keep on MP only? Maybe MP can have the two settings (at least they'll be on the same class!). IW will read it in the ctor and never consult MP again for new flushed segments. User-wise, the settings are near each other...

          Show
          Shai Erera added a comment - Oh I see, sorry I didn't read through the issue. So there will be two settings, one on IWC, the other on MP? What if a user sets IWC to true and is unaware that MP may make such decision too, and actually but default MP decides to keep large segments as non-CFS? Won't that be confusing? How complicated is it to keep on MP only? Maybe MP can have the two settings (at least they'll be on the same class!). IW will read it in the ctor and never consult MP again for new flushed segments. User-wise, the settings are near each other...
          Hide
          Simon Willnauer added a comment -

          I think we should make this two different settings to be honest! Frist of all I think we should only call MP if we merge otherwise it's not a MP anymore? Further I think it is important to not imply any place where we check this in the name ie. if you have flush in the name we should check on flush and should not make the decision ahead of time which we might want to do in the future. For instance you have set it to true and then set it to false afterwards but we already checked the boolean and wrapped the Directory in a CFSDirectory which might be confusing.

          I actually think it's a good thing to have this in two different places since those are two different usecases ie. for merges you might not want CFS at all but for new writes you want it because NRT can flush / write lots of segments. The defaults on MP are for merging and the defaults on IWC are for indexing. Btw. I think useCompoundFileOnWrite is a good fit?!

          How complicated is it to keep on MP only?

          I'd really appreciate to move it out of MP. it's not a merge setting and I think it's rather confusing today since we take the merge setting and use it for indexing. Aside of this, I think this is pretty expert too. Adding a boolean for this to MP would feel odd and I think it's the wrong thing as well.

          Show
          Simon Willnauer added a comment - I think we should make this two different settings to be honest! Frist of all I think we should only call MP if we merge otherwise it's not a MP anymore? Further I think it is important to not imply any place where we check this in the name ie. if you have flush in the name we should check on flush and should not make the decision ahead of time which we might want to do in the future. For instance you have set it to true and then set it to false afterwards but we already checked the boolean and wrapped the Directory in a CFSDirectory which might be confusing. I actually think it's a good thing to have this in two different places since those are two different usecases ie. for merges you might not want CFS at all but for new writes you want it because NRT can flush / write lots of segments. The defaults on MP are for merging and the defaults on IWC are for indexing. Btw. I think useCompoundFileOnWrite is a good fit?! How complicated is it to keep on MP only? I'd really appreciate to move it out of MP. it's not a merge setting and I think it's rather confusing today since we take the merge setting and use it for indexing. Aside of this, I think this is pretty expert too. Adding a boolean for this to MP would feel odd and I think it's the wrong thing as well.
          Hide
          Shai Erera added a comment -

          I get what you're saying but I reviewed the patch and I think it shows how more complex it has become. It's not enough to set useCFS(true) on MP, but you also need to set it to IWC. So now we have 3 settings: mp.setUseCFS(), mp.setNoCFSRatio() and iwc.setUseCFSOnWrite. That's too much, no?

          Worse thing is, if I do want to control CFS, how can I tell that IWC is not enough? At least when everything is on MP, or on any other class I don't care, I get a hint from auto-complete or reading jdocs.

          I understand that the two settings are different, but the API is already there today and the changes don't simplify it IMO. If it's a concurrency thing you want to solve, IW / DWPT can read that setting in the ctor and never access MP again. I prefer to leave the API clear.

          Show
          Shai Erera added a comment - I get what you're saying but I reviewed the patch and I think it shows how more complex it has become. It's not enough to set useCFS(true) on MP, but you also need to set it to IWC. So now we have 3 settings: mp.setUseCFS(), mp.setNoCFSRatio() and iwc.setUseCFSOnWrite. That's too much, no? Worse thing is, if I do want to control CFS, how can I tell that IWC is not enough? At least when everything is on MP, or on any other class I don't care, I get a hint from auto-complete or reading jdocs. I understand that the two settings are different, but the API is already there today and the changes don't simplify it IMO. If it's a concurrency thing you want to solve, IW / DWPT can read that setting in the ctor and never access MP again. I prefer to leave the API clear.
          Hide
          Robert Muir added a comment -

          we should only call MP if we merge otherwise it's not a MP anymore?

          I totally agree with this. Put the one used at flush in FlushPolicy?

          Show
          Robert Muir added a comment - we should only call MP if we merge otherwise it's not a MP anymore? I totally agree with this. Put the one used at flush in FlushPolicy?
          Hide
          Shai Erera added a comment -

          What if we keep setUseCFS on IWC (or FlushPolicy) and it defines flush + merged segments (i.e. MP will be the one consulting IWC/FlushPolicy). And some MPs can have in addition setNoCFSRatio? How fine grained do we need to get with this?

          Show
          Shai Erera added a comment - What if we keep setUseCFS on IWC (or FlushPolicy) and it defines flush + merged segments (i.e. MP will be the one consulting IWC/FlushPolicy). And some MPs can have in addition setNoCFSRatio? How fine grained do we need to get with this?
          Hide
          Simon Willnauer added a comment -

          hey maybe we should not allow to set it at all and just write CFS by default?

          I really think this is not that much of a deal here, I kind of like the flush policy idea though.

          So now we have 3 settings: mp.setUseCFS(), mp.setNoCFSRatio() and iwc.setUseCFSOnWrite. That's too much, no?

          Thinking out loud... I couldn't agree more, I think mp.setUseCFS() should go away entirely and we should only use setNonCFSRatio() to control for merges. 0.0 would mean no CFS in that case. on IWC we just use the setting from the MP as the default ie. if you disable it we also disable it in the IWC unless it's set explicitly via a setter. That means we are back to 2 settings, MP implies the setting on IWC...

          Yet, the MP settings are impl details and not part of the interface at all (which is ok I think?) would we awesome if we could clean this up even further in the future

          Show
          Simon Willnauer added a comment - hey maybe we should not allow to set it at all and just write CFS by default? I really think this is not that much of a deal here, I kind of like the flush policy idea though. So now we have 3 settings: mp.setUseCFS(), mp.setNoCFSRatio() and iwc.setUseCFSOnWrite. That's too much, no? Thinking out loud... I couldn't agree more, I think mp.setUseCFS() should go away entirely and we should only use setNonCFSRatio() to control for merges. 0.0 would mean no CFS in that case. on IWC we just use the setting from the MP as the default ie. if you disable it we also disable it in the IWC unless it's set explicitly via a setter. That means we are back to 2 settings, MP implies the setting on IWC... Yet, the MP settings are impl details and not part of the interface at all (which is ok I think?) would we awesome if we could clean this up even further in the future
          Hide
          Shai Erera added a comment -

          I don't think we can tie MP and IWC. I.e. you normally do MP = new MP() followed by new IWC().setMP(). IWC ctor already needs to set some value for useCFS, so it defaults to say true, but when you setMP with say false, it's not going to IWC.

          I prefer that we keep that simple and seems that you and I agree that setCFS on MP should go away. So we end with IWC.setCFS (defaults to true) and MP.setNoCFSRatio (default is per impl?). And that's it? If want to turn off CFS entirely, you do it in two places, but that's really expert. Otherwise we default to CFS and you only need to decide whether you want merged segments to be CFS or not, which is the more important decision I think.

          Show
          Shai Erera added a comment - I don't think we can tie MP and IWC. I.e. you normally do MP = new MP() followed by new IWC().setMP(). IWC ctor already needs to set some value for useCFS, so it defaults to say true, but when you setMP with say false, it's not going to IWC. I prefer that we keep that simple and seems that you and I agree that setCFS on MP should go away. So we end with IWC.setCFS (defaults to true) and MP.setNoCFSRatio (default is per impl?). And that's it? If want to turn off CFS entirely, you do it in two places, but that's really expert. Otherwise we default to CFS and you only need to decide whether you want merged segments to be CFS or not, which is the more important decision I think.
          Hide
          Simon Willnauer added a comment -

          +1 I will change the patch!

          Show
          Simon Willnauer added a comment - +1 I will change the patch!
          Hide
          Michael McCandless added a comment -

          So we end with IWC.setCFS (defaults to true) and MP.setNoCFSRatio (default is per impl?).

          +1, I like this! Plus it sidesteps trying to name the new method

          Show
          Michael McCandless added a comment - So we end with IWC.setCFS (defaults to true) and MP.setNoCFSRatio (default is per impl?). +1, I like this! Plus it sidesteps trying to name the new method
          Hide
          Simon Willnauer added a comment -

          here is an updated patch. I removed setUseCompoundFile from the MPs now we only have the ratio on the MP and setUseCF on IWC. Yet, given that patch I think we can safely move the CFS related settings from LogMP and TieredMP into MergePolicy or add an intermediate class. The are all identical and we should consolidate this. Any objections if I pull them up to MP?

          Show
          Simon Willnauer added a comment - here is an updated patch. I removed setUseCompoundFile from the MPs now we only have the ratio on the MP and setUseCF on IWC. Yet, given that patch I think we can safely move the CFS related settings from LogMP and TieredMP into MergePolicy or add an intermediate class. The are all identical and we should consolidate this. Any objections if I pull them up to MP?
          Hide
          Shai Erera added a comment -

          Patch looks good. +1 to pull setNoCFSRatio to MP. It's a generic setting. We can default to 1.0 (always) in MP in order to not break MP impls out there. Also, Can you add documentation to IWC.setUseCFS to see MP.setNoCFSRatio for controlling CFS for merged segments?

          Show
          Shai Erera added a comment - Patch looks good. +1 to pull setNoCFSRatio to MP. It's a generic setting. We can default to 1.0 (always) in MP in order to not break MP impls out there. Also, Can you add documentation to IWC.setUseCFS to see MP.setNoCFSRatio for controlling CFS for merged segments?
          Hide
          Simon Willnauer added a comment -

          Patch looks good. +1 to pull setNoCFSRatio to MP. It's a generic setting. We can default to 1.0 (always) in MP in order to not break MP impls out there. Also, Can you add documentation to IWC.setUseCFS to see MP.setNoCFSRatio for controlling CFS for merged segments?

          I will use 1.0 as the default I guess I will also pull up setMaxCFSSegmentSizeMB and friends since they are redundant as well. I agree default should be 1.0 on MP but the TMP and LMP use their defaults or do I miss something?

          Show
          Simon Willnauer added a comment - Patch looks good. +1 to pull setNoCFSRatio to MP. It's a generic setting. We can default to 1.0 (always) in MP in order to not break MP impls out there. Also, Can you add documentation to IWC.setUseCFS to see MP.setNoCFSRatio for controlling CFS for merged segments? I will use 1.0 as the default I guess I will also pull up setMaxCFSSegmentSizeMB and friends since they are redundant as well. I agree default should be 1.0 on MP but the TMP and LMP use their defaults or do I miss something?
          Hide
          Shai Erera added a comment -

          TMP and LMP should use their defaults. I was suggesting a default for MP so that you don't need to make the method abstract and potentially break existing MP impls out there. 1.0 sounds like a good default.

          Show
          Shai Erera added a comment - TMP and LMP should use their defaults. I was suggesting a default for MP so that you don't need to make the method abstract and potentially break existing MP impls out there. 1.0 sounds like a good default.
          Hide
          Simon Willnauer added a comment -

          here is a new patch refactoring MP to have a default useCompoundFile impl and all the setters etc. I also added some more javadocs and fixed all tests that did redundant casts.

          Shai Erera do you mind taking another look

          Show
          Simon Willnauer added a comment - here is a new patch refactoring MP to have a default useCompoundFile impl and all the setters etc. I also added some more javadocs and fixed all tests that did redundant casts. Shai Erera do you mind taking another look
          Hide
          Michael McCandless added a comment -

          +1

          I found a couple typos in MergePolicy.java:

          • The default implemeantion retruns
          • Default ratio for compound file system useage.
          Show
          Michael McCandless added a comment - +1 I found a couple typos in MergePolicy.java: The default implemeantion retruns Default ratio for compound file system useage.
          Hide
          Shai Erera added a comment -

          Looks good!

          Do you think that setMaxCFSSegmentSizeMB should also set to Long.MAX_VALUE if v * 1024 * 1024 < 0 (i.e. overflows)?

          Show
          Shai Erera added a comment - Looks good! Do you think that setMaxCFSSegmentSizeMB should also set to Long.MAX_VALUE if v * 1024 * 1024 < 0 (i.e. overflows)?
          Hide
          Michael McCandless added a comment -

          Do you think that setMaxCFSSegmentSizeMB should also set to Long.MAX_VALUE if v * 1024 * 1024 < 0 (i.e. overflows)?

          Yes.

          Show
          Michael McCandless added a comment - Do you think that setMaxCFSSegmentSizeMB should also set to Long.MAX_VALUE if v * 1024 * 1024 < 0 (i.e. overflows)? Yes.
          Hide
          Simon Willnauer added a comment -

          updated patch & fixed spellings. I also added a check if v < 0 but don't see how this happens if a double overflows it becomes Positive Infinity no? I'd love to know if I miss something

          Show
          Simon Willnauer added a comment - updated patch & fixed spellings. I also added a check if v < 0 but don't see how this happens if a double overflows it becomes Positive Infinity no? I'd love to know if I miss something
          Hide
          Shai Erera added a comment -

          Hmm I think you're right Simon. According to several places I read on the web, a double cannot overflow to negative, but it will be +INF. +1 to commit.

          Show
          Shai Erera added a comment - Hmm I think you're right Simon. According to several places I read on the web, a double cannot overflow to negative, but it will be +INF. +1 to commit.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] simonw
          http://svn.apache.org/viewvc?view=revision&revision=1492701

          LUCENE-5038: Refactor CompoundFile settings in MergePolicy and IndexWriterConfig

          Show
          Commit Tag Bot added a comment - [trunk commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1492701 LUCENE-5038 : Refactor CompoundFile settings in MergePolicy and IndexWriterConfig
          Hide
          Simon Willnauer added a comment -

          I will let it bake in a bit and then backport tomorrow. I'd deprecate the setUseCFS on MP in branch_4x and just delegate to setNoCFSRatio(1.0 | 0.0) if nobody objects?

          Show
          Simon Willnauer added a comment - I will let it bake in a bit and then backport tomorrow. I'd deprecate the setUseCFS on MP in branch_4x and just delegate to setNoCFSRatio(1.0 | 0.0) if nobody objects?
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] simonw
          http://svn.apache.org/viewvc?view=revision&revision=1492770

          LUCENE-5038: Fix test, useCompoundFile doesn't exists anymore

          Show
          Commit Tag Bot added a comment - [trunk commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1492770 LUCENE-5038 : Fix test, useCompoundFile doesn't exists anymore
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] simonw
          http://svn.apache.org/viewvc?view=revision&revision=1493022

          LUCENE-5038: Fix test to reliably work with all codecs / posting formats

          Show
          Commit Tag Bot added a comment - [trunk commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1493022 LUCENE-5038 : Fix test to reliably work with all codecs / posting formats
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] simonw
          http://svn.apache.org/viewvc?view=revision&revision=1493225

          LUCENE-5038: Refactor CompoundFile settings in MergePolicy and IndexWriterConfig

          Show
          Commit Tag Bot added a comment - [branch_4x commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1493225 LUCENE-5038 : Refactor CompoundFile settings in MergePolicy and IndexWriterConfig
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] simonw
          http://svn.apache.org/viewvc?view=revision&revision=1493238

          LUCENE-5038: Set useCFS on IWC in backwards-compatibility tests

          Show
          Commit Tag Bot added a comment - [branch_4x commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1493238 LUCENE-5038 : Set useCFS on IWC in backwards-compatibility tests
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] simonw
          http://svn.apache.org/viewvc?view=revision&revision=1493319

          LUCENE-5038: Disable CFS on IWC for TestTermInfosReaderIndex non-CFS files are expected

          Show
          Commit Tag Bot added a comment - [branch_4x commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1493319 LUCENE-5038 : Disable CFS on IWC for TestTermInfosReaderIndex non-CFS files are expected
          Hide
          Shalin Shekhar Mangar added a comment -

          This changed the LuceneTestCase.newLogMergePolicy method in a non backward compatible way.

          -  public static LogMergePolicy newLogMergePolicy(boolean useCFS, int mergeFactor) {
          +  public static MergePolicy newLogMergePolicy(boolean useCFS, int mergeFactor) {
          
          Show
          Shalin Shekhar Mangar added a comment - This changed the LuceneTestCase.newLogMergePolicy method in a non backward compatible way. - public static LogMergePolicy newLogMergePolicy( boolean useCFS, int mergeFactor) { + public static MergePolicy newLogMergePolicy( boolean useCFS, int mergeFactor) {
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development