Lucene - Core
  1. Lucene - Core
  2. LUCENE-845

If you "flush by RAM usage" then IndexWriter may over-merge

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think a good way to maximize performance of Lucene's indexing for a
      given amount of RAM is to flush (writer.flush()) the added documents
      whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
      RAM you can afford.

      But, this can confuse the merge policy and cause over-merging, unless
      you set maxBufferedDocs properly.

      This is because the merge policy looks at the current maxBufferedDocs
      to figure out which segments are level 0 (first flushed) or level 1
      (merged from <mergeFactor> level 0 segments).

      I'm not sure how to fix this. Maybe we can look at net size (bytes)
      of a segment and "infer" level from this? Still we would have to be
      resilient to the application suddenly increasing the RAM allowed.

      The good news is to workaround this bug I think you just need to
      ensure that your maxBufferedDocs is less than mergeFactor *
      typical-number-of-docs-flushed.

      1. LUCENE-845.patch
        13 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          This bug is actually rather serious.

          If you set maxBufferedDocs to a very large number (on the expectation
          that it's not used since you will manually flush by RAM usage) then
          the merge policy will always merge the index down to 1 segment as soon
          as it hits mergeFactor segments.

          This will be an O(N^2) slowdown. EG if based on RAM you are
          flushing every 100 docs, then at 1000 docs you will merge to 1
          segment. Then at 1900 docs, you merge to 1 segment again. At 2800,
          3700, 4600, ... (every 900 docs) you keep merging to 1 segment. Your
          indexing process will get very slow because every 900 documents the
          entire index is effectively being optimized.

          With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs
          entirely and switch to flushing by RAM usage instead (you can always
          manually flush every N documents in your app if for some reason you
          need that). But obviously we need to resolve this bug first.

          Show
          Michael McCandless added a comment - This bug is actually rather serious. If you set maxBufferedDocs to a very large number (on the expectation that it's not used since you will manually flush by RAM usage) then the merge policy will always merge the index down to 1 segment as soon as it hits mergeFactor segments. This will be an O(N^2) slowdown. EG if based on RAM you are flushing every 100 docs, then at 1000 docs you will merge to 1 segment. Then at 1900 docs, you merge to 1 segment again. At 2800, 3700, 4600, ... (every 900 docs) you keep merging to 1 segment. Your indexing process will get very slow because every 900 documents the entire index is effectively being optimized. With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs entirely and switch to flushing by RAM usage instead (you can always manually flush every N documents in your app if for some reason you need that). But obviously we need to resolve this bug first.
          Hide
          Yonik Seeley added a comment -

          > With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs entirely

          That might present a problem for users of ParallelReader. Right now, it's possible to construct two indicies with corresponding docids.... switching to flush-by-ram would makesegment merging unpredictable and destroy the docid matching.

          Show
          Yonik Seeley added a comment - > With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs entirely That might present a problem for users of ParallelReader. Right now, it's possible to construct two indicies with corresponding docids.... switching to flush-by-ram would makesegment merging unpredictable and destroy the docid matching.
          Hide
          Michael McCandless added a comment -

          >> With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs entirely
          >
          > That might present a problem for users of ParallelReader. Right now,
          > it's possible to construct two indicies with corresponding
          > docids.... switching to flush-by-ram would makesegment merging
          > unpredictable and destroy the docid matching.

          Ahhh, this is a very good point. OK I won't deprecate "flushing by
          doc count" and instead will allow either "flush by RAM usage" (default
          to this?) or "flush by doc count".

          Show
          Michael McCandless added a comment - >> With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs entirely > > That might present a problem for users of ParallelReader. Right now, > it's possible to construct two indicies with corresponding > docids.... switching to flush-by-ram would makesegment merging > unpredictable and destroy the docid matching. Ahhh, this is a very good point. OK I won't deprecate "flushing by doc count" and instead will allow either "flush by RAM usage" (default to this?) or "flush by doc count".
          Hide
          Michael McCandless added a comment -

          Just recapping some following discussion from java-dev ...

          The current merge policy can be thought of logically as two different
          steps:

          1. How to determine the "level" of each segment in the index.

          2. How & when to pick which level N segments into a level N+1
          segment.

          The current policy determines a segment's level by looking at the doc
          count in the segment as well as the current maxBufferedDocs, which is
          very problematic when you "flush by RAM usage" instead. This Jira
          issue, then, is proposing to instead look at overall byte size of a
          segment for determining its level, while keeping step 2. above.

          However, I would propose we also fix LUCENE-854 (which addresses step
          2 above and not step 1) at the same time, as a single merge policy,
          and maybe at some point in the future make this new merge policy the
          default merge policy.

          Show
          Michael McCandless added a comment - Just recapping some following discussion from java-dev ... The current merge policy can be thought of logically as two different steps: 1. How to determine the "level" of each segment in the index. 2. How & when to pick which level N segments into a level N+1 segment. The current policy determines a segment's level by looking at the doc count in the segment as well as the current maxBufferedDocs, which is very problematic when you "flush by RAM usage" instead. This Jira issue, then, is proposing to instead look at overall byte size of a segment for determining its level, while keeping step 2. above. However, I would propose we also fix LUCENE-854 (which addresses step 2 above and not step 1) at the same time, as a single merge policy, and maybe at some point in the future make this new merge policy the default merge policy.
          Hide
          Steven Parkes added a comment -

          Following up on this, it's basically the idea that segments ought to be created/merged both either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ...

          It does impact the APIs, though. It's easy enough to imagine, with factored merge policies, both by-doc-count and by-segment-size policies. But the initial segment creation is going to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in an API handshake, but I'm not sure if it's worth the mess?

          Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size algorithms, since the algorithm for creating/merging segments has to be globally consistent, not just per index, right?

          If that is right, what does that say about making by-segment-size the default? It's gonna break (as in bad results) people relying on that behavior that don't change their code. Is there a community consensus on this? It's not really an API change that would cause a compile/class-load failure, but in some ways, it's worse ...

          Show
          Steven Parkes added a comment - Following up on this, it's basically the idea that segments ought to be created/merged both either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ... It does impact the APIs, though. It's easy enough to imagine, with factored merge policies, both by-doc-count and by-segment-size policies. But the initial segment creation is going to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in an API handshake, but I'm not sure if it's worth the mess? Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size algorithms, since the algorithm for creating/merging segments has to be globally consistent, not just per index, right? If that is right, what does that say about making by-segment-size the default? It's gonna break (as in bad results) people relying on that behavior that don't change their code. Is there a community consensus on this? It's not really an API change that would cause a compile/class-load failure, but in some ways, it's worse ...
          Hide
          Michael McCandless added a comment -

          > Following up on this, it's basically the idea that segments ought to be created/merged both either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ...

          Right, but we need the refactored merge policy framework in place
          first. I'll mark this issue dependent on LUCENE-847.

          > It does impact the APIs, though. It's easy enough to imagine, with factored merge policies, both by-doc-count and by-segment-size policies. But the initial segment creation is going to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in an API handshake, but I'm not sure if it's worth the mess?

          Good question. I think it's OK (at least for our first go at this –
          progress not perfection!) to expect the developer to choose a merge
          policy and then to use IndexWriter in a way that's "consistent" with
          that merge policy? I think it's going to get too complex if we try to
          formally couple "when to flush/commit" with the merge policy?

          But, I think the default merge policy needs to be resilient to people
          doing things like changing maxBuffereDocs/mergeFactor partway through
          an index, calling flush() whenever they want, etc. The merge policy
          today is not resilient to these "normal" usages of IndexWriter. So I
          think we need to do something here even without the pressure from
          LUCENE-843.

          > Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size algorithms, since the algorithm for creating/merging segments has to be globally consistent, not just per index, right?

          Right. We clearly need to keep the current "by doc" merge policy
          easily available for this use case.

          > If that is right, what does that say about making by-segment-size the default? It's gonna break (as in bad results) people relying on that behavior that don't change their code. Is there a community consensus on this? It's not really an API change that would cause a compile/class-load failure, but in some ways, it's worse ...

          I think there are actually two questions here:

          1) What exactly makes for a good default merge policy?

          I think the merge policy we have today has some limitations:

          • It's not resilient to "normal" usage of the public APIs in
            IndexWriter. If you call flush() yourself, if you change
            maxBufferedDocs (and maybe mergeFactor?) partway through an
            index, etc, you can cause disastrous amounts of over-merging
            (that's this issue).

          I think the default policy should be entirely resilient to
          any usage of the public IndexWriter APIs.

          • Default merge policy should strive to minimize net cost
            (amortized over time) of merging, but the current one
            doesn't:
          • When docs differ in size (frequently the case) it will be
            too costly in CPU/IO consumption because small segments are
            merged with large ones.
          • It does too much work in advance (too much "pay it
            forward"). I don't think a merge policy should
            "inadvertently optimize" (I opened LUCENE-854 to describe
            this).

          I think Lucene "out of the box" should give you good indexing
          performance. You should not have to do extra tuning to get
          substantially better performance. The best way to get that
          is to "flush by RAM" (with LUCENE-843). But current merge
          policy prevents this (due to this issue).

          2) Can we change the default merge policy?

          I sure hope so, given the issues above.

          I think the majority of Lucene users do the simple "create a
          writer, add/delete docs, close writer, while reader(s) use the
          same index" type of usage and so would benefit by the gained
          performance of LUCENE-843 and LUCENE-854.

          I think (but may be wrong!) it's a minority who use
          ParallelReader and therefore have a reliance on the specific
          merge policy we use today?

          Ideally we first commit the "decouple merge policy from IndexWriter"
          (LUCENE-847), then we would make a new merge policy that resolves this
          issue and LUCENE-854, and make it the default policy. I think this
          policy would look at size (in bytes) of each segment (perhaps
          proportionally reducing # bytes according to pending deletes against
          that segment), and would merge any adjacent segments (not just
          rightmost ones) that are "the most similar" in size. I think it would
          merge N (configurable) at a time and at no time would inadvertently
          optimize.

          This would mean users of ParallelReader on upgrading to this would
          need to change their merge policy to the legacy "merge by doc count"
          policy.

          Show
          Michael McCandless added a comment - > Following up on this, it's basically the idea that segments ought to be created/merged both either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ... Right, but we need the refactored merge policy framework in place first. I'll mark this issue dependent on LUCENE-847 . > It does impact the APIs, though. It's easy enough to imagine, with factored merge policies, both by-doc-count and by-segment-size policies. But the initial segment creation is going to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in an API handshake, but I'm not sure if it's worth the mess? Good question. I think it's OK (at least for our first go at this – progress not perfection!) to expect the developer to choose a merge policy and then to use IndexWriter in a way that's "consistent" with that merge policy? I think it's going to get too complex if we try to formally couple "when to flush/commit" with the merge policy? But, I think the default merge policy needs to be resilient to people doing things like changing maxBuffereDocs/mergeFactor partway through an index, calling flush() whenever they want, etc. The merge policy today is not resilient to these "normal" usages of IndexWriter. So I think we need to do something here even without the pressure from LUCENE-843 . > Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size algorithms, since the algorithm for creating/merging segments has to be globally consistent, not just per index, right? Right. We clearly need to keep the current "by doc" merge policy easily available for this use case. > If that is right, what does that say about making by-segment-size the default? It's gonna break (as in bad results) people relying on that behavior that don't change their code. Is there a community consensus on this? It's not really an API change that would cause a compile/class-load failure, but in some ways, it's worse ... I think there are actually two questions here: 1) What exactly makes for a good default merge policy? I think the merge policy we have today has some limitations: It's not resilient to "normal" usage of the public APIs in IndexWriter. If you call flush() yourself, if you change maxBufferedDocs (and maybe mergeFactor?) partway through an index, etc, you can cause disastrous amounts of over-merging (that's this issue). I think the default policy should be entirely resilient to any usage of the public IndexWriter APIs. Default merge policy should strive to minimize net cost (amortized over time) of merging, but the current one doesn't: When docs differ in size (frequently the case) it will be too costly in CPU/IO consumption because small segments are merged with large ones. It does too much work in advance (too much "pay it forward"). I don't think a merge policy should "inadvertently optimize" (I opened LUCENE-854 to describe this). It blocks LUCENE-843 (flushing by RAM usage). I think Lucene "out of the box" should give you good indexing performance. You should not have to do extra tuning to get substantially better performance. The best way to get that is to "flush by RAM" (with LUCENE-843 ). But current merge policy prevents this (due to this issue). 2) Can we change the default merge policy? I sure hope so, given the issues above. I think the majority of Lucene users do the simple "create a writer, add/delete docs, close writer, while reader(s) use the same index" type of usage and so would benefit by the gained performance of LUCENE-843 and LUCENE-854 . I think (but may be wrong!) it's a minority who use ParallelReader and therefore have a reliance on the specific merge policy we use today? Ideally we first commit the "decouple merge policy from IndexWriter" ( LUCENE-847 ), then we would make a new merge policy that resolves this issue and LUCENE-854 , and make it the default policy. I think this policy would look at size (in bytes) of each segment (perhaps proportionally reducing # bytes according to pending deletes against that segment), and would merge any adjacent segments (not just rightmost ones) that are "the most similar" in size. I think it would merge N (configurable) at a time and at no time would inadvertently optimize. This would mean users of ParallelReader on upgrading to this would need to change their merge policy to the legacy "merge by doc count" policy.
          Hide
          Michael McCandless added a comment -

          First cut patch. You have to first apply the most recent patch from
          LUCENE-847:

          https://issues.apache.org/jira/secure/attachment/12363880/LUCENE-847.patch.txt

          and then apply this patch over it.

          This patch has two merge policies:

          LogDocMergePolicy

          This is "backwards compatible" to current merge policy, yet,
          resolve this "over-merge issue" by not using the current setting
          of "maxBufferedDocs" when computing levels. I think it should
          replace the current LogDocMergePolicy from LUCENE-847.

          LogByteSizeMergePolicy

          Chooses merges according to net size in bytes of all files for a
          segment. I think we should make this one the default merge
          policy, and also change IndexWriter to flush by RAM by default.

          They both subclass from abstract base LogMergePolicy and differ only
          in the "size" method which defines how you measure a segment's size (#
          docs in that segment or net size in bytes of that segment).

          The gist of the approach is the same as the current merge policy: you
          generally try to merge segments that are "roughly" the same size
          (where size can be doc count or byte size), mergeFactor at a time.

          The big difference is instead of starting from maxBufferedDocs and
          "going up" to determine level, I start from the max segment size (of
          all segments in the index) and "go down" to determine level. This
          resolves the bug because levels are "self-defined" by the segments,
          rather than by the current value of maxBufferedDocs on IndexWriter.

          I then pick merges exactly the same as the current merge policy: if
          any level has >= mergeFactor segments, we merge them.

          All tests pass, except:

          • One assert in testAddIndexesNoOptimize which was relying on the
            specific invariants of the current merge policy (it's the same
            assert that LUCENE-847 had changed; this assert is testing
            particular corner cases of the current merge policy). Changing
            the assertEquals to "4" instead of "3" fixes it.
          • TestLogDocMergePolicy (added in LUCENE-847) doesn't compile
            against the new version above because it's using methods that
            don't exist in the new one.
          Show
          Michael McCandless added a comment - First cut patch. You have to first apply the most recent patch from LUCENE-847 : https://issues.apache.org/jira/secure/attachment/12363880/LUCENE-847.patch.txt and then apply this patch over it. This patch has two merge policies: LogDocMergePolicy This is "backwards compatible" to current merge policy, yet, resolve this "over-merge issue" by not using the current setting of "maxBufferedDocs" when computing levels. I think it should replace the current LogDocMergePolicy from LUCENE-847 . LogByteSizeMergePolicy Chooses merges according to net size in bytes of all files for a segment. I think we should make this one the default merge policy, and also change IndexWriter to flush by RAM by default. They both subclass from abstract base LogMergePolicy and differ only in the "size" method which defines how you measure a segment's size (# docs in that segment or net size in bytes of that segment). The gist of the approach is the same as the current merge policy: you generally try to merge segments that are "roughly" the same size (where size can be doc count or byte size), mergeFactor at a time. The big difference is instead of starting from maxBufferedDocs and "going up" to determine level, I start from the max segment size (of all segments in the index) and "go down" to determine level. This resolves the bug because levels are "self-defined" by the segments, rather than by the current value of maxBufferedDocs on IndexWriter. I then pick merges exactly the same as the current merge policy: if any level has >= mergeFactor segments, we merge them. All tests pass, except: One assert in testAddIndexesNoOptimize which was relying on the specific invariants of the current merge policy (it's the same assert that LUCENE-847 had changed; this assert is testing particular corner cases of the current merge policy). Changing the assertEquals to "4" instead of "3" fixes it. TestLogDocMergePolicy (added in LUCENE-847 ) doesn't compile against the new version above because it's using methods that don't exist in the new one.
          Hide
          Steven Parkes added a comment -

          This increases file descriptor usage in some cases, right? In the old scheme, if you set mergeFactor to 10 and maxBufferedDocs to 1000, you'd only get 10 segments with size <= 1000. But with this code, you can't bound that anymore. If I create single doc segments (perhaps by flushing based on latency), I can get 30 of them?

          Of course, if what we're trying to do is manage the number of file descriptors, we should just do that, rather than using using maxBufferedDocs as a proxy (with all it's nasty overmerging behavior).

          Show
          Steven Parkes added a comment - This increases file descriptor usage in some cases, right? In the old scheme, if you set mergeFactor to 10 and maxBufferedDocs to 1000, you'd only get 10 segments with size <= 1000. But with this code, you can't bound that anymore. If I create single doc segments (perhaps by flushing based on latency), I can get 30 of them? Of course, if what we're trying to do is manage the number of file descriptors, we should just do that, rather than using using maxBufferedDocs as a proxy (with all it's nasty overmerging behavior).
          Hide
          Michael McCandless added a comment -

          > This increases file descriptor usage in some cases, right? In the
          > old scheme, if you set mergeFactor to 10 and maxBufferedDocs to
          > 1000, you'd only get 10 segments with size <= 1000. But with this
          > code, you can't bound that anymore. If I create single doc segments
          > (perhaps by flushing based on latency), I can get 30 of them?

          Right, the # segments allowed in the index will be more than it is w/
          the current merge policy if you consistently flush with [far] fewer
          docs than maxBufferedDocs is set to.

          But, this is actually the essense of the bug. The case we're trying
          to fix is where you set maxBufferedDocs to something really large (say
          1,000,000) to avoid flushing by doc count, and you setRamBufferSizeMB
          to something like 32 MB. In this case, the current merge policy would
          just keep merging any set of 10 segments with < 1,000,000 docs each,
          such that eventually all your indexing time is being spent doing
          highly sub-optimal merges.

          Show
          Michael McCandless added a comment - > This increases file descriptor usage in some cases, right? In the > old scheme, if you set mergeFactor to 10 and maxBufferedDocs to > 1000, you'd only get 10 segments with size <= 1000. But with this > code, you can't bound that anymore. If I create single doc segments > (perhaps by flushing based on latency), I can get 30 of them? Right, the # segments allowed in the index will be more than it is w/ the current merge policy if you consistently flush with [far] fewer docs than maxBufferedDocs is set to. But, this is actually the essense of the bug. The case we're trying to fix is where you set maxBufferedDocs to something really large (say 1,000,000) to avoid flushing by doc count, and you setRamBufferSizeMB to something like 32 MB. In this case, the current merge policy would just keep merging any set of 10 segments with < 1,000,000 docs each, such that eventually all your indexing time is being spent doing highly sub-optimal merges.
          Hide
          Steven Parkes added a comment -

          I understand the merge problem but I'm still concerned about the increased number of file descriptors. Is this a concern?

          It seems like there are ways of approaching this, that might be able to fix both problems?

          For example, right now (pre-fix), if you have maxBufferedDocs set to 1000, mergeFactor set to 10, and add (for the sake of obvious example) 10 single doc segments, it's going to do a merge to one segment of size 1010, which is not great.

          One solution to this would be in cases like this to merge the small segments to one but not include the big segments. So you get [1000 10] where the last segment keeps growing until it reaches 1000. This does more copies than the current case, but always on small segments, with the advantage of a lower bound on the number of file descriptors?

          Of course, if no one's worried about this "moderate" (not exactly large, not exactly small) change in file descriptor usage, then it's not a big deal. It doesn't impact my work but I'm not sure about the greater community.

          Show
          Steven Parkes added a comment - I understand the merge problem but I'm still concerned about the increased number of file descriptors. Is this a concern? It seems like there are ways of approaching this, that might be able to fix both problems? For example, right now (pre-fix), if you have maxBufferedDocs set to 1000, mergeFactor set to 10, and add (for the sake of obvious example) 10 single doc segments, it's going to do a merge to one segment of size 1010, which is not great. One solution to this would be in cases like this to merge the small segments to one but not include the big segments. So you get [1000 10] where the last segment keeps growing until it reaches 1000. This does more copies than the current case, but always on small segments, with the advantage of a lower bound on the number of file descriptors? Of course, if no one's worried about this "moderate" (not exactly large, not exactly small) change in file descriptor usage, then it's not a big deal. It doesn't impact my work but I'm not sure about the greater community.
          Hide
          Yonik Seeley added a comment -

          Is there a change in filedescriptor use if you don't use setRamBufferSizeMB?

          Show
          Yonik Seeley added a comment - Is there a change in filedescriptor use if you don't use setRamBufferSizeMB?
          Hide
          Michael McCandless added a comment -

          > Is there a change in filedescriptor use if you don't use setRamBufferSizeMB?

          Yes. EG, if you set maxBufferedDocs to 1000 but then flush after
          every added doc, and you add 1000 docs, with the current merge policy,
          every 10 flushes you will merge all segments together. Ie, first
          segment has 10 docs, then 20, 30, 40, 50, ..., 1000. This is where
          O(N^2) cost on merging comes from. But, you will never have more than
          10 segments in your index.

          Whereas the new merge policy will make levels (segments of size 100,
          10, 1) and merge only segments from the same level together. So merge
          cost will be much less (not O(N^2)), but, you will have more max segments
          in the index (up to 1 + (mergeFactor-1) * log_mergeFactor(numDocs)),
          or 28 segments in this example (I think).

          Basically the new merge policy tries to make levels "all the way
          down" rather than forcefully stopping when the levels get smaller than
          maxBufferedDocs, to avoid the O(N^2) merge cost.

          > One solution to this would be in cases like this to merge the small
          > segments to one but not include the big segments. So you get [1000
          > 10] where the last segment keeps growing until it reaches 1000. This
          > does more copies than the current case, but always on small
          > segments, with the advantage of a lower bound on the number of file
          > descriptors?

          I'm not sure that helps? Because that "small segment" will have to
          grow bit by bit up to 1000 (causing the O(N^2) cost).

          Note that the goal here is to be able to switch to flushing by RAM
          buffer size instead of docCount (and also merge by byte-size of
          segments not doc count), by default, in IndexWriter. But, even once
          we do that, if you always flush tiny segments the new merge policy
          will still build levels "all the way down".

          Here's an idea: maybe we can accept the O(N^2) merge cost, when the
          segments are "small"? Ie, maybe doing 100 sub-optimal merges (in the
          example above) does not amount to that much actual cost in practice.
          (After all nobody has complained about this .

          I will run some tests. Clearly at some point the O(N^2) cost will
          dominate your indexing time, but maybe we can set a "rough" docCount
          below which all segments are counted as a single level and not take
          too much of a indexing performance hit.

          Show
          Michael McCandless added a comment - > Is there a change in filedescriptor use if you don't use setRamBufferSizeMB? Yes. EG, if you set maxBufferedDocs to 1000 but then flush after every added doc, and you add 1000 docs, with the current merge policy, every 10 flushes you will merge all segments together. Ie, first segment has 10 docs, then 20, 30, 40, 50, ..., 1000. This is where O(N^2) cost on merging comes from. But, you will never have more than 10 segments in your index. Whereas the new merge policy will make levels (segments of size 100, 10, 1) and merge only segments from the same level together. So merge cost will be much less (not O(N^2)), but, you will have more max segments in the index (up to 1 + (mergeFactor-1) * log_mergeFactor(numDocs)), or 28 segments in this example (I think). Basically the new merge policy tries to make levels "all the way down" rather than forcefully stopping when the levels get smaller than maxBufferedDocs, to avoid the O(N^2) merge cost. > One solution to this would be in cases like this to merge the small > segments to one but not include the big segments. So you get [1000 > 10] where the last segment keeps growing until it reaches 1000. This > does more copies than the current case, but always on small > segments, with the advantage of a lower bound on the number of file > descriptors? I'm not sure that helps? Because that "small segment" will have to grow bit by bit up to 1000 (causing the O(N^2) cost). Note that the goal here is to be able to switch to flushing by RAM buffer size instead of docCount (and also merge by byte-size of segments not doc count), by default, in IndexWriter. But, even once we do that, if you always flush tiny segments the new merge policy will still build levels "all the way down". Here's an idea: maybe we can accept the O(N^2) merge cost, when the segments are "small"? Ie, maybe doing 100 sub-optimal merges (in the example above) does not amount to that much actual cost in practice. (After all nobody has complained about this . I will run some tests. Clearly at some point the O(N^2) cost will dominate your indexing time, but maybe we can set a "rough" docCount below which all segments are counted as a single level and not take too much of a indexing performance hit.
          Hide
          Yonik Seeley added a comment -

          You may avoid the cost of a bunch of small merges, but then you pay the price in searching performance. I'm not sure that's the right tradeoff because if someone wanted to optimize for indexing performance, they would do more in batches.

          How does this work when flushing by MB? If you set setRamBufferSizeMB(32), are you guaranteed that you never have more than 10 segments less than 32MB (ignoring LEVEL_LOG_SPAN for now) if mergeFactor is 10?

          Almost seems like we need a minSegmentSize parameter too - using setRamBufferSizeMB confuses two different but related issues.

          Show
          Yonik Seeley added a comment - You may avoid the cost of a bunch of small merges, but then you pay the price in searching performance. I'm not sure that's the right tradeoff because if someone wanted to optimize for indexing performance, they would do more in batches. How does this work when flushing by MB? If you set setRamBufferSizeMB(32), are you guaranteed that you never have more than 10 segments less than 32MB (ignoring LEVEL_LOG_SPAN for now) if mergeFactor is 10? Almost seems like we need a minSegmentSize parameter too - using setRamBufferSizeMB confuses two different but related issues.
          Hide
          Steven Parkes added a comment -

          Here's an idea: maybe we can accept the O(N^2) merge cost, when the
          segments are "small"?

          That's basically the underlying idea I was trying to get at.

          Show
          Steven Parkes added a comment - Here's an idea: maybe we can accept the O(N^2) merge cost, when the segments are "small"? That's basically the underlying idea I was trying to get at.
          Hide
          Michael McCandless added a comment -

          > You may avoid the cost of a bunch of small merges, but then you pay
          > the price in searching performance. I'm not sure that's the right
          > tradeoff because if someone wanted to optimize for indexing
          > performance, they would do more in batches.

          Agreed.

          It's like we would want to run "partial optimize" (ie, merge the tail
          of "small" segments) on demand, only when a reader is about to
          refresh.

          Or here's another random idea: maybe IndexReaders should load the tail
          of "small segments" into a RAMDirectory, for each one. Ie, an
          IndexReader is given a RAM buffer "budget" and it spends it on any
          numerous small segments in the index....?

          > How does this work when flushing by MB? If you set
          > setRamBufferSizeMB(32), are you guaranteed that you never have more
          > than 10 segments less than 32MB (ignoring LEVEL_LOG_SPAN for now) if
          > mergeFactor is 10?

          No, we have the same challenge of avoiding O(N^2) merge cost. When
          merging by "byte size" of the segments, I don't look at the current
          RAM buffer size of the writer.

          I feel like there should be a strong separation of "flush params" from
          "merge params".

          > Almost seems like we need a minSegmentSize parameter too - using
          > setRamBufferSizeMB confuses two different but related issues.

          Exactly! I'm thinking that I add "minSegmentSize" to the
          LogMergePolicy, which is separate from "maxBufferedDocs" and
          "ramBufferSizeMB". And, we default it to values that seem like an
          "acceptable" tradeoff of the cost of O(N^2) merging (based on tests I
          will run) vs the cost of slowdown to readers...

          I'll run some perf tests. O(N^2) should be acceptable for certain
          segment sizes....

          Show
          Michael McCandless added a comment - > You may avoid the cost of a bunch of small merges, but then you pay > the price in searching performance. I'm not sure that's the right > tradeoff because if someone wanted to optimize for indexing > performance, they would do more in batches. Agreed. It's like we would want to run "partial optimize" (ie, merge the tail of "small" segments) on demand, only when a reader is about to refresh. Or here's another random idea: maybe IndexReaders should load the tail of "small segments" into a RAMDirectory, for each one. Ie, an IndexReader is given a RAM buffer "budget" and it spends it on any numerous small segments in the index....? > How does this work when flushing by MB? If you set > setRamBufferSizeMB(32), are you guaranteed that you never have more > than 10 segments less than 32MB (ignoring LEVEL_LOG_SPAN for now) if > mergeFactor is 10? No, we have the same challenge of avoiding O(N^2) merge cost. When merging by "byte size" of the segments, I don't look at the current RAM buffer size of the writer. I feel like there should be a strong separation of "flush params" from "merge params". > Almost seems like we need a minSegmentSize parameter too - using > setRamBufferSizeMB confuses two different but related issues. Exactly! I'm thinking that I add "minSegmentSize" to the LogMergePolicy, which is separate from "maxBufferedDocs" and "ramBufferSizeMB". And, we default it to values that seem like an "acceptable" tradeoff of the cost of O(N^2) merging (based on tests I will run) vs the cost of slowdown to readers... I'll run some perf tests. O(N^2) should be acceptable for certain segment sizes....
          Hide
          Michael McCandless added a comment -

          > > Here's an idea: maybe we can accept the O(N^2) merge cost, when
          > > the segments are "small"?
          >
          > That's basically the underlying idea I was trying to get at.

          Ahh, good! We agree

          Show
          Michael McCandless added a comment - > > Here's an idea: maybe we can accept the O(N^2) merge cost, when > > the segments are "small"? > > That's basically the underlying idea I was trying to get at. Ahh, good! We agree
          Hide
          Michael McCandless added a comment -

          > Or here's another random idea: maybe IndexReaders should load the
          > tail of "small segments" into a RAMDirectory, for each one. Ie, an
          > IndexReader is given a RAM buffer "budget" and it spends it on any
          > numerous small segments in the index....?

          Following up on this ... I think IndexReader could load "small tail
          segments" into RAMDirectory and then do a merge on them to make
          search even faster. It should typically be extremely fast if we set the
          defaults right, and RAM usage should be quite low since merging
          small segments usually gives great compression in net bytes used.

          This would allow us to avoid (or, minimize) the O(N^2) cost on merging
          to ensure that an index is "at all instants" ready for a reader to
          directly load it. This basically gives us our "merge tail segments on
          demand when a reader refreshes".

          We can do a combination of these two approaches, whereby the
          IndexWriter is free to make use a "long tail" of segments so it
          doesn't have O(N^2) slowdown on merge cost, yet a reader pays very
          small (one-time) cost for such segments.

          I think the combination of these two changes should give a net/net
          sizable improvement on "low latency" apps.... because IndexWriter is
          free to make miniscule segments (document by document even) and
          IndexReader (especially combined with LUCENE-743) can quickly
          re-open and do a "mini-optimize" on the tail segments and have
          great performance.

          Show
          Michael McCandless added a comment - > Or here's another random idea: maybe IndexReaders should load the > tail of "small segments" into a RAMDirectory, for each one. Ie, an > IndexReader is given a RAM buffer "budget" and it spends it on any > numerous small segments in the index....? Following up on this ... I think IndexReader could load "small tail segments" into RAMDirectory and then do a merge on them to make search even faster. It should typically be extremely fast if we set the defaults right, and RAM usage should be quite low since merging small segments usually gives great compression in net bytes used. This would allow us to avoid (or, minimize) the O(N^2) cost on merging to ensure that an index is "at all instants" ready for a reader to directly load it. This basically gives us our "merge tail segments on demand when a reader refreshes". We can do a combination of these two approaches, whereby the IndexWriter is free to make use a "long tail" of segments so it doesn't have O(N^2) slowdown on merge cost, yet a reader pays very small (one-time) cost for such segments. I think the combination of these two changes should give a net/net sizable improvement on "low latency" apps.... because IndexWriter is free to make miniscule segments (document by document even) and IndexReader (especially combined with LUCENE-743 ) can quickly re-open and do a "mini-optimize" on the tail segments and have great performance.
          Hide
          Steven Parkes added a comment -

          I think the combination of these two changes should give a net/net
          sizable improvement on "low latency" apps....

          I think this would be great. It's always been a pet peeve of mine that even in low pressure/activity environments, there is often a delay from write to read.

          Sounds like this would help take most of the work/risk off the developer.

          Show
          Steven Parkes added a comment - I think the combination of these two changes should give a net/net sizable improvement on "low latency" apps.... I think this would be great. It's always been a pet peeve of mine that even in low pressure/activity environments, there is often a delay from write to read. Sounds like this would help take most of the work/risk off the developer.
          Hide
          Michael McCandless added a comment -

          > I think this would be great. It's always been a pet peeve of mine
          > that even in low pressure/activity environments, there is often a
          > delay from write to read.

          I'll open a new issue.

          > Sounds like this would help take most of the work/risk off the
          > developer.

          Precisely! Out of the box we can have very low latency from
          IndexWriter flushing single doc segments, and not having to pay the
          O(N^2) merge cost of merging down such segments to be "at all moments"
          ready for an IndexReader to open the index, while IndexReader can load
          such an index (or re-open by loading only the "new" segments) and very
          quickly reduce the # segments so that searching is still fast.

          Show
          Michael McCandless added a comment - > I think this would be great. It's always been a pet peeve of mine > that even in low pressure/activity environments, there is often a > delay from write to read. I'll open a new issue. > Sounds like this would help take most of the work/risk off the > developer. Precisely! Out of the box we can have very low latency from IndexWriter flushing single doc segments, and not having to pay the O(N^2) merge cost of merging down such segments to be "at all moments" ready for an IndexReader to open the index, while IndexReader can load such an index (or re-open by loading only the "new" segments) and very quickly reduce the # segments so that searching is still fast.
          Hide
          Yonik Seeley added a comment -

          Merging small segments in the reader seems like a cool idea on it's own.
          But if it's an acceptable hit to merge in the reader, why is it not in the writer?

          Think about a writer flushing 10 small segments and a new reader opened each time:
          The reader would do ~10*10/2 merges if it just merged the small segments.
          If the writer were to do the merging instead, it would need to merge ~10 segments.

          Thinking about it anotherway... if there were no separation between reader and writer, and small segments were merged on an open, why not just write out the result so it wouldn't have to be done again? Now move "merge on an open" to "merge on the close" and that's what IndexWriter currently does. Why is it OK for a reader to pay the price but not the writer?

          Also, would this tail merging on an open be able to reduce the peak number of file descriptors?
          It seems like to do so, the tail would have to be merged before other index files were opened, further complicating matters.

          Show
          Yonik Seeley added a comment - Merging small segments in the reader seems like a cool idea on it's own. But if it's an acceptable hit to merge in the reader, why is it not in the writer? Think about a writer flushing 10 small segments and a new reader opened each time: The reader would do ~10*10/2 merges if it just merged the small segments. If the writer were to do the merging instead, it would need to merge ~10 segments. Thinking about it anotherway... if there were no separation between reader and writer, and small segments were merged on an open, why not just write out the result so it wouldn't have to be done again? Now move "merge on an open" to "merge on the close" and that's what IndexWriter currently does. Why is it OK for a reader to pay the price but not the writer? Also, would this tail merging on an open be able to reduce the peak number of file descriptors? It seems like to do so, the tail would have to be merged before other index files were opened, further complicating matters.
          Hide
          Michael McCandless added a comment -

          > Merging small segments in the reader seems like a cool idea on it's
          > own. But if it's an acceptable hit to merge in the reader, why is
          > it not in the writer?

          Good point. I think it comes down to how often we expect readers to
          refresh vs writers flushing.

          If indeed it's 1 to 1 (as the truest "low latency" app would in fact
          be, or a "single writer + reader with no separation"), then the writer
          should merge them because although it's paying an O(N^2) cost to keep
          the tail "short", merging on open would pay even more cost.

          But if writer flushes frequently and reader re-opens less frequently
          then it's better to merge on open.

          Of course, if the O(N^2) cost for IndexWriter to keep a short tail is
          in practice not too costly then we should just leave this in
          IndexWriter. I still need to run that test for LUCENE-845.

          > Also, would this tail merging on an open be able to reduce the peak
          > number of file descriptors? It seems like to do so, the tail would
          > have to be merged before other index files were opened, further
          > complicating matters.

          Right I think to keep peak descriptor usage capped we must merge the
          tail, first, then open the remaining segments, which definitely
          complicate things...

          Show
          Michael McCandless added a comment - > Merging small segments in the reader seems like a cool idea on it's > own. But if it's an acceptable hit to merge in the reader, why is > it not in the writer? Good point. I think it comes down to how often we expect readers to refresh vs writers flushing. If indeed it's 1 to 1 (as the truest "low latency" app would in fact be, or a "single writer + reader with no separation"), then the writer should merge them because although it's paying an O(N^2) cost to keep the tail "short", merging on open would pay even more cost. But if writer flushes frequently and reader re-opens less frequently then it's better to merge on open. Of course, if the O(N^2) cost for IndexWriter to keep a short tail is in practice not too costly then we should just leave this in IndexWriter. I still need to run that test for LUCENE-845 . > Also, would this tail merging on an open be able to reduce the peak > number of file descriptors? It seems like to do so, the tail would > have to be merged before other index files were opened, further > complicating matters. Right I think to keep peak descriptor usage capped we must merge the tail, first, then open the remaining segments, which definitely complicate things...
          Hide
          Yonik Seeley added a comment -

          > But if writer flushes frequently and reader re-opens less frequently
          > then it's better to merge on open.

          Seems like an odd case though, because if readers aren't opened frequently, then it's a wast to flush small segments so often (and much slower for the writer than not doing so).

          Show
          Yonik Seeley added a comment - > But if writer flushes frequently and reader re-opens less frequently > then it's better to merge on open. Seems like an odd case though, because if readers aren't opened frequently, then it's a wast to flush small segments so often (and much slower for the writer than not doing so).
          Hide
          Michael McCandless added a comment -

          Agreed. OK, I think this is a dead end: it adds complexity and won't
          help in "typical" uses of Lucene.

          So ... my plan of action is to assess the "actual" O(N^2) cost for
          IndexWriter to keep the tail short, add a parameter to LogMergePolicy
          so that it "floors" the level and always merges segments less than
          this floor together, despite the O(N^2) cost. And then pick a
          reasonable default for this floor.

          Show
          Michael McCandless added a comment - Agreed. OK, I think this is a dead end: it adds complexity and won't help in "typical" uses of Lucene. So ... my plan of action is to assess the "actual" O(N^2) cost for IndexWriter to keep the tail short, add a parameter to LogMergePolicy so that it "floors" the level and always merges segments less than this floor together, despite the O(N^2) cost. And then pick a reasonable default for this floor.
          Hide
          Michael McCandless added a comment -

          In the latest patch on LUCENE-847 I've added methods to
          LogDocMergePolicy (setMinMergeDocs) and LogByteSizeMergePolicy
          (setMinMergeMB) to set a floor on the segment levels such that all
          segments below this min size are aggressively merged as if they were in
          one level. This effectively "truncates" what would otherwise be a
          long tail of segment sizes, when you are flushing many tiny segments
          into your index.

          In order to pick reasonable defaults for the min segment size, I ran
          some benchmarks to measure the indexing cost of truncating the tail.

          I processed Wiki content into ~4 KB plain text documents and then
          indexed the first 10,000 docs using this alg:

          analyzer=org.apache.lucene.analysis.SimpleAnalyzer
          doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
          directory=FSDirectory
          docs.file=/lucene/wiki4K.txt
          max.buffered = 500

          ResetSystemErase
          CreateIndex
          {AddDoc >: 10000
          CloseIndex

          RepSumByName

          I'm using the SerialMergeScheduler.

          I modified contrib/benchmark to always flush a new segment after each
          added document: this simulates the "worst case" of tiny segments, ie,
          lowest latency indexing where every added doc must then be visible to
          searchers.

          Each time is best of 2 runs. This is run on Linux (2.6.22.1) Core II
          Duo 2.4 Ghz machine with 4 GB RAM, RAID 5 IO system using Java 1.5
          -server.

          maxBufferedDocs seconds slowdown
          10 40 1.0
          100 50 1.3
          200 59 1.5
          300 64 1.6
          400 72 1.8
          500 80 2.0
          750 97 2.4
          1000 114 2.9
          1500 138 3.5
          2000 169 4.2
          3000 205 5.1
          4000 264 6.6
          5000 320 8.0
          7500 404 10.1
          10000 645 16.1

          Here's my thinking:

          • If you are flushing zillions of such tiny segments I think it's OK
            to accept a net/net sizable slowdown of your overall indexing
            speed. I'll choose a 4X slowdown "tolerance" to choose default
            values. This corresponds roughly to the "2000" line above.
            However, because I tested on a fairly fast CPU & IO system I'll
            multiply the numbers by 0.5.
          • Given this, I propose we default the minMergeMB
            (LogByteSizeMergePolicy) to 1.6 MB (avg size of real segments at
            the 2000 point above was 3.2 MB) and default minMergeDocs
            (LogDocMergePolicy) to 1000.
          • Note that when you are flushing large segments (larger than these
            min size settings) then there is no slowdown at all because the
            flushed segments are already above the minimum size.

          These are just defaults, so a given application can always change
          their "min merge size" as needed.

          Show
          Michael McCandless added a comment - In the latest patch on LUCENE-847 I've added methods to LogDocMergePolicy (setMinMergeDocs) and LogByteSizeMergePolicy (setMinMergeMB) to set a floor on the segment levels such that all segments below this min size are aggressively merged as if they were in one level. This effectively "truncates" what would otherwise be a long tail of segment sizes, when you are flushing many tiny segments into your index. In order to pick reasonable defaults for the min segment size, I ran some benchmarks to measure the indexing cost of truncating the tail. I processed Wiki content into ~4 KB plain text documents and then indexed the first 10,000 docs using this alg: analyzer=org.apache.lucene.analysis.SimpleAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker directory=FSDirectory docs.file=/lucene/wiki4K.txt max.buffered = 500 ResetSystemErase CreateIndex {AddDoc >: 10000 CloseIndex RepSumByName I'm using the SerialMergeScheduler. I modified contrib/benchmark to always flush a new segment after each added document: this simulates the "worst case" of tiny segments, ie, lowest latency indexing where every added doc must then be visible to searchers. Each time is best of 2 runs. This is run on Linux (2.6.22.1) Core II Duo 2.4 Ghz machine with 4 GB RAM, RAID 5 IO system using Java 1.5 -server. maxBufferedDocs seconds slowdown 10 40 1.0 100 50 1.3 200 59 1.5 300 64 1.6 400 72 1.8 500 80 2.0 750 97 2.4 1000 114 2.9 1500 138 3.5 2000 169 4.2 3000 205 5.1 4000 264 6.6 5000 320 8.0 7500 404 10.1 10000 645 16.1 Here's my thinking: If you are flushing zillions of such tiny segments I think it's OK to accept a net/net sizable slowdown of your overall indexing speed. I'll choose a 4X slowdown "tolerance" to choose default values. This corresponds roughly to the "2000" line above. However, because I tested on a fairly fast CPU & IO system I'll multiply the numbers by 0.5. Given this, I propose we default the minMergeMB (LogByteSizeMergePolicy) to 1.6 MB (avg size of real segments at the 2000 point above was 3.2 MB) and default minMergeDocs (LogDocMergePolicy) to 1000. Note that when you are flushing large segments (larger than these min size settings) then there is no slowdown at all because the flushed segments are already above the minimum size. These are just defaults, so a given application can always change their "min merge size" as needed.
          Hide
          Yonik Seeley added a comment - - edited

          Thanks for adding minMergeMB, the default seems fine.
          Shound minMergeDocs default to maxBufferedDocs (that should yield the old behavior)?
          Although 1000 isn't bad... much better to slow indexing a little in the odd app than to break it by running it out of descriptors.

          Show
          Yonik Seeley added a comment - - edited Thanks for adding minMergeMB, the default seems fine. Shound minMergeDocs default to maxBufferedDocs (that should yield the old behavior)? Although 1000 isn't bad... much better to slow indexing a little in the odd app than to break it by running it out of descriptors.
          Hide
          Michael McCandless added a comment -

          > Shound minMergeDocs default to maxBufferedDocs (that should yield
          > the old behavior)?

          Good idea – I think we could do this dynamically so that whenever
          IndexWriter is flushing by doc count and the merge policy is
          LogDocMergePolicy we "write through" any changes to maxBufferedDocs
          --> LogDocMergePolicy.setMinMergeDocs? I'll take that approach to
          keep backwards compatibility.

          > Although 1000 isn't bad... much better to slow indexing a little in
          > the odd app than to break it by running it out of descriptors.

          Agreed, that's the right direction to "err" here.

          Show
          Michael McCandless added a comment - > Shound minMergeDocs default to maxBufferedDocs (that should yield > the old behavior)? Good idea – I think we could do this dynamically so that whenever IndexWriter is flushing by doc count and the merge policy is LogDocMergePolicy we "write through" any changes to maxBufferedDocs --> LogDocMergePolicy.setMinMergeDocs? I'll take that approach to keep backwards compatibility. > Although 1000 isn't bad... much better to slow indexing a little in > the odd app than to break it by running it out of descriptors. Agreed, that's the right direction to "err" here.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development