> Following up on this, it's basically the idea that segments ought to be created/merged both either by-segment-size or by-doc-count but not by a mixture? That wouldn't be suprising ...
Right, but we need the refactored merge policy framework in place
first. I'll mark this issue dependent on
> It does impact the APIs, though. It's easy enough to imagine, with factored merge policies, both by-doc-count and by-segment-size policies. But the initial segment creation is going to be handled by IndexWriter, so you have to manually make sure you don't set that algorithm and the merge policy in conflict. Not great, but I don't have any great ideas. Could put in an API handshake, but I'm not sure if it's worth the mess?
Good question. I think it's OK (at least for our first go at this –
progress not perfection!) to expect the developer to choose a merge
policy and then to use IndexWriter in a way that's "consistent" with
that merge policy? I think it's going to get too complex if we try to
formally couple "when to flush/commit" with the merge policy?
But, I think the default merge policy needs to be resilient to people
doing things like changing maxBuffereDocs/mergeFactor partway through
an index, calling flush() whenever they want, etc. The merge policy
today is not resilient to these "normal" usages of IndexWriter. So I
think we need to do something here even without the pressure from
> Also, it sounds like, so far, there's no good way of managing parallel-reader setups w/by-segment-size algorithms, since the algorithm for creating/merging segments has to be globally consistent, not just per index, right?
Right. We clearly need to keep the current "by doc" merge policy
easily available for this use case.
> If that is right, what does that say about making by-segment-size the default? It's gonna break (as in bad results) people relying on that behavior that don't change their code. Is there a community consensus on this? It's not really an API change that would cause a compile/class-load failure, but in some ways, it's worse ...
I think there are actually two questions here:
1) What exactly makes for a good default merge policy?
I think the merge policy we have today has some limitations:
- It's not resilient to "normal" usage of the public APIs in
IndexWriter. If you call flush() yourself, if you change
maxBufferedDocs (and maybe mergeFactor?) partway through an
index, etc, you can cause disastrous amounts of over-merging
(that's this issue).
I think the default policy should be entirely resilient to
any usage of the public IndexWriter APIs.
- Default merge policy should strive to minimize net cost
(amortized over time) of merging, but the current one
- When docs differ in size (frequently the case) it will be
too costly in CPU/IO consumption because small segments are
merged with large ones.
- It does too much work in advance (too much "pay it
forward"). I don't think a merge policy should
"inadvertently optimize" (I opened
LUCENE-854 to describe
I think Lucene "out of the box" should give you good indexing
performance. You should not have to do extra tuning to get
substantially better performance. The best way to get that
is to "flush by RAM" (with
LUCENE-843). But current merge
policy prevents this (due to this issue).
2) Can we change the default merge policy?
I sure hope so, given the issues above.
I think the majority of Lucene users do the simple "create a
writer, add/delete docs, close writer, while reader(s) use the
same index" type of usage and so would benefit by the gained
LUCENE-843 and LUCENE-854.
I think (but may be wrong!) it's a minority who use
ParallelReader and therefore have a reliance on the specific
merge policy we use today?
Ideally we first commit the "decouple merge policy from IndexWriter"
LUCENE-847), then we would make a new merge policy that resolves this
LUCENE-854, and make it the default policy. I think this
policy would look at size (in bytes) of each segment (perhaps
proportionally reducing # bytes according to pending deletes against
that segment), and would merge any adjacent segments (not just
rightmost ones) that are "the most similar" in size. I think it would
merge N (configurable) at a time and at no time would inadvertently
This would mean users of ParallelReader on upgrading to this would
need to change their merge policy to the legacy "merge by doc count"