Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If we factor the merge policy out of IndexWriter, we can make it pluggable, making it possible for apps to choose a custom merge policy and for easier experimenting with merge policy variants.

      1. concurrentMerge.patch
        104 kB
        Ning Li
      2. LUCENE-847.patch.txt
        150 kB
        Steven Parkes
      3. LUCENE-847.patch.txt
        111 kB
        Steven Parkes
      4. LUCENE-847.take3.patch
        75 kB
        Michael McCandless
      5. LUCENE-847.take4.patch
        89 kB
        Michael McCandless
      6. LUCENE-847.take5.patch
        145 kB
        Michael McCandless
      7. LUCENE-847.take6.patch
        164 kB
        Michael McCandless
      8. LUCENE-847.take7.patch
        172 kB
        Michael McCandless
      9. LUCENE-847.take8.patch
        173 kB
        Michael McCandless
      10. LUCENE-847.txt
        117 kB
        Steven Parkes

        Issue Links

          Activity

          Hide
          steven_parkes Steven Parkes added a comment -

          Here's a first cut at a factored merge policy.

          It's not polished. Sparsely commented and there are probably a few changes that should be backed out.

          It factors a merge policy interface out of IndexWriter and creates an implementation of the existing merge policy.

          Actually, it's a tweak on the existing merge policy. Currently the merge policy is implemented in ways that assume certain things about the existing list of segments. The factored version doesn't make these assumptions. It simplifies the interface but I'm not yet sure if there are bad side effects. Among other things I want to run performance tests.

          There is part of a pass at a concurrent version of the current merge policy. It's not complete. I've been pushing it to see if I understand the issues around concurrent merges. Interesting topics are 1) how to control the merges 2) how/when to cascade merges if they are happening in a parallel and 3) how to handle synchronization of IndexWriter#segmentInfos. That last one in particular is a bit touchy.

          I did a quick implementation of KS's fib merge policy but it's incomplete in that IndexWriter won't merge non-contiguous segment lists, but I think I can fix that fairly easily with no major side effects. The factored merge policy makes this plug in pretty clean ...

          Show
          steven_parkes Steven Parkes added a comment - Here's a first cut at a factored merge policy. It's not polished. Sparsely commented and there are probably a few changes that should be backed out. It factors a merge policy interface out of IndexWriter and creates an implementation of the existing merge policy. Actually, it's a tweak on the existing merge policy. Currently the merge policy is implemented in ways that assume certain things about the existing list of segments. The factored version doesn't make these assumptions. It simplifies the interface but I'm not yet sure if there are bad side effects. Among other things I want to run performance tests. There is part of a pass at a concurrent version of the current merge policy. It's not complete. I've been pushing it to see if I understand the issues around concurrent merges. Interesting topics are 1) how to control the merges 2) how/when to cascade merges if they are happening in a parallel and 3) how to handle synchronization of IndexWriter#segmentInfos. That last one in particular is a bit touchy. I did a quick implementation of KS's fib merge policy but it's incomplete in that IndexWriter won't merge non-contiguous segment lists, but I think I can fix that fairly easily with no major side effects. The factored merge policy makes this plug in pretty clean ...
          Hide
          cutting Doug Cutting added a comment -

          How public should such an API be? Should the interface be public? Should the implementations? The most conservative approach would be to make it all package private, to give more leeway for evolving the update API. But that also decreases the utility. Thoughts?

          Show
          cutting Doug Cutting added a comment - How public should such an API be? Should the interface be public? Should the implementations? The most conservative approach would be to make it all package private, to give more leeway for evolving the update API. But that also decreases the utility. Thoughts?
          Hide
          steven_parkes Steven Parkes added a comment -

          Visibility is one of those things I haven't cleaned up yet.

          Client code is gonna want to create and set merge policies. And it'll want to set "external" merge policy parameters. That's all probably not controversial.

          As for other stuff, I tend to leave things open, but I know that's debatable and don't have a strong opinion in this case.

          In fact, there a few things here that are fairly subtle/important. The relationship/protocol between the writer and policy is pretty strong. This can be seen in the strawman concurrent merge code where the merge policy holds state and relies on being called from a synchronized writer method. If that goes forward anything like it is, it would argue for tightening that api up some. Chris suggested a way to make the writer<>polcy relationship "atomic". I didn't add the code (yet) but I'm not against it.

          Show
          steven_parkes Steven Parkes added a comment - Visibility is one of those things I haven't cleaned up yet. Client code is gonna want to create and set merge policies. And it'll want to set "external" merge policy parameters. That's all probably not controversial. As for other stuff, I tend to leave things open, but I know that's debatable and don't have a strong opinion in this case. In fact, there a few things here that are fairly subtle/important. The relationship/protocol between the writer and policy is pretty strong. This can be seen in the strawman concurrent merge code where the merge policy holds state and relies on being called from a synchronized writer method. If that goes forward anything like it is, it would argue for tightening that api up some. Chris suggested a way to make the writer<>polcy relationship "atomic". I didn't add the code (yet) but I'm not against it.
          Hide
          mikemccand Michael McCandless added a comment -

          Steven, I looked through the patch quickly. It looks great! First
          some general comments and then I'll add more specifics as
          separate comments.

          Can you open separate issues for the other new and interesting merge
          policies here? I think the refactoring of merge policy plus creation
          of the default policy that is identical to today's merge policy, which
          should be a fairly quick and low-risk operation, would then remain
          under this issue?

          Then, iterating / vetting / debugging the new interesting merge
          policies can take longer under their own separate issues and time
          frame.

          On staging I think we could first do this issue (decouple MergePolicy
          from writer), then LUCENE-845 because it blocks LUCENE-843 (which
          would then be fixing LogarithmicMergePolicy to use segment sizes
          instead of docs counts as basis for determing levels) then LUCENE-843
          (performance improvements for how writer uses RAM)?

          Show
          mikemccand Michael McCandless added a comment - Steven, I looked through the patch quickly. It looks great! First some general comments and then I'll add more specifics as separate comments. Can you open separate issues for the other new and interesting merge policies here? I think the refactoring of merge policy plus creation of the default policy that is identical to today's merge policy, which should be a fairly quick and low-risk operation, would then remain under this issue? Then, iterating / vetting / debugging the new interesting merge policies can take longer under their own separate issues and time frame. On staging I think we could first do this issue (decouple MergePolicy from writer), then LUCENE-845 because it blocks LUCENE-843 (which would then be fixing LogarithmicMergePolicy to use segment sizes instead of docs counts as basis for determing levels) then LUCENE-843 (performance improvements for how writer uses RAM)?
          Hide
          mikemccand Michael McCandless added a comment -

          My first comment, which I fear will be the most controversial feedback
          here , is a whitespace style question: I'm not really a fan of
          "cancerous whitespace" where every ( [ etc has its own whitespace
          around it.

          I generally prefer minimal whitespace within reason (ie as long as it
          does not heavily hurt readability). The thing is I don't know that
          Lucene has settled on this / if anyone else shares my opinion? It
          does seem that "two space indentation" is the standard, which I agree
          with, but I don't know if whitespace style has otherwise been agreed
          on? Many people will say it's unimportant to agree on whitespace but
          I feel it's actually quite important.

          Show
          mikemccand Michael McCandless added a comment - My first comment, which I fear will be the most controversial feedback here , is a whitespace style question: I'm not really a fan of "cancerous whitespace" where every ( [ etc has its own whitespace around it. I generally prefer minimal whitespace within reason (ie as long as it does not heavily hurt readability). The thing is I don't know that Lucene has settled on this / if anyone else shares my opinion? It does seem that "two space indentation" is the standard, which I agree with, but I don't know if whitespace style has otherwise been agreed on? Many people will say it's unimportant to agree on whitespace but I feel it's actually quite important.
          Hide
          mikemccand Michael McCandless added a comment -

          OK some specific comments, only on the refactoring (ie I haven't
          really looked at the new merge policies yet):

          • I think maxBufferedDocs should not be exposed in any *MergePolicy
            classes or interfaces? I'm planning on deprecating this param
            with LUCENE-843 when we switch by default to "buffering by RAM
            usage" and it really relates to "how/when should writer flush its
            RAM buffer".
          • I also think "minMergeDocs" (which today is one and the same as
            maxBufferedDocs in IndexWriter but conceptually could be a
            different configuration) also should not appear in the MergePolicy
            interface. I think it should only appear in
            LogarithmicMergePolicy?

          If we remove these from the MergePolicy interface then maybe we
          don't need MergePolicyBase? (Just to makes things simpler).

          • I think we should not create a LegacyMergePolicy interface. But I
            realize you need this so the deprecated methods in IndexWriter
            (setMergeFactor, setMaxBufferedDocs, setMaxMergeDocs, etc.) can be
            implemented. How about instead these methods will only work if
            the current merge policy is the LogarithmicMergePolicy? You can
            check if the current mergePolicy is an instanceof
            LogarithmicMergePolicy and then throw eg an IllegalStateException
            if it's not?

          Ie, going forward, with new and interesting merge policies,
          developers should interact with their merge policy to configure
          it.

          • I was a little spooked by this change to TestAddIndexesNoOptimize:
          • assertEquals(2, writer.getSegmentCount());
            + assertEquals(3, writer.getSegmentCount());

          I think with just the refactoring, there should not need to be any
          changes to unit tests right?

          • It's interesting that you've pulled "useCompoundFile" into the
            LegacyMergePolicy. I'm torn on whether it belongs in MergePolicy
            at all, since this is really a file format issue?

          For example, newly written segments (no longer a "merge" with
          LUCENE-843) must also know whether to write in compound file
          format. If we make interesting file format changes in the future
          that are configurable by the developer we wouldn't want to change
          all MergePolicy classes to propogate that. It feels like using
          compound file or not should remain only in IndexWriter?

          Show
          mikemccand Michael McCandless added a comment - OK some specific comments, only on the refactoring (ie I haven't really looked at the new merge policies yet): I think maxBufferedDocs should not be exposed in any *MergePolicy classes or interfaces? I'm planning on deprecating this param with LUCENE-843 when we switch by default to "buffering by RAM usage" and it really relates to "how/when should writer flush its RAM buffer". I also think "minMergeDocs" (which today is one and the same as maxBufferedDocs in IndexWriter but conceptually could be a different configuration) also should not appear in the MergePolicy interface. I think it should only appear in LogarithmicMergePolicy? If we remove these from the MergePolicy interface then maybe we don't need MergePolicyBase? (Just to makes things simpler). I think we should not create a LegacyMergePolicy interface. But I realize you need this so the deprecated methods in IndexWriter (setMergeFactor, setMaxBufferedDocs, setMaxMergeDocs, etc.) can be implemented. How about instead these methods will only work if the current merge policy is the LogarithmicMergePolicy? You can check if the current mergePolicy is an instanceof LogarithmicMergePolicy and then throw eg an IllegalStateException if it's not? Ie, going forward, with new and interesting merge policies, developers should interact with their merge policy to configure it. I was a little spooked by this change to TestAddIndexesNoOptimize: assertEquals(2, writer.getSegmentCount()); + assertEquals(3, writer.getSegmentCount()); I think with just the refactoring, there should not need to be any changes to unit tests right? It's interesting that you've pulled "useCompoundFile" into the LegacyMergePolicy. I'm torn on whether it belongs in MergePolicy at all, since this is really a file format issue? For example, newly written segments (no longer a "merge" with LUCENE-843 ) must also know whether to write in compound file format. If we make interesting file format changes in the future that are configurable by the developer we wouldn't want to change all MergePolicy classes to propogate that. It feels like using compound file or not should remain only in IndexWriter?
          Hide
          ningli Ning Li added a comment -

          Here is a patch for concurrent merge as discussed in:
          http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651

          I put it under this issue because it helps design and verify a factored merge policy which would provide good support for concurrent merge.

          As described before, a merge thread is started when a writer is created and stopped when the writer is closed. The merge process consists of three steps: first, create a merge task/spec; then, carry out the actual merge; finally, "commit" the merged segment (replace segments it merged in segmentInfos), but only after appropriate deletes are applied. The first and last steps are fast and synchronous. The second step is where concurrency is achieved. Does it make sense to capture them as separate steps in the factored merge policy?

          As discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651: documents can be buffered while segments are merged, but no more than maxBufferedDocs can be buffered at any time. So this version provides limited concurrency. The main goal is to achieve short ingestion hiccups, especially when the ingestion rate is low. After the factored merge policy, we could provide different versions of concurrent merge policies which provide different levels of concurrency.

          All unit tests pass. If IndexWriter is replaced with IndexWriterConcurrentMerge, all unit tests pass except the following:

          • TestAddIndexesNoOptimize and TestIndexWriter*
            This is because they check segment sizes expecting all merges are done. These tests pass if these checks are performed after the concurrent merges finish. The modified tests (with waits for concurrent merges to finish) are in TestIndexWriterConcurrentMerge*.
          • testExactFieldNames in TestBackwardCompatibility and testDeleteLeftoverFiles in TestIndexFileDeleter
            In both cases, file name segments_a is expected, but the actual is segments_7. This is because with concurrent merge, if compound file is used, only the compound version is "committed" (added to segmentInfos), not the non-compound version, thus the lower segments generation number.

          Cheers,
          Ning

          Show
          ningli Ning Li added a comment - Here is a patch for concurrent merge as discussed in: http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651 I put it under this issue because it helps design and verify a factored merge policy which would provide good support for concurrent merge. As described before, a merge thread is started when a writer is created and stopped when the writer is closed. The merge process consists of three steps: first, create a merge task/spec; then, carry out the actual merge; finally, "commit" the merged segment (replace segments it merged in segmentInfos), but only after appropriate deletes are applied. The first and last steps are fast and synchronous. The second step is where concurrency is achieved. Does it make sense to capture them as separate steps in the factored merge policy? As discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651: documents can be buffered while segments are merged, but no more than maxBufferedDocs can be buffered at any time. So this version provides limited concurrency. The main goal is to achieve short ingestion hiccups, especially when the ingestion rate is low. After the factored merge policy, we could provide different versions of concurrent merge policies which provide different levels of concurrency. All unit tests pass. If IndexWriter is replaced with IndexWriterConcurrentMerge, all unit tests pass except the following: TestAddIndexesNoOptimize and TestIndexWriter* This is because they check segment sizes expecting all merges are done. These tests pass if these checks are performed after the concurrent merges finish. The modified tests (with waits for concurrent merges to finish) are in TestIndexWriterConcurrentMerge*. testExactFieldNames in TestBackwardCompatibility and testDeleteLeftoverFiles in TestIndexFileDeleter In both cases, file name segments_a is expected, but the actual is segments_7. This is because with concurrent merge, if compound file is used, only the compound version is "committed" (added to segmentInfos), not the non-compound version, thus the lower segments generation number. Cheers, Ning
          Hide
          steven_parkes Steven Parkes added a comment -

          Here are some numbers comparing the load performance for the factored vs. non-factored merge policies.

          The setup uses enwiki, loads 200K documents, and uses 4 different combinations of maxBufferedDocs and mergeFactor (just the default from the standard benchmark, not because I necessarily thought it was a good idea.)

          The factored merge policy seems to be on the order of 1% slower loading than the non-factored version ... and I'm not sure why, so I'm going to check into this. The factored version does more examination of segment list than the non-factored version, so there's compute overhead, but I would expect that to be swamped by I/O Maybe that's not a good assumption? Or it might be doing different merges for reasons I haven't considered, which I'm going to check.

          Relating this to some of the merge discussions, I'm going to look at monitoring both the number of merges taking place and the size of those merges. I think that's helpful in understand different candidate merge policies, in addition to absolute difference in runtime.

          I also think histogramming the per-doc cost would also be interesting, since mitigating the long delay at cascading merges is at least one goal of a concurrent merge policy.

          And all this doesn't even consider testing the recent stuff on merging multiple indexes. That's an area where the factored merge policy differs (because of the simpler interface.)

          I'm curious if anyone is surprised by these numbers, the 60 docs/sec, in particular. This machine is a dual dual-core xeon, writing to a single local disk. My dual opty achieved ~85-100 docs/sec on a three disk SATA3 RAID5 array.

          Non-factored (current) merge policy

          [java] ------------> Report sum by Prefix (MAddDocs) and Round (8 about 8 out of 33)
          [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] MAddDocs_200000 0 10 10 1 200000 41.6 4,804.11 11,758,928 12,591,104
          [java] MAddDocs_200000 - 1 100 10 - - 1 - - 200000 - - - 50.0 - 4,000.25 - 34,831,992 - 52,563,968
          [java] MAddDocs_200000 2 10 100 1 200000 49.9 4,004.95 42,158,232 60,444,672
          [java] MAddDocs_200000 - 3 100 100 - - 1 - - 200000 - - - 57.9 - 3,455.97 - 45,646,680 - 61,083,648
          [java] MAddDocs_200000 4 10 10 1 200000 44.9 4,458.66 36,928,616 61,083,648
          [java] MAddDocs_200000 - 5 100 10 - - 1 - - 200000 - - - 50.4 - 3,965.98 - 47,855,064 - 61,083,648
          [java] MAddDocs_200000 6 10 100 1 200000 49.7 4,023.51 52,506,448 64,217,088
          [java] MAddDocs_200000 - 7 100 100 - - 1 - - 200000 - - - 57.9 - 3,451.39 - 64,466,128 - 73,220,096

          Factored (new) merge policy

          [java] ------------> Report sum by Prefix (MAddDocs) and Round (8 about 8 out of 33)
          [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] MAddDocs_200000 0 10 10 1 200000 41.4 4,828.25 10,477,976 12,386,304
          [java] MAddDocs_200000 - 1 100 10 - - 1 - - 200000 - - - 50.4 - 3,968.27 - 38,333,544 - 46,170,112
          [java] MAddDocs_200000 2 10 100 1 200000 50.3 3,973.52 33,539,824 63,860,736
          [java] MAddDocs_200000 - 3 100 100 - - 1 - - 200000 - - - 58.6 - 3,413.87 - 44,580,528 - 87,781,376
          [java] MAddDocs_200000 4 10 10 1 200000 45.3 4,411.50 57,850,104 87,781,376
          [java] MAddDocs_200000 - 5 100 10 - - 1 - - 200000 - - - 51.0 - 3,921.48 - 62,793,432 - 87,781,376
          [java] MAddDocs_200000 6 10 100 1 200000 50.4 3,969.87 49,625,496 93,966,336
          [java] MAddDocs_200000 - 7 100 100 - - 1 - - 200000 - - - 58.7 - 3,409.51 - 68,100,288 - 129,572,864

          Show
          steven_parkes Steven Parkes added a comment - Here are some numbers comparing the load performance for the factored vs. non-factored merge policies. The setup uses enwiki, loads 200K documents, and uses 4 different combinations of maxBufferedDocs and mergeFactor (just the default from the standard benchmark, not because I necessarily thought it was a good idea.) The factored merge policy seems to be on the order of 1% slower loading than the non-factored version ... and I'm not sure why, so I'm going to check into this. The factored version does more examination of segment list than the non-factored version, so there's compute overhead, but I would expect that to be swamped by I/O Maybe that's not a good assumption? Or it might be doing different merges for reasons I haven't considered, which I'm going to check. Relating this to some of the merge discussions, I'm going to look at monitoring both the number of merges taking place and the size of those merges. I think that's helpful in understand different candidate merge policies, in addition to absolute difference in runtime. I also think histogramming the per-doc cost would also be interesting, since mitigating the long delay at cascading merges is at least one goal of a concurrent merge policy. And all this doesn't even consider testing the recent stuff on merging multiple indexes. That's an area where the factored merge policy differs (because of the simpler interface.) I'm curious if anyone is surprised by these numbers, the 60 docs/sec, in particular. This machine is a dual dual-core xeon, writing to a single local disk. My dual opty achieved ~85-100 docs/sec on a three disk SATA3 RAID5 array. Non-factored (current) merge policy [java] ------------> Report sum by Prefix (MAddDocs) and Round (8 about 8 out of 33) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_200000 0 10 10 1 200000 41.6 4,804.11 11,758,928 12,591,104 [java] MAddDocs_200000 - 1 100 10 - - 1 - - 200000 - - - 50.0 - 4,000.25 - 34,831,992 - 52,563,968 [java] MAddDocs_200000 2 10 100 1 200000 49.9 4,004.95 42,158,232 60,444,672 [java] MAddDocs_200000 - 3 100 100 - - 1 - - 200000 - - - 57.9 - 3,455.97 - 45,646,680 - 61,083,648 [java] MAddDocs_200000 4 10 10 1 200000 44.9 4,458.66 36,928,616 61,083,648 [java] MAddDocs_200000 - 5 100 10 - - 1 - - 200000 - - - 50.4 - 3,965.98 - 47,855,064 - 61,083,648 [java] MAddDocs_200000 6 10 100 1 200000 49.7 4,023.51 52,506,448 64,217,088 [java] MAddDocs_200000 - 7 100 100 - - 1 - - 200000 - - - 57.9 - 3,451.39 - 64,466,128 - 73,220,096 Factored (new) merge policy [java] ------------> Report sum by Prefix (MAddDocs) and Round (8 about 8 out of 33) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_200000 0 10 10 1 200000 41.4 4,828.25 10,477,976 12,386,304 [java] MAddDocs_200000 - 1 100 10 - - 1 - - 200000 - - - 50.4 - 3,968.27 - 38,333,544 - 46,170,112 [java] MAddDocs_200000 2 10 100 1 200000 50.3 3,973.52 33,539,824 63,860,736 [java] MAddDocs_200000 - 3 100 100 - - 1 - - 200000 - - - 58.6 - 3,413.87 - 44,580,528 - 87,781,376 [java] MAddDocs_200000 4 10 10 1 200000 45.3 4,411.50 57,850,104 87,781,376 [java] MAddDocs_200000 - 5 100 10 - - 1 - - 200000 - - - 51.0 - 3,921.48 - 62,793,432 - 87,781,376 [java] MAddDocs_200000 6 10 100 1 200000 50.4 3,969.87 49,625,496 93,966,336 [java] MAddDocs_200000 - 7 100 100 - - 1 - - 200000 - - - 58.7 - 3,409.51 - 68,100,288 - 129,572,864
          Hide
          steven_parkes Steven Parkes added a comment -

          Here's an update to the patch. I wouldn't say it's ready to be committed, but I think it's significantly closer than it was.

          The concurrent and other misc. stuff have been pulled out (that part still needs work, figuring out how to get the concurrency right.)

          The new patch works against trunk, which means it handles docswriter and is more compatible with merging by # of docs or merging by ram (or size, to be more accurate?)

          My take on the migration path here was that we could well be going towards merging by size but need to keep merging by # docs for parallel index cases. The current patch still only does merging by # docs.

          I think I commented on a couple of other things dev, but to reiterate:

          There's a small change in the test results because the new merge policy simplifies the treatatement of addIndexes operations. The change is understood and shouldn't be a problem.

          useCompoundFile is delegated to the merge policy so a smart merge policy could make decisions looking at the state of all segments rather than all-or-nothing. There are a couple of fixme's in IndexWriter related to this and the segments being created by the docswriter.

          I'm going to look at that, plus the concurrent stuff: Ning's stuff plus by old approach (which has to change, given the new docswriter stuff).

          Show
          steven_parkes Steven Parkes added a comment - Here's an update to the patch. I wouldn't say it's ready to be committed, but I think it's significantly closer than it was. The concurrent and other misc. stuff have been pulled out (that part still needs work, figuring out how to get the concurrency right.) The new patch works against trunk, which means it handles docswriter and is more compatible with merging by # of docs or merging by ram (or size, to be more accurate?) My take on the migration path here was that we could well be going towards merging by size but need to keep merging by # docs for parallel index cases. The current patch still only does merging by # docs. I think I commented on a couple of other things dev, but to reiterate: There's a small change in the test results because the new merge policy simplifies the treatatement of addIndexes operations. The change is understood and shouldn't be a problem. useCompoundFile is delegated to the merge policy so a smart merge policy could make decisions looking at the state of all segments rather than all-or-nothing. There are a couple of fixme's in IndexWriter related to this and the segments being created by the docswriter. I'm going to look at that, plus the concurrent stuff: Ning's stuff plus by old approach (which has to change, given the new docswriter stuff).
          Hide
          steven_parkes Steven Parkes added a comment -

          For the time being, the patch also contains some of the code from LUCENE-971 since that's how I test it.

          Show
          steven_parkes Steven Parkes added a comment - For the time being, the patch also contains some of the code from LUCENE-971 since that's how I test it.
          Hide
          mikemccand Michael McCandless added a comment -

          This looks great Steve!

          More specific feeedback soon, but ... in thinking about concurrency
          (and from reading your comments about it in LogDocMergePolicy), I
          think we ideally would like concurrency to be fully independent of the
          merge policy.

          Ie, just like you can take any shell command and choose to run it in
          the background by sticking an "&" on the end, I should be able to take
          my favorite MergePolicy instance X and "wrap" it inside a "concurrent
          merge policy wrapper". Eg, instantiate ConcurrentMergePolicy(X), and
          then ConcurrentMergePolicy would take the merges requested by X and do
          them in the background.

          I think with one change to your MergePolicy API & control flow, we
          could make this work very well: instead of requiring the MergePolicy
          to call IndexWriter.merge, and do the cascading, it should just return
          the one MergeSpecification that should be done right now. This would
          mean the "MergePolicy.merge" method would return null if no merge is
          necessary right now, and would return a MergeSpecification if a merge
          is required.

          This way, it is IndexWriter that would execute the merge. When the
          merge is done, IndexWriter would then call the MergePolicy again to
          give it a chance to "cascade". This simplifies the locking because
          IndexWriter can synchronize on SegmentInfos when it calls
          "MergePolicy.merge" and so MergePolicy no longer has to deal with this
          complexity (that SegmentInfos change during merge).

          Then, with this change, we could make a ConcurrentMergePolicy that
          could (I think) easily "wrap" itself around another MergePolicy X,
          intercepting the calls to "merge". When the inner MergePolicy wants
          to do a merge, the ConcurrentMergePolicy would in turn kick off that
          merge in the BG but then return null to the IndexWriter allowing
          IndexWriter to return to its caller, etc.

          Then, this also simplifies MergePolicy implementations because you no
          longer have to deal w/ thread safety issues around driving your own
          merges, cascading merges, dealing with sneaky SegmentInfos changing
          while doing the merge, etc....

          Show
          mikemccand Michael McCandless added a comment - This looks great Steve! More specific feeedback soon, but ... in thinking about concurrency (and from reading your comments about it in LogDocMergePolicy), I think we ideally would like concurrency to be fully independent of the merge policy. Ie, just like you can take any shell command and choose to run it in the background by sticking an "&" on the end, I should be able to take my favorite MergePolicy instance X and "wrap" it inside a "concurrent merge policy wrapper". Eg, instantiate ConcurrentMergePolicy(X), and then ConcurrentMergePolicy would take the merges requested by X and do them in the background. I think with one change to your MergePolicy API & control flow, we could make this work very well: instead of requiring the MergePolicy to call IndexWriter.merge, and do the cascading, it should just return the one MergeSpecification that should be done right now. This would mean the "MergePolicy.merge" method would return null if no merge is necessary right now, and would return a MergeSpecification if a merge is required. This way, it is IndexWriter that would execute the merge. When the merge is done, IndexWriter would then call the MergePolicy again to give it a chance to "cascade". This simplifies the locking because IndexWriter can synchronize on SegmentInfos when it calls "MergePolicy.merge" and so MergePolicy no longer has to deal with this complexity (that SegmentInfos change during merge). Then, with this change, we could make a ConcurrentMergePolicy that could (I think) easily "wrap" itself around another MergePolicy X, intercepting the calls to "merge". When the inner MergePolicy wants to do a merge, the ConcurrentMergePolicy would in turn kick off that merge in the BG but then return null to the IndexWriter allowing IndexWriter to return to its caller, etc. Then, this also simplifies MergePolicy implementations because you no longer have to deal w/ thread safety issues around driving your own merges, cascading merges, dealing with sneaky SegmentInfos changing while doing the merge, etc....
          Hide
          steven_parkes Steven Parkes added a comment -

          I
          think we ideally would like concurrency to be fully independent of the
          merge policy.

          I thought of that, too, while taking a fresh look at things again. It's my current approach, though I'm not yet sure there won't be stumbling blocks. More soon, hopefully.

          I think with one change to your MergePolicy API & control flow, we
          could make this work very well: instead of requiring the MergePolicy
          to call IndexWriter.merge, and do the cascading, it should just return
          the one MergeSpecification that should be done right now.

          Hmm ... interesting idea. I thought about it briefly, though I didn't pursue it (see below). It would end up changing the possible space of merge policies subtly. You wouldn't be able to have any state in the algorithm. Arguably this is a good thing. There is also a bit more overhead, since starting the computation of potential merges from scratch each time could imply a little more computation, but I suspect this is not significant.

          When the inner MergePolicy wants
          to do a merge, the ConcurrentMergePolicy would in turn kick off that
          merge in the BG but then return null to the IndexWriter allowing
          IndexWriter to return to its caller, etc.

          I'm a little unsure here. Are you saying the ConcurrentMergePolicy does the merges itself, rather than using the writer? That's going to mean a synchronization dance between the CMP and the writer. There's no question but that there has to be some synch dance, but my current thinking was to try to keep as cleanly within one class, IW, as I could.

          Show
          steven_parkes Steven Parkes added a comment - I think we ideally would like concurrency to be fully independent of the merge policy. I thought of that, too, while taking a fresh look at things again. It's my current approach, though I'm not yet sure there won't be stumbling blocks. More soon, hopefully. I think with one change to your MergePolicy API & control flow, we could make this work very well: instead of requiring the MergePolicy to call IndexWriter.merge, and do the cascading, it should just return the one MergeSpecification that should be done right now. Hmm ... interesting idea. I thought about it briefly, though I didn't pursue it (see below). It would end up changing the possible space of merge policies subtly. You wouldn't be able to have any state in the algorithm. Arguably this is a good thing. There is also a bit more overhead, since starting the computation of potential merges from scratch each time could imply a little more computation, but I suspect this is not significant. When the inner MergePolicy wants to do a merge, the ConcurrentMergePolicy would in turn kick off that merge in the BG but then return null to the IndexWriter allowing IndexWriter to return to its caller, etc. I'm a little unsure here. Are you saying the ConcurrentMergePolicy does the merges itself, rather than using the writer? That's going to mean a synchronization dance between the CMP and the writer. There's no question but that there has to be some synch dance, but my current thinking was to try to keep as cleanly within one class, IW, as I could.
          Hide
          mikemccand Michael McCandless added a comment -

          Some more feedback:

          • Is the separate IndexMerger interface really necessary? Can't we
            just use IndexWriter directly? It's somewhat awkward/forced now,
            because the interface has getters for ramBufferSizeMB and
            maxBufferedDocs that are really a "writer" (flushing) thing not a
            "merging" thing.

          While LogDocMergePolicy does need "maxBufferedDocs", I think,
          instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
          through" to the LogDocMergePolicy if that is the merge policy in
          use (and LogDocMergePolicy should store its own private
          "minMergeDocs").

          I think the three getters may not even be needed (based on
          comments below), in which case it seems like we shouldn't be
          creating a new interface.

          • In LogDocMergePolicy, it seems like the checking that's done for
            whether a SegmentInfo is in a different directory, as well as the
            subsequent copy to move it over to the IndexWriter's directory,
            should not live in the MergePolicy?

          Otherwise, every MergePolicy is going to have to duplicate this
          code. Not to mention, we may someday create a more efficient way
          to copy than running a single-segment merge (which is a very
          inefficient, but, we only do it in addIndexes* I think). I'd like
          to capture this in one place (IndexWriter).

          EG, the writer could have its own "copyExternalSegments" method
          which is called in addIndexes* and also optimize(), after the
          merge policy has done its merge, which does the check for wrong
          directory and subsequent copy.

          I think this would mean IndexMerger.getDirectory() is not really
          needed.

          • The "checkOptimize" method in LogDocMergePolicy seems like it
            belongs back in IndexWriter: I think we don't want every
            MergePolicy having to replicate that tricky while condition.

          Maybe we could change MergePolicy.merge to take a boolean "forced"
          which, if true, means that the MergePolicy must now pick a merge
          even if it didn't think it was time to. This would let us move
          that tricky while condition logic back into IndexWriter.

          Also, I think at some point we may want to have an optimize()
          method that takes an int parameter stating the max # segments in
          the resulting index. This would allow you to optimize down to <=
          N segments w/o paying full cost of a complete "only one segment"
          optimize. If we had a "forced" boolean then we could build such
          an optimize method in the future, whereas as it stands now it
          would not be so easy to retrofit previously created MergePolicy
          classes to do this.

          And some more minor feedback:

          • I love the simplification of addIndexesNoOptimize Though (same
            comment as above) I think we should move that final "copy if
            different directory" step back in IndexWriter.
          • There are some minor things that should not be committed eg the
            added "infoStream = null" in IndexFileDeleter. I typically try to
            put a comment "// nocommit" above such changes as I make them to
            remind myself and keep them from slipping in.
          • In the deprecated IndexWriter methods you're doing a hard cast to
            LogDocMergePolicy which gives a bad result if you're using a
            different merge policy; maybe catch the class cast exception (or,
            better, check upfront if it's an instanceof) and raise a more
            reasonable exception if not?
          • IndexWriter around line 1908 has for loop that has commented out
            "System.err.println"; we should just comment out/remove for loop
          • These commented out synchronized spook me a bit:

          /* synchronized(segmentInfos) */ {

          Are they needed in these contexts? Is this only once we have
          concurrent merging? (This ties back to the first feedback to
          simplify MergePolicy API so that this kind of locking is only
          needed inside IndexWriter).

          • Can we support non-contiguous merges? If I understand it
            correctly, the MergeSpecification can express such a merge, it's
            just that the current IndexMerger (IndexWriter) cannot execute it,
            right? So we are at least not precluding fixing this in the
            future.
          • It confuses me that MergePolicy has a method "merge(...)" – can
            we rename it to "maybeMerge(..)" or "checkForMerge(...)"? Ie,
            this method determines whether a merge is necessary and, if so, it
            then asks IndexMerger to in fact execute the merge (or, returns
            the spec)?
          • Instead of IndexWriter.releaseMergePolicy() can we have
            IndexWriter only close the merge policy if it was the one that had
            created it? (Similar to how IndexWriter closes the dir if it has
            opened it from a String or File, but does not close it if it was
            passed in).
          Show
          mikemccand Michael McCandless added a comment - Some more feedback: Is the separate IndexMerger interface really necessary? Can't we just use IndexWriter directly? It's somewhat awkward/forced now, because the interface has getters for ramBufferSizeMB and maxBufferedDocs that are really a "writer" (flushing) thing not a "merging" thing. While LogDocMergePolicy does need "maxBufferedDocs", I think, instead, in IndexWriter's "setMaxBufferedDocs()" it should "write through" to the LogDocMergePolicy if that is the merge policy in use (and LogDocMergePolicy should store its own private "minMergeDocs"). I think the three getters may not even be needed (based on comments below), in which case it seems like we shouldn't be creating a new interface. In LogDocMergePolicy, it seems like the checking that's done for whether a SegmentInfo is in a different directory, as well as the subsequent copy to move it over to the IndexWriter's directory, should not live in the MergePolicy? Otherwise, every MergePolicy is going to have to duplicate this code. Not to mention, we may someday create a more efficient way to copy than running a single-segment merge (which is a very inefficient, but, we only do it in addIndexes* I think). I'd like to capture this in one place (IndexWriter). EG, the writer could have its own "copyExternalSegments" method which is called in addIndexes* and also optimize(), after the merge policy has done its merge, which does the check for wrong directory and subsequent copy. I think this would mean IndexMerger.getDirectory() is not really needed. The "checkOptimize" method in LogDocMergePolicy seems like it belongs back in IndexWriter: I think we don't want every MergePolicy having to replicate that tricky while condition. Maybe we could change MergePolicy.merge to take a boolean "forced" which, if true, means that the MergePolicy must now pick a merge even if it didn't think it was time to. This would let us move that tricky while condition logic back into IndexWriter. Also, I think at some point we may want to have an optimize() method that takes an int parameter stating the max # segments in the resulting index. This would allow you to optimize down to <= N segments w/o paying full cost of a complete "only one segment" optimize. If we had a "forced" boolean then we could build such an optimize method in the future, whereas as it stands now it would not be so easy to retrofit previously created MergePolicy classes to do this. And some more minor feedback: I love the simplification of addIndexesNoOptimize Though (same comment as above) I think we should move that final "copy if different directory" step back in IndexWriter. There are some minor things that should not be committed eg the added "infoStream = null" in IndexFileDeleter. I typically try to put a comment "// nocommit" above such changes as I make them to remind myself and keep them from slipping in. In the deprecated IndexWriter methods you're doing a hard cast to LogDocMergePolicy which gives a bad result if you're using a different merge policy; maybe catch the class cast exception (or, better, check upfront if it's an instanceof) and raise a more reasonable exception if not? IndexWriter around line 1908 has for loop that has commented out "System.err.println"; we should just comment out/remove for loop These commented out synchronized spook me a bit: /* synchronized(segmentInfos) */ { Are they needed in these contexts? Is this only once we have concurrent merging? (This ties back to the first feedback to simplify MergePolicy API so that this kind of locking is only needed inside IndexWriter). Can we support non-contiguous merges? If I understand it correctly, the MergeSpecification can express such a merge, it's just that the current IndexMerger (IndexWriter) cannot execute it, right? So we are at least not precluding fixing this in the future. It confuses me that MergePolicy has a method "merge(...)" – can we rename it to "maybeMerge(..)" or "checkForMerge(...)"? Ie, this method determines whether a merge is necessary and, if so, it then asks IndexMerger to in fact execute the merge (or, returns the spec)? Instead of IndexWriter.releaseMergePolicy() can we have IndexWriter only close the merge policy if it was the one that had created it? (Similar to how IndexWriter closes the dir if it has opened it from a String or File, but does not close it if it was passed in).
          Hide
          mikemccand Michael McCandless added a comment -

          > > I think we ideally would like concurrency to be fully independent of
          > > the merge policy.
          >
          > I thought of that, too, while taking a fresh look at things
          > again. It's my current approach, though I'm not yet sure there won't
          > be stumbling blocks. More soon, hopefully.

          Well I think the current MergePolicy API (where the "merge" method
          calls IndexWriter.merge itself, must cascade itself, etc.) makes it
          hard to build a generic ConcurrentMergePolicy "wrapper" that you could
          use to make any MergePolicy concurrent . How would you do it?

          EG I'm working on a new MergePolicy for LUCENE-845, which would be
          nice to run concurrently, but I'd really rather not have to figure out
          how to build my own concurrency/locking/etc in it. Ideally
          "concurrency" is captured as a single wrapper class that we all can
          re-use on top of any MergePolicy. I think we can do that with the
          proposed simplification.

          > > I think with one change to your MergePolicy API & control flow, we
          > > could make this work very well: instead of requiring the MergePolicy
          > > to call IndexWriter.merge, and do the cascading, it should just
          > > return the one MergeSpecification that should be done right now.

          > Hmm ... interesting idea. I thought about it briefly, though I
          > didn't pursue it (see below). It would end up changing the possible
          > space of merge policies subtly. You wouldn't be able to have any
          > state in the algorithm. Arguably this is a good thing. There is also
          > a bit more overhead, since starting the computation of potential
          > merges from scratch each time could imply a little more computation,
          > but I suspect this is not significant.

          I think you can still have state (as instance variables in your
          class)? How would this simplification restrict the space of merge
          policies?

          > > When the inner MergePolicy wants to do a merge, the
          > > ConcurrentMergePolicy would in turn kick off that merge in the BG but
          > > then return null to the IndexWriter allowing IndexWriter to return to
          > > its caller, etc.
          >
          > I'm a little unsure here. Are you saying the ConcurrentMergePolicy
          > does the merges itself, rather than using the writer? That's going
          > to mean a synchronization dance between the CMP and the
          > writer. There's no question but that there has to be some synch
          > dance, but my current thinking was to try to keep as cleanly within
          > one class, IW, as I could.

          Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec),
          just with a separate thread. And so all synchronization required is
          still inside IndexWriter (I think?).

          In fact, if we stick with the current MergePolicy API, aren't you
          going to have to put some locking into eg the LogDocMergePolicy when
          concurrent merges might be happening? With the new approach,
          IndexWriter could invoke MergePolicy.merge under a
          "synchronized(segmentInfos)", and then each MergePolicy doesn't have
          to deal with locking at all.

          Show
          mikemccand Michael McCandless added a comment - > > I think we ideally would like concurrency to be fully independent of > > the merge policy. > > I thought of that, too, while taking a fresh look at things > again. It's my current approach, though I'm not yet sure there won't > be stumbling blocks. More soon, hopefully. Well I think the current MergePolicy API (where the "merge" method calls IndexWriter.merge itself, must cascade itself, etc.) makes it hard to build a generic ConcurrentMergePolicy "wrapper" that you could use to make any MergePolicy concurrent . How would you do it? EG I'm working on a new MergePolicy for LUCENE-845 , which would be nice to run concurrently, but I'd really rather not have to figure out how to build my own concurrency/locking/etc in it. Ideally "concurrency" is captured as a single wrapper class that we all can re-use on top of any MergePolicy. I think we can do that with the proposed simplification. > > I think with one change to your MergePolicy API & control flow, we > > could make this work very well: instead of requiring the MergePolicy > > to call IndexWriter.merge, and do the cascading, it should just > > return the one MergeSpecification that should be done right now. > Hmm ... interesting idea. I thought about it briefly, though I > didn't pursue it (see below). It would end up changing the possible > space of merge policies subtly. You wouldn't be able to have any > state in the algorithm. Arguably this is a good thing. There is also > a bit more overhead, since starting the computation of potential > merges from scratch each time could imply a little more computation, > but I suspect this is not significant. I think you can still have state (as instance variables in your class)? How would this simplification restrict the space of merge policies? > > When the inner MergePolicy wants to do a merge, the > > ConcurrentMergePolicy would in turn kick off that merge in the BG but > > then return null to the IndexWriter allowing IndexWriter to return to > > its caller, etc. > > I'm a little unsure here. Are you saying the ConcurrentMergePolicy > does the merges itself, rather than using the writer? That's going > to mean a synchronization dance between the CMP and the > writer. There's no question but that there has to be some synch > dance, but my current thinking was to try to keep as cleanly within > one class, IW, as I could. Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec), just with a separate thread. And so all synchronization required is still inside IndexWriter (I think?). In fact, if we stick with the current MergePolicy API, aren't you going to have to put some locking into eg the LogDocMergePolicy when concurrent merges might be happening? With the new approach, IndexWriter could invoke MergePolicy.merge under a "synchronized(segmentInfos)", and then each MergePolicy doesn't have to deal with locking at all.
          Hide
          steven_parkes Steven Parkes added a comment -

          Is the separate IndexMerger interface really necessary?

          I wrestled with this, so in the past, it wouldn't have taken much to convince me otherwise. The reason for the extra interface is I was hoping to discourage or create a little extra friction for merge policies in terms of looking too much into the internals of IndexWriter.

          As an example, it's not a good idea for merge policies to be able to access IndexWriter#segmentInfos directly. (That's a case I would like to solve by making segmentInfos private, but I haven't looked at that completely yet and even beyond that case, I'd still like merge policies to have a very clean interface with IWs.)

          It feels kinda weird for me to be arguing for constraints since I work mostly in dynamic languages that have none of this stuff. But it reflects my goal that merge policies simply be algorithms, not real workers.

          Moreover, I think it may be useful for implementing concurrency (see below).

          While LogDocMergePolicy does need "maxBufferedDocs", I think,
          instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
          through" to the LogDocMergePolicy if that is the merge policy in
          use (and LogDocMergePolicy should store its own private
          "minMergeDocs").

          The thing here is that the merge policy has nothing to do with max buffered docs, right? The code for buffering docs for the first segment is wholly in the IndexWriter. LogDocMergePolicy happens to need it (in its current incarnation) in order to handle the way the log levels are computed. This could, of course, be changed. There's nothing that says a merge policy has to look at these values, just that they're available should the merge policy want to look.

          I guess my idea was that the index writer was responsible for handling the initial segment (with DocsWriter, if it wants) and also to provide an indication to the merge policies how it was handling them.

          It's possible to have the merge policy influence the first segment size but that opens up a lot of issues. Does every merge policy then have to know how to trade between buffering by doc count and buffering by ram? I was hoping to avoid that.

          I'm not all that happy with this the way it is, but supporting both by-doc and by-ram is messy but seems necessary. This was my take on making it least messy?

          In LogDocMergePolicy, it seems like the checking that's done for
          whether a SegmentInfo is in a different directory, as well as the
          subsequent copy to move it over to the IndexWriter's directory,
          should not live in the MergePolicy?

          I see two parts to this.

          First, I hesitate to make it not possible for merge policy to see the directory information, i.e., remove IndexMerger#getDirectory(). Copying a segment is one way to get it into the current directory, but merging with other segments does to. A merge policy could decide to go ahead and merge a bunch of segments that are in other directories rather than just copy them individually. Taking away getDirectory() removes that option.

          As to how to handle the case where single segments are copied, I guess my main reason for leaving that in the merge policy would be for simplicity. Seems nicer to have all segment amalgamation management in one place, rather than some in one class and some in another. Could be factored into an optional base merge policy for derived classes to use as they might like.

          The "checkOptimize" method in LogDocMergePolicy seems like it
          belongs back in IndexWriter: I think we don't want every
          MergePolicy having to replicate that tricky while condition.

          The reason for not doing this was I can imagine different merge policies having a different model of what optimize means. I can imagine some policies that would not optimize everything down to a single segment but instead obeyed a max segment size. But we could factor the default conditional into an optional base class, as above.

          It is an ugly conditional that there might be better way to formulate, so that policies don't have to look at the grody details like hasSeparateNorms. But I'd still like the merge policies to be able to decide what optimize means at a high level.

          Maybe we could change MergePolicy.merge to take a boolean "forced"
          which, if true, means that the MergePolicy must now pick a merge
          even if it didn't think it was time to. This would let us move
          that tricky while condition logic back into IndexWriter.

          I didn't follow this. forced == optimize? If not, what does pick a merge mean? Not sure what LogDoc should do for merge(force=true) if it thinks everything is fine?

          Also, I think at some point we may want to have an optimize()
          method that takes an int parameter stating the max # segments in
          the resulting index.

          I think this is great functionality for a merge policy, but what about just making that part of the individual merge policy interface, rather than linked to IndexWriter? That was one goal of making a factored merge policy: that you could do these tweaks without changing IndexWriter.

          This would allow you to optimize down to <=
          N segments w/o paying full cost of a complete "only one segment"
          optimize. If we had a "forced" boolean then we could build such
          an optimize method in the future, whereas as it stands now it
          would not be so easy to retrofit previously created MergePolicy
          classes to do this.

          I haven't looked at how difficult it would be to make LogDoc sufficiently derivable to allow a derived class to add this tweak. If I could, would it be enough?

          There are some minor things that should not be committed eg the
          added "infoStream = null" in IndexFileDeleter. I typically try to
          put a comment "// nocommit" above such changes as I make them to
          remind myself and keep them from slipping in.

          You're right and thanks for the idea. Obvious now.

          In the deprecated IndexWriter methods you're doing a hard cast to
          LogDocMergePolicy which gives a bad result if you're using a
          different merge policy; maybe catch the class cast exception (or,
          better, check upfront if it's an instanceof) and raise a more
          reasonable exception if not?

          Agreed.

          IndexWriter around line 1908 has for loop that has commented out
          "System.err.println"; we should just comment out/remove for loop

          The whole loop will be gone before commit but I didn't want to delete it yet since I might need to turn it back on for more debugging. It should, of course, have a "// nocommit" comment.

          These commented out synchronized spook me a bit:

          /* synchronized(segmentInfos) */ {

          This is from my old code. I left it in there as a hint as I work on the concurrent stuff again. It's only a weak hint, though, because things have changed a lot since that code was even partially functional. Ignore it. It won't go into the serial patch and anything for LUCENE-870 will have to have its own justification.

          Can we support non-contiguous merges? If I understand it
          correctly, the MergeSpecification can express such a merge, it's
          just that the current IndexMerger (IndexWriter) cannot execute it,
          right? So we are at least not precluding fixing this in the
          future.

          As far as I've seen so far, there are no barriers to non-contiguous merges. Maybe something will crop up that is, but in what I've done, I haven't seen any.

          It confuses me that MergePolicy has a method "merge(...)" – can
          we rename it to "maybeMerge(..)" or "checkForMerge(...)"?

          I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of "merge" like "optimize": make it so / idempotent. But I'm certainly willing to write whatever people find clearest.

          Instead of IndexWriter.releaseMergePolicy() can we have
          IndexWriter only close the merge policy if it was the one that had
          created it? (Similar to how IndexWriter closes the dir if it has
          opened it from a String or File, but does not close it if it was
          passed in).

          This precludes

          iw.setMergePolicy(new MyMergePolicy(...));
          ...
          iw.close();

          You're always going to have to

          MergePolicy mp = new MyMergePolicy(...);
          iw.setMergePolicy(mp);
          ...
          iw.close();
          mp.close();

          The implementation's much cleaner using the semantics you describe, but I was thinking it'd be better to optimize for the usability of the common client code case?

          Well I think the current MergePolicy API (where the "merge" method
          calls IndexWriter.merge itself, must cascade itself, etc.) makes it
          hard to build a generic ConcurrentMergePolicy "wrapper" that you could
          use to make any MergePolicy concurrent . How would you do it?

          I really haven't had time to go heads down on this (the old concurrent merge policy was a derived class rather than a wrapper class). But I was thinking that perhaps ConurrentMergePolicy would actually wrap IndexWriter as well as the serial merge policy, i.e., implement IndexMerger (my biggest argument for IM at this point). But I haven't looked deeply at whether this will work but I think it has at least a chance.

          I should know more about this is a day or two.

          I think you can still have state (as instance variables in your
          class)? How would this simplification restrict the space of merge
          policies?

          s/state/stack state/. Yeah, you can always unwind your loops and lift your recursions, put all that stack state into instance variables. But, well, yuck. I'd like to make it easy to write simple merge policies and take up the heavy lifting elsewhere. Hopefully there will be more merge policies than index writers.

          Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec),
          just with a separate thread. And so all synchronization required is
          still inside IndexWriter (I think?).

          That's my idea.

          The synchronization is still tricky, since parts of segmentInfos are getting changed at various times and there are references and/or copies of it other places. And as Ning pointed out to me, we also have to deal with buffered delete terms. I'd say I got about 80% of the way there on the last go around. I'm hoping to get all the way this time.

          In fact, if we stick with the current MergePolicy API, aren't you
          going to have to put some locking into eg the LogDocMergePolicy when
          concurrent merges might be happening?

          Yes and no. If CMP implements IndexMerger, I think we might be okay? In the previous iteration, I used derivation so that ConcurrentLogDocMergePolicy derived from the serial version but had the necessary threading. I agree that a wrapper is better solution if it can be made to work.

          Show
          steven_parkes Steven Parkes added a comment - Is the separate IndexMerger interface really necessary? I wrestled with this, so in the past, it wouldn't have taken much to convince me otherwise. The reason for the extra interface is I was hoping to discourage or create a little extra friction for merge policies in terms of looking too much into the internals of IndexWriter. As an example, it's not a good idea for merge policies to be able to access IndexWriter#segmentInfos directly. (That's a case I would like to solve by making segmentInfos private, but I haven't looked at that completely yet and even beyond that case, I'd still like merge policies to have a very clean interface with IWs.) It feels kinda weird for me to be arguing for constraints since I work mostly in dynamic languages that have none of this stuff. But it reflects my goal that merge policies simply be algorithms, not real workers. Moreover, I think it may be useful for implementing concurrency (see below). While LogDocMergePolicy does need "maxBufferedDocs", I think, instead, in IndexWriter's "setMaxBufferedDocs()" it should "write through" to the LogDocMergePolicy if that is the merge policy in use (and LogDocMergePolicy should store its own private "minMergeDocs"). The thing here is that the merge policy has nothing to do with max buffered docs, right? The code for buffering docs for the first segment is wholly in the IndexWriter. LogDocMergePolicy happens to need it (in its current incarnation) in order to handle the way the log levels are computed. This could, of course, be changed. There's nothing that says a merge policy has to look at these values, just that they're available should the merge policy want to look. I guess my idea was that the index writer was responsible for handling the initial segment (with DocsWriter, if it wants) and also to provide an indication to the merge policies how it was handling them. It's possible to have the merge policy influence the first segment size but that opens up a lot of issues. Does every merge policy then have to know how to trade between buffering by doc count and buffering by ram? I was hoping to avoid that. I'm not all that happy with this the way it is, but supporting both by-doc and by-ram is messy but seems necessary. This was my take on making it least messy? In LogDocMergePolicy, it seems like the checking that's done for whether a SegmentInfo is in a different directory, as well as the subsequent copy to move it over to the IndexWriter's directory, should not live in the MergePolicy? I see two parts to this. First, I hesitate to make it not possible for merge policy to see the directory information, i.e., remove IndexMerger#getDirectory(). Copying a segment is one way to get it into the current directory, but merging with other segments does to. A merge policy could decide to go ahead and merge a bunch of segments that are in other directories rather than just copy them individually. Taking away getDirectory() removes that option. As to how to handle the case where single segments are copied, I guess my main reason for leaving that in the merge policy would be for simplicity. Seems nicer to have all segment amalgamation management in one place, rather than some in one class and some in another. Could be factored into an optional base merge policy for derived classes to use as they might like. The "checkOptimize" method in LogDocMergePolicy seems like it belongs back in IndexWriter: I think we don't want every MergePolicy having to replicate that tricky while condition. The reason for not doing this was I can imagine different merge policies having a different model of what optimize means. I can imagine some policies that would not optimize everything down to a single segment but instead obeyed a max segment size. But we could factor the default conditional into an optional base class, as above. It is an ugly conditional that there might be better way to formulate, so that policies don't have to look at the grody details like hasSeparateNorms. But I'd still like the merge policies to be able to decide what optimize means at a high level. Maybe we could change MergePolicy.merge to take a boolean "forced" which, if true, means that the MergePolicy must now pick a merge even if it didn't think it was time to. This would let us move that tricky while condition logic back into IndexWriter. I didn't follow this. forced == optimize? If not, what does pick a merge mean? Not sure what LogDoc should do for merge(force=true) if it thinks everything is fine? Also, I think at some point we may want to have an optimize() method that takes an int parameter stating the max # segments in the resulting index. I think this is great functionality for a merge policy, but what about just making that part of the individual merge policy interface, rather than linked to IndexWriter? That was one goal of making a factored merge policy: that you could do these tweaks without changing IndexWriter. This would allow you to optimize down to <= N segments w/o paying full cost of a complete "only one segment" optimize. If we had a "forced" boolean then we could build such an optimize method in the future, whereas as it stands now it would not be so easy to retrofit previously created MergePolicy classes to do this. I haven't looked at how difficult it would be to make LogDoc sufficiently derivable to allow a derived class to add this tweak. If I could, would it be enough? There are some minor things that should not be committed eg the added "infoStream = null" in IndexFileDeleter. I typically try to put a comment "// nocommit" above such changes as I make them to remind myself and keep them from slipping in. You're right and thanks for the idea. Obvious now. In the deprecated IndexWriter methods you're doing a hard cast to LogDocMergePolicy which gives a bad result if you're using a different merge policy; maybe catch the class cast exception (or, better, check upfront if it's an instanceof) and raise a more reasonable exception if not? Agreed. IndexWriter around line 1908 has for loop that has commented out "System.err.println"; we should just comment out/remove for loop The whole loop will be gone before commit but I didn't want to delete it yet since I might need to turn it back on for more debugging. It should, of course, have a "// nocommit" comment. These commented out synchronized spook me a bit: /* synchronized(segmentInfos) */ { This is from my old code. I left it in there as a hint as I work on the concurrent stuff again. It's only a weak hint, though, because things have changed a lot since that code was even partially functional. Ignore it. It won't go into the serial patch and anything for LUCENE-870 will have to have its own justification. Can we support non-contiguous merges? If I understand it correctly, the MergeSpecification can express such a merge, it's just that the current IndexMerger (IndexWriter) cannot execute it, right? So we are at least not precluding fixing this in the future. As far as I've seen so far, there are no barriers to non-contiguous merges. Maybe something will crop up that is, but in what I've done, I haven't seen any. It confuses me that MergePolicy has a method "merge(...)" – can we rename it to "maybeMerge(..)" or "checkForMerge(...)"? I suppose. I'm not a big fan of the "maybeFoo" style of naming. I think of "merge" like "optimize": make it so / idempotent. But I'm certainly willing to write whatever people find clearest. Instead of IndexWriter.releaseMergePolicy() can we have IndexWriter only close the merge policy if it was the one that had created it? (Similar to how IndexWriter closes the dir if it has opened it from a String or File, but does not close it if it was passed in). This precludes iw.setMergePolicy(new MyMergePolicy(...)); ... iw.close(); You're always going to have to MergePolicy mp = new MyMergePolicy(...); iw.setMergePolicy(mp); ... iw.close(); mp.close(); The implementation's much cleaner using the semantics you describe, but I was thinking it'd be better to optimize for the usability of the common client code case? Well I think the current MergePolicy API (where the "merge" method calls IndexWriter.merge itself, must cascade itself, etc.) makes it hard to build a generic ConcurrentMergePolicy "wrapper" that you could use to make any MergePolicy concurrent . How would you do it? I really haven't had time to go heads down on this (the old concurrent merge policy was a derived class rather than a wrapper class). But I was thinking that perhaps ConurrentMergePolicy would actually wrap IndexWriter as well as the serial merge policy, i.e., implement IndexMerger (my biggest argument for IM at this point). But I haven't looked deeply at whether this will work but I think it has at least a chance. I should know more about this is a day or two. I think you can still have state (as instance variables in your class)? How would this simplification restrict the space of merge policies? s/state/stack state/. Yeah, you can always unwind your loops and lift your recursions, put all that stack state into instance variables. But, well, yuck. I'd like to make it easy to write simple merge policies and take up the heavy lifting elsewhere. Hopefully there will be more merge policies than index writers. Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec), just with a separate thread. And so all synchronization required is still inside IndexWriter (I think?). That's my idea. The synchronization is still tricky, since parts of segmentInfos are getting changed at various times and there are references and/or copies of it other places. And as Ning pointed out to me, we also have to deal with buffered delete terms. I'd say I got about 80% of the way there on the last go around. I'm hoping to get all the way this time. In fact, if we stick with the current MergePolicy API, aren't you going to have to put some locking into eg the LogDocMergePolicy when concurrent merges might be happening? Yes and no. If CMP implements IndexMerger, I think we might be okay? In the previous iteration, I used derivation so that ConcurrentLogDocMergePolicy derived from the serial version but had the necessary threading. I agree that a wrapper is better solution if it can be made to work.
          Hide
          steven_parkes Steven Parkes added a comment -

          On a related note, Mike, there a few FIXME's in IW related to useCompoundFile: it doesn't exist in IW anymore (other than as a deprecated feature). The goal was to allow merge policies to decide when to use compound files, perhaps on a segment-by-segment basis. That all works fine for merge operations.

          But there's also the case of new segments, what format they should be in. These are segments that are going to be created by IW (via DocsWriter?) My stab at this was to have IW ask the merge policy. Since this isn't a merge, per say, the IW passes to the merge policy the current set of segment infos and the new segment info, asking what format the new segment info should use. So MergePolicy has

          boolean useCompoundFile(SegmentInfos segments, SegmentInfo newSegment);

          Looking at IW, with the new DocsWriter stuff, it looks like there isn't a segmentInfo object for the new segment at the time the predicate is being called. Would it be possible to make one? Something like DocumentsWriter#getDocStoreSegmentInfo() analogous to DocumentsWriter#getDocStoreSegment()? It could be an object just thrown away.

          Is this a bad idea?

          Show
          steven_parkes Steven Parkes added a comment - On a related note, Mike, there a few FIXME's in IW related to useCompoundFile: it doesn't exist in IW anymore (other than as a deprecated feature). The goal was to allow merge policies to decide when to use compound files, perhaps on a segment-by-segment basis. That all works fine for merge operations. But there's also the case of new segments, what format they should be in. These are segments that are going to be created by IW (via DocsWriter?) My stab at this was to have IW ask the merge policy. Since this isn't a merge, per say, the IW passes to the merge policy the current set of segment infos and the new segment info, asking what format the new segment info should use. So MergePolicy has boolean useCompoundFile(SegmentInfos segments, SegmentInfo newSegment); Looking at IW, with the new DocsWriter stuff, it looks like there isn't a segmentInfo object for the new segment at the time the predicate is being called. Would it be possible to make one? Something like DocumentsWriter#getDocStoreSegmentInfo() analogous to DocumentsWriter#getDocStoreSegment()? It could be an object just thrown away. Is this a bad idea?
          Hide
          mikemccand Michael McCandless added a comment -

          > Looking at IW, with the new DocsWriter stuff, it looks like there
          > isn't a segmentInfo object for the new segment at the time the
          > predicate is being called. Would it be possible to make one?
          > Something like DocumentsWriter#getDocStoreSegmentInfo() analogous to
          > DocumentsWriter#getDocStoreSegment()? It could be an object just
          > thrown away.

          Hmmm: it looks like you're calling
          "mergePolicy.useCompoundFile(segmentInfos,null)" twice: once inside
          flushDocStores and immediately thereafter when flushDocStores returns
          into the flush() code. Maybe you should change flushDocStores to
          return whether or not the flushed files are in compound format
          instead?

          Since flushDocStores() is flushing only the "doc store" index files
          (stored fields & term vectors) it's not a real "segment" so it's a
          somewhat forced fit to make a SegmentInfo in this case. I guess we
          could make a "largely empty" SegmentInfo and fill in what we can, but
          that's somewhat dangerous (eg methods like files() would have to be
          fixed to deal with this).

          Maybe, instead, we could use one of the SegmentInfo instances from a
          segment that refers to this doc store segment? This would just mean
          stepping through all SegmentInfo's and finding the first one (say)
          whose docStoreSegment equals the one we are now flushing? Still it
          would be good to differentiate this case (creating compound file for
          doc store files vs for a real segment) to MergePolicy somehow (maybe
          add a boolean arg "isDocStore" or some such?).

          Show
          mikemccand Michael McCandless added a comment - > Looking at IW, with the new DocsWriter stuff, it looks like there > isn't a segmentInfo object for the new segment at the time the > predicate is being called. Would it be possible to make one? > Something like DocumentsWriter#getDocStoreSegmentInfo() analogous to > DocumentsWriter#getDocStoreSegment()? It could be an object just > thrown away. Hmmm: it looks like you're calling "mergePolicy.useCompoundFile(segmentInfos,null)" twice: once inside flushDocStores and immediately thereafter when flushDocStores returns into the flush() code. Maybe you should change flushDocStores to return whether or not the flushed files are in compound format instead? Since flushDocStores() is flushing only the "doc store" index files (stored fields & term vectors) it's not a real "segment" so it's a somewhat forced fit to make a SegmentInfo in this case. I guess we could make a "largely empty" SegmentInfo and fill in what we can, but that's somewhat dangerous (eg methods like files() would have to be fixed to deal with this). Maybe, instead, we could use one of the SegmentInfo instances from a segment that refers to this doc store segment? This would just mean stepping through all SegmentInfo's and finding the first one (say) whose docStoreSegment equals the one we are now flushing? Still it would be good to differentiate this case (creating compound file for doc store files vs for a real segment) to MergePolicy somehow (maybe add a boolean arg "isDocStore" or some such?).
          Hide
          steven_parkes Steven Parkes added a comment -

          Ah. I understand better now. I have to admit, I haven't kept up to date on some of the deeper file stuff in LUCENE-843.

          It seems to me there's quite a bit of difference between segment files and doc store files. Since doc store files can be referred to by multiple segments, they can get quite large. I don't have any trouble imaging that a merge policy might want to CFS 10MB segments but not 10GB doc stores?

          I'm thinking maybe a MergePolicy#useCompoundDocStore( SegmentInfos ) makes sense? The naive case would still just use the static setting we have now, but we could think about a better implementation:

          • Maybe never cfs doc store files? Is that an unreasonable default? It just strikes me that there should be far fewer of these so that we don't need to and on the other end, if they are big, CFS'ing them is going to be expensive.
          • Do something smart but easy depending on number and size
          Show
          steven_parkes Steven Parkes added a comment - Ah. I understand better now. I have to admit, I haven't kept up to date on some of the deeper file stuff in LUCENE-843 . It seems to me there's quite a bit of difference between segment files and doc store files. Since doc store files can be referred to by multiple segments, they can get quite large. I don't have any trouble imaging that a merge policy might want to CFS 10MB segments but not 10GB doc stores? I'm thinking maybe a MergePolicy#useCompoundDocStore( SegmentInfos ) makes sense? The naive case would still just use the static setting we have now, but we could think about a better implementation: Maybe never cfs doc store files? Is that an unreasonable default? It just strikes me that there should be far fewer of these so that we don't need to and on the other end, if they are big, CFS'ing them is going to be expensive. Do something smart but easy depending on number and size
          Hide
          mikemccand Michael McCandless added a comment -

          > I'm thinking maybe a MergePolicy#useCompoundDocStore( SegmentInfos )
          > makes sense? The naive case would still just use the static setting
          > we have now, but we could think about a better implementation:

          I think adding that new method to MergePolicy is great! And, just
          using the same "useCompoundFile" as normal segmetns is good for
          starters (and, this matches what's done today).

          Show
          mikemccand Michael McCandless added a comment - > I'm thinking maybe a MergePolicy#useCompoundDocStore( SegmentInfos ) > makes sense? The naive case would still just use the static setting > we have now, but we could think about a better implementation: I think adding that new method to MergePolicy is great! And, just using the same "useCompoundFile" as normal segmetns is good for starters (and, this matches what's done today).
          Hide
          mikemccand Michael McCandless added a comment -

          New feedback:

          • Are you going to fix all unit tests that call the now-deprecated
            APIs? (You should still leave a few tests using the deprecated
            APIs to make sure they in fact continue to work, but most tests
            should not use the deprecated APIs).

          Responses to previous feedback:

          > As an example, it's not a good idea for merge policies to be able to
          > access IndexWriter#segmentInfos directly. (That's a case I would
          > like to solve by making segmentInfos private, but I haven't looked
          > at that completely yet and even beyond that case, I'd still like
          > merge policies to have a very clean interface with IWs.)

          Agreed, but the solution to that is to make segmentInfos private, not
          to make a whole new interface. Ie, this is an IndexWriter problem, so
          we should fix it in IndexWriter.

          > > While LogDocMergePolicy does need "maxBufferedDocs", I think,
          > > instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
          > > through" to the LogDocMergePolicy if that is the merge policy in
          > > use (and LogDocMergePolicy should store its own private
          > > "minMergeDocs").
          >
          > The thing here is that the merge policy has nothing to do with max
          > buffered docs, right? The code for buffering docs for the first
          > segment is wholly in the IndexWriter. LogDocMergePolicy happens to
          > need it (in its current incarnation) in order to handle the way the
          > log levels are computed. This could, of course, be changed. There's
          > nothing that says a merge policy has to look at these values, just
          > that they're available should the merge policy want to look.

          Exactly: these settings decide when a segment is flushed, so, why put
          them into IndexMerger interface? They shouldn't have anything to with
          merging; I think they should be removed.

          For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that
          does not use maxBufferedDocs.

          > I guess my idea was that the index writer was responsible for
          > handling the initial segment (with DocsWriter, if it wants) and also
          > to provide an indication to the merge policies how it was handling
          > them.
          >
          > It's possible to have the merge policy influence the first segment
          > size but that opens up a lot of issues. Does every merge policy then
          > have to know how to trade between buffering by doc count and
          > buffering by ram? I was hoping to avoid that.

          Yeah, I don't think merge policy should dictate flushing either;
          especially because app logic above IndexWriter can already easily call
          flush() if necessary.

          > > In LogDocMergePolicy, it seems like the checking that's done for
          > > whether a SegmentInfo is in a different directory, as well as the
          > > subsequent copy to move it over to the IndexWriter's directory,
          > > should not live in the MergePolicy?
          >
          > I see two parts to this.
          >
          > First, I hesitate to make it not possible for merge policy to see
          > the directory information, i.e., remove
          > IndexMerger#getDirectory(). Copying a segment is one way to get it
          > into the current directory, but merging with other segments does
          > to. A merge policy could decide to go ahead and merge a bunch of
          > segments that are in other directories rather than just copy them
          > individually. Taking away getDirectory() removes that option.

          Agreed, a "sophisticated" merge policy would go and merge segments in
          other directories. But, it should not have to.

          I'm not proposing making it "not possible"; I'm proposing the merge
          policy is given IndexWriter at which point it can getDirectory() from
          it. (Ie the extra interface solely for this purpose is overkill).

          > As to how to handle the case where single segments are copied, I
          > guess my main reason for leaving that in the merge policy would be
          > for simplicity. Seems nicer to have all segment amalgamation
          > management in one place, rather than some in one class and some in
          > another. Could be factored into an optional base merge policy for
          > derived classes to use as they might like.

          I don't see this as simpler: I see it as making the MergePolicy
          writer's job more complex. I also see it as substantial duplicated
          code (I just had to copy/paste a bunch of code in working on my
          MergePolicy in LUCENE-845).

          I think factoring into a base class is an OK solution, but, it
          shouldn't be MergePolicy's job to remember to call this final "move
          any segments in the wrong directory over" code. As long as its in one
          place and people don't have to copy/paste code between MergePolicy
          sources.

          > > The "checkOptimize" method in LogDocMergePolicy seems like it
          > > belongs back in IndexWriter: I think we don't want every
          > > MergePolicy having to replicate that tricky while condition.
          >
          > The reason for not doing this was I can imagine different merge
          > policies having a different model of what optimize means. I can
          > imagine some policies that would not optimize everything down to a
          > single segment but instead obeyed a max segment size. But we could
          > factor the default conditional into an optional base class, as
          > above.
          >
          > It is an ugly conditional that there might be better way to
          > formulate, so that policies don't have to look at the grody details
          > like hasSeparateNorms. But I'd still like the merge policies to be
          > able to decide what optimize means at a high level.

          Agreed: I too don't want to preclude variants to optimize like
          "optimize to at most N segments". (Maybe we should add that method,
          now, to IndexWriter, and fix MergePolicy to work with this?).

          So, yes, please at least factor this out into a base class. In
          LUCENE-845 this was another copy/paste for me (ick). I think there
          should in fact be a default optimize() in the base class that does
          what current IndexWriter now does so that a MergePolicy need not
          implement optimize at all.

          > > Maybe we could change MergePolicy.merge to take a boolean "forced"
          > > which, if true, means that the MergePolicy must now pick a merge
          > > even if it didn't think it was time to. This would let us move
          > > that tricky while condition logic back into IndexWriter.
          >
          > I didn't follow this. forced == optimize? If not, what does pick a
          > merge mean? Not sure what LogDoc should do for merge(force=true) if
          > it thinks everything is fine?

          No, forced would mean the merge policy must do a merge; whereas,
          normally, it's free not to do a merge until it wants to. Instead of
          boolean argument we could have two methods, one called "merge" (you
          must merge) and one called "maybeMerge" or "checkForMerges".

          Ie, optimize is really a series of forced merges: we are merging
          segments from different levels, N times, until we are down to 1
          segment w/ no deletes, norms, etc.

          > > Also, I think at some point we may want to have an optimize()
          > > method that takes an int parameter stating the max # segments in
          > > the resulting index.
          >
          > I think this is great functionality for a merge policy, but what
          > about just making that part of the individual merge policy
          > interface, rather than linked to IndexWriter? That was one goal of
          > making a factored merge policy: that you could do these tweaks
          > without changing IndexWriter.

          Well, it's sort of awkward if you want to vary that max # segments.
          Say during the day you optimize down to 15 segments every time you
          update the index, but then at night you want to optimize down to 5.
          If we don't add method to IndexWriter you then must have instance var
          on your MergePolicy that you set, then you call optimize. It's not
          clean since really it should be a parameter.

          And, with the merge/maybeMerge separation above, every merge policy
          could have a default implementation for optimize(int maxNumSegments)
          (in MergePolicy base class or in IndexWriter).

          > > Can we support non-contiguous merges? If I understand it
          > > correctly, the MergeSpecification can express such a merge, it's
          > > just that the current IndexMerger (IndexWriter) cannot execute it,
          > > right? So we are at least not precluding fixing this in the
          > > future.
          >
          > As far as I've seen so far, there are no barriers to non-contiguous
          > merges. Maybe something will crop up that is, but in what I've done,
          > I haven't seen any.

          Wait: there is a barrier, right? In IndexWriter.replace we don't do
          the right thing with non-contiguous merges? What I'm asking is: is
          that the only barrier? Ie MergePolicy API will not need to change in
          the future once we fix IndexWriter.replace to handle non-contiguous
          merges?

          > > It confuses me that MergePolicy has a method "merge(...)" – can
          > > we rename it to "maybeMerge(..)" or "checkForMerge(...)"?
          >
          > I suppose. I'm not a big fan of the "maybeFoo" style of naming. I
          > think of "merge" like "optimize": make it so / idempotent. But I'm
          > certainly willing to write whatever people find clearest.

          I'm not wed to "maybeMerge()" but I really don't like "merge" It's
          overloaded now.

          EG IndexMerger interface has a method called "merge" that is named
          correctly because it will specifically go a do the requested merge.
          So does IndexWriter.

          Then, you have other [overloaded] methods in LogDocMergePolicy called
          "merge" that are named appropriately (they will do a specific merge).

          How about "checkForMerges()"?

          > > Instead of IndexWriter.releaseMergePolicy() can we have
          > > IndexWriter only close the merge policy if it was the one that had
          > > created it? (Similar to how IndexWriter closes the dir if it has
          > > opened it from a String or File, but does not close it if it was
          > > passed in).
          >
          > This precludes
          >
          > iw.setMergePolicy(new MyMergePolicy(...));
          > ...
          > iw.close();
          >
          > The implementation's much cleaner using the semantics you describe,
          > but I was thinking it'd be better to optimize for the usability of the
          > common client code case?

          The thing is, that method leaves IndexWriter in a broken state (null
          mergePolicy). What if you keep adding docs after that then suddenly
          hit an NPE?

          Also, I'm OK if people need to separately close their MergePolicy
          instances: this is an advanced use of Lucene so it's OK to expect that
          ("simple things should be simple; complex things should be possible").

          Maybe we could add a "setMergePolicy(MergePolicy policy, boolean
          doClose)" and default doClose to true?

          Finally: does MergePolicy really need a close()? Is this overkill (at
          this point)?

          > > Well I think the current MergePolicy API (where the "merge" method
          > calls IndexWriter.merge itself, must cascade itself, etc.) makes it
          > hard to build a generic ConcurrentMergePolicy "wrapper" that you
          > could use to make any MergePolicy concurrent . How would you do
          > it?
          >
          > I really haven't had time to go heads down on this (the old
          > concurrent merge policy was a derived class rather than a wrapper
          > class). But I was thinking that perhaps ConurrentMergePolicy would
          > actually wrap IndexWriter as well as the serial merge policy, i.e.,
          > implement IndexMerger (my biggest argument for IM at this
          > point). But I haven't looked deeply at whether this will work but I
          > think it has at least a chance.
          >
          > I should know more about this is a day or two.

          I don't see how it can work (building the generic concurrency wrapper
          "under" IndexMerger) because the MergePolicy is in "serial control",
          eg, when it wants to cascade merges. How will you return that thread
          back to IndexWriter?

          Also it feels like the wrong place for concurrency – I think
          generally for "macro" concurrency you want it higher up, not lower
          down, in the call stack.

          With concurrency wrapper "on the top" it's able to easily take a merge
          request as returned by the policy, kick it off in the backrground, and
          immediately return control of original thread back to IndexWriter.

          But if you see a way to make it work "on the bottom", let's definitely
          explore it & understand the tradeoffs.

          > > I think you can still have state (as instance variables in your
          > > class)? How would this simplification restrict the space of merge
          > > policies?
          >
          > s/state/stack state/. Yeah, you can always unwind your loops and
          > lift your recursions, put all that stack state into instance
          > variables. But, well, yuck. I'd like to make it easy to write simple
          > merge policies and take up the heavy lifting elsewhere. Hopefully
          > there will be more merge policies than index writers.

          Can you give an example of the kind of "state" we're talking about?
          Is this just academic?

          Since the segments change in an unpredictable way (as seen by
          MergePolicy) eg from addIndexes*, flushing, concurrent merge swapping
          things at random times (thus requiring careful locking), etc, it seems
          like you can't be keeping much state (stack or instance) that depends
          on what segments are in the index?

          > > Oh, no: ConcurrentMergePolicy would still call
          > > IndexWriter.merge(spec), just with a separate thread. And so all
          > > synchronization required is still inside IndexWriter (I think?).
          >
          > That's my idea.

          Actually I was talking about my idea (to "simplify MergePolicy.merge
          API"). With the simplification (whereby MergePolicy.merge just
          returns the MergeSpecification instead of driving the merge itself) I
          believe it's simple to make a concurrency wrapper around any merge
          policy, and, have all necessary locking for SegmentInfos inside
          IndexWriter.

          > > In fact, if we stick with the current MergePolicy API, aren't you
          > > going to have to put some locking into eg the LogDocMergePolicy
          > > when concurrent merges might be happening?
          >
          > Yes and no. If CMP implements IndexMerger, I think we might be okay?

          If CMP implements IndexMerger you must have locking inside any
          MergePolicy that's calling into CMP? Whereas with the simplified
          MergePolicy.merge API, no locking is necessary because IndexWriter
          would lock segmentInfos whenever it calls MergePolicy.merge.

          > In the previous iteration, I used derivation so that
          > ConcurrentLogDocMergePolicy derived from the serial version but had
          > the necessary threading. I agree that a wrapper is better solution
          > if it can be made to work.

          I think it (concurrency wrapper around any merge policy) can be made
          to work, if we do simplify the MergePolicy.merge API. I'm not sure it
          can be made to work if we don't, but if you have an approach we should
          work through it!

          Show
          mikemccand Michael McCandless added a comment - New feedback: Are you going to fix all unit tests that call the now-deprecated APIs? (You should still leave a few tests using the deprecated APIs to make sure they in fact continue to work, but most tests should not use the deprecated APIs). Responses to previous feedback: > As an example, it's not a good idea for merge policies to be able to > access IndexWriter#segmentInfos directly. (That's a case I would > like to solve by making segmentInfos private, but I haven't looked > at that completely yet and even beyond that case, I'd still like > merge policies to have a very clean interface with IWs.) Agreed, but the solution to that is to make segmentInfos private, not to make a whole new interface. Ie, this is an IndexWriter problem, so we should fix it in IndexWriter. > > While LogDocMergePolicy does need "maxBufferedDocs", I think, > > instead, in IndexWriter's "setMaxBufferedDocs()" it should "write > > through" to the LogDocMergePolicy if that is the merge policy in > > use (and LogDocMergePolicy should store its own private > > "minMergeDocs"). > > The thing here is that the merge policy has nothing to do with max > buffered docs, right? The code for buffering docs for the first > segment is wholly in the IndexWriter. LogDocMergePolicy happens to > need it (in its current incarnation) in order to handle the way the > log levels are computed. This could, of course, be changed. There's > nothing that says a merge policy has to look at these values, just > that they're available should the merge policy want to look. Exactly: these settings decide when a segment is flushed, so, why put them into IndexMerger interface? They shouldn't have anything to with merging; I think they should be removed. For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that does not use maxBufferedDocs. > I guess my idea was that the index writer was responsible for > handling the initial segment (with DocsWriter, if it wants) and also > to provide an indication to the merge policies how it was handling > them. > > It's possible to have the merge policy influence the first segment > size but that opens up a lot of issues. Does every merge policy then > have to know how to trade between buffering by doc count and > buffering by ram? I was hoping to avoid that. Yeah, I don't think merge policy should dictate flushing either; especially because app logic above IndexWriter can already easily call flush() if necessary. > > In LogDocMergePolicy, it seems like the checking that's done for > > whether a SegmentInfo is in a different directory, as well as the > > subsequent copy to move it over to the IndexWriter's directory, > > should not live in the MergePolicy? > > I see two parts to this. > > First, I hesitate to make it not possible for merge policy to see > the directory information, i.e., remove > IndexMerger#getDirectory(). Copying a segment is one way to get it > into the current directory, but merging with other segments does > to. A merge policy could decide to go ahead and merge a bunch of > segments that are in other directories rather than just copy them > individually. Taking away getDirectory() removes that option. Agreed, a "sophisticated" merge policy would go and merge segments in other directories. But, it should not have to. I'm not proposing making it "not possible"; I'm proposing the merge policy is given IndexWriter at which point it can getDirectory() from it. (Ie the extra interface solely for this purpose is overkill). > As to how to handle the case where single segments are copied, I > guess my main reason for leaving that in the merge policy would be > for simplicity. Seems nicer to have all segment amalgamation > management in one place, rather than some in one class and some in > another. Could be factored into an optional base merge policy for > derived classes to use as they might like. I don't see this as simpler: I see it as making the MergePolicy writer's job more complex. I also see it as substantial duplicated code (I just had to copy/paste a bunch of code in working on my MergePolicy in LUCENE-845 ). I think factoring into a base class is an OK solution, but, it shouldn't be MergePolicy's job to remember to call this final "move any segments in the wrong directory over" code. As long as its in one place and people don't have to copy/paste code between MergePolicy sources. > > The "checkOptimize" method in LogDocMergePolicy seems like it > > belongs back in IndexWriter: I think we don't want every > > MergePolicy having to replicate that tricky while condition. > > The reason for not doing this was I can imagine different merge > policies having a different model of what optimize means. I can > imagine some policies that would not optimize everything down to a > single segment but instead obeyed a max segment size. But we could > factor the default conditional into an optional base class, as > above. > > It is an ugly conditional that there might be better way to > formulate, so that policies don't have to look at the grody details > like hasSeparateNorms. But I'd still like the merge policies to be > able to decide what optimize means at a high level. Agreed: I too don't want to preclude variants to optimize like "optimize to at most N segments". (Maybe we should add that method, now, to IndexWriter, and fix MergePolicy to work with this?). So, yes, please at least factor this out into a base class. In LUCENE-845 this was another copy/paste for me (ick). I think there should in fact be a default optimize() in the base class that does what current IndexWriter now does so that a MergePolicy need not implement optimize at all. > > Maybe we could change MergePolicy.merge to take a boolean "forced" > > which, if true, means that the MergePolicy must now pick a merge > > even if it didn't think it was time to. This would let us move > > that tricky while condition logic back into IndexWriter. > > I didn't follow this. forced == optimize? If not, what does pick a > merge mean? Not sure what LogDoc should do for merge(force=true) if > it thinks everything is fine? No, forced would mean the merge policy must do a merge; whereas, normally, it's free not to do a merge until it wants to. Instead of boolean argument we could have two methods, one called "merge" (you must merge) and one called "maybeMerge" or "checkForMerges". Ie, optimize is really a series of forced merges: we are merging segments from different levels, N times, until we are down to 1 segment w/ no deletes, norms, etc. > > Also, I think at some point we may want to have an optimize() > > method that takes an int parameter stating the max # segments in > > the resulting index. > > I think this is great functionality for a merge policy, but what > about just making that part of the individual merge policy > interface, rather than linked to IndexWriter? That was one goal of > making a factored merge policy: that you could do these tweaks > without changing IndexWriter. Well, it's sort of awkward if you want to vary that max # segments. Say during the day you optimize down to 15 segments every time you update the index, but then at night you want to optimize down to 5. If we don't add method to IndexWriter you then must have instance var on your MergePolicy that you set, then you call optimize. It's not clean since really it should be a parameter. And, with the merge/maybeMerge separation above, every merge policy could have a default implementation for optimize(int maxNumSegments) (in MergePolicy base class or in IndexWriter). > > Can we support non-contiguous merges? If I understand it > > correctly, the MergeSpecification can express such a merge, it's > > just that the current IndexMerger (IndexWriter) cannot execute it, > > right? So we are at least not precluding fixing this in the > > future. > > As far as I've seen so far, there are no barriers to non-contiguous > merges. Maybe something will crop up that is, but in what I've done, > I haven't seen any. Wait: there is a barrier, right? In IndexWriter.replace we don't do the right thing with non-contiguous merges? What I'm asking is: is that the only barrier? Ie MergePolicy API will not need to change in the future once we fix IndexWriter.replace to handle non-contiguous merges? > > It confuses me that MergePolicy has a method "merge(...)" – can > > we rename it to "maybeMerge(..)" or "checkForMerge(...)"? > > I suppose. I'm not a big fan of the "maybeFoo" style of naming. I > think of "merge" like "optimize": make it so / idempotent. But I'm > certainly willing to write whatever people find clearest. I'm not wed to "maybeMerge()" but I really don't like "merge" It's overloaded now. EG IndexMerger interface has a method called "merge" that is named correctly because it will specifically go a do the requested merge. So does IndexWriter. Then, you have other [overloaded] methods in LogDocMergePolicy called "merge" that are named appropriately (they will do a specific merge). How about "checkForMerges()"? > > Instead of IndexWriter.releaseMergePolicy() can we have > > IndexWriter only close the merge policy if it was the one that had > > created it? (Similar to how IndexWriter closes the dir if it has > > opened it from a String or File, but does not close it if it was > > passed in). > > This precludes > > iw.setMergePolicy(new MyMergePolicy(...)); > ... > iw.close(); > > The implementation's much cleaner using the semantics you describe, > but I was thinking it'd be better to optimize for the usability of the > common client code case? The thing is, that method leaves IndexWriter in a broken state (null mergePolicy). What if you keep adding docs after that then suddenly hit an NPE? Also, I'm OK if people need to separately close their MergePolicy instances: this is an advanced use of Lucene so it's OK to expect that ("simple things should be simple; complex things should be possible"). Maybe we could add a "setMergePolicy(MergePolicy policy, boolean doClose)" and default doClose to true? Finally: does MergePolicy really need a close()? Is this overkill (at this point)? > > Well I think the current MergePolicy API (where the "merge" method > calls IndexWriter.merge itself, must cascade itself, etc.) makes it > hard to build a generic ConcurrentMergePolicy "wrapper" that you > could use to make any MergePolicy concurrent . How would you do > it? > > I really haven't had time to go heads down on this (the old > concurrent merge policy was a derived class rather than a wrapper > class). But I was thinking that perhaps ConurrentMergePolicy would > actually wrap IndexWriter as well as the serial merge policy, i.e., > implement IndexMerger (my biggest argument for IM at this > point). But I haven't looked deeply at whether this will work but I > think it has at least a chance. > > I should know more about this is a day or two. I don't see how it can work (building the generic concurrency wrapper "under" IndexMerger) because the MergePolicy is in "serial control", eg, when it wants to cascade merges. How will you return that thread back to IndexWriter? Also it feels like the wrong place for concurrency – I think generally for "macro" concurrency you want it higher up, not lower down, in the call stack. With concurrency wrapper "on the top" it's able to easily take a merge request as returned by the policy, kick it off in the backrground, and immediately return control of original thread back to IndexWriter. But if you see a way to make it work "on the bottom", let's definitely explore it & understand the tradeoffs. > > I think you can still have state (as instance variables in your > > class)? How would this simplification restrict the space of merge > > policies? > > s/state/stack state/. Yeah, you can always unwind your loops and > lift your recursions, put all that stack state into instance > variables. But, well, yuck. I'd like to make it easy to write simple > merge policies and take up the heavy lifting elsewhere. Hopefully > there will be more merge policies than index writers. Can you give an example of the kind of "state" we're talking about? Is this just academic? Since the segments change in an unpredictable way (as seen by MergePolicy) eg from addIndexes*, flushing, concurrent merge swapping things at random times (thus requiring careful locking), etc, it seems like you can't be keeping much state (stack or instance) that depends on what segments are in the index? > > Oh, no: ConcurrentMergePolicy would still call > > IndexWriter.merge(spec), just with a separate thread. And so all > > synchronization required is still inside IndexWriter (I think?). > > That's my idea. Actually I was talking about my idea (to "simplify MergePolicy.merge API"). With the simplification (whereby MergePolicy.merge just returns the MergeSpecification instead of driving the merge itself) I believe it's simple to make a concurrency wrapper around any merge policy, and, have all necessary locking for SegmentInfos inside IndexWriter. > > In fact, if we stick with the current MergePolicy API, aren't you > > going to have to put some locking into eg the LogDocMergePolicy > > when concurrent merges might be happening? > > Yes and no. If CMP implements IndexMerger, I think we might be okay? If CMP implements IndexMerger you must have locking inside any MergePolicy that's calling into CMP? Whereas with the simplified MergePolicy.merge API, no locking is necessary because IndexWriter would lock segmentInfos whenever it calls MergePolicy.merge. > In the previous iteration, I used derivation so that > ConcurrentLogDocMergePolicy derived from the serial version but had > the necessary threading. I agree that a wrapper is better solution > if it can be made to work. I think it (concurrency wrapper around any merge policy) can be made to work, if we do simplify the MergePolicy.merge API. I'm not sure it can be made to work if we don't, but if you have an approach we should work through it!
          Hide
          ningli Ning Li added a comment -

          On 8/8/07, Michael McCandless (JIRA) <jira@apache.org> wrote:
          > Actually I was talking about my idea (to "simplify MergePolicy.merge
          > API"). With the simplification (whereby MergePolicy.merge just
          > returns the MergeSpecification instead of driving the merge itself) I
          > believe it's simple to make a concurrency wrapper around any merge
          > policy, and, have all necessary locking for SegmentInfos inside
          > IndexWriter.

          I agree with Mike. In fact, MergeSelector.select, which is the counterpart
          of MergePolicy.merge in the patch I submitted for concurrent merge,
          simply returns a MergeSpecification. It's simple and sufficient to have
          all necessary lockings for SegmentInfos in one class, say IndexWriter.
          For example, IndexWriter locks SegmentInfos when MergePolicy(MergeSelector)
          picks a merge spec. Another example, when a merge is finished, say
          IndexWriter.checkin is called which locks SegmentInfos and replaces
          the source segment infos with the target segment info.

          On 8/7/07, Steven Parkes (JIRA) <jira@apache.org> wrote:
          > The synchronization is still tricky, since parts of segmentInfos are
          > getting changed at various times and there are references and/or
          > copies of it other places. And as Ning pointed out to me, we also
          > have to deal with buffered delete terms. I'd say I got about 80% of
          >the way there on the last go around. I'm hoping to get all the way
          > this time.

          It just occurred to me that there is a neat way to handle deletes that
          are flushed during a concurrent merge. For example, MergePolicy
          decides to merge segments B and C, with B's delete file 0001 and
          C's 100. When the concurrent merge finishes, B's delete file becomes
          0011 and C's 110. We do a simple computation on the delete bit
          vectors and check in the merged segment with delete file 00110.

          Show
          ningli Ning Li added a comment - On 8/8/07, Michael McCandless (JIRA) <jira@apache.org> wrote: > Actually I was talking about my idea (to "simplify MergePolicy.merge > API"). With the simplification (whereby MergePolicy.merge just > returns the MergeSpecification instead of driving the merge itself) I > believe it's simple to make a concurrency wrapper around any merge > policy, and, have all necessary locking for SegmentInfos inside > IndexWriter. I agree with Mike. In fact, MergeSelector.select, which is the counterpart of MergePolicy.merge in the patch I submitted for concurrent merge, simply returns a MergeSpecification. It's simple and sufficient to have all necessary lockings for SegmentInfos in one class, say IndexWriter. For example, IndexWriter locks SegmentInfos when MergePolicy(MergeSelector) picks a merge spec. Another example, when a merge is finished, say IndexWriter.checkin is called which locks SegmentInfos and replaces the source segment infos with the target segment info. On 8/7/07, Steven Parkes (JIRA) <jira@apache.org> wrote: > The synchronization is still tricky, since parts of segmentInfos are > getting changed at various times and there are references and/or > copies of it other places. And as Ning pointed out to me, we also > have to deal with buffered delete terms. I'd say I got about 80% of >the way there on the last go around. I'm hoping to get all the way > this time. It just occurred to me that there is a neat way to handle deletes that are flushed during a concurrent merge. For example, MergePolicy decides to merge segments B and C, with B's delete file 0001 and C's 100. When the concurrent merge finishes, B's delete file becomes 0011 and C's 110. We do a simple computation on the delete bit vectors and check in the merged segment with delete file 00110.
          Hide
          ningli Ning Li added a comment -

          The following comments are about the impact on merge if we add
          "deleteDocument(int doc)" (and deprecate IndexModifier). Since it
          concerns the topic in this issue, I also post it here to get your opinions.

          I'm thinking about the impact of adding "deleteDocument(int doc)" on
          LUCENE-847, especially on concurrent merge. The semantics of
          "deleteDocument(int doc)" is that the document to delete is specified
          by the document id on the index at the time of the call. When a merge
          is finished and the result is being checked into IndexWriter's
          SegmentInfos, document ids may change. Therefore, it may be necessary
          to flush buffered delete doc ids (thus buffered docs and delete terms
          as well) before a merge result is checked in.

          The flush is not necessary if there is no buffered delete doc ids. I
          don't think it should be the reason not to support "deleteDocument(int
          doc)" in IndexWriter. But its impact on concurrent merge is a concern.

          Show
          ningli Ning Li added a comment - The following comments are about the impact on merge if we add "deleteDocument(int doc)" (and deprecate IndexModifier). Since it concerns the topic in this issue, I also post it here to get your opinions. I'm thinking about the impact of adding "deleteDocument(int doc)" on LUCENE-847 , especially on concurrent merge. The semantics of "deleteDocument(int doc)" is that the document to delete is specified by the document id on the index at the time of the call. When a merge is finished and the result is being checked into IndexWriter's SegmentInfos, document ids may change. Therefore, it may be necessary to flush buffered delete doc ids (thus buffered docs and delete terms as well) before a merge result is checked in. The flush is not necessary if there is no buffered delete doc ids. I don't think it should be the reason not to support "deleteDocument(int doc)" in IndexWriter. But its impact on concurrent merge is a concern.
          Hide
          mikemccand Michael McCandless added a comment -

          > It just occurred to me that there is a neat way to handle deletes
          > that are flushed during a concurrent merge. For example, MergePolicy
          > decides to merge segments B and C, with B's delete file 0001 and C's
          > 100. When the concurrent merge finishes, B's delete file becomes
          > 0011 and C's 110. We do a simple computation on the delete bit
          > vectors and check in the merged segment with delete file 00110

          Excellent! This lets you efficiently merge in the additional deletes
          (if any) that were flushed against each of the merged segments after
          the merge had begun. Furthermore, I think this is all contained
          within IndexWriter, right?

          Ie when we go to "replace/checkin" the newly merged segment, this
          "merge newly flushed deletes" would execute at that time. And, I
          think, we would block flushes while this is happening, but
          addDocument/deleteDocument/updateDocument would still be allowed?

          It should in fact be quite fast to run since delete BitVectors is all
          in RAM.

          > I'm thinking about the impact of adding "deleteDocument(int doc)" on
          > LUCENE-847, especially on concurrent merge. The semantics of
          > "deleteDocument(int doc)" is that the document to delete is
          > specified by the document id on the index at the time of the
          > call. When a merge is finished and the result is being checked into
          > IndexWriter's SegmentInfos, document ids may change. Therefore, it
          > may be necessary to flush buffered delete doc ids (thus buffered
          > docs and delete terms as well) before a merge result is checked in.
          >
          > The flush is not necessary if there is no buffered delete doc ids. I
          > don't think it should be the reason not to support
          > "deleteDocument(int doc)" in IndexWriter. But its impact on
          > concurrent merge is a concern.

          Couldn't we also just update the docIDs of pending deletes, and not
          flush? Ie we know the mapping of old -> new docID caused by the
          merge, so we can run through all deleted docIDs and remap?

          Show
          mikemccand Michael McCandless added a comment - > It just occurred to me that there is a neat way to handle deletes > that are flushed during a concurrent merge. For example, MergePolicy > decides to merge segments B and C, with B's delete file 0001 and C's > 100. When the concurrent merge finishes, B's delete file becomes > 0011 and C's 110. We do a simple computation on the delete bit > vectors and check in the merged segment with delete file 00110 Excellent! This lets you efficiently merge in the additional deletes (if any) that were flushed against each of the merged segments after the merge had begun. Furthermore, I think this is all contained within IndexWriter, right? Ie when we go to "replace/checkin" the newly merged segment, this "merge newly flushed deletes" would execute at that time. And, I think, we would block flushes while this is happening, but addDocument/deleteDocument/updateDocument would still be allowed? It should in fact be quite fast to run since delete BitVectors is all in RAM. > I'm thinking about the impact of adding "deleteDocument(int doc)" on > LUCENE-847 , especially on concurrent merge. The semantics of > "deleteDocument(int doc)" is that the document to delete is > specified by the document id on the index at the time of the > call. When a merge is finished and the result is being checked into > IndexWriter's SegmentInfos, document ids may change. Therefore, it > may be necessary to flush buffered delete doc ids (thus buffered > docs and delete terms as well) before a merge result is checked in. > > The flush is not necessary if there is no buffered delete doc ids. I > don't think it should be the reason not to support > "deleteDocument(int doc)" in IndexWriter. But its impact on > concurrent merge is a concern. Couldn't we also just update the docIDs of pending deletes, and not flush? Ie we know the mapping of old -> new docID caused by the merge, so we can run through all deleted docIDs and remap?
          Hide
          ningli Ning Li added a comment -

          > Furthermore, I think this is all contained within IndexWriter, right?
          > Ie when we go to "replace/checkin" the newly merged segment, this
          > "merge newly flushed deletes" would execute at that time. And, I
          > think, we would block flushes while this is happening, but
          > addDocument/deleteDocument/updateDocument would still be allowed?

          Yes and yes.

          > Couldn't we also just update the docIDs of pending deletes, and not
          > flush? Ie we know the mapping of old -> new docID caused by the
          > merge, so we can run through all deleted docIDs and remap?

          Hmm, I was worried quite a number of delete docIDs could be buffered,
          but I guess it's still better than having to do a flush. So yes, this is better!

          Show
          ningli Ning Li added a comment - > Furthermore, I think this is all contained within IndexWriter, right? > Ie when we go to "replace/checkin" the newly merged segment, this > "merge newly flushed deletes" would execute at that time. And, I > think, we would block flushes while this is happening, but > addDocument/deleteDocument/updateDocument would still be allowed? Yes and yes. > Couldn't we also just update the docIDs of pending deletes, and not > flush? Ie we know the mapping of old -> new docID caused by the > merge, so we can run through all deleted docIDs and remap? Hmm, I was worried quite a number of delete docIDs could be buffered, but I guess it's still better than having to do a flush. So yes, this is better!
          Hide
          steven_parkes Steven Parkes added a comment -

          Are you going to fix all unit tests that call the now-deprecated APIs?

          Yeah. Thanks for the reminder.

          As to the IndexWriter vs. IndexMerger issue, I still think having the interface is useful if not only that it makes my testing much easier. I have a MockIndexMerger that implements only the functions in the interface and therefore I can test merge policies without creating a writer. For me this has been a big win ...

          Exactly: these settings decide when a segment is flushed, so, why put
          them into IndexMerger interface? They shouldn't have anything to with
          merging; I think they should be removed.

          For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that
          does not use maxBufferedDocs.

          I can see that one could write a merge policy that didn't have any idea of how the initial buffering was done, but I worry about precluding it. Maybe the LUCENE-845 patch will show a strong enough pattern to believe no merge policies will need it?

          I think factoring into a base class is an OK solution, but, it
          shouldn't be MergePolicy's job to remember to call this final "move
          any segments in the wrong directory over" code. As long as its in one
          place and people don't have to copy/paste code between MergePolicy
          sources.

          In the case of concurrent merges, I think this gets more complicated. When do you do those directory copies? I think you can't do them at the return from the merge policy because the merge policy may want to do them, but later.

          I don't think IndexWriter has enough information to know when the copies need to done. Doubly so if we have concurrent merges?

          I still stand by it should be the merge policy making the choice. You could have the code in IndexWriter too, but then there'd be duplicate code. To put the code only in IndexWriter removes the choice from the merge policy.

          I think there
          should in fact be a default optimize() in the base class that does
          what current IndexWriter now does so that a MergePolicy need not
          implement optimize at all.

          It'd be nice, but I don't know how to do it: the merge factor is not generic, so I don't know how to implement the loop generically.

          Ah ... I see: with your forced merge ... hmmm.

          No, forced would mean the merge policy must do a merge; whereas,
          normally, it's free not to do a merge until it wants to.

          Hmmmm ...

          I think adding a forced merge concept here is new ... If it's simply to support optimize, I'm not sure I find it too compelling. LogDoc as it stands uses different algorithms for incremental merges and optimize, so there's not too much of a concept of forced merges vs. optional merges to be factored out. So I guess I'm not seeing a strong compelling case for creating it?

          Well, it's sort of awkward if you want to vary that max # segments.
          Say during the day you optimize down to 15 segments every time you
          update the index, but then at night you want to optimize down to 5.
          If we don't add method to IndexWriter you then must have instance var
          on your MergePolicy that you set, then you call optimize. It's not
          clean since really it should be a parameter.

          Well, I don't know if I buy the argument that it should be a parameter. The merge policy has lots of state like docs/seg. I don't really see why segs/optimize is different.

          My main reason for not wanting put this into IndexWriter is then every merge policy must support it.

          Wait: there is a barrier, right? In IndexWriter.replace we don't do
          the right thing with non-contiguous merges?

          Yeah, I meant that I'm not aware of any barriers except fixing IndexWriter#replace, in other words, I'm not aware of any other places where non-contiguity would cause a failure.

          I'm not wed to "maybeMerge()" but I really don't like "merge" It's
          overloaded now.

          EG IndexMerger interface has a method called "merge" that is named
          correctly because it will specifically go a do the requested merge.
          So does IndexWriter.

          Then, you have other [overloaded] methods in LogDocMergePolicy called
          "merge" that are named appropriately (they will do a specific merge).

          How about "checkForMerges()"?

          I don't find it ambiguous based on class and argument type. It's all personal, of course.

          I'd definitely prefer maybe over checkFor because that sounds like a predicate.

          Maybe we could add a "setMergePolicy(MergePolicy policy, boolean
          doClose)" and default doClose to true?

          That sounds good.

          Finally: does MergePolicy really need a close()?

          I think so. The concurrent merge policy maintains all sorts of state.

          I don't see how it can work (building the generic concurrency wrapper
          "under" IndexMerger) because the MergePolicy is in "serial control",
          eg, when it wants to cascade merges. How will you return that thread
          back to IndexWriter?

          So this is how it looks now: the concurrent merge policy is both a merge policy and an index merger. The serial merge policy knows nothing about it other than it does not get IndexWriter as its merge.

          The index writer wants its merge, so it does it merge/maybeMerge call on the concurrent merge policy. The CMP calls merge on the serial policy, but substitutes itself for the merger rather than IndexWriter.

          The serial merge policy goes on its merry way, looking for merges to do (in the current model, this is a loop; more on that in a minute). Each time it has a subset of segments to merge, it calls merger.merge(...).

          At this point, the concurrent merge policy takes over again. It looks at the segments to be merged and other segments being processed by all existing merge threads and determines if there's a conflict (a request to merge a segment that's currently in a merge). If there's no conflict, it starts a merge thread and calls IndexWriter#merge on the thread. The original calling thread returns immediately. (I have a few ideas how to handle conflicts, the simplest of which is to wait for the conflicting merge and the restart the serial merge, e.g., revert to serial).

          This seems to work pretty well, so far. The only difference in API for the serial merges is that the merge operation can't return the number of documents in the result (since it isn't known how many docs will be deleted).

          With concurrency wrapper "on the top" it's able to easily take a merge
          request as returned by the policy, kick it off in the backrground, and
          immediately return control of original thread back to IndexWriter.

          What I don't know how to do with this is figure out how to do a bunch of merges. Lets say I have two levels in LogDoc that are merge worthy. If I call LogDoc, it'll return the lower level. That's all good. But what about doing the higher level in parallel? If I call LogDoc again, it's going to return the lower level again because it knows nothing about the current merge going on.

          LogDoc already does things in a loop: it's pretty much set up to call all possible merges at one time (if they return immediately).

          It would be possible to have it return a vector of segmentInfo subsets, but I don't see the gain (and it doesn't work out as well for my putative conflict resolution).

          have all necessary locking for SegmentInfos inside
          IndexWriter

          This was a red-herring on my part. All the "segmentInfos locking" has always been in IndexWriter. That's note exactly sufficient. The fundamental issue is that IndexWriter#merge has to operate without a lock on IndexWriter. At some point, I was thinking that meant it would have to lock SegmentInfos but that's ludicrous, actually. It's sufficient for IndexWriter#replace to be synchronized.

          If CMP implements IndexMerger you must have locking inside any
          MergePolicy that's calling into CMP?

          No. CMP does it's own locking (for purposes of thread management) but the serial merge policies no nothing of this (and they can expect to be called synchronously).

          Show
          steven_parkes Steven Parkes added a comment - Are you going to fix all unit tests that call the now-deprecated APIs? Yeah. Thanks for the reminder. As to the IndexWriter vs. IndexMerger issue, I still think having the interface is useful if not only that it makes my testing much easier. I have a MockIndexMerger that implements only the functions in the interface and therefore I can test merge policies without creating a writer. For me this has been a big win ... Exactly: these settings decide when a segment is flushed, so, why put them into IndexMerger interface? They shouldn't have anything to with merging; I think they should be removed. For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that does not use maxBufferedDocs. I can see that one could write a merge policy that didn't have any idea of how the initial buffering was done, but I worry about precluding it. Maybe the LUCENE-845 patch will show a strong enough pattern to believe no merge policies will need it? I think factoring into a base class is an OK solution, but, it shouldn't be MergePolicy's job to remember to call this final "move any segments in the wrong directory over" code. As long as its in one place and people don't have to copy/paste code between MergePolicy sources. In the case of concurrent merges, I think this gets more complicated. When do you do those directory copies? I think you can't do them at the return from the merge policy because the merge policy may want to do them, but later. I don't think IndexWriter has enough information to know when the copies need to done. Doubly so if we have concurrent merges? I still stand by it should be the merge policy making the choice. You could have the code in IndexWriter too, but then there'd be duplicate code. To put the code only in IndexWriter removes the choice from the merge policy. I think there should in fact be a default optimize() in the base class that does what current IndexWriter now does so that a MergePolicy need not implement optimize at all. It'd be nice, but I don't know how to do it: the merge factor is not generic, so I don't know how to implement the loop generically. Ah ... I see: with your forced merge ... hmmm. No, forced would mean the merge policy must do a merge; whereas, normally, it's free not to do a merge until it wants to. Hmmmm ... I think adding a forced merge concept here is new ... If it's simply to support optimize, I'm not sure I find it too compelling. LogDoc as it stands uses different algorithms for incremental merges and optimize, so there's not too much of a concept of forced merges vs. optional merges to be factored out. So I guess I'm not seeing a strong compelling case for creating it? Well, it's sort of awkward if you want to vary that max # segments. Say during the day you optimize down to 15 segments every time you update the index, but then at night you want to optimize down to 5. If we don't add method to IndexWriter you then must have instance var on your MergePolicy that you set, then you call optimize. It's not clean since really it should be a parameter. Well, I don't know if I buy the argument that it should be a parameter. The merge policy has lots of state like docs/seg. I don't really see why segs/optimize is different. My main reason for not wanting put this into IndexWriter is then every merge policy must support it. Wait: there is a barrier, right? In IndexWriter.replace we don't do the right thing with non-contiguous merges? Yeah, I meant that I'm not aware of any barriers except fixing IndexWriter#replace, in other words, I'm not aware of any other places where non-contiguity would cause a failure. I'm not wed to "maybeMerge()" but I really don't like "merge" It's overloaded now. EG IndexMerger interface has a method called "merge" that is named correctly because it will specifically go a do the requested merge. So does IndexWriter. Then, you have other [overloaded] methods in LogDocMergePolicy called "merge" that are named appropriately (they will do a specific merge). How about "checkForMerges()"? I don't find it ambiguous based on class and argument type. It's all personal, of course. I'd definitely prefer maybe over checkFor because that sounds like a predicate. Maybe we could add a "setMergePolicy(MergePolicy policy, boolean doClose)" and default doClose to true? That sounds good. Finally: does MergePolicy really need a close()? I think so. The concurrent merge policy maintains all sorts of state. I don't see how it can work (building the generic concurrency wrapper "under" IndexMerger) because the MergePolicy is in "serial control", eg, when it wants to cascade merges. How will you return that thread back to IndexWriter? So this is how it looks now: the concurrent merge policy is both a merge policy and an index merger. The serial merge policy knows nothing about it other than it does not get IndexWriter as its merge. The index writer wants its merge, so it does it merge/maybeMerge call on the concurrent merge policy. The CMP calls merge on the serial policy, but substitutes itself for the merger rather than IndexWriter. The serial merge policy goes on its merry way, looking for merges to do (in the current model, this is a loop; more on that in a minute). Each time it has a subset of segments to merge, it calls merger.merge(...). At this point, the concurrent merge policy takes over again. It looks at the segments to be merged and other segments being processed by all existing merge threads and determines if there's a conflict (a request to merge a segment that's currently in a merge). If there's no conflict, it starts a merge thread and calls IndexWriter#merge on the thread. The original calling thread returns immediately. (I have a few ideas how to handle conflicts, the simplest of which is to wait for the conflicting merge and the restart the serial merge, e.g., revert to serial). This seems to work pretty well, so far. The only difference in API for the serial merges is that the merge operation can't return the number of documents in the result (since it isn't known how many docs will be deleted). With concurrency wrapper "on the top" it's able to easily take a merge request as returned by the policy, kick it off in the backrground, and immediately return control of original thread back to IndexWriter. What I don't know how to do with this is figure out how to do a bunch of merges. Lets say I have two levels in LogDoc that are merge worthy. If I call LogDoc, it'll return the lower level. That's all good. But what about doing the higher level in parallel? If I call LogDoc again, it's going to return the lower level again because it knows nothing about the current merge going on. LogDoc already does things in a loop: it's pretty much set up to call all possible merges at one time (if they return immediately). It would be possible to have it return a vector of segmentInfo subsets, but I don't see the gain (and it doesn't work out as well for my putative conflict resolution). have all necessary locking for SegmentInfos inside IndexWriter This was a red-herring on my part. All the "segmentInfos locking" has always been in IndexWriter. That's note exactly sufficient. The fundamental issue is that IndexWriter#merge has to operate without a lock on IndexWriter. At some point, I was thinking that meant it would have to lock SegmentInfos but that's ludicrous, actually. It's sufficient for IndexWriter#replace to be synchronized. If CMP implements IndexMerger you must have locking inside any MergePolicy that's calling into CMP? No. CMP does it's own locking (for purposes of thread management) but the serial merge policies no nothing of this (and they can expect to be called synchronously).
          Hide
          steven_parkes Steven Parkes added a comment -

          It just occurred to me that there is a neat way to handle deletes that
          are flushed during a concurrent merge. For example, MergePolicy
          decides to merge segments B and C, with B's delete file 0001 and
          C's 100. When the concurrent merge finishes, B's delete file becomes
          0011 and C's 110. We do a simple computation on the delete bit
          vectors and check in the merged segment with delete file 00110.

          Well, that makes my life much easier. Now I don't have to figure out what to do, just have to make it so ...

          Thanks!

          Show
          steven_parkes Steven Parkes added a comment - It just occurred to me that there is a neat way to handle deletes that are flushed during a concurrent merge. For example, MergePolicy decides to merge segments B and C, with B's delete file 0001 and C's 100. When the concurrent merge finishes, B's delete file becomes 0011 and C's 110. We do a simple computation on the delete bit vectors and check in the merged segment with delete file 00110. Well, that makes my life much easier. Now I don't have to figure out what to do, just have to make it so ... Thanks!
          Hide
          steven_parkes Steven Parkes added a comment -

          Updated patch:

          • Don't call deprecated methods
          • note: currently renamed with "_" prepended to make easy to find; don't commit
            those
          • Factor MergePolicyBase
          • comments to remind to delete before commit (though might still have missed some)
          • Make LDMP casts not throw bad cast
          • Get rid of releaseMergePolicy and add doClose parameter on set
          • Didn't factor copy from other dirs: requires compound file choices
          • Didn't (yet) rename merge -> maybeMerge
          • Does this mean optimize -> maybeOptimize, too?
          Show
          steven_parkes Steven Parkes added a comment - Updated patch: Don't call deprecated methods note: currently renamed with "_" prepended to make easy to find; don't commit those Factor MergePolicyBase comments to remind to delete before commit (though might still have missed some) Make LDMP casts not throw bad cast Get rid of releaseMergePolicy and add doClose parameter on set Didn't factor copy from other dirs: requires compound file choices Didn't (yet) rename merge -> maybeMerge Does this mean optimize -> maybeOptimize, too?
          Hide
          mikemccand Michael McCandless added a comment -

          One new small item: you've added a "public void merge()" to
          IndexWriter so that people can externally kick off a merge request,
          which is good I think.

          But, is it really necessary to flush here? It would be better to not
          flush so that users then have two separate methods (flush() and
          merge()) to do each function independently.

          > > Are you going to fix all unit tests that call the now-deprecated
          > > APIs?
          >
          > Yeah. Thanks for the reminder.

          On thinking about this more ... and on seeing all the diffs ... I no
          longer feel we should be deprecating "get/setUseCompoundFile()" nor
          "get/setMergeFactor()" nor "get/setMaxMergeDocs()" in IndexWriter.

          The vast majoriy of Lucene users will not make their own merge policy
          (just use the default merge policy) and so I don't think we should be
          complicating their lives with having to now write lines like this when
          they want to change settings:

          ((LogDocMergePolicy)writer.getMergePolicy()).setUseCompoundFile(useCompoundFile);

          Also, this ties our hands if ever we want to change the default merge
          policy (which, under LUCENE-845, I'd like to do).

          I think instead we should leave the methods, not deprecated, as
          convenience (sugar) methods. Simple things should be simple; complex
          things should be possible. Sorry I didn't think of this before you
          made the new patch Steve

          > > I'm not wed to "maybeMerge()" but I really don't like "merge"
          > > It's overloaded now.
          > >
          > > EG IndexMerger interface has a method called "merge" that is named
          > > correctly because it will specifically go a do the requested
          > > merge. So does IndexWriter.
          > >
          > > Then, you have other [overloaded] methods in LogDocMergePolicy
          > > called "merge" that are named appropriately (they will do a
          > > specific merge).
          > >
          > > How about "checkForMerges()"?
          >
          > I don't find it ambiguous based on class and argument type. It's all
          > personal, of course.
          >
          > I'd definitely prefer maybe over checkFor because that sounds like a
          > predicate.

          OK let's settle on "maybeMerge"?

          > - Does this mean optimize -> maybeOptimize, too?

          Uh, no: when someone calls optimize that means it really must be done,
          right? So "optimize" is the right name I think.

          > * Make LDMP casts not throw bad cast

          Can you factor this out, eg add a private method
          "getLogDocMergePolicy(String reason)" that would be the one place that
          does the class casting & throwing an error message from one single
          source line? Right now the message is copied in multiple places and,
          it's wrong for mergeFactor (was copied from useCompoundFile).

          > * Get rid of releaseMergePolicy and add doClose parameter on set

          Looks good, thanks. Can you add javadocs (w/ params) for both of
          these new methods?

          > As to the IndexWriter vs. IndexMerger issue, I still think having
          > the interface is useful if not only that it makes my testing much
          > easier. I have a MockIndexMerger that implements only the functions
          > in the interface and therefore I can test merge policies without
          > creating a writer. For me this has been a big win ...

          Subclassing IndexWriter to make MockIndexMerger would also work for
          testing? This is what MockRAMDirectory does for example...

          > > Exactly: these settings decide when a segment is flushed, so, why
          > > put them into IndexMerger interface? They shouldn't have anything
          > > to with merging; I think they should be removed.
          > >
          > > For LUCENE-845 I'm working on a replacement for LogDocMergePolicy
          > > that does not use maxBufferedDocs.

          > I can see that one could write a merge policy that didn't have any
          > idea of how the initial buffering was done, but I worry about
          > precluding it. Maybe the LUCENE-845 patch will show a strong enough
          > pattern to believe no merge policies will need it?

          We wouldn't be precluding it (people can still get it from their
          IndexWriter). This is one of the big reasons that I don't like making
          an interface out of IndexMerger: here we are having to pick & choose
          which settings from IndexWriter a merge policy is "allowed" to use. I
          don't think that's necessary (we are just making extra work for
          ourselves) and inevitably we won't get it right...

          > > I think factoring into a base class is an OK solution, but, it
          > > shouldn't be MergePolicy's job to remember to call this final
          > > "move any segments in the wrong directory over" code. As long as
          > > its in one place and people don't have to copy/paste code
          > > between MergePolicy sources.
          >
          > In the case of concurrent merges, I think this gets more
          > complicated. When do you do those directory copies? I think you
          > can't do them at the return from the merge policy because the merge
          > policy may want to do them, but later.
          >
          > I don't think IndexWriter has enough information to know when the
          > copies need to done. Doubly so if we have concurrent merges?

          Ahh, good point. Though, this raises the tricky question of index
          consistency ... IndexWriter commits the new segments file right after
          mergePolicy.merge returns ... so for CMP we suddenly have an unusable
          index (as seen by an IndexReader). EG if things crash any time after
          this point and before the background merging finishes & commits,
          you're hosed.

          Maybe it's too ambitious to allow merges of segments from other
          directories to run concurrently?

          I would consider it a hard error in IndexWriter if after calling
          mergePolicy.merge from any of the addIndexes*, there remain segments
          in other directories. I think we should catch this & throw an
          exception?

          > I still stand by it should be the merge policy making the
          > choice. You could have the code in IndexWriter too, but then there'd
          > be duplicate code. To put the code only in IndexWriter removes the
          > choice from the merge policy.

          I agree that merge policy should be the one making the choice, but the
          execution of it should be a centralized place (IndexWriter). EG with
          the simplified API, the merge policy would just return, one by one,
          each of the segments that is in a different directory...

          We can't all be copy/pasting code (like I had to do for LUCENE-845)
          for checking & then moving segments across directories. I think we
          need single source for this, somehow.

          > > I think there should in fact be a default optimize() in the base class
          > > that does what current IndexWriter now does so that a MergePolicy need
          > > not implement optimize at all.
          >
          > It'd be nice, but I don't know how to do it: the merge factor is not
          > generic, so I don't know how to implement the loop generically.

          Hmmm, OK. I think what you did (factoring out that massive
          conditional) is good here.

          > Ah ... I see: with your forced merge ... hmmm.
          >
          > No, forced would mean the merge policy must do a merge; whereas,
          > normally, it's free not to do a merge until it wants to.
          >
          > I think adding a forced merge concept here is new ... If it's simply
          > to support optimize, I'm not sure I find it too compelling. LogDoc
          > as it stands uses different algorithms for incremental merges and
          > optimize, so there's not too much of a concept of forced merges
          > vs. optional merges to be factored out. So I guess I'm not seeing a
          > strong compelling case for creating it?

          OK, I agree, let's not add "forced". How about, instead we only
          require mergePolicy to implement optimize(int maxNumSegments)? (And
          current IndexWriter.optimize() calls this with parameter "1").

          > > Well, it's sort of awkward if you want to vary that max #
          > > segments. Say during the day you optimize down to 15 segments
          > > every time you update the index, but then at night you want to
          > > optimize down to 5. If we don't add method to IndexWriter you
          > > then must have instance var on your MergePolicy that you set,
          > > then you call optimize. It's not clean since really it should be
          > > a parameter.
          >
          > Well, I don't know if I buy the argument that it should be a
          > parameter. The merge policy has lots of state like docs/seg. I don't
          > really see why segs/optimize is different.

          I think this would be a useful enough method that it should be "made
          simple" (ie, this is different from the "other state" that a merge
          policy would store). I opened a separate issue LUCENE-982 to track
          this.

          > My main reason for not wanting put this into IndexWriter is then
          > every merge policy must support it.

          This is why I want to address it now, while we are cementing the
          MergePolicy API: I don't want to preclude it.

          > > Wait: there is a barrier, right? In IndexWriter.replace we don't do
          > > the right thing with non-contiguous merges?
          >
          > Yeah, I meant that I'm not aware of any barriers except fixing
          > IndexWriter#replace, in other words, I'm not aware of any other
          > places where non-contiguity would cause a failure.

          OK, good, that's my impression too.

          Although ... do you think we need need some way for merge policy to
          state where the new segment should be inserted into SegmentInfos? For
          the contiguous case it seems clear that we should default to what is
          done now (new segment goes into same spot where old segments were).
          But for the non-contiguous case, how would IndexWriter know where to
          put the newly created segment?

          > > Finally: does MergePolicy really need a close()?
          >
          > I think so. The concurrent merge policy maintains all sorts of
          > state.

          OK. Hmmm, does CMP block on close while it joins to any running merge
          threads? How can the user close IndexWriter and abort the running
          merges? I guess CMP would provide a method to abort any running
          merges, and user would first call that before calling
          IndexWriter.close?

          > > I don't see how it can work (building the generic concurrency
          > > wrapper "under" IndexMerger) because the MergePolicy is in "serial
          > > control", eg, when it wants to cascade merges. How will you return
          > > that thread back to IndexWriter?
          >
          > So this is how it looks now: the concurrent merge policy is both a
          > merge policy and an index merger. The serial merge policy knows
          > nothing about it other than it does not get IndexWriter as its
          > merge.
          >
          > The index writer wants its merge, so it does it merge/maybeMerge
          > call on the concurrent merge policy. The CMP calls merge on the
          > serial policy, but substitutes itself for the merger rather than
          > IndexWriter.
          >
          > The serial merge policy goes on its merry way, looking for merges to
          > do (in the current model, this is a loop; more on that in a
          > minute). Each time it has a subset of segments to merge, it calls
          > merger.merge(...).
          >
          > At this point, the concurrent merge policy takes over again. It
          > looks at the segments to be merged and other segments being
          > processed by all existing merge threads and determines if there's a
          > conflict (a request to merge a segment that's currently in a
          > merge). If there's no conflict, it starts a merge thread and calls
          > IndexWriter#merge on the thread. The original calling thread returns
          > immediately. (I have a few ideas how to handle conflicts, the
          > simplest of which is to wait for the conflicting merge and the
          > restart the serial merge, e.g., revert to serial).
          >
          > This seems to work pretty well, so far. The only difference in API
          > for the serial merges is that the merge operation can't return the
          > number of documents in the result (since it isn't known how many
          > docs will be deleted).

          Hmmm. This looks more complex than the proposed API simplification,
          because you now have CMP on the top and on the bottom. Also, this
          requires the IndexMerger interface, but with the simplification we
          would not need a separate interface. Finally, I'm pretty sure you
          have locking issues (more below...), which are required of all merge
          policies, that the simplified API wouldn't have.

          How we handle conflicts is important but I think independent of this
          API discussion (ie both your CMP and my CMP have this same challenge,
          and I agree we should start simple by just blocking when the selected
          merge conflicts with a previous one that's still in progress).

          > > With concurrency wrapper "on the top" it's able to easily take a
          > > merge request as returned by the policy, kick it off in the
          > > backrground, and immediately return control of original thread
          > > back to IndexWriter.
          >
          > What I don't know how to do with this is figure out how to do a
          > bunch of merges. Lets say I have two levels in LogDoc that are merge
          > worthy. If I call LogDoc, it'll return the lower level. That's all
          > good. But what about doing the higher level in parallel? If I call
          > LogDoc again, it's going to return the lower level again because it
          > knows nothing about the current merge going on.

          True, LogDoc as it now stands would never exploit concurrency (it will
          always return the highest level that needs merging). But, we could
          relax that such that if ever the lowest level has > 2*mergeFactor
          pending segments to merge then we select the 2nd set. This would
          expose concurrency that would only be used when CMP is in use. But I
          think we should do this, later, as an enhancement. Let's focus on
          simplifying the API now...

          > It would be possible to have it return a vector of segmentInfo
          > subsets, but I don't see the gain (and it doesn't work out as well
          > for my putative conflict resolution).

          Yeah that would make the API even more complex, which is the wrong
          direction here

          > > have all necessary locking for SegmentInfos inside IndexWriter
          >
          > This was a red-herring on my part. All the "segmentInfos locking"
          > has always been in IndexWriter. That's note exactly sufficient. The
          > fundamental issue is that IndexWriter#merge has to operate without a
          > lock on IndexWriter. At some point, I was thinking that meant it
          > would have to lock SegmentInfos but that's ludicrous, actually. It's
          > sufficient for IndexWriter#replace to be synchronized.

          Right: merging certainly shouldn't hold lock on IndexWriter nor
          segmentInfos.

          > > If CMP implements IndexMerger you must have locking inside any
          > > MergePolicy that's calling into CMP?
          >
          > No. CMP does it's own locking (for purposes of thread management)
          > but the serial merge policies no nothing of this (and they can
          > expect to be called synchronously).

          This I don't get: it seems to me that the serial merge policies must
          do their own locking when they access the SegmentInfos that's passed
          in? And that lock must be released, somehow, when they call merge?
          Would merge (inside IndexWriter) somehow release the lock on being
          called? I don't see how you're going to make the locking work, but I
          think it's required with the current API.

          This is another benefit of the simplified API: MergePolicy.maybeMerge
          would only be called with a lock already acquired (by IndexWriter) on
          the segmentInfos. Then maybeMerge looks @ the segmentInfos, makes its
          choice, and returns it, and the lock is released. The lock is not
          held for an extended period of time...

          Show
          mikemccand Michael McCandless added a comment - One new small item: you've added a "public void merge()" to IndexWriter so that people can externally kick off a merge request, which is good I think. But, is it really necessary to flush here? It would be better to not flush so that users then have two separate methods (flush() and merge()) to do each function independently. > > Are you going to fix all unit tests that call the now-deprecated > > APIs? > > Yeah. Thanks for the reminder. On thinking about this more ... and on seeing all the diffs ... I no longer feel we should be deprecating "get/setUseCompoundFile()" nor "get/setMergeFactor()" nor "get/setMaxMergeDocs()" in IndexWriter. The vast majoriy of Lucene users will not make their own merge policy (just use the default merge policy) and so I don't think we should be complicating their lives with having to now write lines like this when they want to change settings: ((LogDocMergePolicy)writer.getMergePolicy()).setUseCompoundFile(useCompoundFile); Also, this ties our hands if ever we want to change the default merge policy (which, under LUCENE-845 , I'd like to do). I think instead we should leave the methods, not deprecated, as convenience (sugar) methods. Simple things should be simple; complex things should be possible. Sorry I didn't think of this before you made the new patch Steve > > I'm not wed to "maybeMerge()" but I really don't like "merge" > > It's overloaded now. > > > > EG IndexMerger interface has a method called "merge" that is named > > correctly because it will specifically go a do the requested > > merge. So does IndexWriter. > > > > Then, you have other [overloaded] methods in LogDocMergePolicy > > called "merge" that are named appropriately (they will do a > > specific merge). > > > > How about "checkForMerges()"? > > I don't find it ambiguous based on class and argument type. It's all > personal, of course. > > I'd definitely prefer maybe over checkFor because that sounds like a > predicate. OK let's settle on "maybeMerge"? > - Does this mean optimize -> maybeOptimize, too? Uh, no: when someone calls optimize that means it really must be done, right? So "optimize" is the right name I think. > * Make LDMP casts not throw bad cast Can you factor this out, eg add a private method "getLogDocMergePolicy(String reason)" that would be the one place that does the class casting & throwing an error message from one single source line? Right now the message is copied in multiple places and, it's wrong for mergeFactor (was copied from useCompoundFile). > * Get rid of releaseMergePolicy and add doClose parameter on set Looks good, thanks. Can you add javadocs (w/ params) for both of these new methods? > As to the IndexWriter vs. IndexMerger issue, I still think having > the interface is useful if not only that it makes my testing much > easier. I have a MockIndexMerger that implements only the functions > in the interface and therefore I can test merge policies without > creating a writer. For me this has been a big win ... Subclassing IndexWriter to make MockIndexMerger would also work for testing? This is what MockRAMDirectory does for example... > > Exactly: these settings decide when a segment is flushed, so, why > > put them into IndexMerger interface? They shouldn't have anything > > to with merging; I think they should be removed. > > > > For LUCENE-845 I'm working on a replacement for LogDocMergePolicy > > that does not use maxBufferedDocs. > I can see that one could write a merge policy that didn't have any > idea of how the initial buffering was done, but I worry about > precluding it. Maybe the LUCENE-845 patch will show a strong enough > pattern to believe no merge policies will need it? We wouldn't be precluding it (people can still get it from their IndexWriter). This is one of the big reasons that I don't like making an interface out of IndexMerger: here we are having to pick & choose which settings from IndexWriter a merge policy is "allowed" to use. I don't think that's necessary (we are just making extra work for ourselves) and inevitably we won't get it right... > > I think factoring into a base class is an OK solution, but, it > > shouldn't be MergePolicy's job to remember to call this final > > "move any segments in the wrong directory over" code. As long as > > its in one place and people don't have to copy/paste code > > between MergePolicy sources. > > In the case of concurrent merges, I think this gets more > complicated. When do you do those directory copies? I think you > can't do them at the return from the merge policy because the merge > policy may want to do them, but later. > > I don't think IndexWriter has enough information to know when the > copies need to done. Doubly so if we have concurrent merges? Ahh, good point. Though, this raises the tricky question of index consistency ... IndexWriter commits the new segments file right after mergePolicy.merge returns ... so for CMP we suddenly have an unusable index (as seen by an IndexReader). EG if things crash any time after this point and before the background merging finishes & commits, you're hosed. Maybe it's too ambitious to allow merges of segments from other directories to run concurrently? I would consider it a hard error in IndexWriter if after calling mergePolicy.merge from any of the addIndexes*, there remain segments in other directories. I think we should catch this & throw an exception? > I still stand by it should be the merge policy making the > choice. You could have the code in IndexWriter too, but then there'd > be duplicate code. To put the code only in IndexWriter removes the > choice from the merge policy. I agree that merge policy should be the one making the choice, but the execution of it should be a centralized place (IndexWriter). EG with the simplified API, the merge policy would just return, one by one, each of the segments that is in a different directory... We can't all be copy/pasting code (like I had to do for LUCENE-845 ) for checking & then moving segments across directories. I think we need single source for this, somehow. > > I think there should in fact be a default optimize() in the base class > > that does what current IndexWriter now does so that a MergePolicy need > > not implement optimize at all. > > It'd be nice, but I don't know how to do it: the merge factor is not > generic, so I don't know how to implement the loop generically. Hmmm, OK. I think what you did (factoring out that massive conditional) is good here. > Ah ... I see: with your forced merge ... hmmm. > > No, forced would mean the merge policy must do a merge; whereas, > normally, it's free not to do a merge until it wants to. > > I think adding a forced merge concept here is new ... If it's simply > to support optimize, I'm not sure I find it too compelling. LogDoc > as it stands uses different algorithms for incremental merges and > optimize, so there's not too much of a concept of forced merges > vs. optional merges to be factored out. So I guess I'm not seeing a > strong compelling case for creating it? OK, I agree, let's not add "forced". How about, instead we only require mergePolicy to implement optimize(int maxNumSegments)? (And current IndexWriter.optimize() calls this with parameter "1"). > > Well, it's sort of awkward if you want to vary that max # > > segments. Say during the day you optimize down to 15 segments > > every time you update the index, but then at night you want to > > optimize down to 5. If we don't add method to IndexWriter you > > then must have instance var on your MergePolicy that you set, > > then you call optimize. It's not clean since really it should be > > a parameter. > > Well, I don't know if I buy the argument that it should be a > parameter. The merge policy has lots of state like docs/seg. I don't > really see why segs/optimize is different. I think this would be a useful enough method that it should be "made simple" (ie, this is different from the "other state" that a merge policy would store). I opened a separate issue LUCENE-982 to track this. > My main reason for not wanting put this into IndexWriter is then > every merge policy must support it. This is why I want to address it now, while we are cementing the MergePolicy API: I don't want to preclude it. > > Wait: there is a barrier, right? In IndexWriter.replace we don't do > > the right thing with non-contiguous merges? > > Yeah, I meant that I'm not aware of any barriers except fixing > IndexWriter#replace, in other words, I'm not aware of any other > places where non-contiguity would cause a failure. OK, good, that's my impression too. Although ... do you think we need need some way for merge policy to state where the new segment should be inserted into SegmentInfos? For the contiguous case it seems clear that we should default to what is done now (new segment goes into same spot where old segments were). But for the non-contiguous case, how would IndexWriter know where to put the newly created segment? > > Finally: does MergePolicy really need a close()? > > I think so. The concurrent merge policy maintains all sorts of > state. OK. Hmmm, does CMP block on close while it joins to any running merge threads? How can the user close IndexWriter and abort the running merges? I guess CMP would provide a method to abort any running merges, and user would first call that before calling IndexWriter.close? > > I don't see how it can work (building the generic concurrency > > wrapper "under" IndexMerger) because the MergePolicy is in "serial > > control", eg, when it wants to cascade merges. How will you return > > that thread back to IndexWriter? > > So this is how it looks now: the concurrent merge policy is both a > merge policy and an index merger. The serial merge policy knows > nothing about it other than it does not get IndexWriter as its > merge. > > The index writer wants its merge, so it does it merge/maybeMerge > call on the concurrent merge policy. The CMP calls merge on the > serial policy, but substitutes itself for the merger rather than > IndexWriter. > > The serial merge policy goes on its merry way, looking for merges to > do (in the current model, this is a loop; more on that in a > minute). Each time it has a subset of segments to merge, it calls > merger.merge(...). > > At this point, the concurrent merge policy takes over again. It > looks at the segments to be merged and other segments being > processed by all existing merge threads and determines if there's a > conflict (a request to merge a segment that's currently in a > merge). If there's no conflict, it starts a merge thread and calls > IndexWriter#merge on the thread. The original calling thread returns > immediately. (I have a few ideas how to handle conflicts, the > simplest of which is to wait for the conflicting merge and the > restart the serial merge, e.g., revert to serial). > > This seems to work pretty well, so far. The only difference in API > for the serial merges is that the merge operation can't return the > number of documents in the result (since it isn't known how many > docs will be deleted). Hmmm. This looks more complex than the proposed API simplification, because you now have CMP on the top and on the bottom. Also, this requires the IndexMerger interface, but with the simplification we would not need a separate interface. Finally, I'm pretty sure you have locking issues (more below...), which are required of all merge policies, that the simplified API wouldn't have. How we handle conflicts is important but I think independent of this API discussion (ie both your CMP and my CMP have this same challenge, and I agree we should start simple by just blocking when the selected merge conflicts with a previous one that's still in progress). > > With concurrency wrapper "on the top" it's able to easily take a > > merge request as returned by the policy, kick it off in the > > backrground, and immediately return control of original thread > > back to IndexWriter. > > What I don't know how to do with this is figure out how to do a > bunch of merges. Lets say I have two levels in LogDoc that are merge > worthy. If I call LogDoc, it'll return the lower level. That's all > good. But what about doing the higher level in parallel? If I call > LogDoc again, it's going to return the lower level again because it > knows nothing about the current merge going on. True, LogDoc as it now stands would never exploit concurrency (it will always return the highest level that needs merging). But, we could relax that such that if ever the lowest level has > 2*mergeFactor pending segments to merge then we select the 2nd set. This would expose concurrency that would only be used when CMP is in use. But I think we should do this, later, as an enhancement. Let's focus on simplifying the API now... > It would be possible to have it return a vector of segmentInfo > subsets, but I don't see the gain (and it doesn't work out as well > for my putative conflict resolution). Yeah that would make the API even more complex, which is the wrong direction here > > have all necessary locking for SegmentInfos inside IndexWriter > > This was a red-herring on my part. All the "segmentInfos locking" > has always been in IndexWriter. That's note exactly sufficient. The > fundamental issue is that IndexWriter#merge has to operate without a > lock on IndexWriter. At some point, I was thinking that meant it > would have to lock SegmentInfos but that's ludicrous, actually. It's > sufficient for IndexWriter#replace to be synchronized. Right: merging certainly shouldn't hold lock on IndexWriter nor segmentInfos. > > If CMP implements IndexMerger you must have locking inside any > > MergePolicy that's calling into CMP? > > No. CMP does it's own locking (for purposes of thread management) > but the serial merge policies no nothing of this (and they can > expect to be called synchronously). This I don't get: it seems to me that the serial merge policies must do their own locking when they access the SegmentInfos that's passed in? And that lock must be released, somehow, when they call merge? Would merge (inside IndexWriter) somehow release the lock on being called? I don't see how you're going to make the locking work, but I think it's required with the current API. This is another benefit of the simplified API: MergePolicy.maybeMerge would only be called with a lock already acquired (by IndexWriter) on the segmentInfos. Then maybeMerge looks @ the segmentInfos, makes its choice, and returns it, and the lock is released. The lock is not held for an extended period of time...
          Hide
          steven_parkes Steven Parkes added a comment -

          One new small item: you've added a "public void merge()" to
          IndexWriter so that people can externally kick off a merge request,
          which is good I think.

          But, is it really necessary to flush here? It would be better to not
          flush so that users then have two separate methods (flush() and
          merge()) to do each function independently.

          That makes sense.

          Note that merge() was added not for users (which I have no strong opinion about) but so that, potentially, CMP can check again for merges when a set of merge threads completes, i.e., cascade.

          I think instead we should leave the methods, not deprecated, as
          convenience (sugar) methods. Simple things should be simple; complex
          things should be possible.

          I think this argues for a LegacyMergePolicy interface again, then? If we change the default merge policy and someone changes their code to use LogDoc for their own purposes, in both cases the getters/setters should work? So cast to the interface and as long as the merge policy supports this, the getters/setters work (unless the merge policy decides to throw within), otherwise the getters/setters throw?

          Uh, no: when someone calls optimize that means it really must be done,
          right? So "optimize" is the right name I think.

          Yeah, but it might do nothing. Just as merge might do nothing.

          Can you factor this out, eg add a private method
          "getLogDocMergePolicy(String reason)"

          Sure.

          Looks good, thanks. Can you add javadocs (w/ params) for both of
          these new methods?

          Sure.

          Though, this raises the tricky question of index
          consistency ...

          Definitely. I'm still trying to understand all the subtleties here.

          IndexWriter commits the new segments file right after
          mergePolicy.merge returns ... so for CMP we suddenly have an unusable
          index (as seen by an IndexReader).

          How so? I figured that after mergePolicy.merge returns, in the case of CMP with an ongoing merge, segmentInfos won't have changed at all. Is that a problem?

          I thought the issue would be on the other end, where the concurrent merge finishes and needs to update segmentInfos.

          Maybe it's too ambitious to allow merges of segments from other
          directories to run concurrently?

          Yeah, that might be the case. At least as a default?

          I would consider it a hard error in IndexWriter if after calling
          mergePolicy.merge from any of the addIndexes*, there remain segments
          in other directories. I think we should catch this & throw an
          exception?

          It would be easy enough for CMP to block in this case, rather than returning immediately. Wouldn't that be better? And I suppose it's possible to imagine an API on CMP for specifying this behavior?

          I opened a separate issue LUCENE-982 to track this.

          I think this is good. I think it's an interesting issue but not directly related to the refactor?

          Although ... do you think we need need some way for merge policy to
          state where the new segment should be inserted into SegmentInfos?

          Right now I assumed it would replace the left most-segment.

          Since I don't really know the details of what such a merge policy would like, I don't really know what it needs.

          If you've thought about this more, do you have a suggestion? I suppose we could just add an int. But, then again, I'd do that as a separate function, leaving the original available, so we can do this later, completely compatibly?

          Hmmm, does CMP block on close while it joins to any running merge
          threads?

          Yeah, at least in my sandbox.

          How can the user close IndexWriter and abort the running
          merges? I guess CMP would provide a method to abort any running
          merges, and user would first call that before calling
          IndexWriter.close?

          I hadn't really thought about this but I can see that should be made possible. It's always safe to abandon a merge so it should be available, for fast, safe, and clean shutdown.

          True, LogDoc as it now stands would never exploit concurrency (it will
          always return the highest level that needs merging). But, we could
          relax that such that if ever the lowest level has > 2*mergeFactor
          pending segments to merge then we select the 2nd set.

          Okay. But it will always return that? Still doesn't sound concurrent?

          The thing is, the serial merge policy has no concept of concurrent merges, so if the API is always to select the best merge, until a pervious merge finishes, it will always return that as the best merge.

          Concurrent is going to require, by hook or by crook, that a merge policy be able to generate a set of non-conflicting merges, is it not?

          I think the LUCENE-845 merge policy does this now, given that CMP gathers up the merge calls. I'm not sure the current LUCENE-847 merge policy does (I'd have to double check) because it sometimes will try to use the result of the current merge in the next merge. The LUCENE-845 merge doesn't try to do this which is a (inconsequential) change?

          This is another benefit of the simplified API: MergePolicy.maybeMerge
          would only be called with a lock already acquired (by IndexWriter) on
          the segmentInfos.

          Do you really mean a lock on segmentInfos or just the lock on IndexWriter? I'm assuming the latter and I think this is the case for both API models.

          I don't think it's feasible to have a lock on segmentInfos separately. Only IndexWriter should change segmentInfos and no code should try to look at segmentInfos w/o being called via an IW synch method.

          This does imply that CMP has to copy any segmentInfos data it plans to use during concurrent merging, since the IW lock is not held during these periods. Then, when the merge is done, segmentInfos is updated in IndexWriter via a synch call to IW#replace.

          This means IW#segmentInfos can change while a merge is in progress and this has to be accounted for. That's what I'm walking through now.

          Show
          steven_parkes Steven Parkes added a comment - One new small item: you've added a "public void merge()" to IndexWriter so that people can externally kick off a merge request, which is good I think. But, is it really necessary to flush here? It would be better to not flush so that users then have two separate methods (flush() and merge()) to do each function independently. That makes sense. Note that merge() was added not for users (which I have no strong opinion about) but so that, potentially, CMP can check again for merges when a set of merge threads completes, i.e., cascade. I think instead we should leave the methods, not deprecated, as convenience (sugar) methods. Simple things should be simple; complex things should be possible. I think this argues for a LegacyMergePolicy interface again, then? If we change the default merge policy and someone changes their code to use LogDoc for their own purposes, in both cases the getters/setters should work? So cast to the interface and as long as the merge policy supports this, the getters/setters work (unless the merge policy decides to throw within), otherwise the getters/setters throw? Uh, no: when someone calls optimize that means it really must be done, right? So "optimize" is the right name I think. Yeah, but it might do nothing. Just as merge might do nothing. Can you factor this out, eg add a private method "getLogDocMergePolicy(String reason)" Sure. Looks good, thanks. Can you add javadocs (w/ params) for both of these new methods? Sure. Though, this raises the tricky question of index consistency ... Definitely. I'm still trying to understand all the subtleties here. IndexWriter commits the new segments file right after mergePolicy.merge returns ... so for CMP we suddenly have an unusable index (as seen by an IndexReader). How so? I figured that after mergePolicy.merge returns, in the case of CMP with an ongoing merge, segmentInfos won't have changed at all. Is that a problem? I thought the issue would be on the other end, where the concurrent merge finishes and needs to update segmentInfos. Maybe it's too ambitious to allow merges of segments from other directories to run concurrently? Yeah, that might be the case. At least as a default? I would consider it a hard error in IndexWriter if after calling mergePolicy.merge from any of the addIndexes*, there remain segments in other directories. I think we should catch this & throw an exception? It would be easy enough for CMP to block in this case, rather than returning immediately. Wouldn't that be better? And I suppose it's possible to imagine an API on CMP for specifying this behavior? I opened a separate issue LUCENE-982 to track this. I think this is good. I think it's an interesting issue but not directly related to the refactor? Although ... do you think we need need some way for merge policy to state where the new segment should be inserted into SegmentInfos? Right now I assumed it would replace the left most-segment. Since I don't really know the details of what such a merge policy would like, I don't really know what it needs. If you've thought about this more, do you have a suggestion? I suppose we could just add an int. But, then again, I'd do that as a separate function, leaving the original available, so we can do this later, completely compatibly? Hmmm, does CMP block on close while it joins to any running merge threads? Yeah, at least in my sandbox. How can the user close IndexWriter and abort the running merges? I guess CMP would provide a method to abort any running merges, and user would first call that before calling IndexWriter.close? I hadn't really thought about this but I can see that should be made possible. It's always safe to abandon a merge so it should be available, for fast, safe, and clean shutdown. True, LogDoc as it now stands would never exploit concurrency (it will always return the highest level that needs merging). But, we could relax that such that if ever the lowest level has > 2*mergeFactor pending segments to merge then we select the 2nd set. Okay. But it will always return that? Still doesn't sound concurrent? The thing is, the serial merge policy has no concept of concurrent merges, so if the API is always to select the best merge, until a pervious merge finishes, it will always return that as the best merge. Concurrent is going to require, by hook or by crook, that a merge policy be able to generate a set of non-conflicting merges, is it not? I think the LUCENE-845 merge policy does this now, given that CMP gathers up the merge calls. I'm not sure the current LUCENE-847 merge policy does (I'd have to double check) because it sometimes will try to use the result of the current merge in the next merge. The LUCENE-845 merge doesn't try to do this which is a (inconsequential) change? This is another benefit of the simplified API: MergePolicy.maybeMerge would only be called with a lock already acquired (by IndexWriter) on the segmentInfos. Do you really mean a lock on segmentInfos or just the lock on IndexWriter? I'm assuming the latter and I think this is the case for both API models. I don't think it's feasible to have a lock on segmentInfos separately. Only IndexWriter should change segmentInfos and no code should try to look at segmentInfos w/o being called via an IW synch method. This does imply that CMP has to copy any segmentInfos data it plans to use during concurrent merging, since the IW lock is not held during these periods. Then, when the merge is done, segmentInfos is updated in IndexWriter via a synch call to IW#replace. This means IW#segmentInfos can change while a merge is in progress and this has to be accounted for. That's what I'm walking through now.
          Hide
          mikemccand Michael McCandless added a comment -

          > Note that merge() was added not for users (which I have no strong
          > opinion about) but so that, potentially, CMP can check again for
          > merges when a set of merge threads completes, i.e., cascade.

          OK, got it. In fact, then it seems more important that we NOT flush
          at this point?

          > > I think instead we should leave the methods, not deprecated, as
          > > convenience (sugar) methods. Simple things should be simple;
          > > complex things should be possible.
          >
          > I think this argues for a LegacyMergePolicy interface again, then?
          > If we change the default merge policy and someone changes their code
          > to use LogDoc for their own purposes, in both cases the
          > getters/setters should work? So cast to the interface and as long as
          > the merge policy supports this, the getters/setters work (unless the
          > merge policy decides to throw within), otherwise the getters/setters
          > throw?

          I don't think so: I think if someone changes the merge policy to
          something else, it's fine to require that they then do settings
          directly through that merge policy. I don't think we should bring
          back the LegacyMergePolicy interface.

          > > Uh, no: when someone calls optimize that means it really must be
          > > done, right? So "optimize" is the right name I think.
          >
          > Yeah, but it might do nothing. Just as merge might do nothing.

          Well... that's the exception not the rule. My vote would be for
          "maybeMerge(...)" and "optimize(..)".

          > > Though, this raises the tricky question of index consistency ...
          >
          > Definitely. I'm still trying to understand all the subtleties here.
          >
          > > IndexWriter commits the new segments file right after
          > > mergePolicy.merge returns ... so for CMP we suddenly have an
          > > unusable index (as seen by an IndexReader).
          >
          > How so? I figured that after mergePolicy.merge returns, in the case
          > of CMP with an ongoing merge, segmentInfos won't have changed at
          > all. Is that a problem?

          This is inside addIndexes that we're talking about. It will have
          changed because the added indexes were stuck into the segmentInfos.
          If you commit that segmentInfos, which now references segments in
          other directories, the index is inconsistent, until the merge policy
          finishes its work (including copying over segments from other dirs).
          In fact this used to be an issue but was fixed in LUCENE-702.

          > > Maybe it's too ambitious to allow merges of segments from other
          > > directories to run concurrently?
          >
          > Yeah, that might be the case. At least as a default?

          I think it's worse: I think we shouldn't allow any mergePolicy to
          leave the index inconsistent (failing to copy over segments from other
          directories). I think it's a bug if the mergePolicy does that and we
          should check & raise an exception, and not commit the new segments
          file. IndexWriter should in general protect itself from a mergePolicy
          that makes the index inconsistent (and, refuse to commit the resulting
          segments file).

          With the proposed "stateless API" we would keep calling the
          mergePolicy, after each merge, until it returned null, and then do the
          check that index is consistent.

          > > I would consider it a hard error in IndexWriter if after calling
          > > mergePolicy.merge from any of the addIndexes*, there remain
          > > segments in other directories. I think we should catch this &
          > > throw an exception?
          >
          > It would be easy enough for CMP to block in this case, rather than
          > returning immediately. Wouldn't that be better? And I suppose it's
          > possible to imagine an API on CMP for specifying this behavior?

          I think CMP should indeed block in this case. I wouldn't add an API
          to change it. It's too dangerous to allow an index to become
          inconsistent.

          > > I opened a separate issue LUCENE-982 to track this.
          >
          > I think this is good. I think it's an interesting issue but not
          > directly related to the refactor?

          I think it is related: we should not preclude it in this refactoring.
          I think we should fix MergePolicy.optimize to take "int
          maxNumSegments"?

          > > Although ... do you think we need need some way for merge policy
          > > to state where the new segment should be inserted into
          > > SegmentInfos?
          >
          > Right now I assumed it would replace the left most-segment.
          >
          > Since I don't really know the details of what such a merge policy
          > would like, I don't really know what it needs.
          >
          > If you've thought about this more, do you have a suggestion? I
          > suppose we could just add an int. But, then again, I'd do that as a
          > separate function, leaving the original available, so we can do this
          > later, completely compatibly?

          I don't have a suggestion And I agree, this is safely postponed while
          keeping future backwards compatibility, so, punt!

          > > How can the user close IndexWriter and abort the running merges? I
          > > guess CMP would provide a method to abort any running merges, and
          > > user would first call that before calling IndexWriter.close?
          >
          > I hadn't really thought about this but I can see that should be made
          > possible. It's always safe to abandon a merge so it should be
          > available, for fast, safe, and clean shutdown.

          OK. Seems like a CMP specific issue (doesn't impact this discussion).

          > > True, LogDoc as it now stands would never exploit concurrency (it
          > > will always return the highest level that needs merging). But, we
          > > could relax that such that if ever the lowest level has >
          > > 2*mergeFactor pending segments to merge then we select the 2nd
          > > set.
          >
          > Okay. But it will always return that? Still doesn't sound
          > concurrent?

          No, after another N (= mergeFactor) flushes, it would return a new
          suggested merge. I think this gives CMP concurrency to work with.
          Also I think other merge policies (eg the rough suggestion in
          LUCENE-854) could provide substantial concurrency.

          > The thing is, the serial merge policy has no concept of concurrent
          > merges, so if the API is always to select the best merge, until a
          > pervious merge finishes, it will always return that as the best
          > merge.
          >
          > Concurrent is going to require, by hook or by crook, that a merge
          > policy be able to generate a set of non-conflicting merges, is it
          > not?

          Correct, if we want more than 1 merge running at once then
          mergePolicy must provide non-conflicting merges.

          But, providing just a single concurrent merge already gains us
          concurrency of merging with adding of docs. Just that is a great step
          forward, and, it's not clear we can expect performance gains by doing
          2 merges simultaneously with adding docs. Have you tested this to
          see?

          If we think there are still gains there, we can use the idea above, or
          apps can use other merge policies (like LUCENE-854) that don't always
          choose non-concurrent (conflicting) merges.

          > I think the LUCENE-845 merge policy does this now, given that CMP
          > gathers up the merge calls. I'm not sure the current LUCENE-847
          > merge policy does (I'd have to double check) because it sometimes
          > will try to use the result of the current merge in the next
          > merge. The LUCENE-845 merge doesn't try to do this which is a
          > (inconsequential) change?

          Right, the LUCENE-845 merge policy doesn't look @ the return result of
          "merge". It just looks at the newly created SegmentInfos.

          Hmmmm, in fact, I think your CMP wrapper would not work with the merge
          policy in LUCENE-845, right? Ie, won't it will just recurse forever?
          So actually I don't see how your CMP (using the current API) can in
          general safely "wrap" around a merge policy w/o breaking things?

          Whereas w/ stateless API, where merge policy just returns what should
          be merged rather than executing it itself and cascading, would work
          fine.

          > > This is another benefit of the simplified API:
          > > MergePolicy.maybeMerge would only be called with a lock already
          > > acquired (by IndexWriter) on the segmentInfos.
          >
          > Do you really mean a lock on segmentInfos or just the lock on
          > IndexWriter? I'm assuming the latter and I think this is the case
          > for both API models.

          But, if you lock on IndexWriter, what about apps that use multiple
          threads to add documents and but don't use CMP? When one thread gets
          tied up merging, you'll then block on the other synchronized methods?
          And you also can't flush from other threads either? I think flushing
          a new segment should be allowed to run concurrently with the merge?

          Whereas if you lock only segmentInfos, and use the proposed stateless
          API, I think the other threads would not be blocked? I guess I don't
          see the reason to synchronize on IndexWriter instead of segmentInfos.

          Net/net I'm still thinking we should simplify this API to be
          stateless. I think there are a number of benefits:

          • We would no longer need to add a new IndexMerger interface that
            adds unecessary complexity to Lucnee (and, make the awkward
            decisions up front on which IndexWriter fields are allowed to be
            visible through the interface).
          • Keep CMP simpler (only top of stack (where I think "macro"
            concurrency should live), not top and bottom).
          • Work correctly as wrapper around other merge policies (ie not hit
            infinite recursion because mergePolicy had naturally assumed that
            "merge" would have changed the segmentInfos)
          • Allows locking on segmentInfos (not IndexWriter), and allows
            concurrency on multiple threads adding docs even without using
            CMP.
          Show
          mikemccand Michael McCandless added a comment - > Note that merge() was added not for users (which I have no strong > opinion about) but so that, potentially, CMP can check again for > merges when a set of merge threads completes, i.e., cascade. OK, got it. In fact, then it seems more important that we NOT flush at this point? > > I think instead we should leave the methods, not deprecated, as > > convenience (sugar) methods. Simple things should be simple; > > complex things should be possible. > > I think this argues for a LegacyMergePolicy interface again, then? > If we change the default merge policy and someone changes their code > to use LogDoc for their own purposes, in both cases the > getters/setters should work? So cast to the interface and as long as > the merge policy supports this, the getters/setters work (unless the > merge policy decides to throw within), otherwise the getters/setters > throw? I don't think so: I think if someone changes the merge policy to something else, it's fine to require that they then do settings directly through that merge policy. I don't think we should bring back the LegacyMergePolicy interface. > > Uh, no: when someone calls optimize that means it really must be > > done, right? So "optimize" is the right name I think. > > Yeah, but it might do nothing. Just as merge might do nothing. Well... that's the exception not the rule. My vote would be for "maybeMerge(...)" and "optimize(..)". > > Though, this raises the tricky question of index consistency ... > > Definitely. I'm still trying to understand all the subtleties here. > > > IndexWriter commits the new segments file right after > > mergePolicy.merge returns ... so for CMP we suddenly have an > > unusable index (as seen by an IndexReader). > > How so? I figured that after mergePolicy.merge returns, in the case > of CMP with an ongoing merge, segmentInfos won't have changed at > all. Is that a problem? This is inside addIndexes that we're talking about. It will have changed because the added indexes were stuck into the segmentInfos. If you commit that segmentInfos, which now references segments in other directories, the index is inconsistent, until the merge policy finishes its work (including copying over segments from other dirs). In fact this used to be an issue but was fixed in LUCENE-702 . > > Maybe it's too ambitious to allow merges of segments from other > > directories to run concurrently? > > Yeah, that might be the case. At least as a default? I think it's worse: I think we shouldn't allow any mergePolicy to leave the index inconsistent (failing to copy over segments from other directories). I think it's a bug if the mergePolicy does that and we should check & raise an exception, and not commit the new segments file. IndexWriter should in general protect itself from a mergePolicy that makes the index inconsistent (and, refuse to commit the resulting segments file). With the proposed "stateless API" we would keep calling the mergePolicy, after each merge, until it returned null, and then do the check that index is consistent. > > I would consider it a hard error in IndexWriter if after calling > > mergePolicy.merge from any of the addIndexes*, there remain > > segments in other directories. I think we should catch this & > > throw an exception? > > It would be easy enough for CMP to block in this case, rather than > returning immediately. Wouldn't that be better? And I suppose it's > possible to imagine an API on CMP for specifying this behavior? I think CMP should indeed block in this case. I wouldn't add an API to change it. It's too dangerous to allow an index to become inconsistent. > > I opened a separate issue LUCENE-982 to track this. > > I think this is good. I think it's an interesting issue but not > directly related to the refactor? I think it is related: we should not preclude it in this refactoring. I think we should fix MergePolicy.optimize to take "int maxNumSegments"? > > Although ... do you think we need need some way for merge policy > > to state where the new segment should be inserted into > > SegmentInfos? > > Right now I assumed it would replace the left most-segment. > > Since I don't really know the details of what such a merge policy > would like, I don't really know what it needs. > > If you've thought about this more, do you have a suggestion? I > suppose we could just add an int. But, then again, I'd do that as a > separate function, leaving the original available, so we can do this > later, completely compatibly? I don't have a suggestion And I agree, this is safely postponed while keeping future backwards compatibility, so, punt! > > How can the user close IndexWriter and abort the running merges? I > > guess CMP would provide a method to abort any running merges, and > > user would first call that before calling IndexWriter.close? > > I hadn't really thought about this but I can see that should be made > possible. It's always safe to abandon a merge so it should be > available, for fast, safe, and clean shutdown. OK. Seems like a CMP specific issue (doesn't impact this discussion). > > True, LogDoc as it now stands would never exploit concurrency (it > > will always return the highest level that needs merging). But, we > > could relax that such that if ever the lowest level has > > > 2*mergeFactor pending segments to merge then we select the 2nd > > set. > > Okay. But it will always return that? Still doesn't sound > concurrent? No, after another N (= mergeFactor) flushes, it would return a new suggested merge. I think this gives CMP concurrency to work with. Also I think other merge policies (eg the rough suggestion in LUCENE-854 ) could provide substantial concurrency. > The thing is, the serial merge policy has no concept of concurrent > merges, so if the API is always to select the best merge, until a > pervious merge finishes, it will always return that as the best > merge. > > Concurrent is going to require, by hook or by crook, that a merge > policy be able to generate a set of non-conflicting merges, is it > not? Correct, if we want more than 1 merge running at once then mergePolicy must provide non-conflicting merges. But, providing just a single concurrent merge already gains us concurrency of merging with adding of docs. Just that is a great step forward, and, it's not clear we can expect performance gains by doing 2 merges simultaneously with adding docs. Have you tested this to see? If we think there are still gains there, we can use the idea above, or apps can use other merge policies (like LUCENE-854 ) that don't always choose non-concurrent (conflicting) merges. > I think the LUCENE-845 merge policy does this now, given that CMP > gathers up the merge calls. I'm not sure the current LUCENE-847 > merge policy does (I'd have to double check) because it sometimes > will try to use the result of the current merge in the next > merge. The LUCENE-845 merge doesn't try to do this which is a > (inconsequential) change? Right, the LUCENE-845 merge policy doesn't look @ the return result of "merge". It just looks at the newly created SegmentInfos. Hmmmm, in fact, I think your CMP wrapper would not work with the merge policy in LUCENE-845 , right? Ie, won't it will just recurse forever? So actually I don't see how your CMP (using the current API) can in general safely "wrap" around a merge policy w/o breaking things? Whereas w/ stateless API, where merge policy just returns what should be merged rather than executing it itself and cascading, would work fine. > > This is another benefit of the simplified API: > > MergePolicy.maybeMerge would only be called with a lock already > > acquired (by IndexWriter) on the segmentInfos. > > Do you really mean a lock on segmentInfos or just the lock on > IndexWriter? I'm assuming the latter and I think this is the case > for both API models. But, if you lock on IndexWriter, what about apps that use multiple threads to add documents and but don't use CMP? When one thread gets tied up merging, you'll then block on the other synchronized methods? And you also can't flush from other threads either? I think flushing a new segment should be allowed to run concurrently with the merge? Whereas if you lock only segmentInfos, and use the proposed stateless API, I think the other threads would not be blocked? I guess I don't see the reason to synchronize on IndexWriter instead of segmentInfos. Net/net I'm still thinking we should simplify this API to be stateless. I think there are a number of benefits: We would no longer need to add a new IndexMerger interface that adds unecessary complexity to Lucnee (and, make the awkward decisions up front on which IndexWriter fields are allowed to be visible through the interface). Keep CMP simpler (only top of stack (where I think "macro" concurrency should live), not top and bottom). Work correctly as wrapper around other merge policies (ie not hit infinite recursion because mergePolicy had naturally assumed that "merge" would have changed the segmentInfos) Allows locking on segmentInfos (not IndexWriter), and allows concurrency on multiple threads adding docs even without using CMP.
          Hide
          steven_parkes Steven Parkes added a comment -

          I don't think so: I think if someone changes the merge policy to
          something else, it's fine to require that they then do settings
          directly through that merge policy.

          You're going to want to change the default merge policy, right? So you're going to change the hard cast in IW to that policy? So it'll fail for anyone that wants to just getMergePolicy back to the old policy?

          If that's the case, I'm going to keep those tests the way they are because when you do change the policy, I'm assuming you'll keep many of them, just add the manual setMergePolicy(), and they'll need to have those casts put back in?

          Maybe we just put it in MergePolicy interface and let them throw (e.g., via MergePolicyBase) if called on an unsupported merge policy? That's moving from compile time checking to run time checking, but ...

          This is inside addIndexes that we're talking about.

          Ah. Right.

          I think we shouldn't allow any mergePolicy to
          leave the index inconsistent (failing to copy over segments from other
          directories).

          That makes sense to me. CMP could enforce this, even in the case of concurrent merges.

          No, after another N (= mergeFactor) flushes, it would return a new
          suggested merge.

          Okay. I think I'm following you here.

          Here's what I understand: in your model, (1) each call to merge will only ever generate one merge thread (regardless of how many levels might be full) and (2) you can get concurrency out of this as long as you consider a level "merge worthy" as different from "full", i.e., blocking).

          You did say

          > > But, we
          > > could relax that such that if ever the lowest level has >
          > > 2*mergeFactor pending segments to merge then we select the 2nd
          > > set.

          And I think you'd want to modify that to select the lowest sufficiently over subscribed level, not just the lowest level if it's oversubscribed?

          Perhaps this is sufficient, but not necessary? I see it as simpler just to have the merge policy (abstractly) generate a set of non-conflicting merges and let someone else worry about scheduling them.

          But, providing just a single concurrent merge already gains us
          concurrency of merging with adding of docs.

          I'm worried about when you start the leftmost merge, that, say, is going to take a day. With a steady influx of docs, it's not going to be long before you need another merge and if you have only one thread, you're going to block for the rest of the day. You've bought a little concurrency, but it's the almost day-long block I really want to avoid.

          With a log-like policy, I think it's feasible to have logN threads. You might not want them all doing disk i/o at the same time: you'd want to prioritize threads on the small merges and/or suspend large merge threads. The speed with which the larger merge threads can vary when other merges are taking place, you just have to not stop them and start over.

          Right, the LUCENE-845 merge policy doesn't look @ the return result of
          "merge". It just looks at the newly created SegmentInfos.

          Yeah. My thinking was this would be tweaked. If merger.merge returns a valid number of docs, it could recurse as it does. If merger.merge returned -1 (which CMP does), it would not recurse but simply continue the loop.

          Hmmmm, in fact, I think your CMP wrapper would not work with the merge
          policy in LUCENE-845, right? Ie, won't it will just recurse forever?
          So actually I don't see how your CMP (using the current API) can in
          general safely "wrap" around a merge policy w/o breaking things?

          I think it's safe, just not concurrent. The recursion would generate the same set of segments to merge and CMP would make the second call block (abstractly, anyway: it actually throws an exception that unwinds the stack and causes the call to start again from the top when the conflicting merge finishes).

          But, if you lock on IndexWriter, what about apps that use multiple
          threads to add documents and but don't use CMP? When one thread gets
          tied up merging, you'll then block on the other synchronized methods?
          And you also can't flush from other threads either? I think flushing
          a new segment should be allowed to run concurrently with the merge?

          I'm not sure I'm following this. That's what happens now, right? Are you trying to get more concurrency then there is now w/o using CMP? I certainly haven't been trying to do that.

          I guess I don't
          see the reason to synchronize on IndexWriter instead of segmentInfos.

          I looked at trying to make IW work when a synchronization of IW didn't imply a synchronization of segmentInfos. It's a very, very heavily used little data structure. I found it very hard to convince myself I could catch all the places locks would be required. And at the same time, I seemed to be able to do everything I needed with IW locking.

          That said, the code's not done, so ....

          Net/net I'm still thinking we should simplify this API to be
          stateless. I think there are a number of benefits:

          • We would no longer need to add a new IndexMerger interface that
            adds unecessary complexity to Lucnee (and, make the awkward
            decisions up front on which IndexWriter fields are allowed to be
            visible through the interface).
          • Keep CMP simpler (only top of stack (where I think "macro"
            concurrency should live), not top and bottom).
          • Work correctly as wrapper around other merge policies (ie not hit
            infinite recursion because mergePolicy had naturally assumed that
            "merge" would have changed the segmentInfos)
          • Allows locking on segmentInfos (not IndexWriter), and allows
            concurrency on multiple threads adding docs even without using
            CMP.

          Hmmm ... I guess our approaches are pretty different. If you want to take a stab at this ...

          Show
          steven_parkes Steven Parkes added a comment - I don't think so: I think if someone changes the merge policy to something else, it's fine to require that they then do settings directly through that merge policy. You're going to want to change the default merge policy, right? So you're going to change the hard cast in IW to that policy? So it'll fail for anyone that wants to just getMergePolicy back to the old policy? If that's the case, I'm going to keep those tests the way they are because when you do change the policy, I'm assuming you'll keep many of them, just add the manual setMergePolicy(), and they'll need to have those casts put back in? Maybe we just put it in MergePolicy interface and let them throw (e.g., via MergePolicyBase) if called on an unsupported merge policy? That's moving from compile time checking to run time checking, but ... This is inside addIndexes that we're talking about. Ah. Right. I think we shouldn't allow any mergePolicy to leave the index inconsistent (failing to copy over segments from other directories). That makes sense to me. CMP could enforce this, even in the case of concurrent merges. No, after another N (= mergeFactor) flushes, it would return a new suggested merge. Okay. I think I'm following you here. Here's what I understand: in your model, (1) each call to merge will only ever generate one merge thread (regardless of how many levels might be full) and (2) you can get concurrency out of this as long as you consider a level "merge worthy" as different from "full", i.e., blocking). You did say > > But, we > > could relax that such that if ever the lowest level has > > > 2*mergeFactor pending segments to merge then we select the 2nd > > set. And I think you'd want to modify that to select the lowest sufficiently over subscribed level, not just the lowest level if it's oversubscribed? Perhaps this is sufficient, but not necessary? I see it as simpler just to have the merge policy (abstractly) generate a set of non-conflicting merges and let someone else worry about scheduling them. But, providing just a single concurrent merge already gains us concurrency of merging with adding of docs. I'm worried about when you start the leftmost merge, that, say, is going to take a day. With a steady influx of docs, it's not going to be long before you need another merge and if you have only one thread, you're going to block for the rest of the day. You've bought a little concurrency, but it's the almost day-long block I really want to avoid. With a log-like policy, I think it's feasible to have logN threads. You might not want them all doing disk i/o at the same time: you'd want to prioritize threads on the small merges and/or suspend large merge threads. The speed with which the larger merge threads can vary when other merges are taking place, you just have to not stop them and start over. Right, the LUCENE-845 merge policy doesn't look @ the return result of "merge". It just looks at the newly created SegmentInfos. Yeah. My thinking was this would be tweaked. If merger.merge returns a valid number of docs, it could recurse as it does. If merger.merge returned -1 (which CMP does), it would not recurse but simply continue the loop. Hmmmm, in fact, I think your CMP wrapper would not work with the merge policy in LUCENE-845 , right? Ie, won't it will just recurse forever? So actually I don't see how your CMP (using the current API) can in general safely "wrap" around a merge policy w/o breaking things? I think it's safe, just not concurrent. The recursion would generate the same set of segments to merge and CMP would make the second call block (abstractly, anyway: it actually throws an exception that unwinds the stack and causes the call to start again from the top when the conflicting merge finishes). But, if you lock on IndexWriter, what about apps that use multiple threads to add documents and but don't use CMP? When one thread gets tied up merging, you'll then block on the other synchronized methods? And you also can't flush from other threads either? I think flushing a new segment should be allowed to run concurrently with the merge? I'm not sure I'm following this. That's what happens now, right? Are you trying to get more concurrency then there is now w/o using CMP? I certainly haven't been trying to do that. I guess I don't see the reason to synchronize on IndexWriter instead of segmentInfos. I looked at trying to make IW work when a synchronization of IW didn't imply a synchronization of segmentInfos. It's a very, very heavily used little data structure. I found it very hard to convince myself I could catch all the places locks would be required. And at the same time, I seemed to be able to do everything I needed with IW locking. That said, the code's not done, so .... Net/net I'm still thinking we should simplify this API to be stateless. I think there are a number of benefits: We would no longer need to add a new IndexMerger interface that adds unecessary complexity to Lucnee (and, make the awkward decisions up front on which IndexWriter fields are allowed to be visible through the interface). Keep CMP simpler (only top of stack (where I think "macro" concurrency should live), not top and bottom). Work correctly as wrapper around other merge policies (ie not hit infinite recursion because mergePolicy had naturally assumed that "merge" would have changed the segmentInfos) Allows locking on segmentInfos (not IndexWriter), and allows concurrency on multiple threads adding docs even without using CMP. Hmmm ... I guess our approaches are pretty different. If you want to take a stab at this ...
          Hide
          mikemccand Michael McCandless added a comment -

          > > I don't think so: I think if someone changes the merge policy to
          > > something else, it's fine to require that they then do settings
          > > directly through that merge policy.
          >
          > You're going to want to change the default merge policy, right? So
          > you're going to change the hard cast in IW to that policy? So it'll
          > fail for anyone that wants to just getMergePolicy back to the old
          > policy?

          I don't really follow... my feeling is we should not deprecate
          setUseCompoundFile, setMergeFactor, setMaxMergeDocs.

          > > I think we shouldn't allow any mergePolicy to leave the index
          > > inconsistent (failing to copy over segments from other
          > > directories).
          >
          > That makes sense to me. CMP could enforce this, even in the case of
          > concurrent merges.

          I think IndexWriter should enforce it? Ie no merge policy should be
          allowed to leave segments in other dirs (= at inconsistent index) at
          point of commit.

          > Perhaps this is sufficient, but not necessary? I see it as simpler
          > just to have the merge policy (abstractly) generate a set of
          > non-conflicting merges and let someone else worry about scheduling
          > them.

          I like that idea It fits well w/ the stateless API. Ie, merge
          policy returns all possible merges and "someone above" takes care of
          scheduling them.

          > > But, providing just a single concurrent merge already gains us
          > > concurrency of merging with adding of docs.
          >
          > I'm worried about when you start the leftmost merge, that, say, is
          > going to take a day. With a steady influx of docs, it's not going to
          > be long before you need another merge and if you have only one
          > thread, you're going to block for the rest of the day. You've bought
          > a little concurrency, but it's the almost day-long block I really
          > want to avoid.

          Ahh ... very good point. I agree.

          > With a log-like policy, I think it's feasible to have logN
          > threads. You might not want them all doing disk i/o at the same
          > time: you'd want to prioritize threads on the small merges and/or
          > suspend large merge threads. The speed with which the larger merge
          > threads can vary when other merges are taking place, you just have
          > to not stop them and start over.

          Agreed: CMP should do this.

          > > Right, the LUCENE-845 merge policy doesn't look @ the return
          > > result of "merge". It just looks at the newly created
          > > SegmentInfos.
          >
          > Yeah. My thinking was this would be tweaked. If merger.merge returns
          > a valid number of docs, it could recurse as it does. If merger.merge
          > returned -1 (which CMP does), it would not recurse but simply
          > continue the loop.

          Hmm. This means each merge policy must know whether it's talking to
          CMP or IndexWriter underneith? With the stateless approach this
          wouldn't happen.

          > > Hmmmm, in fact, I think your CMP wrapper would not work with the
          > > merge policy in LUCENE-845, right? Ie, won't it will just recurse
          > > forever? So actually I don't see how your CMP (using the current
          > > API) can in general safely "wrap" around a merge policy w/o
          > > breaking things?
          >
          > I think it's safe, just not concurrent. The recursion would generate
          > the same set of segments to merge and CMP would make the second call
          > block (abstractly, anyway: it actually throws an exception that
          > unwinds the stack and causes the call to start again from the top
          > when the conflicting merge finishes).

          Oh I see... that's kind of sneaky (planning on using exceptions to
          abort a merge requested by the policy). I think the stateless
          approach would be cleaner here.

          > > But, if you lock on IndexWriter, what about apps that use multiple
          > > threads to add documents and but don't use CMP? When one thread
          > > gets tied up merging, you'll then block on the other synchronized
          > > methods? And you also can't flush from other threads either? I
          > > think flushing a new segment should be allowed to run concurrently
          > > with the merge?
          >
          > I'm not sure I'm following this. That's what happens now, right? Are
          > you trying to get more concurrency then there is now w/o using CMP?
          > I certainly haven't been trying to do that.

          True, this is something new. But since you're already doing the work
          to allow a merge to run in the BG without blocking adding of docs,
          flushing, etc, wouldn't this come nearly for free? Actually I think
          all that's necessary, regardless of sync'ing on IndexWriter or
          SegmentInfos is to move the "if (triggerMerge)" out of the
          synchronized method/block.

          > > I guess I don't see the reason to synchronize on IndexWriter
          > > instead of segmentInfos.
          >
          > I looked at trying to make IW work when a synchronization of IW
          > didn't imply a synchronization of segmentInfos. It's a very, very
          > heavily used little data structure. I found it very hard to convince
          > myself I could catch all the places locks would be required. And at
          > the same time, I seemed to be able to do everything I needed with IW
          > locking.

          Well, eg flush() now synchronizes on IndexWriter: we don't want 2
          threads doing this at once. But, the touching of segmentInfos inside
          flush (to add the new SegmentInfo) is a tiny fleeting event (like
          replace) and so you would want segmentInfos to be free to change while
          the flushing was running (eg by a BG merge that has finished).

          > Hmmm ... I guess our approaches are pretty different. If you want to
          > take a stab at this ...

          OK I will try to take a rough stab a the stateless approach....

          Show
          mikemccand Michael McCandless added a comment - > > I don't think so: I think if someone changes the merge policy to > > something else, it's fine to require that they then do settings > > directly through that merge policy. > > You're going to want to change the default merge policy, right? So > you're going to change the hard cast in IW to that policy? So it'll > fail for anyone that wants to just getMergePolicy back to the old > policy? I don't really follow... my feeling is we should not deprecate setUseCompoundFile, setMergeFactor, setMaxMergeDocs. > > I think we shouldn't allow any mergePolicy to leave the index > > inconsistent (failing to copy over segments from other > > directories). > > That makes sense to me. CMP could enforce this, even in the case of > concurrent merges. I think IndexWriter should enforce it? Ie no merge policy should be allowed to leave segments in other dirs (= at inconsistent index) at point of commit. > Perhaps this is sufficient, but not necessary? I see it as simpler > just to have the merge policy (abstractly) generate a set of > non-conflicting merges and let someone else worry about scheduling > them. I like that idea It fits well w/ the stateless API. Ie, merge policy returns all possible merges and "someone above" takes care of scheduling them. > > But, providing just a single concurrent merge already gains us > > concurrency of merging with adding of docs. > > I'm worried about when you start the leftmost merge, that, say, is > going to take a day. With a steady influx of docs, it's not going to > be long before you need another merge and if you have only one > thread, you're going to block for the rest of the day. You've bought > a little concurrency, but it's the almost day-long block I really > want to avoid. Ahh ... very good point. I agree. > With a log-like policy, I think it's feasible to have logN > threads. You might not want them all doing disk i/o at the same > time: you'd want to prioritize threads on the small merges and/or > suspend large merge threads. The speed with which the larger merge > threads can vary when other merges are taking place, you just have > to not stop them and start over. Agreed: CMP should do this. > > Right, the LUCENE-845 merge policy doesn't look @ the return > > result of "merge". It just looks at the newly created > > SegmentInfos. > > Yeah. My thinking was this would be tweaked. If merger.merge returns > a valid number of docs, it could recurse as it does. If merger.merge > returned -1 (which CMP does), it would not recurse but simply > continue the loop. Hmm. This means each merge policy must know whether it's talking to CMP or IndexWriter underneith? With the stateless approach this wouldn't happen. > > Hmmmm, in fact, I think your CMP wrapper would not work with the > > merge policy in LUCENE-845 , right? Ie, won't it will just recurse > > forever? So actually I don't see how your CMP (using the current > > API) can in general safely "wrap" around a merge policy w/o > > breaking things? > > I think it's safe, just not concurrent. The recursion would generate > the same set of segments to merge and CMP would make the second call > block (abstractly, anyway: it actually throws an exception that > unwinds the stack and causes the call to start again from the top > when the conflicting merge finishes). Oh I see... that's kind of sneaky (planning on using exceptions to abort a merge requested by the policy). I think the stateless approach would be cleaner here. > > But, if you lock on IndexWriter, what about apps that use multiple > > threads to add documents and but don't use CMP? When one thread > > gets tied up merging, you'll then block on the other synchronized > > methods? And you also can't flush from other threads either? I > > think flushing a new segment should be allowed to run concurrently > > with the merge? > > I'm not sure I'm following this. That's what happens now, right? Are > you trying to get more concurrency then there is now w/o using CMP? > I certainly haven't been trying to do that. True, this is something new. But since you're already doing the work to allow a merge to run in the BG without blocking adding of docs, flushing, etc, wouldn't this come nearly for free? Actually I think all that's necessary, regardless of sync'ing on IndexWriter or SegmentInfos is to move the "if (triggerMerge)" out of the synchronized method/block. > > I guess I don't see the reason to synchronize on IndexWriter > > instead of segmentInfos. > > I looked at trying to make IW work when a synchronization of IW > didn't imply a synchronization of segmentInfos. It's a very, very > heavily used little data structure. I found it very hard to convince > myself I could catch all the places locks would be required. And at > the same time, I seemed to be able to do everything I needed with IW > locking. Well, eg flush() now synchronizes on IndexWriter: we don't want 2 threads doing this at once. But, the touching of segmentInfos inside flush (to add the new SegmentInfo) is a tiny fleeting event (like replace) and so you would want segmentInfos to be free to change while the flushing was running (eg by a BG merge that has finished). > Hmmm ... I guess our approaches are pretty different. If you want to > take a stab at this ... OK I will try to take a rough stab a the stateless approach....
          Hide
          steven_parkes Steven Parkes added a comment -

          my feeling is we should not deprecate
          setUseCompoundFile, setMergeFactor, setMaxMergeDocs

          I understood that you didn't want to deprecate them in IndexWriter. I wasn't sure that you meant that they should be added to the MergePolicy interface? If you do, everything makes sense. Otherwise, it sounds like there's still a cast in there and I'm not sure about that.

          I think IndexWriter should enforce it? Ie no merge policy should be
          allowed to leave segments in other dirs (= at inconsistent index) at
          point of commit.

          I think it's just about code location: since a merge policy might want to factor into it's algorithm the directories used, it needs the info and it will presumably sometimes do it. Presumably you could provide code in MergePolicyBase so the merges could decide when but wouldn't have to write the copy loop. If you put the code in IndexWriter too, it sounds duplicated, again presuming sometimes a policy might want to do it itself.

          I like that idea It fits well w/ the stateless API. Ie, merge
          policy returns all possible merges and "someone above" takes care of
          scheduling them.

          So it returns a vector of specs?

          That's essentially what the CMP as an above/below wrapper does. I can see that above/below is strange enough to be less clever (I wasn't trying to be so much clever as backwards compatible) and more insane.

          Sane is good.

          Hmm. This means each merge policy must know whether it's talking to
          CMP or IndexWriter underneith? With the stateless approach this
          wouldn't happen.

          Well, I wouldn't so much say it has to know. All it cares is what merge returns. Doesn't have to know who returned it or why.

          The only real difference between this and the "generate a vector of merges" is that in the merge policy can take advantage immediately of merge results in the serial case where if you're generating a vector of merges, it can't know.

          Of course, I guess in that case, if IndexWriter gets a vector of merges, it can always take the lowest and ignore the rest, calling the merge policy again incase it wants to request a different set. Then you only have the excess computation for merges you never really considered.

          Oh I see... that's kind of sneaky (planning on using exceptions to
          abort a merge requested by the policy).

          There's always going to be the chance of an exception to a merge. I'm pretty sure of that. But you're right, if the merge policy isn't in the control path, it would never see them. They'll be there, but it's out of the path.

          But since you're already doing the work
          to allow a merge to run in the BG without blocking adding of docs,
          flushing, etc, wouldn't this come nearly for free?

          I haven't looked at this.

          Well, eg flush() now synchronizes on IndexWriter

          Yeah, and making it not is less than straightforward. I've looked at his code a fair amount, experimented with different ideas, but hadn't gotten all the way to a working model.

          You can look at locking segmentInfos but there are many places that segmentInfos is iterated over that would require locks if the lock on IW wasn't sufficient to guarantee that the iteration was safe.

          I did look at that early on, so maybe my understanding was still too lacking and it's more feasible than I was thinking ...

          Show
          steven_parkes Steven Parkes added a comment - my feeling is we should not deprecate setUseCompoundFile, setMergeFactor, setMaxMergeDocs I understood that you didn't want to deprecate them in IndexWriter. I wasn't sure that you meant that they should be added to the MergePolicy interface? If you do, everything makes sense. Otherwise, it sounds like there's still a cast in there and I'm not sure about that. I think IndexWriter should enforce it? Ie no merge policy should be allowed to leave segments in other dirs (= at inconsistent index) at point of commit. I think it's just about code location: since a merge policy might want to factor into it's algorithm the directories used, it needs the info and it will presumably sometimes do it. Presumably you could provide code in MergePolicyBase so the merges could decide when but wouldn't have to write the copy loop. If you put the code in IndexWriter too, it sounds duplicated, again presuming sometimes a policy might want to do it itself. I like that idea It fits well w/ the stateless API. Ie, merge policy returns all possible merges and "someone above" takes care of scheduling them. So it returns a vector of specs? That's essentially what the CMP as an above/below wrapper does. I can see that above/below is strange enough to be less clever (I wasn't trying to be so much clever as backwards compatible) and more insane. Sane is good. Hmm. This means each merge policy must know whether it's talking to CMP or IndexWriter underneith? With the stateless approach this wouldn't happen. Well, I wouldn't so much say it has to know. All it cares is what merge returns. Doesn't have to know who returned it or why. The only real difference between this and the "generate a vector of merges" is that in the merge policy can take advantage immediately of merge results in the serial case where if you're generating a vector of merges, it can't know. Of course, I guess in that case, if IndexWriter gets a vector of merges, it can always take the lowest and ignore the rest, calling the merge policy again incase it wants to request a different set. Then you only have the excess computation for merges you never really considered. Oh I see... that's kind of sneaky (planning on using exceptions to abort a merge requested by the policy). There's always going to be the chance of an exception to a merge. I'm pretty sure of that. But you're right, if the merge policy isn't in the control path, it would never see them. They'll be there, but it's out of the path. But since you're already doing the work to allow a merge to run in the BG without blocking adding of docs, flushing, etc, wouldn't this come nearly for free? I haven't looked at this. Well, eg flush() now synchronizes on IndexWriter Yeah, and making it not is less than straightforward. I've looked at his code a fair amount, experimented with different ideas, but hadn't gotten all the way to a working model. You can look at locking segmentInfos but there are many places that segmentInfos is iterated over that would require locks if the lock on IW wasn't sufficient to guarantee that the iteration was safe. I did look at that early on, so maybe my understanding was still too lacking and it's more feasible than I was thinking ...
          Hide
          mikemccand Michael McCandless added a comment -

          OK I started from the original patch and made the changes described
          below.

          This is still a work in progress, but I think I think the new
          stateless approach works very well.

          All unit tests pass (one assert had to be changed in
          TestAddIndexesNoOptimize).

          I created a ConcurrentMergePolicyWrapper along with this (I'll post
          patch to LUCENE-870).

          I've also included the two merge policies from LUCENE-845 (still
          defaulting to LogDocMergePolicy).

          Here are the changes:

          • Renamed merge -> maybeMerge
          • Changed the API to be "stateless" meaning the merge policy is no
            longer responsible for running the merges itself. Instead, it
            quickly returns the specification, which describes which merges
            are needed, back to the writer and the writer then runs them. I
            also changed MergeSpecification to contain a list of OneMerge
            instances.
          • Removed IndexMerger interface (just use IndexWriter instead)
          • Put isOptimized() logic into LogMergePolicy: on thinking about
            this more (and seeing response to a thread on java-dev), I now
            agree with Steve that this logically belongs in LogMergePolicy
            because each MergePolicy is free to define just what it considers
            "optimized" to mean. Then I removed the MergePolicyBase.
          • Un-deprecated {get/set} {UseCompoundFile,MergeFactor,MaxMergeDocs}

            .
            But I did leave the static constants deprecated.

          • IndexWriter keeps track of which segments are involved in running
            merges and throws a MergeException if it's asked to initiate a
            merge that involves a segment that's already being merged.
          • Fixed LogMergePolicy to return all possible merges (exposes
            concurrency).
          • Implemented the "merge deletes when commiting the merge" algorithm
            that Ning suggested (this is in commitMerge).
          • Assert that the merge request is in fact contiguous (at start &
            finish of merge) & throw MergeException if not.
          • Fixed a number of sneaky concurrency issues so that CMPW would
            work. Broke "merge" into mergeInit, mergeMiddle and mergeFinish.
            The first & last are carefully sychronized.
          • I put copyDirFiles in IW and call this in addIndexesNoOptimize
            before committing new segments file: we can't let mergePolicy
            leave the index inconsistent.
          • I reverted the changes to addIndexes(IndexReader[]): I think the
            change here wasn't valid: you can't assume that you can re-create
            any IndexReader instance by loading from its directory; I put the
            original back for this method.
          • the changes to addIndexes I'm not sure are good.
          • Fixed LogMergePolicy to return more than 1 merge
          • Made CMPW
          • Renamed replace -> commitMerge; made it private.
          Show
          mikemccand Michael McCandless added a comment - OK I started from the original patch and made the changes described below. This is still a work in progress, but I think I think the new stateless approach works very well. All unit tests pass (one assert had to be changed in TestAddIndexesNoOptimize). I created a ConcurrentMergePolicyWrapper along with this (I'll post patch to LUCENE-870 ). I've also included the two merge policies from LUCENE-845 (still defaulting to LogDocMergePolicy). Here are the changes: Renamed merge -> maybeMerge Changed the API to be "stateless" meaning the merge policy is no longer responsible for running the merges itself. Instead, it quickly returns the specification, which describes which merges are needed, back to the writer and the writer then runs them. I also changed MergeSpecification to contain a list of OneMerge instances. Removed IndexMerger interface (just use IndexWriter instead) Put isOptimized() logic into LogMergePolicy: on thinking about this more (and seeing response to a thread on java-dev), I now agree with Steve that this logically belongs in LogMergePolicy because each MergePolicy is free to define just what it considers "optimized" to mean. Then I removed the MergePolicyBase. Un-deprecated {get/set} {UseCompoundFile,MergeFactor,MaxMergeDocs} . But I did leave the static constants deprecated. IndexWriter keeps track of which segments are involved in running merges and throws a MergeException if it's asked to initiate a merge that involves a segment that's already being merged. Fixed LogMergePolicy to return all possible merges (exposes concurrency). Implemented the "merge deletes when commiting the merge" algorithm that Ning suggested (this is in commitMerge). Assert that the merge request is in fact contiguous (at start & finish of merge) & throw MergeException if not. Fixed a number of sneaky concurrency issues so that CMPW would work. Broke "merge" into mergeInit, mergeMiddle and mergeFinish. The first & last are carefully sychronized. I put copyDirFiles in IW and call this in addIndexesNoOptimize before committing new segments file: we can't let mergePolicy leave the index inconsistent. I reverted the changes to addIndexes(IndexReader[]): I think the change here wasn't valid: you can't assume that you can re-create any IndexReader instance by loading from its directory; I put the original back for this method. the changes to addIndexes I'm not sure are good. Fixed LogMergePolicy to return more than 1 merge Made CMPW Renamed replace -> commitMerge; made it private.
          Hide
          mikemccand Michael McCandless added a comment -

          OK new patch:

          • Added the missing MergePolicy.java from last time that Ning caught
            (thanks!)
          • Fixed some javadocs
          • Relaxed synchronization of merging so that merges can run
            concurrently with flushing if you are using multiple thread to do
            indexing. This gains concurrency of merging even if you are not
            using CMPW. But I left flushing as synchronized; I think we can
            relax this at some point in the future.
          • Fixed some concurrency issues
          • Added "minMergeDocs" to LogDocMergePolicy and "minMergeMB" to
            LogByteSizeMergePolicy; set their defaults as described in
            LUCENE-845.

          Still a few small things to do. I think it's getting close.

          Show
          mikemccand Michael McCandless added a comment - OK new patch: Added the missing MergePolicy.java from last time that Ning caught (thanks!) Fixed some javadocs Relaxed synchronization of merging so that merges can run concurrently with flushing if you are using multiple thread to do indexing. This gains concurrency of merging even if you are not using CMPW. But I left flushing as synchronized; I think we can relax this at some point in the future. Fixed some concurrency issues Added "minMergeDocs" to LogDocMergePolicy and "minMergeMB" to LogByteSizeMergePolicy; set their defaults as described in LUCENE-845 . Still a few small things to do. I think it's getting close.
          Hide
          ningli Ning Li added a comment -

          I include comments for both LUCENE-847 and LUCENE-870 here since they are closely related.

          I like the stateless approach used for refactoring merge policy. But modeling concurrent merge (ConcurrentMergePolicyWrapper) as a MergePolicy seems to be inconsistent with the MergePolicy interface:
          1 As you pointed out, "the merge policy is no longer responsible for running the merges itself". MergePolicy.maybeMerge simply returns a merge specification. But ConcurrentMergePolicyWrapper.maybeMerge actually starts concurrent merge threads thus doing the merges.
          2 Related to 1, cascading is done in IndexWriter in non-concurrent case. But in concurrent case, cascading is also done in merge threads which are started by ConcurrentMergePolicyWrapper.maybeMerge.

          MergePolicy.maybeMerge should continue to simply return a merge specification. (BTW, should we rename this maybeMerge to, say, findCandidateMerges?) Can we carve the merge process out of IndexWriter into a Merger? IndexWriter still provides the building blocks - merge(OneMerge), mergeInit(OneMerge), etc. Merger uses these building blocks. A ConcurrentMerger extends Merger but starts concurrent merge threads as ConcurrentMergePolicyWrapper does.

          Other comments:
          1 UpdateDocument's and deleteDocument's bufferDeleteTerm are synchronized on different variables in this patch. However, the semantics of updateDocument changed since LUCENE-843. Before LUCENE-843, updateDocument, which is a delete and an insert, guaranteed the delete and the insert are committed together (thus an update). Now it's possible that they are committed in different transactions. If we consider DocumentsWriter as the RAM staging area for IndexWriter, then deletes are also buffered in RAM staging area and we can restore our previous semantics, right?

          2 OneMerge.segments seems to rely on its segment infos' reference to segment infos of IndexWriter.segmentInfos. The use in commitMerge, which calls ensureContiguousMerge, is an example. However, segmentInfos can be a cloned copy because of exceptions, thus the reference broken.

          3 Calling optimize of an IndexWriter with the current ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec returned by MergePolicy.optimize may be in conflict with a concurrent merge (the same merge spec will be returned without changes to segmentInfos), but a concurrent merge cannot finish because optimize is holding the lock.

          4 Finally, a couple of minor things:
          1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo info) and useCompoundDocStore(SegmentInfos infos): why the parameters?
          2 Do we need doMergeClose in IndexWriter? Can we simply close a MergePolicy if not null?

          Show
          ningli Ning Li added a comment - I include comments for both LUCENE-847 and LUCENE-870 here since they are closely related. I like the stateless approach used for refactoring merge policy. But modeling concurrent merge (ConcurrentMergePolicyWrapper) as a MergePolicy seems to be inconsistent with the MergePolicy interface: 1 As you pointed out, "the merge policy is no longer responsible for running the merges itself". MergePolicy.maybeMerge simply returns a merge specification. But ConcurrentMergePolicyWrapper.maybeMerge actually starts concurrent merge threads thus doing the merges. 2 Related to 1, cascading is done in IndexWriter in non-concurrent case. But in concurrent case, cascading is also done in merge threads which are started by ConcurrentMergePolicyWrapper.maybeMerge. MergePolicy.maybeMerge should continue to simply return a merge specification. (BTW, should we rename this maybeMerge to, say, findCandidateMerges?) Can we carve the merge process out of IndexWriter into a Merger? IndexWriter still provides the building blocks - merge(OneMerge), mergeInit(OneMerge), etc. Merger uses these building blocks. A ConcurrentMerger extends Merger but starts concurrent merge threads as ConcurrentMergePolicyWrapper does. Other comments: 1 UpdateDocument's and deleteDocument's bufferDeleteTerm are synchronized on different variables in this patch. However, the semantics of updateDocument changed since LUCENE-843 . Before LUCENE-843 , updateDocument, which is a delete and an insert, guaranteed the delete and the insert are committed together (thus an update). Now it's possible that they are committed in different transactions. If we consider DocumentsWriter as the RAM staging area for IndexWriter, then deletes are also buffered in RAM staging area and we can restore our previous semantics, right? 2 OneMerge.segments seems to rely on its segment infos' reference to segment infos of IndexWriter.segmentInfos. The use in commitMerge, which calls ensureContiguousMerge, is an example. However, segmentInfos can be a cloned copy because of exceptions, thus the reference broken. 3 Calling optimize of an IndexWriter with the current ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec returned by MergePolicy.optimize may be in conflict with a concurrent merge (the same merge spec will be returned without changes to segmentInfos), but a concurrent merge cannot finish because optimize is holding the lock. 4 Finally, a couple of minor things: 1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo info) and useCompoundDocStore(SegmentInfos infos): why the parameters? 2 Do we need doMergeClose in IndexWriter? Can we simply close a MergePolicy if not null?
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks for the detailed review Ning!

          > 1 As you pointed out, "the merge policy is no longer responsible for
          > running the merges itself". MergePolicy.maybeMerge simply returns a
          > merge specification. But ConcurrentMergePolicyWrapper.maybeMerge
          > actually starts concurrent merge threads thus doing the merges.

          True, but I was thinking CMPW could be an exception to this rule. I
          guess I would change the rule to "simple merge policies don't have to
          run their own merges"?

          However I do agree, CMPW is clearly a different beast from a typical
          merge policy because it entails scheduling, not selection, of merges.

          > 2 Related to 1, cascading is done in IndexWriter in non-concurrent
          > case. But in concurrent case, cascading is also done in merge
          > threads which are started by
          > ConcurrentMergePolicyWrapper.maybeMerge.

          Good point... I think I could refactor this so that cascading logic
          lives entirely in one place IndexWriter.

          > MergePolicy.maybeMerge should continue to simply return a merge
          > specification. (BTW, should we rename this maybeMerge to, say,
          > findCandidateMerges?)

          Good! I like findCandidateMerges better; I'll change it.

          > Can we carve the merge process out of IndexWriter into a Merger?
          > IndexWriter still provides the building blocks - merge(OneMerge),
          > mergeInit(OneMerge), etc. Merger uses these building blocks. A
          > ConcurrentMerger extends Merger but starts concurrent merge threads
          > as ConcurrentMergePolicyWrapper does.

          How would this be used? Ie, how would one make an IndexWriter that
          uses the ConcurrentMerger? Would we add expert methods
          IndexWriter.set/getIndexMerger(...)? (And presumably the mergePolicy
          is now owned by IndexMerger so it would have the
          set/getMergePolicy(...)?)

          Also, how would you separate what remains in IW vs what would be in
          IndexMerger?

          I like this approach in principle; I'm just trying to hash out the
          details...

          > 1 UpdateDocument's and deleteDocument's bufferDeleteTerm are
          > synchronized on different variables in this patch.

          Woops, good catch! I'll fix.

          > However, the semantics of updateDocument changed since
          > LUCENE-843. Before LUCENE-843, updateDocument, which is a delete and
          > an insert, guaranteed the delete and the insert are committed
          > together (thus an update). Now it's possible that they are committed
          > in different transactions. If we consider DocumentsWriter as the RAM
          > staging area for IndexWriter, then deletes are also buffered in RAM
          > staging area and we can restore our previous semantics, right?

          Hmm ... you're right. This is a separate issue from merge policy,
          right? Are you proposing buffering deletes in DocumentsWriter
          instead?

          > 2 OneMerge.segments seems to rely on its segment infos' reference to
          > segment infos of IndexWriter.segmentInfos. The use in commitMerge,
          > which calls ensureContiguousMerge, is an example. However,
          > segmentInfos can be a cloned copy because of exceptions, thus the
          > reference broken.

          Good catch! How to fix? One thing we could do is always use
          SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
          level. This way using the default 'equals' (same instance) would
          work. Or we could establish identity (equals) of a SegmentInfo as
          checking if the directory plus segment name are equal? I think I'd
          lean to the 2nd option....

          > 3 Calling optimize of an IndexWriter with the current
          > ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec
          > returned by MergePolicy.optimize may be in conflict with a
          > concurrent merge (the same merge spec will be returned without
          > changes to segmentInfos), but a concurrent merge cannot finish
          > because optimize is holding the lock.

          Hmmm yes. In fact I think we can remove synchronized from optimize
          altogether since within it we are synchronizing(this) at the right
          places? If more than one thread calls optimize at once, externally,
          it is actually OK: they will each pick a merge that's viable
          (concurrently) and will run the merge, else return once there is no
          more concurrency left. I'll add a unit test that confirms this.

          > 4 Finally, a couple of minor things:
          >
          > 1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo
          > info) and useCompoundDocStore(SegmentInfos infos): why the
          > parameters?

          Well, useCompoundFile(...) is given a single newly flushed segment and
          should decide whether it should be CFS. Whereas
          useCompoundDocStore(...) is called when doc stores are flushed. When
          autoCommit=false, segments can share a single set of doc stores, so
          there's no single SegmentInfo to pass down.

          > 2 Do we need doMergeClose in IndexWriter? Can we simply close a
          > MergePolicy if not null?

          Good point. I think this is reasonable – I'll fix.

          Show
          mikemccand Michael McCandless added a comment - Thanks for the detailed review Ning! > 1 As you pointed out, "the merge policy is no longer responsible for > running the merges itself". MergePolicy.maybeMerge simply returns a > merge specification. But ConcurrentMergePolicyWrapper.maybeMerge > actually starts concurrent merge threads thus doing the merges. True, but I was thinking CMPW could be an exception to this rule. I guess I would change the rule to "simple merge policies don't have to run their own merges"? However I do agree, CMPW is clearly a different beast from a typical merge policy because it entails scheduling, not selection, of merges. > 2 Related to 1, cascading is done in IndexWriter in non-concurrent > case. But in concurrent case, cascading is also done in merge > threads which are started by > ConcurrentMergePolicyWrapper.maybeMerge. Good point... I think I could refactor this so that cascading logic lives entirely in one place IndexWriter. > MergePolicy.maybeMerge should continue to simply return a merge > specification. (BTW, should we rename this maybeMerge to, say, > findCandidateMerges?) Good! I like findCandidateMerges better; I'll change it. > Can we carve the merge process out of IndexWriter into a Merger? > IndexWriter still provides the building blocks - merge(OneMerge), > mergeInit(OneMerge), etc. Merger uses these building blocks. A > ConcurrentMerger extends Merger but starts concurrent merge threads > as ConcurrentMergePolicyWrapper does. How would this be used? Ie, how would one make an IndexWriter that uses the ConcurrentMerger? Would we add expert methods IndexWriter.set/getIndexMerger(...)? (And presumably the mergePolicy is now owned by IndexMerger so it would have the set/getMergePolicy(...)?) Also, how would you separate what remains in IW vs what would be in IndexMerger? I like this approach in principle; I'm just trying to hash out the details... > 1 UpdateDocument's and deleteDocument's bufferDeleteTerm are > synchronized on different variables in this patch. Woops, good catch! I'll fix. > However, the semantics of updateDocument changed since > LUCENE-843 . Before LUCENE-843 , updateDocument, which is a delete and > an insert, guaranteed the delete and the insert are committed > together (thus an update). Now it's possible that they are committed > in different transactions. If we consider DocumentsWriter as the RAM > staging area for IndexWriter, then deletes are also buffered in RAM > staging area and we can restore our previous semantics, right? Hmm ... you're right. This is a separate issue from merge policy, right? Are you proposing buffering deletes in DocumentsWriter instead? > 2 OneMerge.segments seems to rely on its segment infos' reference to > segment infos of IndexWriter.segmentInfos. The use in commitMerge, > which calls ensureContiguousMerge, is an example. However, > segmentInfos can be a cloned copy because of exceptions, thus the > reference broken. Good catch! How to fix? One thing we could do is always use SegmentInfo.reset(...) and never swap in clones at the SegmentInfo level. This way using the default 'equals' (same instance) would work. Or we could establish identity (equals) of a SegmentInfo as checking if the directory plus segment name are equal? I think I'd lean to the 2nd option.... > 3 Calling optimize of an IndexWriter with the current > ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec > returned by MergePolicy.optimize may be in conflict with a > concurrent merge (the same merge spec will be returned without > changes to segmentInfos), but a concurrent merge cannot finish > because optimize is holding the lock. Hmmm yes. In fact I think we can remove synchronized from optimize altogether since within it we are synchronizing(this) at the right places? If more than one thread calls optimize at once, externally, it is actually OK: they will each pick a merge that's viable (concurrently) and will run the merge, else return once there is no more concurrency left. I'll add a unit test that confirms this. > 4 Finally, a couple of minor things: > > 1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo > info) and useCompoundDocStore(SegmentInfos infos): why the > parameters? Well, useCompoundFile(...) is given a single newly flushed segment and should decide whether it should be CFS. Whereas useCompoundDocStore(...) is called when doc stores are flushed. When autoCommit=false, segments can share a single set of doc stores, so there's no single SegmentInfo to pass down. > 2 Do we need doMergeClose in IndexWriter? Can we simply close a > MergePolicy if not null? Good point. I think this is reasonable – I'll fix.
          Hide
          ningli Ning Li added a comment -

          > True, but I was thinking CMPW could be an exception to this rule. I
          > guess I would change the rule to "simple merge policies don't have to
          > run their own merges"?

          Let's see if we have to make that exception.

          > Good point... I think I could refactor this so that cascading logic
          > lives entirely in one place IndexWriter.

          Another problem of the current cascading in CMPW.MergeThread is, if multiple candidate merges are found, all of them are added to IndexWriter.mergingSegments. But all but the first should be removed because only the first merge is carried out (thus removed from mergeSegments after the merge is done).

          How do you make cascading live entirely in IndexWriter? Just removing cascading from CMPW.MergeThread has one drawback. For example, segment sizes of an index are: 40, 20, 10, buffer size is 10 and merge factor is 2. A buffer full flush of 10 will trigger merge of 10 & 10, then cascade to 20 & 20, then cascade to 40 & 40. CMPW without cascading will stop after 10 & 10 since IndexWriter.maybeMerge has already returned. Then we have to wait for the next flush to merge 20 & 20.

          > How would this be used? Ie, how would one make an IndexWriter that
          > uses the ConcurrentMerger? Would we add expert methods
          > IndexWriter.set/getIndexMerger(...)? (And presumably the mergePolicy
          > is now owned by IndexMerger so it would have the
          > set/getMergePolicy(...)?)
          >
          > Also, how would you separate what remains in IW vs what would be in
          > IndexMerger?

          Maybe Merger does and only does merge (so IndexWriter still owns MergePolicy)? Say, base class Merger.merge simply calls IndexWriter.merge. ConcurrentMerger.merge creates a merge thread if possible. Otherwise it calls super.merge, which does non-concurrent merge. IndexWriter simply calls its merger's merge instead of its own merge. Everything else remains in IndexWriter.

          1
          > Hmm ... you're right. This is a separate issue from merge policy,
          > right? Are you proposing buffering deletes in DocumentsWriter
          > instead?

          Yes, this is a separate issue. And yes if we consider DocumentsWriter as staging area.

          2
          > Good catch! How to fix? One thing we could do is always use
          > SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
          > level. This way using the default 'equals' (same instance) would
          > work. Or we could establish identity (equals) of a SegmentInfo as
          > checking if the directory plus segment name are equal? I think I'd
          > lean to the 2nd option....

          I think the 2nd option is better.

          3
          > Hmmm yes. In fact I think we can remove synchronized from optimize
          > altogether since within it we are synchronizing(this) at the right
          > places? If more than one thread calls optimize at once, externally,
          > it is actually OK: they will each pick a merge that's viable
          > (concurrently) and will run the merge, else return once there is no
          > more concurrency left. I'll add a unit test that confirms this.

          That seems to be the case. The fact that "the same merge spec will be returned without changes to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds merges which may not be eligible; but CMPW checks for eligibility when looking for candidate merges. Maybe we should unify the behaviour? BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility either.

          4
          > Well, useCompoundFile(...) is given a single newly flushed segment and
          > should decide whether it should be CFS. Whereas
          > useCompoundDocStore(...) is called when doc stores are flushed. When
          > autoCommit=false, segments can share a single set of doc stores, so
          > there's no single SegmentInfo to pass down.

          The reason I asked is because none of them are used right now. So they might be used in the future?

          Show
          ningli Ning Li added a comment - > True, but I was thinking CMPW could be an exception to this rule. I > guess I would change the rule to "simple merge policies don't have to > run their own merges"? Let's see if we have to make that exception. > Good point... I think I could refactor this so that cascading logic > lives entirely in one place IndexWriter. Another problem of the current cascading in CMPW.MergeThread is, if multiple candidate merges are found, all of them are added to IndexWriter.mergingSegments. But all but the first should be removed because only the first merge is carried out (thus removed from mergeSegments after the merge is done). How do you make cascading live entirely in IndexWriter? Just removing cascading from CMPW.MergeThread has one drawback. For example, segment sizes of an index are: 40, 20, 10, buffer size is 10 and merge factor is 2. A buffer full flush of 10 will trigger merge of 10 & 10, then cascade to 20 & 20, then cascade to 40 & 40. CMPW without cascading will stop after 10 & 10 since IndexWriter.maybeMerge has already returned. Then we have to wait for the next flush to merge 20 & 20. > How would this be used? Ie, how would one make an IndexWriter that > uses the ConcurrentMerger? Would we add expert methods > IndexWriter.set/getIndexMerger(...)? (And presumably the mergePolicy > is now owned by IndexMerger so it would have the > set/getMergePolicy(...)?) > > Also, how would you separate what remains in IW vs what would be in > IndexMerger? Maybe Merger does and only does merge (so IndexWriter still owns MergePolicy)? Say, base class Merger.merge simply calls IndexWriter.merge. ConcurrentMerger.merge creates a merge thread if possible. Otherwise it calls super.merge, which does non-concurrent merge. IndexWriter simply calls its merger's merge instead of its own merge. Everything else remains in IndexWriter. 1 > Hmm ... you're right. This is a separate issue from merge policy, > right? Are you proposing buffering deletes in DocumentsWriter > instead? Yes, this is a separate issue. And yes if we consider DocumentsWriter as staging area. 2 > Good catch! How to fix? One thing we could do is always use > SegmentInfo.reset(...) and never swap in clones at the SegmentInfo > level. This way using the default 'equals' (same instance) would > work. Or we could establish identity (equals) of a SegmentInfo as > checking if the directory plus segment name are equal? I think I'd > lean to the 2nd option.... I think the 2nd option is better. 3 > Hmmm yes. In fact I think we can remove synchronized from optimize > altogether since within it we are synchronizing(this) at the right > places? If more than one thread calls optimize at once, externally, > it is actually OK: they will each pick a merge that's viable > (concurrently) and will run the merge, else return once there is no > more concurrency left. I'll add a unit test that confirms this. That seems to be the case. The fact that "the same merge spec will be returned without changes to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds merges which may not be eligible; but CMPW checks for eligibility when looking for candidate merges. Maybe we should unify the behaviour? BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility either. 4 > Well, useCompoundFile(...) is given a single newly flushed segment and > should decide whether it should be CFS. Whereas > useCompoundDocStore(...) is called when doc stores are flushed. When > autoCommit=false, segments can share a single set of doc stores, so > there's no single SegmentInfo to pass down. The reason I asked is because none of them are used right now. So they might be used in the future?
          Hide
          mikemccand Michael McCandless added a comment -

          > > Good point... I think I could refactor this so that cascading logic
          > > lives entirely in one place IndexWriter.
          >
          > Another problem of the current cascading in CMPW.MergeThread is, if
          > multiple candidate merges are found, all of them are added to
          > IndexWriter.mergingSegments. But all but the first should be removed
          > because only the first merge is carried out (thus removed from
          > mergeSegments after the merge is done).

          You're right – I'm only doing the first non-conflicting merge in
          CMPW (but then not releasing the rest of them). I think this would be
          fixed by having cascading logic only in IndexWriter.

          > How do you make cascading live entirely in IndexWriter? Just
          > removing cascading from CMPW.MergeThread has one drawback. For
          > example, segment sizes of an index are: 40, 20, 10, buffer size is
          > 10 and merge factor is 2. A buffer full flush of 10 will trigger
          > merge of 10 & 10, then cascade to 20 & 20, then cascade to 40 &
          > 40. CMPW without cascading will stop after 10 & 10 since
          > IndexWriter.maybeMerge has already returned. Then we have to wait
          > for the next flush to merge 20 & 20.

          Oh, I would remove from CMPW and add then add it into IndexWriter (so
          the scenario above would cascade normally). Meaning, IndexWriter,
          upon completing a merge, would always consult the policy for whether
          the completed merge has now enabled any new merges.

          This is somewhat messy though (with CMPW as a MergePolicy) because
          then findCandidateMerges would need to know if it was being called
          (due to cascading) under one of its own threads and if so act
          differently. Another good reason to make it a separate Merger
          subclass.

          > > How would this be used? Ie, how would one make an IndexWriter
          > > that uses the ConcurrentMerger? Would we add expert methods
          > > IndexWriter.set/getIndexMerger(...)? (And presumably the
          > > mergePolicy is now owned by IndexMerger so it would have the
          > > set/getMergePolicy(...)?)
          > >
          > > Also, how would you separate what remains in IW vs what would be
          > > in IndexMerger?
          >
          > Maybe Merger does and only does merge (so IndexWriter still owns
          > MergePolicy)? Say, base class Merger.merge simply calls
          > IndexWriter.merge. ConcurrentMerger.merge creates a merge thread if
          > possible. Otherwise it calls super.merge, which does non-concurrent
          > merge. IndexWriter simply calls its merger's merge instead of its
          > own merge. Everything else remains in IndexWriter.

          OK I will test out this approach.

          > > Hmm ... you're right. This is a separate issue from merge policy,
          > > right? Are you proposing buffering deletes in DocumentsWriter
          > > instead?
          >
          > Yes, this is a separate issue. And yes if we consider
          > DocumentsWriter as staging area.

          I will open new issue.

          > > Good catch! How to fix? One thing we could do is always use
          > > SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
          > > level. This way using the default 'equals' (same instance) would
          > > work. Or we could establish identity (equals) of a SegmentInfo as
          > > checking if the directory plus segment name are equal? I think
          > > I'd lean to the 2nd option....
          >
          > I think the 2nd option is better.

          I'll take this approach.

          > > Hmmm yes. In fact I think we can remove synchronized from
          > > optimize altogether since within it we are synchronizing(this) at
          > > the right places? If more than one thread calls optimize at once,
          > > externally, it is actually OK: they will each pick a merge that's
          > > viable (concurrently) and will run the merge, else return once
          > > there is no more concurrency left. I'll add a unit test that
          > > confirms this.
          >
          > That seems to be the case.

          I'll add unit test to confirm.

          > The fact that "the same merge spec will be returned without changes
          > to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds
          > merges which may not be eligible; but CMPW checks for eligibility
          > when looking for candidate merges. Maybe we should unify the
          > behaviour?

          Not quite following you here... not being eligible because the merge
          is in-progress in a thread is something I think any given MergePolicy
          should not have to track? Once I factor out CMPW as its own Merger
          subclass I think the eligibility check happens only in IndexWriter?

          > BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility
          > either.

          Rename to/from what? (It is currently called MergePolicy.optimize).
          IndexWriter steps through the merges and only runs the ones that do
          not conflict (are eligible)?

          > > Well, useCompoundFile(...) is given a single newly flushed segment
          > > and should decide whether it should be CFS. Whereas
          > > useCompoundDocStore(...) is called when doc stores are flushed.
          > > When autoCommit=false, segments can share a single set of doc
          > > stores, so there's no single SegmentInfo to pass down.
          >
          > The reason I asked is because none of them are used right now. So
          > they might be used in the future?

          Both of these methods are now called by IndexWriter (in the patch),
          upon flushing a new segment.

          Show
          mikemccand Michael McCandless added a comment - > > Good point... I think I could refactor this so that cascading logic > > lives entirely in one place IndexWriter. > > Another problem of the current cascading in CMPW.MergeThread is, if > multiple candidate merges are found, all of them are added to > IndexWriter.mergingSegments. But all but the first should be removed > because only the first merge is carried out (thus removed from > mergeSegments after the merge is done). You're right – I'm only doing the first non-conflicting merge in CMPW (but then not releasing the rest of them). I think this would be fixed by having cascading logic only in IndexWriter. > How do you make cascading live entirely in IndexWriter? Just > removing cascading from CMPW.MergeThread has one drawback. For > example, segment sizes of an index are: 40, 20, 10, buffer size is > 10 and merge factor is 2. A buffer full flush of 10 will trigger > merge of 10 & 10, then cascade to 20 & 20, then cascade to 40 & > 40. CMPW without cascading will stop after 10 & 10 since > IndexWriter.maybeMerge has already returned. Then we have to wait > for the next flush to merge 20 & 20. Oh, I would remove from CMPW and add then add it into IndexWriter (so the scenario above would cascade normally). Meaning, IndexWriter, upon completing a merge, would always consult the policy for whether the completed merge has now enabled any new merges. This is somewhat messy though (with CMPW as a MergePolicy) because then findCandidateMerges would need to know if it was being called (due to cascading) under one of its own threads and if so act differently. Another good reason to make it a separate Merger subclass. > > How would this be used? Ie, how would one make an IndexWriter > > that uses the ConcurrentMerger? Would we add expert methods > > IndexWriter.set/getIndexMerger(...)? (And presumably the > > mergePolicy is now owned by IndexMerger so it would have the > > set/getMergePolicy(...)?) > > > > Also, how would you separate what remains in IW vs what would be > > in IndexMerger? > > Maybe Merger does and only does merge (so IndexWriter still owns > MergePolicy)? Say, base class Merger.merge simply calls > IndexWriter.merge. ConcurrentMerger.merge creates a merge thread if > possible. Otherwise it calls super.merge, which does non-concurrent > merge. IndexWriter simply calls its merger's merge instead of its > own merge. Everything else remains in IndexWriter. OK I will test out this approach. > > Hmm ... you're right. This is a separate issue from merge policy, > > right? Are you proposing buffering deletes in DocumentsWriter > > instead? > > Yes, this is a separate issue. And yes if we consider > DocumentsWriter as staging area. I will open new issue. > > Good catch! How to fix? One thing we could do is always use > > SegmentInfo.reset(...) and never swap in clones at the SegmentInfo > > level. This way using the default 'equals' (same instance) would > > work. Or we could establish identity (equals) of a SegmentInfo as > > checking if the directory plus segment name are equal? I think > > I'd lean to the 2nd option.... > > I think the 2nd option is better. I'll take this approach. > > Hmmm yes. In fact I think we can remove synchronized from > > optimize altogether since within it we are synchronizing(this) at > > the right places? If more than one thread calls optimize at once, > > externally, it is actually OK: they will each pick a merge that's > > viable (concurrently) and will run the merge, else return once > > there is no more concurrency left. I'll add a unit test that > > confirms this. > > That seems to be the case. I'll add unit test to confirm. > The fact that "the same merge spec will be returned without changes > to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds > merges which may not be eligible; but CMPW checks for eligibility > when looking for candidate merges. Maybe we should unify the > behaviour? Not quite following you here... not being eligible because the merge is in-progress in a thread is something I think any given MergePolicy should not have to track? Once I factor out CMPW as its own Merger subclass I think the eligibility check happens only in IndexWriter? > BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility > either. Rename to/from what? (It is currently called MergePolicy.optimize). IndexWriter steps through the merges and only runs the ones that do not conflict (are eligible)? > > Well, useCompoundFile(...) is given a single newly flushed segment > > and should decide whether it should be CFS. Whereas > > useCompoundDocStore(...) is called when doc stores are flushed. > > When autoCommit=false, segments can share a single set of doc > > stores, so there's no single SegmentInfo to pass down. > > The reason I asked is because none of them are used right now. So > they might be used in the future? Both of these methods are now called by IndexWriter (in the patch), upon flushing a new segment.
          Hide
          ningli Ning Li added a comment -

          > Not quite following you here... not being eligible because the merge
          > is in-progress in a thread is something I think any given MergePolicy
          > should not have to track? Once I factor out CMPW as its own Merger
          > subclass I think the eligibility check happens only in IndexWriter?

          I was referring to the current patch: LogMergePolicy does not check for eligibility, but CMPW, a subclass of MergePolicy, checks for eligibility. Yes, the eligibility check only happens in IndexWriter after we do Merger class.

          > Rename to/from what? (It is currently called MergePolicy.optimize).
          > IndexWriter steps through the merges and only runs the ones that do
          > not conflict (are eligible)?

          Maybe rename to MergePolicy.findMergesToOptimize?

          > > The reason I asked is because none of them are used right now. So
          > > they might be used in the future?
          >
          > Both of these methods are now called by IndexWriter (in the patch),
          > upon flushing a new segment.

          I was referring to the parameters. The parameters are not used.

          Show
          ningli Ning Li added a comment - > Not quite following you here... not being eligible because the merge > is in-progress in a thread is something I think any given MergePolicy > should not have to track? Once I factor out CMPW as its own Merger > subclass I think the eligibility check happens only in IndexWriter? I was referring to the current patch: LogMergePolicy does not check for eligibility, but CMPW, a subclass of MergePolicy, checks for eligibility. Yes, the eligibility check only happens in IndexWriter after we do Merger class. > Rename to/from what? (It is currently called MergePolicy.optimize). > IndexWriter steps through the merges and only runs the ones that do > not conflict (are eligible)? Maybe rename to MergePolicy.findMergesToOptimize? > > The reason I asked is because none of them are used right now. So > > they might be used in the future? > > Both of these methods are now called by IndexWriter (in the patch), > upon flushing a new segment. I was referring to the parameters. The parameters are not used.
          Hide
          mikemccand Michael McCandless added a comment -

          > > Not quite following you here... not being eligible because the
          > > merge is in-progress in a thread is something I think any given
          > > MergePolicy should not have to track? Once I factor out CMPW as
          > > its own Merger subclass I think the eligibility check happens only
          > > in IndexWriter?
          >
          > I was referring to the current patch: LogMergePolicy does not check
          > for eligibility, but CMPW, a subclass of MergePolicy, checks for
          > eligibility. Yes, the eligibility check only happens in IndexWriter
          > after we do Merger class.

          OK, let's leave eligibility check in IW.

          > > Rename to/from what? (It is currently called
          > > MergePolicy.optimize). IndexWriter steps through the merges and
          > > only runs the ones that do not conflict (are eligible)?
          >
          > Maybe rename to MergePolicy.findMergesToOptimize?

          OK, that's good.

          > > > The reason I asked is because none of them are used right
          > > > now. So they might be used in the future?
          > >
          > > Both of these methods are now called by IndexWriter (in the
          > > patch), upon flushing a new segment.
          >
          > I was referring to the parameters. The parameters are not used.

          Ahh, got it. Yes the thinking is merge policies in the future may
          want to look @ segmentinfos to decide.

          Show
          mikemccand Michael McCandless added a comment - > > Not quite following you here... not being eligible because the > > merge is in-progress in a thread is something I think any given > > MergePolicy should not have to track? Once I factor out CMPW as > > its own Merger subclass I think the eligibility check happens only > > in IndexWriter? > > I was referring to the current patch: LogMergePolicy does not check > for eligibility, but CMPW, a subclass of MergePolicy, checks for > eligibility. Yes, the eligibility check only happens in IndexWriter > after we do Merger class. OK, let's leave eligibility check in IW. > > Rename to/from what? (It is currently called > > MergePolicy.optimize). IndexWriter steps through the merges and > > only runs the ones that do not conflict (are eligible)? > > Maybe rename to MergePolicy.findMergesToOptimize? OK, that's good. > > > The reason I asked is because none of them are used right > > > now. So they might be used in the future? > > > > Both of these methods are now called by IndexWriter (in the > > patch), upon flushing a new segment. > > I was referring to the parameters. The parameters are not used. Ahh, got it. Yes the thinking is merge policies in the future may want to look @ segmentinfos to decide.
          Hide
          mikemccand Michael McCandless added a comment -

          Attached new patch (take5) incorporating Ning's feedback.

          This patch includes LUCENE-845 (a new merge default merge policy plus
          a "merge by size in bytes of segment" merge policy), LUCENE-847
          (factor merge policy/scheduling out of IndexWriter) and LUCENE-870
          (ConcurrentMergeScheduler).

          The one thing remaining after these are done, that I'll open a
          separate issue for and commit separately, is to switch IndexWriter to
          flush by RAM usage by default (instead of by docCount == 10) as well
          as merge by size-in-bytes by default.

          I broke out a separate MergeScheduler interface. SerialMergeScheduler
          is the default (matches how merges are executed today: sequentially,
          using the calling thread). ConcurrentMergeScheduler runs the merges
          as separate threads (up to a max number at which point the extras are
          done sequentially).

          Other changes:

          • Allow multiple threads to call optimize(). I added a unit test
            for this.
          • Tightnened calls to deleter.refresh(), which remove partially
            created files on an exception, to remove only those files that the
            given piece of code would create. This is very important because
            otherwise refresh() could remove the files being created by a
            background merge.
          • Added some unit tests
          Show
          mikemccand Michael McCandless added a comment - Attached new patch (take5) incorporating Ning's feedback. This patch includes LUCENE-845 (a new merge default merge policy plus a "merge by size in bytes of segment" merge policy), LUCENE-847 (factor merge policy/scheduling out of IndexWriter) and LUCENE-870 (ConcurrentMergeScheduler). The one thing remaining after these are done, that I'll open a separate issue for and commit separately, is to switch IndexWriter to flush by RAM usage by default (instead of by docCount == 10) as well as merge by size-in-bytes by default. I broke out a separate MergeScheduler interface. SerialMergeScheduler is the default (matches how merges are executed today: sequentially, using the calling thread). ConcurrentMergeScheduler runs the merges as separate threads (up to a max number at which point the extras are done sequentially). Other changes: Allow multiple threads to call optimize(). I added a unit test for this. Tightnened calls to deleter.refresh(), which remove partially created files on an exception, to remove only those files that the given piece of code would create. This is very important because otherwise refresh() could remove the files being created by a background merge. Added some unit tests
          Hide
          cutting Doug Cutting added a comment -

          Is there any reason not to make ConcurrentMergeScheduler the default too after this is committed?

          Show
          cutting Doug Cutting added a comment - Is there any reason not to make ConcurrentMergeScheduler the default too after this is committed?
          Hide
          mikemccand Michael McCandless added a comment -

          > Is there any reason not to make ConcurrentMergeScheduler the default too after this is committed?

          Good question. The only downsides I can think of are:

          • It's all fresh code so until we let it "age" some, it's a higher
            risk that something broke. That said there is decent unit test
            coverage for it and these unit tests did find some sneaky issues
            (which I fixed!).
          • It only actually helps on machines that have some concurrency.
            But in this case we are largely talking about IO concurrent w/ CPU
            which nearly all machines have I think.

          I think the benefits are sizable:

          • Good performance gains (25% speedup of net indexing time for all
            of Wikipedia content – details in LUCENE-870)
          • Trivial way to leverage concurrency (ie you don't need to manage
            your own threads).
          • No more unexpected long pauses on certain addDocument calls.

          So I think it would make sense to make it the default. I'll include
          that in the new issue for changing defaults in IndexWriter.

          Show
          mikemccand Michael McCandless added a comment - > Is there any reason not to make ConcurrentMergeScheduler the default too after this is committed? Good question. The only downsides I can think of are: It's all fresh code so until we let it "age" some, it's a higher risk that something broke. That said there is decent unit test coverage for it and these unit tests did find some sneaky issues (which I fixed!). It only actually helps on machines that have some concurrency. But in this case we are largely talking about IO concurrent w/ CPU which nearly all machines have I think. I think the benefits are sizable: Good performance gains (25% speedup of net indexing time for all of Wikipedia content – details in LUCENE-870 ) Trivial way to leverage concurrency (ie you don't need to manage your own threads). No more unexpected long pauses on certain addDocument calls. So I think it would make sense to make it the default. I'll include that in the new issue for changing defaults in IndexWriter.
          Hide
          mikemccand Michael McCandless added a comment -

          OK, as a better test of ConcurrentMergeScheduler, and towards making it
          the default merge scheduler, I tried making it the default in
          IndexWriter and then ran all unit tests, and uncovered problems with
          the current patch (notably how optimize works!). So I'm working on an
          new patch now....

          Show
          mikemccand Michael McCandless added a comment - OK, as a better test of ConcurrentMergeScheduler, and towards making it the default merge scheduler, I tried making it the default in IndexWriter and then ran all unit tests, and uncovered problems with the current patch (notably how optimize works!). So I'm working on an new patch now....
          Hide
          ningli Ning Li added a comment -

          Comments on optimize():

          • In the while loop of optimize(), LogMergePolicy.findMergesForOptimize returns a merge spec with one merge. If ConcurrentMergeScheduler is used, the one merge will be started in MergeScheduler.merge() and findMergesForOptimize will be called again. Before the one merge finishes, findMergesForOptimize will return the same spec but the one merge is already started. So only one concurrent merge is possible and the main thread will spin on calling findMergesForOptimize and attempting to merge.
          • One possible solution is to make LogMergePolicy.findMergesForOptimize return multiple merge candidates. It allows higher level of concurrency. It also alleviates a bit the problem of main thread spinning. To solve this problem, maybe we can check if a merge is actually started, then sleep briefly if not (which means all merges candidates are in conflict)?

          A comment on concurrent merge threads:

          • One difference between the current approach on concurrent merge and the patch I posted a while back is that, in the current approach, a MergeThread object is created and started for every concurrent merge. In my old patch, maxThreadCount of threads are created and started at the beginning and are used throughout. Both have pros and cons.
          Show
          ningli Ning Li added a comment - Comments on optimize(): In the while loop of optimize(), LogMergePolicy.findMergesForOptimize returns a merge spec with one merge. If ConcurrentMergeScheduler is used, the one merge will be started in MergeScheduler.merge() and findMergesForOptimize will be called again. Before the one merge finishes, findMergesForOptimize will return the same spec but the one merge is already started. So only one concurrent merge is possible and the main thread will spin on calling findMergesForOptimize and attempting to merge. One possible solution is to make LogMergePolicy.findMergesForOptimize return multiple merge candidates. It allows higher level of concurrency. It also alleviates a bit the problem of main thread spinning. To solve this problem, maybe we can check if a merge is actually started, then sleep briefly if not (which means all merges candidates are in conflict)? A comment on concurrent merge threads: One difference between the current approach on concurrent merge and the patch I posted a while back is that, in the current approach, a MergeThread object is created and started for every concurrent merge. In my old patch, maxThreadCount of threads are created and started at the beginning and are used throughout. Both have pros and cons.
          Hide
          mikemccand Michael McCandless added a comment -

          OK, another rev of the patch (take6). I think it's close!

          This patch passes all unit tests with SerialMergeScheduler (left as
          the default for now) and also passes all unit tests once you switch
          the default to ConcurrentMergeScheduler instead.

          I made one simplification to the approach: IndexWriter now keeps track
          of "pendingMerges" (merges that mergePolicy has declared are necessary
          but have not yet been started), and "runningMerges" (merges currently
          in flight). Then MergeScheduler just asks IndexWriter for the next
          pending merge when it's ready to run it. This also cleaned up how
          cascading works.

          Other changes:

          • Optimize: optimize is now fully concurrent (it can run multiple
            merges at once, new segments can be flushed during an optimize,
            etc). Optimize will optimize only those segments present when it
            started (newly flushed segments may remain separate).
          • New API: optimize(boolean doWait) allows you to not wait for
            optimize to complete (it runs in background). This only works
            when MergeScheduler uses threads.
          • New API: close(boolean doWait) allows you to not wait for running
            merges if you want to "close in a hurry". Also only works when
            MergeScheduler uses threads.
          • I fixed LogMergePolicy to expose merge concurrency during optimize
            by first calling the "normal" merge policy to see if it requires
            merges and returning those merges if so, and then falling back to
            the normal "merge the tail <= mergeFactor segments until there is
            only 1 left".
          • Because IndexModifier synchronizes on directory, it can't use
            ConcurrentMergeScheduler since this quickly leads to deadlock at
            least during IndexWriter.close. So I set it back to
            SerialMergeScheduler (it is deprecated anyway).
          • Added private IndexWriter.message(...) that prints message to the
            infoStream prefixed by the thread name and changed all
            infoStream.print*'s to message(...). Also added more messages in
            the exceptional cases to aid future diagnostics.
          • Added more unit tests
          Show
          mikemccand Michael McCandless added a comment - OK, another rev of the patch (take6). I think it's close! This patch passes all unit tests with SerialMergeScheduler (left as the default for now) and also passes all unit tests once you switch the default to ConcurrentMergeScheduler instead. I made one simplification to the approach: IndexWriter now keeps track of "pendingMerges" (merges that mergePolicy has declared are necessary but have not yet been started), and "runningMerges" (merges currently in flight). Then MergeScheduler just asks IndexWriter for the next pending merge when it's ready to run it. This also cleaned up how cascading works. Other changes: Optimize: optimize is now fully concurrent (it can run multiple merges at once, new segments can be flushed during an optimize, etc). Optimize will optimize only those segments present when it started (newly flushed segments may remain separate). New API: optimize(boolean doWait) allows you to not wait for optimize to complete (it runs in background). This only works when MergeScheduler uses threads. New API: close(boolean doWait) allows you to not wait for running merges if you want to "close in a hurry". Also only works when MergeScheduler uses threads. I fixed LogMergePolicy to expose merge concurrency during optimize by first calling the "normal" merge policy to see if it requires merges and returning those merges if so, and then falling back to the normal "merge the tail <= mergeFactor segments until there is only 1 left". Because IndexModifier synchronizes on directory, it can't use ConcurrentMergeScheduler since this quickly leads to deadlock at least during IndexWriter.close. So I set it back to SerialMergeScheduler (it is deprecated anyway). Added private IndexWriter.message(...) that prints message to the infoStream prefixed by the thread name and changed all infoStream.print*'s to message(...). Also added more messages in the exceptional cases to aid future diagnostics. Added more unit tests
          Hide
          mikemccand Michael McCandless added a comment -

          > In the while loop of optimize(), LogMergePolicy.findMergesForOptimize
          > returns a merge spec with one merge. If ConcurrentMergeScheduler is
          > used, the one merge will be started in MergeScheduler.merge() and
          > findMergesForOptimize will be called again. Before the one merge
          > finishes, findMergesForOptimize will return the same spec but the
          > one merge is already started. So only one concurrent merge is
          > possible and the main thread will spin on calling
          > findMergesForOptimize and attempting to merge.

          Yes. The new patch has cleaned this up nicely, I think.

          > One possible solution is to make LogMergePolicy.findMergesForOptimize
          > return multiple merge candidates. It allows higher level of
          > concurrency.

          Good idea! I took exactly this approach in patch I just attached. I
          made a simple change: LogMergePolicy.findMergesForOptimize first
          checks if "normal merging" would want to do merges and returns them if
          so. Since "normal merging" exposes concurrent merges, this gains us
          concurrency for optimize in cases where the index has too many
          segments. I wasn't sure how otherwise to expose concurrency...

          > It also alleviates a bit the problem of main thread spinning. To
          > solve this problem, maybe we can check if a merge is actually
          > started, then sleep briefly if not (which means all merges
          > candidates are in conflict)?

          This is much cleaner in new patch: there is no more spinning. In new
          patch if multiple threads are merging (either spawned by
          ConcurrentMergeaScheduler or provided by the application or both) then
          they all pull from a shared queue of "merges needing to run" and then
          return when that queue is empty. So no more spinning.

          > One difference between the current approach on concurrent merge and
          > the patch I posted a while back is that, in the current approach, a
          > MergeThread object is created and started for every concurrent
          > merge. In my old patch, maxThreadCount of threads are created and
          > started at the beginning and are used throughout. Both have pros and
          > cons.

          Yeah I thought I would keep it simple (launch thread when needed then
          let it finish when it's done) rather than use a pool. This way
          threads are only created (and are only alive) while concurrency is
          actually needed (ie > N merges necessary at once). But yes there are
          pros/cons either way.

          Show
          mikemccand Michael McCandless added a comment - > In the while loop of optimize(), LogMergePolicy.findMergesForOptimize > returns a merge spec with one merge. If ConcurrentMergeScheduler is > used, the one merge will be started in MergeScheduler.merge() and > findMergesForOptimize will be called again. Before the one merge > finishes, findMergesForOptimize will return the same spec but the > one merge is already started. So only one concurrent merge is > possible and the main thread will spin on calling > findMergesForOptimize and attempting to merge. Yes. The new patch has cleaned this up nicely, I think. > One possible solution is to make LogMergePolicy.findMergesForOptimize > return multiple merge candidates. It allows higher level of > concurrency. Good idea! I took exactly this approach in patch I just attached. I made a simple change: LogMergePolicy.findMergesForOptimize first checks if "normal merging" would want to do merges and returns them if so. Since "normal merging" exposes concurrent merges, this gains us concurrency for optimize in cases where the index has too many segments. I wasn't sure how otherwise to expose concurrency... > It also alleviates a bit the problem of main thread spinning. To > solve this problem, maybe we can check if a merge is actually > started, then sleep briefly if not (which means all merges > candidates are in conflict)? This is much cleaner in new patch: there is no more spinning. In new patch if multiple threads are merging (either spawned by ConcurrentMergeaScheduler or provided by the application or both) then they all pull from a shared queue of "merges needing to run" and then return when that queue is empty. So no more spinning. > One difference between the current approach on concurrent merge and > the patch I posted a while back is that, in the current approach, a > MergeThread object is created and started for every concurrent > merge. In my old patch, maxThreadCount of threads are created and > started at the beginning and are used throughout. Both have pros and > cons. Yeah I thought I would keep it simple (launch thread when needed then let it finish when it's done) rather than use a pool. This way threads are only created (and are only alive) while concurrency is actually needed (ie > N merges necessary at once). But yes there are pros/cons either way.
          Hide
          ningli Ning Li added a comment -

          > OK, another rev of the patch (take6). I think it's close!

          Yes, it's close!

          > I made one simplification to the approach: IndexWriter now keeps track
          > of "pendingMerges" (merges that mergePolicy has declared are necessary
          > but have not yet been started), and "runningMerges" (merges currently
          > in flight). Then MergeScheduler just asks IndexWriter for the next
          > pending merge when it's ready to run it. This also cleaned up how
          > cascading works.

          I like this simplification.

          > * Optimize: optimize is now fully concurrent (it can run multiple
          > merges at once, new segments can be flushed during an optimize,
          > etc). Optimize will optimize only those segments present when it
          > started (newly flushed segments may remain separate).

          This semantics does add a bit complexity - segmentsToOptimize, OneMerge.optimize.

          > Good idea! I took exactly this approach in patch I just attached. I
          > made a simple change: LogMergePolicy.findMergesForOptimize first
          > checks if "normal merging" would want to do merges and returns them if
          > so. Since "normal merging" exposes concurrent merges, this gains us
          > concurrency for optimize in cases where the index has too many
          > segments. I wasn't sure how otherwise to expose concurrency...

          Another option is to schedule merges for the newest N segments and the next newest N segments and the next next... N is the merge factor.

          A couple of other things:

          • It seems you intended sync() to be part of the MergeScheduler interface?
          • IndexWriter.close([true]), abort(): The behaviour should be the same whether the calling thread is the one that actually gets to do the closing. Right now, only the thread that actually does the closing waits for the closing. The others do not wait for the closing.
          Show
          ningli Ning Li added a comment - > OK, another rev of the patch (take6). I think it's close! Yes, it's close! > I made one simplification to the approach: IndexWriter now keeps track > of "pendingMerges" (merges that mergePolicy has declared are necessary > but have not yet been started), and "runningMerges" (merges currently > in flight). Then MergeScheduler just asks IndexWriter for the next > pending merge when it's ready to run it. This also cleaned up how > cascading works. I like this simplification. > * Optimize: optimize is now fully concurrent (it can run multiple > merges at once, new segments can be flushed during an optimize, > etc). Optimize will optimize only those segments present when it > started (newly flushed segments may remain separate). This semantics does add a bit complexity - segmentsToOptimize, OneMerge.optimize. > Good idea! I took exactly this approach in patch I just attached. I > made a simple change: LogMergePolicy.findMergesForOptimize first > checks if "normal merging" would want to do merges and returns them if > so. Since "normal merging" exposes concurrent merges, this gains us > concurrency for optimize in cases where the index has too many > segments. I wasn't sure how otherwise to expose concurrency... Another option is to schedule merges for the newest N segments and the next newest N segments and the next next... N is the merge factor. A couple of other things: It seems you intended sync() to be part of the MergeScheduler interface? IndexWriter.close( [true] ), abort(): The behaviour should be the same whether the calling thread is the one that actually gets to do the closing. Right now, only the thread that actually does the closing waits for the closing. The others do not wait for the closing.
          Hide
          mikemccand Michael McCandless added a comment -

          > > Good idea! I took exactly this approach in patch I just attached. I
          > > made a simple change: LogMergePolicy.findMergesForOptimize first
          > > checks if "normal merging" would want to do merges and returns them if
          > > so. Since "normal merging" exposes concurrent merges, this gains us
          > > concurrency for optimize in cases where the index has too many
          > > segments. I wasn't sure how otherwise to expose concurrency...
          >
          > Another option is to schedule merges for the newest N segments and
          > the next newest N segments and the next next... N is the merge
          > factor.

          OK, that is simpler. I'll take that approach (and not call the
          "normal" merge policy first).

          > A couple of other things:
          >
          > - It seems you intended sync() to be part of the MergeScheduler
          > interface?

          I had started down this route but then backed away from it: I think
          IndexWriter should handle this rather than making every MergeScheduler
          have duplicated code for doing so. Oh I see, I had left empty sync()
          in SerialMergeScheduler; I'll remove that.

          > - IndexWriter.close([true]), abort(): The behaviour should be the
          > same whether the calling thread is the one that actually gets to do
          > the closing. Right now, only the thread that actually does the
          > closing waits for the closing. The others do not wait for the
          > closing.

          Ahh good point. OK, I'll have other threads wait() until the
          close/abort is complete.

          Show
          mikemccand Michael McCandless added a comment - > > Good idea! I took exactly this approach in patch I just attached. I > > made a simple change: LogMergePolicy.findMergesForOptimize first > > checks if "normal merging" would want to do merges and returns them if > > so. Since "normal merging" exposes concurrent merges, this gains us > > concurrency for optimize in cases where the index has too many > > segments. I wasn't sure how otherwise to expose concurrency... > > Another option is to schedule merges for the newest N segments and > the next newest N segments and the next next... N is the merge > factor. OK, that is simpler. I'll take that approach (and not call the "normal" merge policy first). > A couple of other things: > > - It seems you intended sync() to be part of the MergeScheduler > interface? I had started down this route but then backed away from it: I think IndexWriter should handle this rather than making every MergeScheduler have duplicated code for doing so. Oh I see, I had left empty sync() in SerialMergeScheduler; I'll remove that. > - IndexWriter.close( [true] ), abort(): The behaviour should be the > same whether the calling thread is the one that actually gets to do > the closing. Right now, only the thread that actually does the > closing waits for the closing. The others do not wait for the > closing. Ahh good point. OK, I'll have other threads wait() until the close/abort is complete.
          Hide
          mikemccand Michael McCandless added a comment -

          New patch (take 7).

          I folded in Ning's comments (above) and Yonik's comments from
          LUCENE-845, added javadocs & fixed Javadoc warnings and fixed two
          other small issues. All tests pass on Linux, OS X, win32, with either
          SerialMergeScheduler or ConcurrentMergeScheduler as the default.

          I plan to commit in a few days time...

          Show
          mikemccand Michael McCandless added a comment - New patch (take 7). I folded in Ning's comments (above) and Yonik's comments from LUCENE-845 , added javadocs & fixed Javadoc warnings and fixed two other small issues. All tests pass on Linux, OS X, win32, with either SerialMergeScheduler or ConcurrentMergeScheduler as the default. I plan to commit in a few days time...
          Hide
          ningli Ning Li added a comment -

          Access of mergeThreads in ConcurrentMergeScheduler.merge() should be synchronized.

          Show
          ningli Ning Li added a comment - Access of mergeThreads in ConcurrentMergeScheduler.merge() should be synchronized.
          Hide
          mikemccand Michael McCandless added a comment -

          Ahh, good catch. Will fix!

          Show
          mikemccand Michael McCandless added a comment - Ahh, good catch. Will fix!
          Hide
          ningli Ning Li added a comment -

          Hmm, it's actually possible to have concurrent merges with SerialMergeScheduler.

          Making SerialMergeScheduler.merge synchronize on SerialMergeScheduler will serialize all merges. A merge can still be concurrent with a ram flush.

          Making SerialMergeScheduler.merge synchronize on IndexWriter will serialize all merges and ram flushes.

          Show
          ningli Ning Li added a comment - Hmm, it's actually possible to have concurrent merges with SerialMergeScheduler. Making SerialMergeScheduler.merge synchronize on SerialMergeScheduler will serialize all merges. A merge can still be concurrent with a ram flush. Making SerialMergeScheduler.merge synchronize on IndexWriter will serialize all merges and ram flushes.
          Hide
          mikemccand Michael McCandless added a comment -

          > Hmm, it's actually possible to have concurrent merges with
          > SerialMergeScheduler.

          This was actually intentional: I thought it fine if the application is
          sending multiple threads into IndexWriter to allow merges to run
          concurrently. Because, the application can always back down to a
          single thread to get everything serialized if that's really required?

          Show
          mikemccand Michael McCandless added a comment - > Hmm, it's actually possible to have concurrent merges with > SerialMergeScheduler. This was actually intentional: I thought it fine if the application is sending multiple threads into IndexWriter to allow merges to run concurrently. Because, the application can always back down to a single thread to get everything serialized if that's really required?
          Hide
          ningli Ning Li added a comment -

          > This was actually intentional: I thought it fine if the application is
          > sending multiple threads into IndexWriter to allow merges to run
          > concurrently. Because, the application can always back down to a
          > single thread to get everything serialized if that's really required?

          Today, applications use multiple threads on IndexWriter to get some concurrency on document parsing. With this patch, applications that want concurrent merges would simply use ConcurrentMergeScheduler, no?

          Or a rename since it doesn't really serialize merges?

          Show
          ningli Ning Li added a comment - > This was actually intentional: I thought it fine if the application is > sending multiple threads into IndexWriter to allow merges to run > concurrently. Because, the application can always back down to a > single thread to get everything serialized if that's really required? Today, applications use multiple threads on IndexWriter to get some concurrency on document parsing. With this patch, applications that want concurrent merges would simply use ConcurrentMergeScheduler, no? Or a rename since it doesn't really serialize merges?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I have to triple check, but on first glance, my apps performance halfed using the ConcurrentMergeScheduler on a recent core duo with 2 GB RAM (As compared to the SerialMergeSceduler). Seems unexpected?

          Show
          markrmiller@gmail.com Mark Miller added a comment - I have to triple check, but on first glance, my apps performance halfed using the ConcurrentMergeScheduler on a recent core duo with 2 GB RAM (As compared to the SerialMergeSceduler). Seems unexpected?
          Hide
          mikemccand Michael McCandless added a comment -

          > Today, applications use multiple threads on IndexWriter to get some
          > concurrency on document parsing. With this patch, applications that
          > want concurrent merges would simply use ConcurrentMergeScheduler,
          > no?

          True. OK I will make SerialMergeScheduler.merge serialized. This way
          only one merge can happen at a time even when the application is using
          multiple threads.

          Show
          mikemccand Michael McCandless added a comment - > Today, applications use multiple threads on IndexWriter to get some > concurrency on document parsing. With this patch, applications that > want concurrent merges would simply use ConcurrentMergeScheduler, > no? True. OK I will make SerialMergeScheduler.merge serialized. This way only one merge can happen at a time even when the application is using multiple threads.
          Hide
          mikemccand Michael McCandless added a comment -

          > I have to triple check, but on first glance, my apps performance
          > halfed using the ConcurrentMergeScheduler on a recent core duo with
          > 2 GB RAM (As compared to the SerialMergeSceduler). Seems unexpected?

          Whoa, that's certainly unexpected! I'll go re-run my perf test.

          Show
          mikemccand Michael McCandless added a comment - > I have to triple check, but on first glance, my apps performance > halfed using the ConcurrentMergeScheduler on a recent core duo with > 2 GB RAM (As compared to the SerialMergeSceduler). Seems unexpected? Whoa, that's certainly unexpected! I'll go re-run my perf test.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Looks like some anomalous tests. Last night I checked twice, but today results are: 58 to 48 in favor of Concurrent. I am going to assume my first results where invalid. Sorry for the noise and thanks for the great patch. Has passed quite a few stress tests I run on my app without any problems so far. Do both merge policies allow for a closer to constant add time or is it just the Concurrent policy?

          Show
          markrmiller@gmail.com Mark Miller added a comment - Looks like some anomalous tests. Last night I checked twice, but today results are: 58 to 48 in favor of Concurrent. I am going to assume my first results where invalid. Sorry for the noise and thanks for the great patch. Has passed quite a few stress tests I run on my app without any problems so far. Do both merge policies allow for a closer to constant add time or is it just the Concurrent policy?
          Hide
          mikemccand Michael McCandless added a comment -

          > Looks like some anomalous tests. Last night I checked twice, but
          > today results are: 58 to 48 in favor of Concurrent. I am going to
          > assume my first results where invalid. Sorry for the noise and
          > thanks for the great patch.

          OK, phew!

          > Has passed quite a few stress tests I run on my app without any
          > problems so far.

          I'm glad to hear that Thanks for being such an early adopter!

          > Do both merge policies allow for a closer to constant add time or is
          > it just the Concurrent policy?

          Not sure I understand the question – you mean addDocument? Yes it's
          only ConcurrentMergeScheduler that should keep addDocument calls
          constant time, because SerialMergeScheduler will hijack the addDocument
          thread to do its merges.

          Show
          mikemccand Michael McCandless added a comment - > Looks like some anomalous tests. Last night I checked twice, but > today results are: 58 to 48 in favor of Concurrent. I am going to > assume my first results where invalid. Sorry for the noise and > thanks for the great patch. OK, phew! > Has passed quite a few stress tests I run on my app without any > problems so far. I'm glad to hear that Thanks for being such an early adopter! > Do both merge policies allow for a closer to constant add time or is > it just the Concurrent policy? Not sure I understand the question – you mean addDocument? Yes it's only ConcurrentMergeScheduler that should keep addDocument calls constant time, because SerialMergeScheduler will hijack the addDocument thread to do its merges.
          Hide
          mikemccand Michael McCandless added a comment -

          Attached take8, incorporating Ning's feedback plus some small
          refactoring and fixing one case where optimize() would do an
          unecessary merge.

          Show
          mikemccand Michael McCandless added a comment - Attached take8, incorporating Ning's feedback plus some small refactoring and fixing one case where optimize() would do an unecessary merge.

            People

            • Assignee:
              steven_parkes Steven Parkes
              Reporter:
              steven_parkes Steven Parkes
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development