Lucene - Core
  1. Lucene - Core
  2. LUCENE-994

Change defaults in IndexWriter to maximize "out of the box" performance

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is follow-through from LUCENE-845, LUCENE-847 and LUCENE-870;
      I'll commit this once those three are committed.

      Out of the box performance of IndexWriter is maximized when flushing
      by RAM instead of a fixed document count (the default today) because
      documents can vary greatly in size.

      Likewise, merging performance should be faster when merging by net
      segment size since, to minimize the net IO cost of merging segments
      over time, you want to merge segments of equal byte size.

      Finally, ConcurrentMergeScheduler improves indexing speed
      substantially (25% in a simple initial test in LUCENE-870) because it
      runs the merges in the backround and doesn't block
      add/update/deleteDocument calls. Most machines have concurrency
      between CPU and IO and so it makes sense to default to this
      MergeScheduler.

      Note that these changes will break users of ParallelReader because the
      parallel indices will no longer have matching docIDs. Such users need
      to switch IndexWriter back to flushing by doc count, and switch the
      MergePolicy back to LogDocMergePolicy. It's likely also necessary to
      switch the MergeScheduler back to SerialMergeScheduler to ensure
      deterministic docID assignment.

      I think the combination of these three default changes, plus other
      performance improvements for indexing (LUCENE-966, LUCENE-843,
      LUCENE-963, LUCENE-969, LUCENE-871, etc.) should make for some sizable
      performance gains Lucene 2.3!

      1. LUCENE-994.patch
        21 kB
        Michael McCandless
      2. writerinfo.zip
        813 kB
        Mark Miller

        Activity

        Hide
        Michael McCandless added a comment -

        Attached patch.

        I changed these defaults:

        • Use ConcurrentMergeScheduler so merges are run in the background.
        • Flush by RAM usage by default. I set buffer size to 16 MB.
        • Merge segments according to byte size, not doc count.

        Most unit tests just passed, but a handful had to be switched back to
        LogDocMergePolicy because they were checking specific doc-count based
        details on how merges are selected. All tests pass now.

        I added an entry in CHANGES.txt under "Changes in runtime behavior"
        including a NOTE for ParallelReader users. I think when we release
        2.3 we should also put a caveat into the release announcement calling
        attention to this specifically for users of ParallelReader.

        I also fixed up contrib/benchmark to accept double-typed params and to
        pull the defaults for its Open/CreateIndexTask from IndexWriter's
        defaults.

        Finally I had to make a few small changes to gdata-server.

        Show
        Michael McCandless added a comment - Attached patch. I changed these defaults: Use ConcurrentMergeScheduler so merges are run in the background. Flush by RAM usage by default. I set buffer size to 16 MB. Merge segments according to byte size, not doc count. Most unit tests just passed, but a handful had to be switched back to LogDocMergePolicy because they were checking specific doc-count based details on how merges are selected. All tests pass now. I added an entry in CHANGES.txt under "Changes in runtime behavior" including a NOTE for ParallelReader users. I think when we release 2.3 we should also put a caveat into the release announcement calling attention to this specifically for users of ParallelReader. I also fixed up contrib/benchmark to accept double-typed params and to pull the defaults for its Open/CreateIndexTask from IndexWriter's defaults. Finally I had to make a few small changes to gdata-server.
        Hide
        Michael McCandless added a comment -

        I just committed this!

        This is a non-backwards-compatible change (and affects at least users
        of ParallelReader). I put a comment in the top section of CHANGES.txt
        explaining this.

        Show
        Michael McCandless added a comment - I just committed this! This is a non-backwards-compatible change (and affects at least users of ParallelReader). I put a comment in the top section of CHANGES.txt explaining this.
        Hide
        Mark Miller added a comment -

        Perhaps this is expected, but my experience:

        I load all my docs and then optimize the index. I load with a mergefactor of 3 because I have found it takes just as much time to merge as you go as it does to optimize everything at the end (I have not tested this with recent improvements).

        After changing to the new default merge policy my app (which does a lot more loading a doc than just index it with Lucene) lost 46% of its performance (processing the data, fully loading lucene and my database, and then optimizing an index).
        Switching back to LogDocMergePolicy() returns my performance.

        I am using flushbyram(42) and the concurrent merger.

        Triple checked this time <G>

        Is this expected?

        • Mark
        Show
        Mark Miller added a comment - Perhaps this is expected, but my experience: I load all my docs and then optimize the index. I load with a mergefactor of 3 because I have found it takes just as much time to merge as you go as it does to optimize everything at the end (I have not tested this with recent improvements). After changing to the new default merge policy my app (which does a lot more loading a doc than just index it with Lucene) lost 46% of its performance (processing the data, fully loading lucene and my database, and then optimizing an index). Switching back to LogDocMergePolicy() returns my performance. I am using flushbyram(42) and the concurrent merger. Triple checked this time <G> Is this expected? Mark
        Hide
        Michael McCandless added a comment -

        This is certainly not expected!

        So you are flushing by RAM usage, and then find that merging according
        to doc count gives substantially better performance than merging
        according to byte size of the segments?

        I'll run a test with contrib/benchmark on wikipedia content. I'll set
        mergeFactor to 3 and ramBufferMB = 42 and I'll optimize in the end.
        Is there anything else interesting in how you're using Lucene?

        Do you have any sense of where this sizable slowdown shows up? EG, is
        the optimize in the end substantially slower, or something?

        Is there any way to tease out the time actually spent in Lucene vs the
        rest of your application?

        Show
        Michael McCandless added a comment - This is certainly not expected! So you are flushing by RAM usage, and then find that merging according to doc count gives substantially better performance than merging according to byte size of the segments? I'll run a test with contrib/benchmark on wikipedia content. I'll set mergeFactor to 3 and ramBufferMB = 42 and I'll optimize in the end. Is there anything else interesting in how you're using Lucene? Do you have any sense of where this sizable slowdown shows up? EG, is the optimize in the end substantially slower, or something? Is there any way to tease out the time actually spent in Lucene vs the rest of your application?
        Hide
        Mark Miller added a comment - - edited

        Okay, I ran some tests loading about 4000 docs:

        autocommit=false, non compound format, mergefactor=3, flushbyram=42, build: latest from trunk (yesterday)

        new merge policy i load about 30 docs per second:

        time for load: 142828ms
        time for optimize: 2422ms

        LogDocMergePolicy() I load about 50 docs per second:

        time for load: 86781ms
        optimize: 4891ms

        So it looks like optimize is quicker, but I pay for it during the load?

        I am not doing anything else special with Lucene that I can think of, and I got duplicate results for a much larger load.

        Not too easy to pull out the non Lucene parts without just writing a test from scratch for Lucene.

        • Mark
        Show
        Mark Miller added a comment - - edited Okay, I ran some tests loading about 4000 docs: autocommit=false, non compound format, mergefactor=3, flushbyram=42, build: latest from trunk (yesterday) new merge policy i load about 30 docs per second: time for load: 142828ms time for optimize: 2422ms LogDocMergePolicy() I load about 50 docs per second: time for load: 86781ms optimize: 4891ms So it looks like optimize is quicker, but I pay for it during the load? I am not doing anything else special with Lucene that I can think of, and I got duplicate results for a much larger load. Not too easy to pull out the non Lucene parts without just writing a test from scratch for Lucene. Mark
        Hide
        Michael McCandless added a comment -

        Hmmm ... it seems like your index is fairly small because optimize
        runs pretty quickly in both cases. But that would mean (I think)
        you're not actually flushing very many segments since you have a high
        RAM buffer size (42 MB). So then I'm baffled why merge policy would
        be changing your numbers so much because your 4000 doc test should not
        (I think?) actually be doing that much merging.

        Are you creating the index from scratch in each test? How large is
        the resulting index? Are you using FSDirectory?

        I ran my own test on Wikipedia content. I ran this alg:

        analyzer=org.apache.lucene.analysis.SimpleAnalyzer
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
        directory=FSDirectory
        docs.file=/lucene/wikifull.txt

        ram.flush.mb=42
        max.field.length=2147483647
        merge.factor=3
        compound=false
        autocommit=false

        doc.maker.forever=false
        doc.add.log.step=5000

        ResetSystemErase
        CreateIndex
        {AddDoc >: *
        Optimize
        CloseIndex

        RepSumByName

        to index all of wikipedia with the same params you're using (flush @
        42 MB, compound false, merge factor 3).

        LogByteSizeMergePolicy (the current default) gives this output (times
        are best of 2 runs):

        indexing 1198 sec
        optimize 282 sec

        LogDocMergePolicy took this long

        indexing 1216 sec
        optimize 270 sec

        I think those numbers are "within the noise" of each other, ie pretty
        much the same. This is what I would expect. So we need to figure out
        why I'm seeing different results than you.

        Can you call writer.setInfoStream(System.out) and attach the resulting
        output from each of your 4000 doc runs? Thanks!

        Show
        Michael McCandless added a comment - Hmmm ... it seems like your index is fairly small because optimize runs pretty quickly in both cases. But that would mean (I think) you're not actually flushing very many segments since you have a high RAM buffer size (42 MB). So then I'm baffled why merge policy would be changing your numbers so much because your 4000 doc test should not (I think?) actually be doing that much merging. Are you creating the index from scratch in each test? How large is the resulting index? Are you using FSDirectory? I ran my own test on Wikipedia content. I ran this alg: analyzer=org.apache.lucene.analysis.SimpleAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker directory=FSDirectory docs.file=/lucene/wikifull.txt ram.flush.mb=42 max.field.length=2147483647 merge.factor=3 compound=false autocommit=false doc.maker.forever=false doc.add.log.step=5000 ResetSystemErase CreateIndex {AddDoc >: * Optimize CloseIndex RepSumByName to index all of wikipedia with the same params you're using (flush @ 42 MB, compound false, merge factor 3). LogByteSizeMergePolicy (the current default) gives this output (times are best of 2 runs): indexing 1198 sec optimize 282 sec LogDocMergePolicy took this long indexing 1216 sec optimize 270 sec I think those numbers are "within the noise" of each other, ie pretty much the same. This is what I would expect. So we need to figure out why I'm seeing different results than you. Can you call writer.setInfoStream(System.out) and attach the resulting output from each of your 4000 doc runs? Thanks!
        Hide
        Mark Miller added a comment -

        Sorry about the small test...just started using 4000 because I was having same results with 20000.

        A sample run with 20000:

        old merge:
        load: 1320453 ms
        optimize: 8891 ms

        new merge:
        load: 393625
        optimize: 17937 ms

        Its a fresh index each time of docs about 5-10k. I am using StandardAnalyzer.

        And I forgot to mention a big quirk: I am writing to two indexes, but only analyzing to a tokenstream once (cachingtoken filter) to have a stemmed and unstemmed index. So obviously a slowdown in writing an index would be a bit exaggerated.

        Still, its a major difference for my app.

        I will get you some debug output from the writers.

        Show
        Mark Miller added a comment - Sorry about the small test...just started using 4000 because I was having same results with 20000. A sample run with 20000: old merge: load: 1320453 ms optimize: 8891 ms new merge: load: 393625 optimize: 17937 ms Its a fresh index each time of docs about 5-10k. I am using StandardAnalyzer. And I forgot to mention a big quirk: I am writing to two indexes, but only analyzing to a tokenstream once (cachingtoken filter) to have a stemmed and unstemmed index. So obviously a slowdown in writing an index would be a bit exaggerated. Still, its a major difference for my app. I will get you some debug output from the writers.
        Hide
        Yonik Seeley added a comment -

        So based on your 20000 run, the "new merge" completed 3 times as fast?

        Some of the differences are going to be luck as to when big segment merges are done.

        If scheme "A" just did a big merge right before the optimize, much of that is wasted effort (the entire index will be rewritten anyway). If scheme "B" was just about to do a big merge, but then optimize was called, it wins.

        For a particular test run, tweaking the parameters can result in huge differences, but they may be "false".

        The only way I can think of minimizing this effect is to do very large runs and cap the maximum size of segments to get rid of the possibility of random huge segment merges.

        Show
        Yonik Seeley added a comment - So based on your 20000 run, the "new merge" completed 3 times as fast? Some of the differences are going to be luck as to when big segment merges are done. If scheme "A" just did a big merge right before the optimize, much of that is wasted effort (the entire index will be rewritten anyway). If scheme "B" was just about to do a big merge, but then optimize was called, it wins. For a particular test run, tweaking the parameters can result in huge differences, but they may be "false". The only way I can think of minimizing this effect is to do very large runs and cap the maximum size of segments to get rid of the possibility of random huge segment merges.
        Hide
        Mark Miller added a comment - - edited

        It was 3 times as fast, but to be fair, its more often closer to 2 times as fast. I just gave the result of the latest run. After running the test many many times, the new merge is much closer to 1/2 as fast as the old, and it has never been faster than that...rarely its slower....1/3 is worst I say actually.

        Show
        Mark Miller added a comment - - edited It was 3 times as fast, but to be fair, its more often closer to 2 times as fast. I just gave the result of the latest run. After running the test many many times, the new merge is much closer to 1/2 as fast as the old, and it has never been faster than that...rarely its slower....1/3 is worst I say actually.
        Hide
        Michael McCandless added a comment -

        Mark, are you working on the debug output? I'm hoping that gives a clue as to why you see such performance loss when merging by net byte size of each segment rather than by doc count... thanks.

        Show
        Michael McCandless added a comment - Mark, are you working on the debug output? I'm hoping that gives a clue as to why you see such performance loss when merging by net byte size of each segment rather than by doc count... thanks.
        Hide
        Michael McCandless added a comment -

        Reopening until we get to the bottom of the performance loss...

        Show
        Michael McCandless added a comment - Reopening until we get to the bottom of the performance loss...
        Hide
        Mark Miller added a comment -

        Sorry for the delay. Here is the debug output. As I said, I am actually writing to two indexes per doc, but I am also doing other processing and storing, so the slow down must be significant anyway (well over 25%?). The first Writer has an analyzer that stems and caches the unstemmed form , the second writer reads from the cached unstemmed tokens.

        If the debug does not lead anywhere I can try to isolate the slowdown out of my app code (if it exists without writing with two writers, though the writing is sequential). Also, I will try with some other merge factors etc.

        I think that the performance gap grows with the number of documents.

        Attached: 4 files, one for each writer with the old policy and the new. Run details: 4000 some docs, 30 doc/s new, 50 doc/s old

        Show
        Mark Miller added a comment - Sorry for the delay. Here is the debug output. As I said, I am actually writing to two indexes per doc, but I am also doing other processing and storing, so the slow down must be significant anyway (well over 25%?). The first Writer has an analyzer that stems and caches the unstemmed form , the second writer reads from the cached unstemmed tokens. If the debug does not lead anywhere I can try to isolate the slowdown out of my app code (if it exists without writing with two writers, though the writing is sequential). Also, I will try with some other merge factors etc. I think that the performance gap grows with the number of documents. Attached: 4 files, one for each writer with the old policy and the new. Run details: 4000 some docs, 30 doc/s new, 50 doc/s old
        Hide
        Michael McCandless added a comment -

        Thanks Mark! OK, I noticed a few things from the logs:

        • It looks like you are actually flushing every 10 docs, not 42 MB.
        • You seem to have a mergeFactor of 10 through all the indexing
          except at some point near the end, before optimize is called, you
          switch to mergeFactor 3?

        That said, the LogByteSizeMergePolicy is definitely not acting right.
        OH, I see the problem!

        OK, the bug happens when autoCommit is false and your docs have stored
        fields / term vectors and you're using LogByteSizeMergePolicy. In
        this case I am incorrectly calculating the byte size of the segment:
        I'm counting the shared doc stores against all segments. This then
        causes merge policy to think all segments are about the same size
        (since the doc stores grow very large).

        I'll open a new issue & fix it. Thanks for testing Mark

        Show
        Michael McCandless added a comment - Thanks Mark! OK, I noticed a few things from the logs: It looks like you are actually flushing every 10 docs, not 42 MB. You seem to have a mergeFactor of 10 through all the indexing except at some point near the end, before optimize is called, you switch to mergeFactor 3? That said, the LogByteSizeMergePolicy is definitely not acting right. OH, I see the problem! OK, the bug happens when autoCommit is false and your docs have stored fields / term vectors and you're using LogByteSizeMergePolicy. In this case I am incorrectly calculating the byte size of the segment: I'm counting the shared doc stores against all segments. This then causes merge policy to think all segments are about the same size (since the doc stores grow very large). I'll open a new issue & fix it. Thanks for testing Mark
        Hide
        Mark Miller added a comment -

        Anytime Michael. Thanks for pointing out the mergefactor issue to me. I recently retrofitted my indexer with google guice, and it seems that something is not working as expected. Glad this little debug session worked out for all <g>

        Can't thank you enough for all of your Lucene patches. Keep em coming!

        Show
        Mark Miller added a comment - Anytime Michael. Thanks for pointing out the mergefactor issue to me. I recently retrofitted my indexer with google guice, and it seems that something is not working as expected. Glad this little debug session worked out for all <g> Can't thank you enough for all of your Lucene patches. Keep em coming!
        Hide
        Michael McCandless added a comment -

        Marking this as fixed again; I opened LUCENE-1009 for the slowdown in merging.

        Show
        Michael McCandless added a comment - Marking this as fixed again; I opened LUCENE-1009 for the slowdown in merging.
        Hide
        Michael McCandless added a comment -

        > Anytime Michael. Thanks for pointing out the mergefactor issue to
        > me. I recently retrofitted my indexer with google guice, and it
        > seems that something is not working as expected. Glad this little
        > debug session worked out for all <g>

        Sure! Make sure you also fix your flushing to actually flush at 42 MB
        RAM buffer (things should go MUCH faster with that .

        > Can't thank you enough for all of your Lucene patches. Keep em
        > coming!

        You're welcome! I enjoy it

        Show
        Michael McCandless added a comment - > Anytime Michael. Thanks for pointing out the mergefactor issue to > me. I recently retrofitted my indexer with google guice, and it > seems that something is not working as expected. Glad this little > debug session worked out for all <g> Sure! Make sure you also fix your flushing to actually flush at 42 MB RAM buffer (things should go MUCH faster with that . > Can't thank you enough for all of your Lucene patches. Keep em > coming! You're welcome! I enjoy it
        Hide
        Yonik Seeley added a comment -

        While trying Solr with the latest Lucene, I ran into this back-incompatibility:
        Caused by: java.lang.IllegalArgumentException: this method can only be called when the merge policy is LogDocMergePolicy
        at org.apache.lucene.index.IndexWriter.getLogDocMergePolicy(IndexWriter.java:316)
        at org.apache.lucene.index.IndexWriter.setMaxMergeDocs(IndexWriter.java:768)

        It's not an issue at all for Solr - we'll fix things up when we officially upgrade Lucene versions, but it does seem like it might affect a number of apps that try and just drop in a new lucene jar. Thoughts?

        Show
        Yonik Seeley added a comment - While trying Solr with the latest Lucene, I ran into this back-incompatibility: Caused by: java.lang.IllegalArgumentException: this method can only be called when the merge policy is LogDocMergePolicy at org.apache.lucene.index.IndexWriter.getLogDocMergePolicy(IndexWriter.java:316) at org.apache.lucene.index.IndexWriter.setMaxMergeDocs(IndexWriter.java:768) It's not an issue at all for Solr - we'll fix things up when we officially upgrade Lucene versions, but it does seem like it might affect a number of apps that try and just drop in a new lucene jar. Thoughts?
        Hide
        Mark Miller added a comment -

        It was my impression that this Lucene release would be unusual in that you shouldn't just drop the jar without first making sure you are in compliance with the new changes? Since some apps are going to break no matter what (few they may be) perhaps you just make a big fuss about possible incompatible changes?

        Show
        Mark Miller added a comment - It was my impression that this Lucene release would be unusual in that you shouldn't just drop the jar without first making sure you are in compliance with the new changes? Since some apps are going to break no matter what (few they may be) perhaps you just make a big fuss about possible incompatible changes?
        Hide
        Michael McCandless added a comment -

        > While trying Solr with the latest Lucene, I ran into this back-incompatibility:
        > Caused by: java.lang.IllegalArgumentException: this method can only be called when the merge policy is LogDocMergePolicy
        > at org.apache.lucene.index.IndexWriter.getLogDocMergePolicy(IndexWriter.java:316)
        > at org.apache.lucene.index.IndexWriter.setMaxMergeDocs(IndexWriter.java:768)
        >
        > It's not an issue at all for Solr - we'll fix things up when we
        > officially upgrade Lucene versions, but it does seem like it might
        > affect a number of apps that try and just drop in a new lucene
        > jar. Thoughts?

        Hmm, good catch.

        This should only happen when "setMaxMergeDocs" is called (this is the
        only method that requires a LogDocMergePolicy). I think we have
        various options:

        1. Leave things as is and put up-front comment in the release saying
        you could either switch to LogDocMergePolicy, or, use
        "setMaxMergeMB" on the default LogByteSizeMergePolicy, instead.
        Also put details in the javadocs for this method explaining these
        options.

        2. Switch back to LogDocMergePolicy by default "out of the box".

        3. If setMaxMergeDocs() is called, switch back to LogDocMergePolicy
        "on-demand".

        4. Modify LogByteSizeMergePolicy to in fact accept both
        "maxMergeDocs" or "maxMergeMB", allowing either one or both just
        like "flush by RAM" and/or "flush by doc count" is being done in
        LUCENE-1007.

        I think I like option 4 the best. 3 seems to magical (violates
        "principle of least surprise"). 2 I think is bad because it's best to
        match the merge policy with how we are flushing (by RAM by default).
        1 is clearly disruptive to people who want to drop Lucene JAR in and
        test.

        I'll open a new issue. Thanks Yonik!

        Show
        Michael McCandless added a comment - > While trying Solr with the latest Lucene, I ran into this back-incompatibility: > Caused by: java.lang.IllegalArgumentException: this method can only be called when the merge policy is LogDocMergePolicy > at org.apache.lucene.index.IndexWriter.getLogDocMergePolicy(IndexWriter.java:316) > at org.apache.lucene.index.IndexWriter.setMaxMergeDocs(IndexWriter.java:768) > > It's not an issue at all for Solr - we'll fix things up when we > officially upgrade Lucene versions, but it does seem like it might > affect a number of apps that try and just drop in a new lucene > jar. Thoughts? Hmm, good catch. This should only happen when "setMaxMergeDocs" is called (this is the only method that requires a LogDocMergePolicy). I think we have various options: 1. Leave things as is and put up-front comment in the release saying you could either switch to LogDocMergePolicy, or, use "setMaxMergeMB" on the default LogByteSizeMergePolicy, instead. Also put details in the javadocs for this method explaining these options. 2. Switch back to LogDocMergePolicy by default "out of the box". 3. If setMaxMergeDocs() is called, switch back to LogDocMergePolicy "on-demand". 4. Modify LogByteSizeMergePolicy to in fact accept both "maxMergeDocs" or "maxMergeMB", allowing either one or both just like "flush by RAM" and/or "flush by doc count" is being done in LUCENE-1007 . I think I like option 4 the best. 3 seems to magical (violates "principle of least surprise"). 2 I think is bad because it's best to match the merge policy with how we are flushing (by RAM by default). 1 is clearly disruptive to people who want to drop Lucene JAR in and test. I'll open a new issue. Thanks Yonik!
        Hide
        Michael McCandless added a comment -

        > It was my impression that this Lucene release would be unusual in
        > that you shouldn't just drop the jar without first making sure you
        > are in compliance with the new changes? Since some apps are going to
        > break no matter what (few they may be) perhaps you just make a big
        > fuss about possible incompatible changes?

        I think this release should in fact "drop in" for most apps. The
        only known case where there is non-backwards compatibility (besides
        this setMaxMergeDocs issue) is users of ParallelReader, I think? I
        think Lucene 3.0 is when we are "allowed" to remove deprecated APIs,
        switch to Java 1.5, etc.

        Show
        Michael McCandless added a comment - > It was my impression that this Lucene release would be unusual in > that you shouldn't just drop the jar without first making sure you > are in compliance with the new changes? Since some apps are going to > break no matter what (few they may be) perhaps you just make a big > fuss about possible incompatible changes? I think this release should in fact "drop in" for most apps. The only known case where there is non-backwards compatibility (besides this setMaxMergeDocs issue) is users of ParallelReader, I think? I think Lucene 3.0 is when we are "allowed" to remove deprecated APIs, switch to Java 1.5, etc.
        Hide
        Hoss Man added a comment -

        lucene's "commitment" to backward compatibility requires that 2.X be API compatible with all previous 2.Y releases...

        http://wiki.apache.org/jakarta-lucene/BackwardsCompatibility

        ...the performance characteristics and file formats (and merge behavior) doesn't need to be exactly the same (so switching the Merge Policy should be fine) as long as a note is made about this in the "run time behavior" section of CHANGES.txt (we already have one) ... but any code that could compile and run against 2.2 should still run against 2.3 ... it might be really slow without some minor tweaks, and it might violate "principle of least surprise" but it's important that it still run so that people don't have to fear a minor version upgrade.

        if we can't provide that – then the release should be called 3.0, but i haven't seen anything that least me to think it's not possible. (it just might not be pretty)

        Show
        Hoss Man added a comment - lucene's "commitment" to backward compatibility requires that 2.X be API compatible with all previous 2.Y releases... http://wiki.apache.org/jakarta-lucene/BackwardsCompatibility ...the performance characteristics and file formats (and merge behavior) doesn't need to be exactly the same (so switching the Merge Policy should be fine) as long as a note is made about this in the "run time behavior" section of CHANGES.txt (we already have one) ... but any code that could compile and run against 2.2 should still run against 2.3 ... it might be really slow without some minor tweaks, and it might violate "principle of least surprise" but it's important that it still run so that people don't have to fear a minor version upgrade. if we can't provide that – then the release should be called 3.0, but i haven't seen anything that least me to think it's not possible. (it just might not be pretty)
        Hide
        Mark Miller added a comment -

        Quick question due to a new issue I am seeing in my application...could the concurrent merge possibly break apps that add a doc and then expect to be able to find it immediately after? I suspect this is not the case, just wondering based on some odd new behavior I am seeing. For example, if you call adddoc and it triggers a background merge, the doc is still immediately visible to the next call from the same thread right? No possibility for a race?

        Show
        Mark Miller added a comment - Quick question due to a new issue I am seeing in my application...could the concurrent merge possibly break apps that add a doc and then expect to be able to find it immediately after? I suspect this is not the case, just wondering based on some odd new behavior I am seeing. For example, if you call adddoc and it triggers a background merge, the doc is still immediately visible to the next call from the same thread right? No possibility for a race?
        Hide
        Yonik Seeley added a comment -

        For example, if you call adddoc and it triggers a background merge, the doc is still immediately visible
        to the next call from the same thread right?

        "visible" by opening a new reader you mean?
        I don't think so... this was never a guarantee (although it might have been normally true in the past), and concurrent merge breaks this.
        So does autocommit=false.

        Show
        Yonik Seeley added a comment - For example, if you call adddoc and it triggers a background merge, the doc is still immediately visible to the next call from the same thread right? "visible" by opening a new reader you mean? I don't think so... this was never a guarantee (although it might have been normally true in the past), and concurrent merge breaks this. So does autocommit=false.
        Hide
        Mark Miller added a comment -

        Well that would explain a lot then. I use both the concurrent merge and autocommit=false, where in the past I did not. Before that I certainly did seem to be able to count on it, as long as it was sequential in the same thread. I never wanted to count on it, but when other developers start using your libraries in ways you never intended or okay'd...so much for ever warming searchers.

        Isn't this a problem for certain unit tests? After adding a bunch of docs and then searching to see if they are there must you pause for a bit to make sure enough ms have passed?

        Show
        Mark Miller added a comment - Well that would explain a lot then. I use both the concurrent merge and autocommit=false, where in the past I did not. Before that I certainly did seem to be able to count on it, as long as it was sequential in the same thread. I never wanted to count on it, but when other developers start using your libraries in ways you never intended or okay'd...so much for ever warming searchers. Isn't this a problem for certain unit tests? After adding a bunch of docs and then searching to see if they are there must you pause for a bit to make sure enough ms have passed?
        Hide
        Yonik Seeley added a comment -

        Yes, I guess one could consider it a minor breakage.
        maxBufferedDocs previously ensured that changes were flushed every "n" docs. They were flushed in a visible way, and I don't recall anything saying that one shouldn't open a reader before the writer was closed.

        Show
        Yonik Seeley added a comment - Yes, I guess one could consider it a minor breakage. maxBufferedDocs previously ensured that changes were flushed every "n" docs. They were flushed in a visible way, and I don't recall anything saying that one shouldn't open a reader before the writer was closed.
        Hide
        Mark Miller added a comment -

        Sorry Yonik...I was not being explicit enough – I am closing the Writer before opening the Reader. Which is why I assumed I could count on this behavior. Semi randomly it is failing in my app now though. I am not positive it is due to Lucene, I just thought that maybe the concurrent merge was somehow not adding the document before triggering the merge in a background thread? Perhaps you dont see the doc till the background threads are done merging? Just looking for someone to tell me, no, even with concurrent merge, as long as you close the writer and then open a new reader, you are guaranteed to find the doc just added (if all from the same thread). I really do assume this is the case, I just have not changed anything else other than updating Lucene, so I am grasping at some straws...

        Show
        Mark Miller added a comment - Sorry Yonik...I was not being explicit enough – I am closing the Writer before opening the Reader. Which is why I assumed I could count on this behavior. Semi randomly it is failing in my app now though. I am not positive it is due to Lucene, I just thought that maybe the concurrent merge was somehow not adding the document before triggering the merge in a background thread? Perhaps you dont see the doc till the background threads are done merging? Just looking for someone to tell me, no, even with concurrent merge, as long as you close the writer and then open a new reader, you are guaranteed to find the doc just added (if all from the same thread). I really do assume this is the case, I just have not changed anything else other than updating Lucene, so I am grasping at some straws...
        Hide
        Michael McCandless added a comment -

        When you close the writer & open a reader, the reader should see all
        docs added, regardless of whether the writer was using concurrent
        merge scheduler or serial merge scheduler, autoCommit true or false,
        flushing by ram or by doc count, etc.

        IndexWriter.close() first flushes any buffered docs to a new segment,
        and then allows merges to run if they are necessary. It will also
        wait for all merges to finish (in the case of concurrent merge
        scheduler) unless you call close(false) which will abort all running
        merges.

        If you're not seeing this behavior then something is seriously wrong!
        Can you post some more details about how you see this intermittant
        failure?

        Show
        Michael McCandless added a comment - When you close the writer & open a reader, the reader should see all docs added, regardless of whether the writer was using concurrent merge scheduler or serial merge scheduler, autoCommit true or false, flushing by ram or by doc count, etc. IndexWriter.close() first flushes any buffered docs to a new segment, and then allows merges to run if they are necessary. It will also wait for all merges to finish (in the case of concurrent merge scheduler) unless you call close(false) which will abort all running merges. If you're not seeing this behavior then something is seriously wrong! Can you post some more details about how you see this intermittant failure?
        Hide
        Doug Cutting added a comment -

        > After adding a bunch of docs and then searching to see if they are there must you pause for a bit to make sure enough ms have passed?

        No. You could previously never rely on a newly added being visible to search until you called IndexWriter#close(). Added documents have always been buffered and all buffers were only flushed by IndexWriter#close(). It used to be the case that the buffer was memory-only and a fixed number of documents. So the last up to MaxBufferedDocs added would not yet be visible.

        Now there is an IndexWriter#flush() method that flushes buffers without closing the IndexWriter. And with the "autocommit=false" feature, nothing is visible to searchers until either #close() or #flush() is called. The primary change of concurrent merging is that calls to addDocument() generally return faster, with merging work done in the background, but concurrent merging and "autocommit=false" don't fundamentally change the need to call #close() or #flush() in order to guarantee that all changes are visible to searchers.

        At least that's my understanding...

        Show
        Doug Cutting added a comment - > After adding a bunch of docs and then searching to see if they are there must you pause for a bit to make sure enough ms have passed? No. You could previously never rely on a newly added being visible to search until you called IndexWriter#close(). Added documents have always been buffered and all buffers were only flushed by IndexWriter#close(). It used to be the case that the buffer was memory-only and a fixed number of documents. So the last up to MaxBufferedDocs added would not yet be visible. Now there is an IndexWriter#flush() method that flushes buffers without closing the IndexWriter. And with the "autocommit=false" feature, nothing is visible to searchers until either #close() or #flush() is called. The primary change of concurrent merging is that calls to addDocument() generally return faster, with merging work done in the background, but concurrent merging and "autocommit=false" don't fundamentally change the need to call #close() or #flush() in order to guarantee that all changes are visible to searchers. At least that's my understanding...
        Hide
        Mark Miller added a comment -

        Based on the responses I am going to assume the problem is not with Lucene or concurrent merge. I have to figure it out though, so if I can determine otherwise, you'll be the first to know. Gotto assume its me first though.

        Show
        Mark Miller added a comment - Based on the responses I am going to assume the problem is not with Lucene or concurrent merge. I have to figure it out though, so if I can determine otherwise, you'll be the first to know. Gotto assume its me first though.
        Hide
        Michael McCandless added a comment -

        Based on the responses I am going to assume the problem is not with Lucene or concurrent merge. I have to figure it out though, so if I can determine otherwise, you'll be the first to know. Gotto assume its me first though.

        I sure hope you're right! Keep us posted...

        No. You could previously never rely on a newly added being visible to search until you called IndexWriter#close(). Added documents have always been buffered and all buffers were only flushed by IndexWriter#close(). It used to be the case that the buffer was memory-only and a fixed number of documents. So the last up to MaxBufferedDocs added would not yet be visible.

        Now there is an IndexWriter#flush() method that flushes buffers without closing the IndexWriter. And with the "autocommit=false" feature, nothing is visible to searchers until either #close() or #flush() is called. The primary change of concurrent merging is that calls to addDocument() generally return faster, with merging work done in the background, but concurrent merging and "autocommit=false" don't fundamentally change the need to call #close() or #flush() in order to guarantee that all changes are visible to searchers.

        At least that's my understanding...

        This is my understanding too, except calling flush() with
        autoCommit=false does not actually make the changes visible to readers
        (though it does flush buffered added/deleted docs to the Directory).
        Only close() will make all changes visible to readers when
        autoCommit=false.

        Show
        Michael McCandless added a comment - Based on the responses I am going to assume the problem is not with Lucene or concurrent merge. I have to figure it out though, so if I can determine otherwise, you'll be the first to know. Gotto assume its me first though. I sure hope you're right! Keep us posted... No. You could previously never rely on a newly added being visible to search until you called IndexWriter#close(). Added documents have always been buffered and all buffers were only flushed by IndexWriter#close(). It used to be the case that the buffer was memory-only and a fixed number of documents. So the last up to MaxBufferedDocs added would not yet be visible. Now there is an IndexWriter#flush() method that flushes buffers without closing the IndexWriter. And with the "autocommit=false" feature, nothing is visible to searchers until either #close() or #flush() is called. The primary change of concurrent merging is that calls to addDocument() generally return faster, with merging work done in the background, but concurrent merging and "autocommit=false" don't fundamentally change the need to call #close() or #flush() in order to guarantee that all changes are visible to searchers. At least that's my understanding... This is my understanding too, except calling flush() with autoCommit=false does not actually make the changes visible to readers (though it does flush buffered added/deleted docs to the Directory). Only close() will make all changes visible to readers when autoCommit=false.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development