Lucene - Core
  1. Lucene - Core
  2. LUCENE-1703

Add a waitForMerges() method to IndexWriter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It would be very useful to have a waitForMerges() method on the IndexWriter.

      Right now, the only way i can see to achieve this is to call IndexWriter.close()

      ideally, there would be a method on the IndexWriter to wait for merges without actually closing the index.
      This would make it so that background merges (or optimize) can be waited for without closing the IndexWriter, and then reopening a new IndexWriter

      the close() reopen IndexWriter method can be problematic if the close() fails as the write lock won't be released
      this could then result in the following sequence:

      • close() - fails
      • force unlock the write lock (per close() documentation)
      • new IndexWriter() (acquires write lock)
      • finalize() on old IndexWriter releases the write lock
      • Index is now not locked, and another IndexWriter pointing to the same directory could be opened

      If you don't force unlock the write lock, opening a new IndexWriter will fail until garbage collection calls finalize() the old IndexWriter

      If the waitForMerges() method is available, i would likely never need to close() the IndexWriter until right before the process being shutdown, so this issue would not occur (worst case scenario, the waitForMerges() fails)

      1. IndexWriter.java.diff
        2 kB
        Tim Smith
      2. IndexWriter.java.diff
        2 kB
        Tim Smith

        Activity

        Hide
        Michael McCandless added a comment -

        You can use ConcurrentMergeScheduler.sync()?

        Show
        Michael McCandless added a comment - You can use ConcurrentMergeScheduler.sync()?
        Hide
        Tim Smith added a comment -

        thought maybe that method would do it, however that method is undocumented (no javadoc)

        ideally, the IndexWriter would have these semantics in order to handle any configured MergeScheduler that may perform background syncs

        Show
        Tim Smith added a comment - thought maybe that method would do it, however that method is undocumented (no javadoc) ideally, the IndexWriter would have these semantics in order to handle any configured MergeScheduler that may perform background syncs
        Hide
        Michael McCandless added a comment -

        ideally, the IndexWriter would have these semantics in order to handle any configured MergeScheduler that may perform background syncs

        I agree. Wanna cons up a patch?

        Show
        Michael McCandless added a comment - ideally, the IndexWriter would have these semantics in order to handle any configured MergeScheduler that may perform background syncs I agree. Wanna cons up a patch?
        Hide
        Tim Smith added a comment -

        I'm not super familiar with internals of IndexWriter, but i'll give it a shot

        Show
        Tim Smith added a comment - I'm not super familiar with internals of IndexWriter, but i'll give it a shot
        Hide
        Tim Smith added a comment -

        Here's a diff for IndexWriter.java

        moved code from else block in finishMerges() into public waitForMerges() method

        this should wait for any outstanding merges to be complete prior to exiting

        will see if i can also work up a test case

        Show
        Tim Smith added a comment - Here's a diff for IndexWriter.java moved code from else block in finishMerges() into public waitForMerges() method this should wait for any outstanding merges to be complete prior to exiting will see if i can also work up a test case
        Hide
        Tim Smith added a comment -

        I'm finding it a bit tricky to create a proper unit test for this

        This is indirectly tested as IndexWriter.close() will call this method

        but explicit calling of this method is harder to test as it requires outstanding merges to be running (very difficult to trigger)

        I started with TestIndexWriter.testNoWaitClose() as a base
        however, modifying this to test waitForMerges() takes 11 seconds to run (surely because it always waits, and doesn't abort the merges)

        also, in order to properly test, another method should be added to IndexWriter

        public synchronized int getOutstandingMerges() {
          return mergingSegments.size();
        }
        

        I suppose the most proper test would create a subclass of ConcurrentMergeScheduler that would block all merges until after i released a lock. Then the test would add docs, fork off thread to unlock for merging in X seconds, then call waitForMerges.

        Show
        Tim Smith added a comment - I'm finding it a bit tricky to create a proper unit test for this This is indirectly tested as IndexWriter.close() will call this method but explicit calling of this method is harder to test as it requires outstanding merges to be running (very difficult to trigger) I started with TestIndexWriter.testNoWaitClose() as a base however, modifying this to test waitForMerges() takes 11 seconds to run (surely because it always waits, and doesn't abort the merges) also, in order to properly test, another method should be added to IndexWriter public synchronized int getOutstandingMerges() { return mergingSegments.size(); } I suppose the most proper test would create a subclass of ConcurrentMergeScheduler that would block all merges until after i released a lock. Then the test would add docs, fork off thread to unlock for merging in X seconds, then call waitForMerges.
        Hide
        Michael McCandless added a comment -

        Patch looks good! I don't think you actually need a separate unit test, since (as you pointed out), this method is well tested internally?

        Show
        Michael McCandless added a comment - Patch looks good! I don't think you actually need a separate unit test, since (as you pointed out), this method is well tested internally?
        Hide
        Shai Erera added a comment -

        May I ask what's the use case for this? I looked at IndexWriter and the following have doWait versions: close(), optimize() and expungeDeletes(). The only one that doesn't is commit(), which documents it will not wait for merges. So this seems to be a method just for

        try { 
          writer.commit(); 
        } finally { 
          writer.waitForMerges(); 
        }
        

        Is there any other use case that you would want to wait for merges to finish?

        It's just that adding methods means we need to support them going forward, and if there's no real use case (or at least none that is covered by current API), we should not do it?

        Show
        Shai Erera added a comment - May I ask what's the use case for this? I looked at IndexWriter and the following have doWait versions: close(), optimize() and expungeDeletes(). The only one that doesn't is commit(), which documents it will not wait for merges. So this seems to be a method just for try { writer.commit(); } finally { writer.waitForMerges(); } Is there any other use case that you would want to wait for merges to finish? It's just that adding methods means we need to support them going forward, and if there's no real use case (or at least none that is covered by current API), we should not do it?
        Hide
        Tim Smith added a comment -

        My primary use case for this is to stabilize an index prior to performing a backup operation

        If background merges are ongoing, then partial segment files will be written to disk (potentially while an rsync or other operation is running)

        this will result in partial segments being backed up

        while this should still be a viable index, as the partial segments have not been "committed", it results in an unclean backup (this can result in rsync throwing warnings/errors (which operations people greatly dislike)

        Show
        Tim Smith added a comment - My primary use case for this is to stabilize an index prior to performing a backup operation If background merges are ongoing, then partial segment files will be written to disk (potentially while an rsync or other operation is running) this will result in partial segments being backed up while this should still be a viable index, as the partial segments have not been "committed", it results in an unclean backup (this can result in rsync throwing warnings/errors (which operations people greatly dislike)
        Hide
        Tim Smith added a comment -

        Very minor change
        moved assert on mergingSegments.size() into waitForMerges()

        Show
        Tim Smith added a comment - Very minor change moved assert on mergingSegments.size() into waitForMerges()
        Hide
        Shai Erera added a comment -

        I'm just playing the devil's advocate here.

        I'm still not sure that this method is needed. A MergeScheduler is given to IndexWriter, and therefore your code should be able to sync() on it, whether it's ConcurrentMergeScheduler or a different one which does background merges, since you control its creation. If it's something that needs to improve on CMS, let's improve it (like documenting sync()).

        My primary use case for this is to stabilize an index prior to performing a backup operation

        Still .. if you call any of the methods I've mentioned, you can change your code to call the doWait variants, except for commit(). If commit() is what you're after, wouldn't it be better to add a commit(doWait) so that you can call it w/ doWait=true?

        BTW, Mike McCandless has written a great article on Hot Backups w/ Lucene, which you can find here: http://www.manning.com/free/green_HotBackupsLucene.html (need to register to Manning first). If you're not familiar w/ it, I suggest you read it since it shows a nice and elegant ways to perform backups, w/o the need to sync your writer.

        I'm planning to try it out. One thing you can do, after calling snapshot() is to compare the files returned from snapshot() to the ones in the backup and: (1) delete the ones in backup that do not appear in the returned list of files, and copy over the files that do not exist in the backup (except for .lock). The article is really good and self explanatory.

        Show
        Shai Erera added a comment - I'm just playing the devil's advocate here. I'm still not sure that this method is needed. A MergeScheduler is given to IndexWriter, and therefore your code should be able to sync() on it, whether it's ConcurrentMergeScheduler or a different one which does background merges, since you control its creation. If it's something that needs to improve on CMS, let's improve it (like documenting sync()). My primary use case for this is to stabilize an index prior to performing a backup operation Still .. if you call any of the methods I've mentioned, you can change your code to call the doWait variants, except for commit(). If commit() is what you're after, wouldn't it be better to add a commit(doWait) so that you can call it w/ doWait=true? BTW, Mike McCandless has written a great article on Hot Backups w/ Lucene, which you can find here: http://www.manning.com/free/green_HotBackupsLucene.html (need to register to Manning first). If you're not familiar w/ it, I suggest you read it since it shows a nice and elegant ways to perform backups, w/o the need to sync your writer. I'm planning to try it out. One thing you can do, after calling snapshot() is to compare the files returned from snapshot() to the ones in the backup and: (1) delete the ones in backup that do not appear in the returned list of files, and copy over the files that do not exist in the backup (except for .lock). The article is really good and self explanatory.
        Hide
        Tim Smith added a comment -

        MergeScheduler does not provide a sync() method in its abstract class. Therefore, an implementation could be used that would not derive from or provide a sync() method

        If it did, this would be a more than acceptable solution instead of putting the method on the IndexWriter. (i potentially want to allow configuration to specify what MergeScheduler to use, and if sync() isn't specified on the base class, i can't reliably use it (will have to do instanceof checks, and a new concurrent based scheduler not derived from ConcurrentMergeScheduler breaks this)

        Also, the SerialMergeScheduler may be in use and be running due to another thread calling addDocument triggering a merge, but i want my second thread to wait for that merge to complete (this would require implementing sync() on SerialMergeScheduler to allow other threads to sync on the merge as well)

        optimize() is not a viable option to use to wait for merges as this could take a long time (hours)
        close() requires reopening the IndexWriter (which i am trying to avoid in the first place)
        expungeDeletes(true) is doing more work that is actually required (this could be a viable solution, however its a bit odd to "waitForMerges()" by calling a method called expungeDeletes())

        I will try getting access to the article you suggest, however i still expect i will want this method at the end of the day (and will end up using expungeDeletes(true) if it never gets added (with copious comments indicating that expungeDeletes() is actually being used to wait for background merges)

        Another use for waitForMerges() is as a more lightweight version of optimize().
        this will allow blocking adding more documents until background merges are complete periodically, preventing from throttling the IO on the indexer from being hit too hard. Could call IndexWriter.maybeMerge() followed by IndexWriter.waitForMerges()

        Show
        Tim Smith added a comment - MergeScheduler does not provide a sync() method in its abstract class. Therefore, an implementation could be used that would not derive from or provide a sync() method If it did, this would be a more than acceptable solution instead of putting the method on the IndexWriter. (i potentially want to allow configuration to specify what MergeScheduler to use, and if sync() isn't specified on the base class, i can't reliably use it (will have to do instanceof checks, and a new concurrent based scheduler not derived from ConcurrentMergeScheduler breaks this) Also, the SerialMergeScheduler may be in use and be running due to another thread calling addDocument triggering a merge, but i want my second thread to wait for that merge to complete (this would require implementing sync() on SerialMergeScheduler to allow other threads to sync on the merge as well) optimize() is not a viable option to use to wait for merges as this could take a long time (hours) close() requires reopening the IndexWriter (which i am trying to avoid in the first place) expungeDeletes(true) is doing more work that is actually required (this could be a viable solution, however its a bit odd to "waitForMerges()" by calling a method called expungeDeletes()) I will try getting access to the article you suggest, however i still expect i will want this method at the end of the day (and will end up using expungeDeletes(true) if it never gets added (with copious comments indicating that expungeDeletes() is actually being used to wait for background merges) Another use for waitForMerges() is as a more lightweight version of optimize(). this will allow blocking adding more documents until background merges are complete periodically, preventing from throttling the IO on the indexer from being hit too hard. Could call IndexWriter.maybeMerge() followed by IndexWriter.waitForMerges()
        Hide
        Shai Erera added a comment -

        MergeScheduler does not provide a sync() method in its abstract class. Therefore, an implementation could be used that would not derive from or provide a sync() method

        If it did, this would be a more than acceptable solution instead of putting the method on the IndexWriter. (i potentially want to allow configuration to specify what MergeScheduler to use, and if sync() isn't specified on the base class, i can't reliably use it (will have to do instanceof checks, and a new concurrent based scheduler not derived from ConcurrentMergeScheduler breaks this)

        We can add a sync() method to MergeScheduler or you do it on your side. There are only two MS today: CMS and SMS. If you're going to write your own CMS variant, then you can have it extend CMS, or define your own interface with a sync() method, which will delegate that call to whatever MS it wraps. Again, the way I see it, you know which MS you instantiate, and therefore you should be able to declare its type and use whatever methods it has. If you want something configurable, I suggest we move sync() to MS (w/ a default impl of doing nothing, for back-compat).

        What do you think?

        ... another thread calling addDocument triggering a merge

        This will only happen if you use autoCommit=true, right? You can use autoCommit=true, and when you call commit(), use the one that waits for merges (the one that does not exist yet ), and sync all your addDoc threads with that current commit thread (make sure you use a ReadWriteLock, or otherwise all your addDocs are going to sync on each other).

        this will allow blocking adding more documents until background merges are complete periodically

        What's "periodically"? If you're using autoCommit=false, those merges cannot happen suddenly. There are only 4 methods that can trigger them (I think?) and you can sync on calling them w/ the rest of the addDoc threads.

        Don't get me wrong - I just play the devil's advocate here. Recently, we've had this thread http://www.nabble.com/ReadOnly-IndexReaders-td23787255.html - a naive question on adding an ever more naive method to IndexReader - isReadOnly. If you read it, you'll see that it wasn't accepted at the end, because we've understood it will just try to cover a design flaw of your system.

        Therefore I'm just trying to verify whether this method is absolutely necessary, and there aren't any "design decisions" we should make in our apps to better handle that case.

        Having said that, I'm not a committer and if one decides it's important enough, one would take this issue and commit it. I just enjoy the discussion for now (hope you too ).

        Show
        Shai Erera added a comment - MergeScheduler does not provide a sync() method in its abstract class. Therefore, an implementation could be used that would not derive from or provide a sync() method If it did, this would be a more than acceptable solution instead of putting the method on the IndexWriter. (i potentially want to allow configuration to specify what MergeScheduler to use, and if sync() isn't specified on the base class, i can't reliably use it (will have to do instanceof checks, and a new concurrent based scheduler not derived from ConcurrentMergeScheduler breaks this) We can add a sync() method to MergeScheduler or you do it on your side. There are only two MS today: CMS and SMS. If you're going to write your own CMS variant, then you can have it extend CMS, or define your own interface with a sync() method, which will delegate that call to whatever MS it wraps. Again, the way I see it, you know which MS you instantiate, and therefore you should be able to declare its type and use whatever methods it has. If you want something configurable, I suggest we move sync() to MS (w/ a default impl of doing nothing, for back-compat). What do you think? ... another thread calling addDocument triggering a merge This will only happen if you use autoCommit=true, right? You can use autoCommit=true, and when you call commit(), use the one that waits for merges (the one that does not exist yet ), and sync all your addDoc threads with that current commit thread (make sure you use a ReadWriteLock, or otherwise all your addDocs are going to sync on each other). this will allow blocking adding more documents until background merges are complete periodically What's "periodically"? If you're using autoCommit=false, those merges cannot happen suddenly. There are only 4 methods that can trigger them (I think?) and you can sync on calling them w/ the rest of the addDoc threads. Don't get me wrong - I just play the devil's advocate here. Recently, we've had this thread http://www.nabble.com/ReadOnly-IndexReaders-td23787255.html - a naive question on adding an ever more naive method to IndexReader - isReadOnly. If you read it, you'll see that it wasn't accepted at the end, because we've understood it will just try to cover a design flaw of your system. Therefore I'm just trying to verify whether this method is absolutely necessary, and there aren't any "design decisions" we should make in our apps to better handle that case. Having said that, I'm not a committer and if one decides it's important enough, one would take this issue and commit it. I just enjoy the discussion for now (hope you too ).
        Hide
        Tim Smith added a comment -

        NOTE: I'm always using autoCommit=false (autoCommit=true is deprecated anyway)

        however, i could potentially have 2 threads feeding the index (in my custom code)
        one thread may call addDocument() (or maybeMerge() to be more to the point)
        this thread could result in the SerialMergeScheduler to start merging (addDocument() won't return until this merge completes)
        I then want thread 2 to call waitForMerges(), at which point it will wait till the first thread will have finished its merges (at which point addDocument will have returned)

        Obviously this is a contrived example as i personally will be locking the updates such that no addDocument() call could be in process when i want to call waitForMerges(), however this situation points out that even the SerialMergeScheduler should have an actual implementation for a sync() method, which would block until the thread actually doing the merge has completed. (as i may be calling sync() from a different thread other than the one the IndexWriter called merge() on) SerialMergeScheduler should therefore have a lock that will be held while merging, and a sync() method should be added that will just acquire and release the lock. Making both the sync() and merge() methods on the SerialMergeScheduler would achieve this (and the sync would just be a synchronized noop)

        It seems more natural to me to put this "sync" on the IndexWriter itself, especially as this will be completely agnostic to the merge scheduler used.

        for the "periodic" waiting for merge thread completion, this would be driven by messages from client code to request a "soft optimize" perhaps, which would just wait for background merges to complete. This could then result in turning over a new IndexReader for more efficient searches than using the old IndexReader (which may be more segmented). This message asking for a "soft optimize" may be sent on some scheduled basis in order to achieve better search performance (without the cost of an explicit optimize)

        Discussion is all well and good, and i definitely appreciate all comments.
        Even if this doesn't end up going in, you've pointed out another solution (using expungeDeletes()) which will achieve the same solution for me at least.

        Show
        Tim Smith added a comment - NOTE: I'm always using autoCommit=false (autoCommit=true is deprecated anyway) however, i could potentially have 2 threads feeding the index (in my custom code) one thread may call addDocument() (or maybeMerge() to be more to the point) this thread could result in the SerialMergeScheduler to start merging (addDocument() won't return until this merge completes) I then want thread 2 to call waitForMerges(), at which point it will wait till the first thread will have finished its merges (at which point addDocument will have returned) Obviously this is a contrived example as i personally will be locking the updates such that no addDocument() call could be in process when i want to call waitForMerges(), however this situation points out that even the SerialMergeScheduler should have an actual implementation for a sync() method, which would block until the thread actually doing the merge has completed. (as i may be calling sync() from a different thread other than the one the IndexWriter called merge() on) SerialMergeScheduler should therefore have a lock that will be held while merging, and a sync() method should be added that will just acquire and release the lock. Making both the sync() and merge() methods on the SerialMergeScheduler would achieve this (and the sync would just be a synchronized noop) It seems more natural to me to put this "sync" on the IndexWriter itself, especially as this will be completely agnostic to the merge scheduler used. for the "periodic" waiting for merge thread completion, this would be driven by messages from client code to request a "soft optimize" perhaps, which would just wait for background merges to complete. This could then result in turning over a new IndexReader for more efficient searches than using the old IndexReader (which may be more segmented). This message asking for a "soft optimize" may be sent on some scheduled basis in order to achieve better search performance (without the cost of an explicit optimize) Discussion is all well and good, and i definitely appreciate all comments. Even if this doesn't end up going in, you've pointed out another solution (using expungeDeletes()) which will achieve the same solution for me at least.
        Hide
        Michael McCandless added a comment -

        There are also "merges" that take place outside of the merge scheduler, eg addIndexesNoOptimize copies over the tail segments by running its own merges. Also, with LUCENE-1313, there may be merges that run entirely in RAM, not under MergeScheduler.

        I think it makes sense to expose waitForMerges in IW (instead of duplicating the code in every merge scheduler). We may be able to then deprecate CMS.sync?

        Show
        Michael McCandless added a comment - There are also "merges" that take place outside of the merge scheduler, eg addIndexesNoOptimize copies over the tail segments by running its own merges. Also, with LUCENE-1313 , there may be merges that run entirely in RAM, not under MergeScheduler. I think it makes sense to expose waitForMerges in IW (instead of duplicating the code in every merge scheduler). We may be able to then deprecate CMS.sync?
        Hide
        Shai Erera added a comment -

        one thread may call addDocument() (or maybeMerge() to be more to the point)
        this thread could result in the SerialMergeScheduler to start merging (addDocument() won't return until this merge completes)

        Again, if autoCommit=false, how can this happen? I thought that if autoCommit=false, no commit happens and therefore no segment merging?

        I think it makes sense to expose waitForMerges in IW (instead of duplicating the code in every merge scheduler). We may be able to then deprecate CMS.sync?

        I guess I'm still not convinced what simplicity would that bring. For Tim's use case, just two threads, using SMS, that might work. But for the general use case, or one which uses multiple indexing threads, one of which may call commit() at some point, another daemon may run optimize(), I dunno, this would still require syncing all threads around that waitForMerges call, if the intent is to prevent document additions while merges occur. Therefore this method is not expected to make my life any easier, except that if I need to report "no merges are running" or make a decision based on "no merges are running" I should have one thread call this waitForMerges()

        Of course, still playing the devil's advocate, you could call waitForMerges() which will return immediately b/c there are no merges to do or that are running, and soon as that happens, a context switch also happens, and an innocent addDocument will trigger a 50-segments merge, at which point whatever you thought to do b/c there are no merges, will hit the exact scenario you were trying to avoid all that time .

        I'll admit though that it's late here (1 AM), and perhaps I'm not seeing this clearly. And .. I still don't understand how if autocommit=false, and addDocument/deleteDocument can trigger a merge.

        Show
        Shai Erera added a comment - one thread may call addDocument() (or maybeMerge() to be more to the point) this thread could result in the SerialMergeScheduler to start merging (addDocument() won't return until this merge completes) Again, if autoCommit=false, how can this happen? I thought that if autoCommit=false, no commit happens and therefore no segment merging? I think it makes sense to expose waitForMerges in IW (instead of duplicating the code in every merge scheduler). We may be able to then deprecate CMS.sync? I guess I'm still not convinced what simplicity would that bring. For Tim's use case, just two threads, using SMS, that might work. But for the general use case, or one which uses multiple indexing threads, one of which may call commit() at some point, another daemon may run optimize(), I dunno, this would still require syncing all threads around that waitForMerges call, if the intent is to prevent document additions while merges occur. Therefore this method is not expected to make my life any easier, except that if I need to report "no merges are running" or make a decision based on "no merges are running" I should have one thread call this waitForMerges() Of course, still playing the devil's advocate, you could call waitForMerges() which will return immediately b/c there are no merges to do or that are running, and soon as that happens, a context switch also happens, and an innocent addDocument will trigger a 50-segments merge, at which point whatever you thought to do b/c there are no merges, will hit the exact scenario you were trying to avoid all that time . I'll admit though that it's late here (1 AM), and perhaps I'm not seeing this clearly. And .. I still don't understand how if autocommit=false, and addDocument/deleteDocument can trigger a merge.
        Hide
        Uwe Schindler added a comment -

        I still don't understand how if autocommit=false, and addDocument/deleteDocument can trigger a merge.

        Why not? I always use autocommit=false. After I added a number of documents to the index some of the existing not-yet committed segments get merged. It will never merge the current existing and committed segments, but the segments created during indexing can be merged. After commit() or close() the already existing and the new segments are maybe additionally merged and a new segments file written.

        Show
        Uwe Schindler added a comment - I still don't understand how if autocommit=false, and addDocument/deleteDocument can trigger a merge. Why not? I always use autocommit=false. After I added a number of documents to the index some of the existing not-yet committed segments get merged. It will never merge the current existing and committed segments, but the segments created during indexing can be merged. After commit() or close() the already existing and the new segments are maybe additionally merged and a new segments file written.
        Hide
        Jason Rutherglen added a comment -

        Seems like a useful feature.

        Show
        Jason Rutherglen added a comment - Seems like a useful feature.
        Hide
        Shai Erera added a comment -

        Right ... forgot the merges to the uncommitted segments. If we do this, can we also add a commit() version which waits for merges, to complete the set of operations that allow you to wait for merges (just for convenience, so that you don't call commit() followed by a waitForMerges()), or .. deprecate all the variants of the other methods that wait for merges, and say that the default behavior will not wait, and if you want this, call waitForMergest() afterwards?

        Show
        Shai Erera added a comment - Right ... forgot the merges to the uncommitted segments. If we do this, can we also add a commit() version which waits for merges, to complete the set of operations that allow you to wait for merges (just for convenience, so that you don't call commit() followed by a waitForMerges()), or .. deprecate all the variants of the other methods that wait for merges, and say that the default behavior will not wait, and if you want this, call waitForMergest() afterwards?
        Hide
        Michael McCandless added a comment -

        After I added a number of documents to the index some of the existing not-yet committed segments get merged. It will never merge the current existing and committed segments, but the segments created during indexing can be merged. After commit() or close() the already existing and the new segments are maybe additionally merged and a new segments file written.

        Actually, existing segments do get merged (assuming the merge policy selects them). It's just that they cannot be deleted because logically there are two snapshots while an IndexWriter is running with autoCommit false: the index as of when it was opened, and the current in-memory SegmentInfos. (And if you are using a deletion policy different from the default KeepOnlyLastCommit, even more snapshots will be kept).

        Show
        Michael McCandless added a comment - After I added a number of documents to the index some of the existing not-yet committed segments get merged. It will never merge the current existing and committed segments, but the segments created during indexing can be merged. After commit() or close() the already existing and the new segments are maybe additionally merged and a new segments file written. Actually, existing segments do get merged (assuming the merge policy selects them). It's just that they cannot be deleted because logically there are two snapshots while an IndexWriter is running with autoCommit false: the index as of when it was opened, and the current in-memory SegmentInfos. (And if you are using a deletion policy different from the default KeepOnlyLastCommit, even more snapshots will be kept).
        Hide
        Michael McCandless added a comment -

        or .. deprecate all the variants of the other methods that wait for merges, and say that the default behavior will not wait, and if you want this, call waitForMergest() afterwards?

        +1

        Though we have a migration challenge (if we want to swap the default). Or we add a matchVersion.

        Show
        Michael McCandless added a comment - or .. deprecate all the variants of the other methods that wait for merges, and say that the default behavior will not wait, and if you want this, call waitForMergest() afterwards? +1 Though we have a migration challenge (if we want to swap the default). Or we add a matchVersion.
        Hide
        Shai Erera added a comment -

        Though we have a migration challenge (if we want to swap the default). Or we add a matchVersion.

        If we change the default to not wait, then I don't think matchVersion will matter. Since I'll need to update my code anyway, I don't think I'd care whether I call waitForMerges(), or matchVersion, even though the latter is called only once and the change is required in just one place. matchVersion, IMO, is required in more complicated cases, like that ACRONYM issue, which my code really has no control over.

        Show
        Shai Erera added a comment - Though we have a migration challenge (if we want to swap the default). Or we add a matchVersion. If we change the default to not wait, then I don't think matchVersion will matter. Since I'll need to update my code anyway, I don't think I'd care whether I call waitForMerges(), or matchVersion, even though the latter is called only once and the change is required in just one place. matchVersion, IMO, is required in more complicated cases, like that ACRONYM issue, which my code really has no control over.
        Hide
        Michael McCandless added a comment -

        If we change the default to not wait, then I don't think matchVersion will matter. Since I'll need to update my code anyway, I don't think I'd care whether I call waitForMerges(), or matchVersion, even though the latter is called only once and the change is required in just one place. matchVersion, IMO, is required in more complicated cases, like that ACRONYM issue, which my code really has no control over.

        But we can't suddenly not wait, if you call optimize(), right? (That's too severe a change in runtime behavior) Ie, we'd have to rename it to something else, and deprecate the old one.

        Actually... I'm no longer so sure we should make this change to the default. I think the current default (optimize(), addIndexes*(), expungeDeletes(), etc., all wait until they're "done") is actually a good default?

        Show
        Michael McCandless added a comment - If we change the default to not wait, then I don't think matchVersion will matter. Since I'll need to update my code anyway, I don't think I'd care whether I call waitForMerges(), or matchVersion, even though the latter is called only once and the change is required in just one place. matchVersion, IMO, is required in more complicated cases, like that ACRONYM issue, which my code really has no control over. But we can't suddenly not wait, if you call optimize(), right? (That's too severe a change in runtime behavior) Ie, we'd have to rename it to something else, and deprecate the old one. Actually... I'm no longer so sure we should make this change to the default. I think the current default (optimize(), addIndexes*(), expungeDeletes(), etc., all wait until they're "done") is actually a good default?
        Hide
        Shai Erera added a comment -

        I'm not against the default, just thought that waitForMerges() will be even more useful if all of these didn't have the doWait variant, and it will reduce the number of methods on IW. We can leave them around though, I guess it doesn't hurt.

        Show
        Shai Erera added a comment - I'm not against the default, just thought that waitForMerges() will be even more useful if all of these didn't have the doWait variant, and it will reduce the number of methods on IW. We can leave them around though, I guess it doesn't hurt.
        Hide
        Tim Smith added a comment -

        I also noticed that at least on expungeDeletes() these methods may only wait for merges that the method itself initiated

        the waitForMerges() method would wait for all merges regardless of who initiated the merge.

        so:

        try {
          expungeDeletes(false);
        } finally {
          waitForMerges();
        }
        

        would not be the same as

        expungeDeletes(true);
        

        as the later would only wait for the merges actually caused by the expungeDeletes() method

        This would indicate that the waitForMerges=true/false variants of different methods should remain

        Show
        Tim Smith added a comment - I also noticed that at least on expungeDeletes() these methods may only wait for merges that the method itself initiated the waitForMerges() method would wait for all merges regardless of who initiated the merge. so: try { expungeDeletes( false ); } finally { waitForMerges(); } would not be the same as expungeDeletes( true ); as the later would only wait for the merges actually caused by the expungeDeletes() method This would indicate that the waitForMerges=true/false variants of different methods should remain
        Hide
        Michael McCandless added a comment -

        This would indicate that the waitForMerges=true/false variants of different methods should remain

        Ahh good point; and I think the default should be to wait. OK I plan to commit the current patch in a day or two. Thanks Tim!

        Show
        Michael McCandless added a comment - This would indicate that the waitForMerges=true/false variants of different methods should remain Ahh good point; and I think the default should be to wait. OK I plan to commit the current patch in a day or two. Thanks Tim!
        Hide
        Michael McCandless added a comment -

        Thanks Tim!

        Show
        Michael McCandless added a comment - Thanks Tim!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Tim Smith
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development