Lucene - Core
  1. Lucene - Core
  2. LUCENE-1705

Add deleteAllDocuments() method to IndexWriter

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Ideally, there would be a deleteAllDocuments() or clear() method on the IndexWriter

      This method should have the same performance and characteristics as:

      • currentWriter.close()
      • currentWriter = new IndexWriter(..., create=true,...)

      This would greatly optimize a delete all documents case. Using deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large existing index.

      IndexWriter.deleteAllDocuments() should have the same semantics as a commit(), as far as index visibility goes (new IndexReader opening would get the empty index)

      I see this was previously asked for in LUCENE-932, however it would be nice to finally see this added such that the IndexWriter would not need to be closed to perform the "clear" as this seems to be the general recommendation for working with an IndexWriter now

      deleteAllDocuments() method should:

      • abort any background merges (they are pointless once a deleteAll has been received)
      • write new segments file referencing no segments

      This method would remove one of the final reasons i would ever need to close an IndexWriter and reopen a new one

      1. DeleteAllFlushDocCountFix.patch
        2 kB
        Tim Smith
      2. IndexWriterDeleteAll.patch
        5 kB
        Tim Smith
      3. LUCENE-1705.patch
        6 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Shai Erera added a comment -

          My search app has such a scenario, and currently we just delete all the documents given a certain criteria (something similar to the above MatchAllDocsQuery. But I actually think that's the wrong approach. If you want to delete all the documents from the index, you'd better create a new one. The main reason is that if your index has, say, 10M documents, a deleteAll() will keep those 10M in the index, and when you'll re-index, the index size will be doubled. Worth still, the deleted documents may belong to segments which will not be merged/optimized right away (depends on your mergeFactor setting), and therefore will stick around for a long time (until you call optimize() or expungeDeletes()).

          But, creating a new IndexWriter right away, while overriding the current one is not so smart, because your users will be left w/ no search results until the index has accumulated enough documents. Therefore I think the solution for such an approach should be:

          1. Call writer.rollback() - abort all current operations, cancel everything until the last commit.
          2. Create a new IndexWriter in a new directory and re-index everything.
          3. In the meantime, all your search operations go against the current index, which you know is not going to change until the other one is re-built, and therefore you can also optimize things, by opening an IndexReader and stop any accounting your code may do - just leave it open.
          4. When re-indexing has complete, sync all your code and:
            • Define your workDir to be the new index dir. That way new searches can begin right away on the index index)
            • Safely delete the old index dir (probably need to do something here to ensure no readers are open against this dir etc.).

          That's a high-level description and I realize it may have some holes here and there, but you get the point.

          If we were to create a deleteAll() method, I'd expect it to work that way. I.e., the solution you proposed above (write a new segments file referencing no segments) would prevent all searches until something new is actually re-indexed right?

          I have to admit though, that I don't have an idea yet on how it can be done inside Lucene, such that new readers will see the old segments, while when I finish re-indexing and call commit, the previous segments will just be deleted.

          A wild shot (and then I'll go to sleep on it) - how about if you re-index everything, not committing during that time at all. Readers that are open against the current directory will see all the documents, EXCEPT the new ones you're adding (same for new readers that you may open). When you're done re-indexing, you'll call a commitNewOnly, which will create an empty segments file and then call commit. That way, assuming you're using KeepOnlyLastCommitDeletionPolicy, after the existing readers will close, any new reader that will be opened will see the new segments only, and the next time you commit, the old segments will be deleted.

          That will move the deleteAll() method to the application side, since it knows when it can safely delete all the current segments. If you don't have such a requirement (keeping an index for searches until re-indexing is complete), then I think you can safely close() the index and re-create it?

          Show
          Shai Erera added a comment - My search app has such a scenario, and currently we just delete all the documents given a certain criteria (something similar to the above MatchAllDocsQuery. But I actually think that's the wrong approach. If you want to delete all the documents from the index, you'd better create a new one. The main reason is that if your index has, say, 10M documents, a deleteAll() will keep those 10M in the index, and when you'll re-index, the index size will be doubled. Worth still, the deleted documents may belong to segments which will not be merged/optimized right away (depends on your mergeFactor setting), and therefore will stick around for a long time (until you call optimize() or expungeDeletes()). But, creating a new IndexWriter right away, while overriding the current one is not so smart, because your users will be left w/ no search results until the index has accumulated enough documents. Therefore I think the solution for such an approach should be: Call writer.rollback() - abort all current operations, cancel everything until the last commit. Create a new IndexWriter in a new directory and re-index everything. In the meantime, all your search operations go against the current index, which you know is not going to change until the other one is re-built, and therefore you can also optimize things, by opening an IndexReader and stop any accounting your code may do - just leave it open. When re-indexing has complete, sync all your code and: Define your workDir to be the new index dir. That way new searches can begin right away on the index index) Safely delete the old index dir (probably need to do something here to ensure no readers are open against this dir etc.). That's a high-level description and I realize it may have some holes here and there, but you get the point. If we were to create a deleteAll() method, I'd expect it to work that way. I.e., the solution you proposed above (write a new segments file referencing no segments) would prevent all searches until something new is actually re-indexed right? I have to admit though, that I don't have an idea yet on how it can be done inside Lucene, such that new readers will see the old segments, while when I finish re-indexing and call commit, the previous segments will just be deleted. A wild shot (and then I'll go to sleep on it) - how about if you re-index everything, not committing during that time at all. Readers that are open against the current directory will see all the documents, EXCEPT the new ones you're adding (same for new readers that you may open). When you're done re-indexing, you'll call a commitNewOnly, which will create an empty segments file and then call commit. That way, assuming you're using KeepOnlyLastCommitDeletionPolicy, after the existing readers will close, any new reader that will be opened will see the new segments only, and the next time you commit, the old segments will be deleted. That will move the deleteAll() method to the application side, since it knows when it can safely delete all the current segments. If you don't have such a requirement (keeping an index for searches until re-indexing is complete), then I think you can safely close() the index and re-create it?
          Hide
          Tim Smith added a comment -

          My use case is like so:

          • IndexReader opened against old index (the last committed index)
          • user requests "clearing the index" (for whatever reason, up to them to decide)
            • this will close the current IndexWriter and reopen with create=true (ideally this would just call writer.deleteAll())
            • IndexWriter.init() writes out new segments.gen file referencing no segments with create=true (old segments are not yet deleted as IndexReader is still open for them)
            • NOTE: new IndexReader not yet opened because they haven't said "commit" yet
            • (previous commit will still be the "live" index seen by searches - obviously this means disk is not reclaimed yet)
          • user may then feed new documents, or not, (up to them)
          • user will then eventually say commit
          • new IndexReader will be opened, which should only contain documents added since the "deleteAll()" operation

          Ideally, i want to perform this delete all operation very efficiently without needing to reopen an IndexWriter
          The rollback() call would possibly slightly optimize what i am currently doing, but even better still would be a highly efficient deleteAll() method (as all the semantics of this operation would be handled nicely (without jumping through hoops to clear the index))

          Creating a new IndexWriter in the same directory with create=true works like a champ, (existing IndexReaders still see the previous commit)
          just want do avoid the IndexWriter.close() call

          Show
          Tim Smith added a comment - My use case is like so: IndexReader opened against old index (the last committed index) user requests "clearing the index" (for whatever reason, up to them to decide) this will close the current IndexWriter and reopen with create=true (ideally this would just call writer.deleteAll()) IndexWriter.init() writes out new segments.gen file referencing no segments with create=true (old segments are not yet deleted as IndexReader is still open for them) NOTE: new IndexReader not yet opened because they haven't said "commit" yet (previous commit will still be the "live" index seen by searches - obviously this means disk is not reclaimed yet) user may then feed new documents, or not, (up to them) user will then eventually say commit new IndexReader will be opened, which should only contain documents added since the "deleteAll()" operation Ideally, i want to perform this delete all operation very efficiently without needing to reopen an IndexWriter The rollback() call would possibly slightly optimize what i am currently doing, but even better still would be a highly efficient deleteAll() method (as all the semantics of this operation would be handled nicely (without jumping through hoops to clear the index)) Creating a new IndexWriter in the same directory with create=true works like a champ, (existing IndexReaders still see the previous commit) just want do avoid the IndexWriter.close() call
          Hide
          Michael McCandless added a comment -

          This should be simple to implement, by taking the logic that's now in IndexWriter's init method (under the if (create) {) and pulling out into a separate method.

          From IndexWriter's standpoint, this is just another commit, on equal footing with all prior commits even though it up and deleted all segments. So eg a deletion policy could choose to keep those past commits around, IndexReaders can open those past commits, etc.

          One good functional gain by offering this method is it could be done within a single IndexWriter transaction. Ie you could deleteAll, make further changes, etc., but then change your mind and rollback the entire transaction. Whereas, requiring if you close & re-open the writer with create=true, you've necessarily committed the change to the index.

          So I don't think we should implicitly do a commit() from within deleteAll(); I think the user should call commit() on their own.

          Show
          Michael McCandless added a comment - This should be simple to implement, by taking the logic that's now in IndexWriter's init method (under the if (create) { ) and pulling out into a separate method. From IndexWriter's standpoint, this is just another commit, on equal footing with all prior commits even though it up and deleted all segments. So eg a deletion policy could choose to keep those past commits around, IndexReaders can open those past commits, etc. One good functional gain by offering this method is it could be done within a single IndexWriter transaction. Ie you could deleteAll, make further changes, etc., but then change your mind and rollback the entire transaction. Whereas, requiring if you close & re-open the writer with create=true, you've necessarily committed the change to the index. So I don't think we should implicitly do a commit() from within deleteAll(); I think the user should call commit() on their own.
          Hide
          Tim Smith added a comment -

          I agree, it would be best if the deleteAll() could be rolled back and was just like any other add/update/delete operation on the index (provided it doesn't cause performance of the operation to degrade to a delete by query with a *:*)

          That then greatly improves the current way that deleteAll() must be done (either sending a *:* to query delete method, or closing IndexWriter and reopening)

          Show
          Tim Smith added a comment - I agree, it would be best if the deleteAll() could be rolled back and was just like any other add/update/delete operation on the index (provided it doesn't cause performance of the operation to degrade to a delete by query with a *:*) That then greatly improves the current way that deleteAll() must be done (either sending a *:* to query delete method, or closing IndexWriter and reopening)
          Hide
          Michael McCandless added a comment -

          provided it doesn't cause performance of the operation to degrade to a delete by query with a :

          The delete actually would not be with a : query; instead, we simple remove all referenced segments from the segments_N file. So, it would be exceptionally fast.

          Do you wanna cons up a patch? It should be a refactoring of what's now done inside IndexWriter's init method...

          Show
          Michael McCandless added a comment - provided it doesn't cause performance of the operation to degrade to a delete by query with a : The delete actually would not be with a : query; instead, we simple remove all referenced segments from the segments_N file. So, it would be exceptionally fast. Do you wanna cons up a patch? It should be a refactoring of what's now done inside IndexWriter's init method...
          Hide
          Tim Smith added a comment -

          Do you wanna cons up a patch? It should be a refactoring of what's now done inside IndexWriter's init method...

          Had a feeling this was coming.

          I'll give it a shot, (unless someone else volunteers)

          Show
          Tim Smith added a comment - Do you wanna cons up a patch? It should be a refactoring of what's now done inside IndexWriter's init method... Had a feeling this was coming. I'll give it a shot, (unless someone else volunteers)
          Hide
          Shai Erera added a comment -

          I'll give it a shot, (unless someone else volunteers)

          I don't mind to have a deep dive here, if you're willing to wait a couple days.

          Show
          Shai Erera added a comment - I'll give it a shot, (unless someone else volunteers) I don't mind to have a deep dive here, if you're willing to wait a couple days.
          Hide
          Tim Smith added a comment -

          It seems like this method could be very simple (obviously i'm not an expert on IndexWriter code (yet))

          seems like this is pretty close:

          public synchronized void deleteAll() throws IOException {
            rollbackTransaction(); // might need to unroll this and not do everything here
            segmentInfos.clear();
            checkpoint(); // should we do this here? or just let the inevitable commit do this?
          }
          

          Alternatively, i suppose we could just copy rollbackTransaction to a deleteAll() method with the one change of not restoring the rollbackSegmentInfos (just leave it cleared)

          This may result in doing less work (and be more self contained

          Show
          Tim Smith added a comment - It seems like this method could be very simple (obviously i'm not an expert on IndexWriter code (yet)) seems like this is pretty close: public synchronized void deleteAll() throws IOException { rollbackTransaction(); // might need to unroll this and not do everything here segmentInfos.clear(); checkpoint(); // should we do this here? or just let the inevitable commit do this ? } Alternatively, i suppose we could just copy rollbackTransaction to a deleteAll() method with the one change of not restoring the rollbackSegmentInfos (just leave it cleared) This may result in doing less work (and be more self contained
          Hide
          Tim Smith added a comment -

          Looking closer, rollbackTransaction() isn't the right thing to use here (but it looks pretty close to the right thing)

          will work up a patch adapting rollbackTransaction() to a deleteAll() method

          might also need to borrow some from rollbackInternal() (not sure if i see rollbackTransaction rolling back any documents buffered in memory)

          Show
          Tim Smith added a comment - Looking closer, rollbackTransaction() isn't the right thing to use here (but it looks pretty close to the right thing) will work up a patch adapting rollbackTransaction() to a deleteAll() method might also need to borrow some from rollbackInternal() (not sure if i see rollbackTransaction rolling back any documents buffered in memory)
          Hide
          Tim Smith added a comment -

          Here's a patch adding a deleteAll() method (based on internalRollback())

          Patch includes tests.

          deleteAll() method can be rolled back (if IndexWriter supports rollback)

          This seems to work quite nicely

          please comment if there are any edge cases not being handled.

          Show
          Tim Smith added a comment - Here's a patch adding a deleteAll() method (based on internalRollback()) Patch includes tests. deleteAll() method can be rolled back (if IndexWriter supports rollback) This seems to work quite nicely please comment if there are any edge cases not being handled.
          Hide
          Michael McCandless added a comment -

          Patch looks good Tim!

          I added another test case (for NRT reader), removed the un-needed notifyAll & assert, and tweaked javadocs. I think it's ready to go in... I'll wait a day or two.

          Show
          Michael McCandless added a comment - Patch looks good Tim! I added another test case (for NRT reader), removed the un-needed notifyAll & assert, and tweaked javadocs. I think it's ready to go in... I'll wait a day or two.
          Hide
          Michael McCandless added a comment -

          Thanks Tim!

          Show
          Michael McCandless added a comment - Thanks Tim!
          Hide
          Tim Smith added a comment -

          Looks like i found an issue with this

          The deleteAll() method isn't resetting the nextDocID on the DocumentsWriter (or some similar behaviour)

          so, the following state will result in an error:

          • deleteAll()
          • updateDocument("5", doc)
          • commit()

          this results in a delete for doc "5" getting buffered, but with a very high "maxDocId"
          at the same time, doc is added, however, the following will then occur on commit:

          • flush segments to disk
          • doc "5" is now in a segment on disk
          • run deletes
          • doc "5" is now blacklisted from segment

          Will work on fixing this and post a new patch (along with updated test case)

          (was worried i was missing an edge case)

          Show
          Tim Smith added a comment - Looks like i found an issue with this The deleteAll() method isn't resetting the nextDocID on the DocumentsWriter (or some similar behaviour) so, the following state will result in an error: deleteAll() updateDocument("5", doc) commit() this results in a delete for doc "5" getting buffered, but with a very high "maxDocId" at the same time, doc is added, however, the following will then occur on commit: flush segments to disk doc "5" is now in a segment on disk run deletes doc "5" is now blacklisted from segment Will work on fixing this and post a new patch (along with updated test case) (was worried i was missing an edge case)
          Hide
          Tim Smith added a comment -

          Here's a patch to TestIndexWriterDelete that shows the problem

          after the deleteAll(), a document is added and a document is updated
          the added document gets indexed, the updated document does not

          Show
          Tim Smith added a comment - Here's a patch to TestIndexWriterDelete that shows the problem after the deleteAll(), a document is added and a document is updated the added document gets indexed, the updated document does not
          Hide
          Tim Smith added a comment -

          Here's a patch that fixes the deleteAll() + updateDocument() issue

          just needed to set the FlushDocCount to 0 after aborting the outstanding documents

          Show
          Tim Smith added a comment - Here's a patch that fixes the deleteAll() + updateDocument() issue just needed to set the FlushDocCount to 0 after aborting the outstanding documents
          Hide
          Michael McCandless added a comment -

          Thanks Tim! I just committed this.

          Show
          Michael McCandless added a comment - Thanks Tim! I just committed this.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Tim Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development