Issue Details (XML | Word | Printable)

Key: LUCENE-1194
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael McCandless
Reporter: Michael McCandless
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Add deleteByQuery to IndexWriter

Created: 26/Feb/08 03:54 PM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1194.patch 2008-02-26 03:55 PM Michael McCandless 56 kB
Text File Licensed for inclusion in ASF works LUCENE-1194.take2.patch 2008-03-01 05:29 PM Michael McCandless 73 kB

Lucene Fields: New
Resolution Date: 06/Mar/08 11:02 AM


 Description  « Hide
This has been discussed several times recently:

http://markmail.org/message/awlt4lmk3533epbe
http://www.gossamer-threads.com/lists/lucene/java-user/57384#57384

If we add deleteByQuery to IndexWriter then this is a big step towards
allowing IndexReader to be readonly.

I took the approach suggested in that first thread: I buffer delete
queries just like we now buffer delete terms, holding the max docID
that the delete should apply to.

Then, I also decoupled flushing deletes (mapping term or query -->
actual docIDs that need deleting) from flushing added documents, and
now I flush deletes only when a merge is started, or on commit() or
close(). SegmentMerger now exports the docID map it used when
merging, and I use that to renumber the max docIDs of all pending
deletes.

Finally, I turned off tracking of memory usage of pending deletes
since they now live beyond each flush. Deletes are now only
explicitly flushed if you set maxBufferedDeleteTerms to something
other than DISABLE_AUTO_FLUSH. Otherwise they are flushed at the
start of every merge.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael McCandless added a comment - 26/Feb/08 03:55 PM
Attached patch. All tests pass.

Michael McCandless made changes - 26/Feb/08 03:55 PM
Field Original Value New Value
Attachment LUCENE-1194.patch [ 12376495 ]
Ning Li added a comment - 26/Feb/08 05:45 PM
Great to see deleteByQuery being added to IndexWriter!

> Then, I also decoupled flushing deletes (mapping term or query -->
> actual docIDs that need deleting) from flushing added documents, and
> now I flush deletes only when a merge is started, or on commit() or
> close().

When autoCommit is true, we have to flush deletes with added documents
for update atomicity, don't we? UpdateByQuery can be added, if there is a
need.

> SegmentMerger now exports the docID map it used when merging,
> and I use that to renumber the max docIDs of all pending deletes.

Because of renumbering, we don't have to flush deletes at the start of
every merge, right? But it is a good time to flush deletes.


Michael McCandless added a comment - 26/Feb/08 08:05 PM

When autoCommit is true, we have to flush deletes with added documents
for update atomicity, don't we? UpdateByQuery can be added, if there is a
need.

Good question Ning!

As of LUCENE-1044, when autoCommit=true, IndexWriter only commits on
committing a merge, not with every flush.

Hmmm ... but, there is actually the reverse problem now with my patch:
an auto commit can actually commit deletes before the corresponding
added docs are committed (from updateDocument calls). This is
because, when we commit we only sync & commit the merged segments (not
the flushed segments). Though, autoCommit=true is deprecated; once we
remove that (in 3.0) this problem goes away. I'll have to ponder how
to fix that for now up until 3.0...it's tricky. Maybe before 3.0
we'll just have to flush all deletes whenever we flush a new
segment....

Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit
is hardwired to false then you can deleteDocuments(Query) and then
addDocument(...) and it will be atomic.

Because of renumbering, we don't have to flush deletes at the start of
every merge, right?

Exactly. EG, we could carry the deletes indefinitely and then flush
them only on close (assuming autoCommit=false), but...

But it is a good time to flush deletes.

Right. I think you want to flush all deletes that apply to the
segments being merged so merging will compact them. Right now I apply
all buffered deletes to all segments when any merge is started. It
may be possible to only apply them to those segments about to be
merged, but, I think it gets rather complex to track which buffered
deletes then need to apply to which segments.


Ning Li added a comment - 27/Feb/08 04:22 PM
> As of LUCENE-1044, when autoCommit=true, IndexWriter only commits on
> committing a merge, not with every flush.

I see. Interesting.

> Hmmm ... but, there is actually the reverse problem now with my patch:
> an auto commit can actually commit deletes before the corresponding
> added docs are committed (from updateDocument calls). This is
> because, when we commit we only sync & commit the merged segments (not
> the flushed segments).

Yep.

> Though, autoCommit=true is deprecated; once we
> remove that (in 3.0) this problem goes away. I'll have to ponder how
> to fix that for now up until 3.0...it's tricky. Maybe before 3.0
> we'll just have to flush all deletes whenever we flush a new
> segment....

I think flushing deletes when we flush a new segment is fine before 3.0.
In 3.0, is the plan to default autoCommit to false or to disable autoCommit
entirely? The latter, right?

> Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit
> is hardwired to false then you can deleteDocuments(Query) and then
> addDocument(...) and it will be atomic.

Agree. When autoCommit is disabled, we don't need any update method.


Michael McCandless added a comment - 27/Feb/08 10:00 PM

In 3.0, is the plan to default autoCommit to false or to disable autoCommit
entirely? The latter, right?

Yes, the latter.


Michael McCandless added a comment - 01/Mar/08 05:29 PM
OK I attached a new rev of the patch, fixing the issue Ning spotted
(thanks!). Now if autoCommit=true we apply deletes with every flush.

Other changes in this rev:

  • I modified the TestAtomicUpdate unit test to fail (because of this
    issue), and with the new patch it now passes.
  • I added the unLock/isLocked methods into IndexWriter and deprecated
    the ones in IndexReader.
  • I fixed an issue with commit() whereby it could return without
    committing, if another thread (merge thread) was in the process of
    committing at the same time.
  • Using the same testPoint call from LUCENE-1198, I added more test
    points so that I could tweak thread scheduling by randomly calling
    Thread.yield(). I changed TestStressIndexing2 & TestAtomicUpdate
    to do this.

I plan to commit in a few days.


Michael McCandless made changes - 01/Mar/08 05:29 PM
Attachment LUCENE-1194.take2.patch [ 12376906 ]
Michael McCandless made changes - 06/Mar/08 11:02 AM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #634219 Thu Mar 06 11:02:04 UTC 2008 mikemccand LUCENE-1194: add IndexWriter.deleteDocuments(Query)
Files Changed
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/index/TestStressIndexing2.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/DocumentsWriter.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/TermVectorsReader.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/index/TestAddIndexesNoOptimize.java
ADD /lucene/java/trunk/src/java/org/apache/lucene/index/MergeDocIDRemapper.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/IndexWriter.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/IndexReader.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/index/TestIndexWriter.java
MODIFY /lucene/java/trunk/CHANGES.txt
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/index/TestIndexWriterDelete.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/index/TestAtomicUpdate.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/SegmentMerger.java

Michael McCandless made changes - 11/Oct/08 12:49 PM
Status Resolved [ 5 ] Closed [ 6 ]