Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This has been discussed several times recently:

      http://markmail.org/message/awlt4lmk3533epbe
      http://www.gossamer-threads.com/lists/lucene/java-user/57384#57384

      If we add deleteByQuery to IndexWriter then this is a big step towards
      allowing IndexReader to be readonly.

      I took the approach suggested in that first thread: I buffer delete
      queries just like we now buffer delete terms, holding the max docID
      that the delete should apply to.

      Then, I also decoupled flushing deletes (mapping term or query -->
      actual docIDs that need deleting) from flushing added documents, and
      now I flush deletes only when a merge is started, or on commit() or
      close(). SegmentMerger now exports the docID map it used when
      merging, and I use that to renumber the max docIDs of all pending
      deletes.

      Finally, I turned off tracking of memory usage of pending deletes
      since they now live beyond each flush. Deletes are now only
      explicitly flushed if you set maxBufferedDeleteTerms to something
      other than DISABLE_AUTO_FLUSH. Otherwise they are flushed at the
      start of every merge.

      1. LUCENE-1194.patch
        56 kB
        Michael McCandless
      2. LUCENE-1194.take2.patch
        73 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch. All tests pass.

        Show
        mikemccand Michael McCandless added a comment - Attached patch. All tests pass.
        Hide
        ningli Ning Li added a comment -

        Great to see deleteByQuery being added to IndexWriter!

        > Then, I also decoupled flushing deletes (mapping term or query -->
        > actual docIDs that need deleting) from flushing added documents, and
        > now I flush deletes only when a merge is started, or on commit() or
        > close().

        When autoCommit is true, we have to flush deletes with added documents
        for update atomicity, don't we? UpdateByQuery can be added, if there is a
        need.

        > SegmentMerger now exports the docID map it used when merging,
        > and I use that to renumber the max docIDs of all pending deletes.

        Because of renumbering, we don't have to flush deletes at the start of
        every merge, right? But it is a good time to flush deletes.

        Show
        ningli Ning Li added a comment - Great to see deleteByQuery being added to IndexWriter! > Then, I also decoupled flushing deletes (mapping term or query --> > actual docIDs that need deleting) from flushing added documents, and > now I flush deletes only when a merge is started, or on commit() or > close(). When autoCommit is true, we have to flush deletes with added documents for update atomicity, don't we? UpdateByQuery can be added, if there is a need. > SegmentMerger now exports the docID map it used when merging, > and I use that to renumber the max docIDs of all pending deletes. Because of renumbering, we don't have to flush deletes at the start of every merge, right? But it is a good time to flush deletes.
        Hide
        mikemccand Michael McCandless added a comment -

        When autoCommit is true, we have to flush deletes with added documents
        for update atomicity, don't we? UpdateByQuery can be added, if there is a
        need.

        Good question Ning!

        As of LUCENE-1044, when autoCommit=true, IndexWriter only commits on
        committing a merge, not with every flush.

        Hmmm ... but, there is actually the reverse problem now with my patch:
        an auto commit can actually commit deletes before the corresponding
        added docs are committed (from updateDocument calls). This is
        because, when we commit we only sync & commit the merged segments (not
        the flushed segments). Though, autoCommit=true is deprecated; once we
        remove that (in 3.0) this problem goes away. I'll have to ponder how
        to fix that for now up until 3.0...it's tricky. Maybe before 3.0
        we'll just have to flush all deletes whenever we flush a new
        segment....

        Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit
        is hardwired to false then you can deleteDocuments(Query) and then
        addDocument(...) and it will be atomic.

        Because of renumbering, we don't have to flush deletes at the start of
        every merge, right?

        Exactly. EG, we could carry the deletes indefinitely and then flush
        them only on close (assuming autoCommit=false), but...

        But it is a good time to flush deletes.

        Right. I think you want to flush all deletes that apply to the
        segments being merged so merging will compact them. Right now I apply
        all buffered deletes to all segments when any merge is started. It
        may be possible to only apply them to those segments about to be
        merged, but, I think it gets rather complex to track which buffered
        deletes then need to apply to which segments.

        Show
        mikemccand Michael McCandless added a comment - When autoCommit is true, we have to flush deletes with added documents for update atomicity, don't we? UpdateByQuery can be added, if there is a need. Good question Ning! As of LUCENE-1044 , when autoCommit=true, IndexWriter only commits on committing a merge, not with every flush. Hmmm ... but, there is actually the reverse problem now with my patch: an auto commit can actually commit deletes before the corresponding added docs are committed (from updateDocument calls). This is because, when we commit we only sync & commit the merged segments (not the flushed segments). Though, autoCommit=true is deprecated; once we remove that (in 3.0) this problem goes away. I'll have to ponder how to fix that for now up until 3.0...it's tricky. Maybe before 3.0 we'll just have to flush all deletes whenever we flush a new segment.... Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit is hardwired to false then you can deleteDocuments(Query) and then addDocument(...) and it will be atomic. Because of renumbering, we don't have to flush deletes at the start of every merge, right? Exactly. EG, we could carry the deletes indefinitely and then flush them only on close (assuming autoCommit=false), but... But it is a good time to flush deletes. Right. I think you want to flush all deletes that apply to the segments being merged so merging will compact them. Right now I apply all buffered deletes to all segments when any merge is started. It may be possible to only apply them to those segments about to be merged, but, I think it gets rather complex to track which buffered deletes then need to apply to which segments.
        Hide
        ningli Ning Li added a comment -

        > As of LUCENE-1044, when autoCommit=true, IndexWriter only commits on
        > committing a merge, not with every flush.

        I see. Interesting.

        > Hmmm ... but, there is actually the reverse problem now with my patch:
        > an auto commit can actually commit deletes before the corresponding
        > added docs are committed (from updateDocument calls). This is
        > because, when we commit we only sync & commit the merged segments (not
        > the flushed segments).

        Yep.

        > Though, autoCommit=true is deprecated; once we
        > remove that (in 3.0) this problem goes away. I'll have to ponder how
        > to fix that for now up until 3.0...it's tricky. Maybe before 3.0
        > we'll just have to flush all deletes whenever we flush a new
        > segment....

        I think flushing deletes when we flush a new segment is fine before 3.0.
        In 3.0, is the plan to default autoCommit to false or to disable autoCommit
        entirely? The latter, right?

        > Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit
        > is hardwired to false then you can deleteDocuments(Query) and then
        > addDocument(...) and it will be atomic.

        Agree. When autoCommit is disabled, we don't need any update method.

        Show
        ningli Ning Li added a comment - > As of LUCENE-1044 , when autoCommit=true, IndexWriter only commits on > committing a merge, not with every flush. I see. Interesting. > Hmmm ... but, there is actually the reverse problem now with my patch: > an auto commit can actually commit deletes before the corresponding > added docs are committed (from updateDocument calls). This is > because, when we commit we only sync & commit the merged segments (not > the flushed segments). Yep. > Though, autoCommit=true is deprecated; once we > remove that (in 3.0) this problem goes away. I'll have to ponder how > to fix that for now up until 3.0...it's tricky. Maybe before 3.0 > we'll just have to flush all deletes whenever we flush a new > segment.... I think flushing deletes when we flush a new segment is fine before 3.0. In 3.0, is the plan to default autoCommit to false or to disable autoCommit entirely? The latter, right? > Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit > is hardwired to false then you can deleteDocuments(Query) and then > addDocument(...) and it will be atomic. Agree. When autoCommit is disabled, we don't need any update method.
        Hide
        mikemccand Michael McCandless added a comment -

        In 3.0, is the plan to default autoCommit to false or to disable autoCommit
        entirely? The latter, right?

        Yes, the latter.

        Show
        mikemccand Michael McCandless added a comment - In 3.0, is the plan to default autoCommit to false or to disable autoCommit entirely? The latter, right? Yes, the latter.
        Hide
        mikemccand Michael McCandless added a comment -

        OK I attached a new rev of the patch, fixing the issue Ning spotted
        (thanks!). Now if autoCommit=true we apply deletes with every flush.

        Other changes in this rev:

        • I modified the TestAtomicUpdate unit test to fail (because of this
          issue), and with the new patch it now passes.
        • I added the unLock/isLocked methods into IndexWriter and deprecated
          the ones in IndexReader.
        • I fixed an issue with commit() whereby it could return without
          committing, if another thread (merge thread) was in the process of
          committing at the same time.
        • Using the same testPoint call from LUCENE-1198, I added more test
          points so that I could tweak thread scheduling by randomly calling
          Thread.yield(). I changed TestStressIndexing2 & TestAtomicUpdate
          to do this.

        I plan to commit in a few days.

        Show
        mikemccand Michael McCandless added a comment - OK I attached a new rev of the patch, fixing the issue Ning spotted (thanks!). Now if autoCommit=true we apply deletes with every flush. Other changes in this rev: I modified the TestAtomicUpdate unit test to fail (because of this issue), and with the new patch it now passes. I added the unLock/isLocked methods into IndexWriter and deprecated the ones in IndexReader. I fixed an issue with commit() whereby it could return without committing, if another thread (merge thread) was in the process of committing at the same time. Using the same testPoint call from LUCENE-1198 , I added more test points so that I could tweak thread scheduling by randomly calling Thread.yield(). I changed TestStressIndexing2 & TestAtomicUpdate to do this. I plan to commit in a few days.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development