Lucene - Core
  1. Lucene - Core
  2. LUCENE-2897

apply delete-by-Term and docID immediately to newly flushed segments

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-2324.

      When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted. But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs. So it's more efficient to apply the deletes (for this one segment) at that time.

      We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.

      This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

      1. LUCENE-2897.patch
        36 kB
        Michael McCandless
      2. LUCENE-2897.patch
        12 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial patch.

        The approach is nice and simple All tests pass, but I put a bunch of nocommits in there, and I stopped short of the stuff I'll have to redo after committing ooo merges.

        Show
        Michael McCandless added a comment - Initial patch. The approach is nice and simple All tests pass, but I put a bunch of nocommits in there, and I stopped short of the stuff I'll have to redo after committing ooo merges.
        Hide
        Michael McCandless added a comment -

        I had to read this a few times, yes it's very elegant as we're skipping the postings that otherwise would be deleted immediately after flush, and we're reusing the terms map already in DWPT.

        Well... I think we can't [easily] skip writing the postings, because could result in non-deterministic behavior (I put a comment on this in the patch).

        If we did the flush w/ 2 passes (first pass to mark all del docs and 2nd to flush) then we could skip writing postings of docs that were deleted. But I suspect that's too much cost on flush.

        With a single pass, we'd end up writing some postings for the doc, but not all, depending on the order in which its terms arrived vs its deleted terms.

        I mean, in practice, an app is gonna delete against ID field (typically) so if we "knew" that down deep here in Luceneland we could do the first pass only against that one field...

        Also, merge is still going to have to apply del docs, since eg stored fields have written the deleted docs.

        Show
        Michael McCandless added a comment - I had to read this a few times, yes it's very elegant as we're skipping the postings that otherwise would be deleted immediately after flush, and we're reusing the terms map already in DWPT. Well... I think we can't [easily] skip writing the postings, because could result in non-deterministic behavior (I put a comment on this in the patch). If we did the flush w/ 2 passes (first pass to mark all del docs and 2nd to flush) then we could skip writing postings of docs that were deleted. But I suspect that's too much cost on flush. With a single pass, we'd end up writing some postings for the doc, but not all, depending on the order in which its terms arrived vs its deleted terms. I mean, in practice, an app is gonna delete against ID field (typically) so if we "knew" that down deep here in Luceneland we could do the first pass only against that one field... Also, merge is still going to have to apply del docs, since eg stored fields have written the deleted docs.
        Hide
        Jason Rutherglen added a comment -

        Well... I think we can't [easily] skip writing the postings, because could result in non-deterministic behavior (I put a comment on this in the patch).

        Instead we're building the deleted docs BV as we're flushing.

        Show
        Jason Rutherglen added a comment - Well... I think we can't [easily] skip writing the postings, because could result in non-deterministic behavior (I put a comment on this in the patch). Instead we're building the deleted docs BV as we're flushing.
        Hide
        Michael Busch added a comment -

        Yeah this is nice. I was thinking we'd switch to live deletes with RT, because then we can also handle delete-by-query like this.

        So the deleted queries we still have to buffer per DWPT, but this solves the updateDocument() problem.

        Show
        Michael Busch added a comment - Yeah this is nice. I was thinking we'd switch to live deletes with RT, because then we can also handle delete-by-query like this. So the deleted queries we still have to buffer per DWPT, but this solves the updateDocument() problem.
        Hide
        Michael McCandless added a comment -

        Instead we're building the deleted docs BV as we're flushing.

        Right, that's what the patch does. Seems to work great

        I was thinking we'd switch to live deletes with RT, because then we can also handle delete-by-query like this.

        Right, though for RT we need to apply the delete immediately (vs this patch which applies on flush).

        And delete-by-Query is still buffered...

        Show
        Michael McCandless added a comment - Instead we're building the deleted docs BV as we're flushing. Right, that's what the patch does. Seems to work great I was thinking we'd switch to live deletes with RT, because then we can also handle delete-by-query like this. Right, though for RT we need to apply the delete immediately (vs this patch which applies on flush). And delete-by-Query is still buffered...
        Hide
        Michael McCandless added a comment -

        Patch – I think it's ready to commit!

        Show
        Michael McCandless added a comment - Patch – I think it's ready to commit!
        Hide
        Michael McCandless added a comment -

        I think we can do this for 3.1.

        Show
        Michael McCandless added a comment - I think we can do this for 3.1.
        Hide
        Michael McCandless added a comment -

        I changed my mind! Pushing this to 3.2 now.

        Show
        Michael McCandless added a comment - I changed my mind! Pushing this to 3.2 now.
        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development