Lucene - Core
  1. Lucene - Core
  2. LUCENE-4203

Add IndexWriter.tryDeleteDocument, to delete by document id when possible

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-BETA, Trunk
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-4069.

      In that use case, where the app needs to first lookup a document, then
      call updateDocument, it's wasteful today because the relatively costly
      lookup (by a primary key field, eg "id") is done twice.

      But, since you already resolved the PK to docID on the first lookup,
      it would be nice to then delete by that docID and then you can call
      addDocument instead.

      So I worked out a rough start at this, by adding
      IndexWriter.tryDeleteDocument. It'd be a very expert API: it takes a
      SegmentInfo (referencing the segment that contains the docID), and as
      long as that segment hasn't yet been merged away, it will mark the
      document for deletion and return true (success). If it has been
      merged away it returns false and the app must then delete-by-term. It
      only works if the writer is in NRT mode (ie you've opened an NRT
      reader).

      In LUCENE-4069 using tryDeleteDocument gave a ~20% net speedup.

      I think tryDeleteDocument would also be useful when Solr "updates" a
      document by loading all stored fields, changing them, and calling
      updateDocument.

      1. LUCENE-4203.patch
        10 kB
        Michael McCandless
      2. LUCENE-4203.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial rough patch ... not ready to commit and needs a good random test.

        Show
        Michael McCandless added a comment - Initial rough patch ... not ready to commit and needs a good random test.
        Hide
        Robert Muir added a comment -

        Whats the advantage of passing the SIPC here? Wouldnt it be cleaner to just take SegmentReader and steal it from there?

        Show
        Robert Muir added a comment - Whats the advantage of passing the SIPC here? Wouldnt it be cleaner to just take SegmentReader and steal it from there?
        Hide
        Michael McCandless added a comment -

        Whats the advantage of passing the SIPC here? Wouldnt it be cleaner to just take SegmentReader and steal it from there?

        Oh, no advantage ... I agree passing SR would be better (then we can leave SR.getSI() as package private). It shouldn't be any hardship either because you should use an NRT reader from IW anyway to even find the docID to delete.

        Show
        Michael McCandless added a comment - Whats the advantage of passing the SIPC here? Wouldnt it be cleaner to just take SegmentReader and steal it from there? Oh, no advantage ... I agree passing SR would be better (then we can leave SR.getSI() as package private). It shouldn't be any hardship either because you should use an NRT reader from IW anyway to even find the docID to delete.
        Hide
        Robert Muir added a comment -

        It shouldn't be any hardship either because you should use an NRT reader from IW anyway to even find the docID to delete.

        Should we consider splitting out NRTSegmentReader and just make it type-safe?

        SegmentReader is already simplified, as long as its an AtomicReader nothing in Lucene should care, and
        we could factor out the NRT-only ctor etc in SR today...

        Just an idea, not sure about it myself

        Show
        Robert Muir added a comment - It shouldn't be any hardship either because you should use an NRT reader from IW anyway to even find the docID to delete. Should we consider splitting out NRTSegmentReader and just make it type-safe? SegmentReader is already simplified, as long as its an AtomicReader nothing in Lucene should care, and we could factor out the NRT-only ctor etc in SR today... Just an idea, not sure about it myself
        Hide
        Michael McCandless added a comment -

        Should we consider splitting out NRTSegmentReader and just make it type-safe?

        Hmm interesting...

        I think we shouldn't: the only reason why IW must be in NRT mode (pooling readers) is because of a limitation of IW: it cannot buffer pending deleted docIDs against already-flushed segments unless it has a pooled reader for that segment.

        We could (should?) someday fix this limitation, so I don't think the API sig should lock it down today.

        I also don't want users to think there may be salient differences b/w an NRT reader and a "normal" reader. The only difference is how the reader got the list of segments it should open ... otherwise an NRT reader and non-NRT reader are identical now.

        Show
        Michael McCandless added a comment - Should we consider splitting out NRTSegmentReader and just make it type-safe? Hmm interesting... I think we shouldn't: the only reason why IW must be in NRT mode (pooling readers) is because of a limitation of IW: it cannot buffer pending deleted docIDs against already-flushed segments unless it has a pooled reader for that segment. We could (should?) someday fix this limitation, so I don't think the API sig should lock it down today. I also don't want users to think there may be salient differences b/w an NRT reader and a "normal" reader. The only difference is how the reader got the list of segments it should open ... otherwise an NRT reader and non-NRT reader are identical now.
        Hide
        Uwe Schindler added a comment -

        Hi Mike,

        the user does not see any impl class (we should actually hide SegmentReader again), user only sees AtomicReader. NRT should subclass (the internal) SegmentReader and after that the code can simply check the incoming Reader if it is NRTReader. If not it say "I cannot delete documents by id with non-NRT IndexWriter". Nothing more, plain simple. Please do not make anything of these crazy internal classes public!

        Uwe

        Show
        Uwe Schindler added a comment - Hi Mike, the user does not see any impl class (we should actually hide SegmentReader again), user only sees AtomicReader. NRT should subclass (the internal) SegmentReader and after that the code can simply check the incoming Reader if it is NRTReader. If not it say "I cannot delete documents by id with non-NRT IndexWriter". Nothing more, plain simple. Please do not make anything of these crazy internal classes public! Uwe
        Hide
        Michael McCandless added a comment -

        OK I'll use AtomicReader, though it really must be a SegmentReader (we can cast it) since we need the SegmentInfoPerCommit instance from it.

        Hmm I suppose it could also be a CompositeReader (nrt DirectoryReader), and then the method would have to resolve to the AtomicReader (SegmentReader) / sub-docID.

        Show
        Michael McCandless added a comment - OK I'll use AtomicReader, though it really must be a SegmentReader (we can cast it) since we need the SegmentInfoPerCommit instance from it. Hmm I suppose it could also be a CompositeReader (nrt DirectoryReader), and then the method would have to resolve to the AtomicReader (SegmentReader) / sub-docID.
        Hide
        Michael McCandless added a comment -

        New patch, allowing any IndexReader, adding javadocs, and updating TestRollingUpdates to use tryDeleteDocument. I think it's ready!

        Show
        Michael McCandless added a comment - New patch, allowing any IndexReader, adding javadocs, and updating TestRollingUpdates to use tryDeleteDocument. I think it's ready!

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development