Lucene - Core
  1. Lucene - Core
  2. LUCENE-1721

IndexWriter to allow deletion by doc ids

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It would be great if IndexWriter would allow for deletion by doc ids as well. It makes sense for cases where a "query" has been executed beforehand, and later, that query needs to be applied in order to delete the matched documents.

      More information here: http://www.nabble.com/Delete-by-docId-in-IndexWriter-td24239930.html

        Activity

        Shay Banon created issue -
        Hide
        Michael McCandless added a comment -

        This is a frequently requested feature, and I agree it'd be useful, but the problem is docID is in general not usable in the context of a writer since docIDs shift when segments that have deletions are committed.

        Show
        Michael McCandless added a comment - This is a frequently requested feature, and I agree it'd be useful, but the problem is docID is in general not usable in the context of a writer since docIDs shift when segments that have deletions are committed.
        Hide
        Tim Smith added a comment -

        how about a delete method on the IndexWriter that takes a segment number and a document id

        it would also be required to add methods to the IndexReader to get the segment number and local document id for a docid, but this should then work just fine

        Show
        Tim Smith added a comment - how about a delete method on the IndexWriter that takes a segment number and a document id it would also be required to add methods to the IndexReader to get the segment number and local document id for a docid, but this should then work just fine
        Hide
        Tim Smith added a comment -

        i suppose even that approach would cause problems if segments merge between getting the segment number/local doc pair and actuallly asking for the delete

        Show
        Tim Smith added a comment - i suppose even that approach would cause problems if segments merge between getting the segment number/local doc pair and actuallly asking for the delete
        Hide
        Michael McCandless added a comment -

        Right, a merge can commit at any time.

        Since a writer can change docIDs at any time, it's never safe for something outside to hold onto docIDs; this is what's good about delete by Term and by Query: they are "late binding", ie, only resolve to docID at the exact moment deletion is done.

        Show
        Michael McCandless added a comment - Right, a merge can commit at any time. Since a writer can change docIDs at any time, it's never safe for something outside to hold onto docIDs; this is what's good about delete by Term and by Query: they are "late binding", ie, only resolve to docID at the exact moment deletion is done.
        Hide
        Tim Smith added a comment -

        One thing that would be nice to see is a boolean return value from "deleteByTerm" that would indicate if any documents will actually be deleted (or even more, the count of docs that would be deleted)

        This would allow an application to perform more optimal commits/optimizes. If the index only receives deletes that do not match the index, the application can be aware of this and prevent performing a commit (thereby removing the need for application specific cache loading/analytics/etc performed after a "noop" commit). This is extremely useful in the event where TONS of deletes will be received, but very few updates with frequent commit intervals

        obviously, this is rather impractical as this would require opening a SegmentReader for each and every segment for each delete given the current implementation of handling deletes (causing this special delete to be extremely slow)

        Show
        Tim Smith added a comment - One thing that would be nice to see is a boolean return value from "deleteByTerm" that would indicate if any documents will actually be deleted (or even more, the count of docs that would be deleted) This would allow an application to perform more optimal commits/optimizes. If the index only receives deletes that do not match the index, the application can be aware of this and prevent performing a commit (thereby removing the need for application specific cache loading/analytics/etc performed after a "noop" commit). This is extremely useful in the event where TONS of deletes will be received, but very few updates with frequent commit intervals obviously, this is rather impractical as this would require opening a SegmentReader for each and every segment for each delete given the current implementation of handling deletes (causing this special delete to be extremely slow)
        Hide
        Yonik Seeley added a comment -

        obviously, this is rather impractical as this would require opening a SegmentReader for each and every segment for each delete given the current implementation of handling deletes (causing this special delete to be extremely slow)

        Right - seems better to be handled at the application level. Solr used to do it's own delete handling before switching to using Lucene's delete-by-term - something like that would give you more control.

        If the index only receives deletes that do not match the index, the application can be aware of this and prevent performing a commit (thereby removing the need for application specific cache loading/analytics/etc performed after a "noop" commit).

        Sounds like you could perhaps use reopen() or the NRT (near realtime) patches - get a new reader, see what segments changed (if any).

        Show
        Yonik Seeley added a comment - obviously, this is rather impractical as this would require opening a SegmentReader for each and every segment for each delete given the current implementation of handling deletes (causing this special delete to be extremely slow) Right - seems better to be handled at the application level. Solr used to do it's own delete handling before switching to using Lucene's delete-by-term - something like that would give you more control. If the index only receives deletes that do not match the index, the application can be aware of this and prevent performing a commit (thereby removing the need for application specific cache loading/analytics/etc performed after a "noop" commit). Sounds like you could perhaps use reopen() or the NRT (near realtime) patches - get a new reader, see what segments changed (if any).
        Hide
        Tim Smith added a comment -

        Sounds like you could perhaps use reopen() or the NRT (near realtime) patches - get a new reader, see what segments changed (if any).

        Yeah, i'm aware of all these (and will take advantage of them once 2.9 is out), but some custom caches may not work on a per segment basis (may be much slower when evaluated this way, or may not be able to be constructed in the first place without full index as a context)

        was just throwing out that use case, as it is one i have run into and is one closely related to wanting to delete documents by internal id.

        Show
        Tim Smith added a comment - Sounds like you could perhaps use reopen() or the NRT (near realtime) patches - get a new reader, see what segments changed (if any). Yeah, i'm aware of all these (and will take advantage of them once 2.9 is out), but some custom caches may not work on a per segment basis (may be much slower when evaluated this way, or may not be able to be constructed in the first place without full index as a context) was just throwing out that use case, as it is one i have run into and is one closely related to wanting to delete documents by internal id.
        Hide
        Jason Rutherglen added a comment -

        I'm still of the somewhat naive opinion delete by doc id is
        possible simply using IR.deleteDocument on a IW.getReader. The
        reason is we can maintain a doc id mapping/genealogy as segments
        are merged. We'd simply keep a chain of child readers in a
        reader, a delete would cascade through it.

        I'm curious as to what other use cases there are for this, maybe
        editing field caches?

        Show
        Jason Rutherglen added a comment - I'm still of the somewhat naive opinion delete by doc id is possible simply using IR.deleteDocument on a IW.getReader. The reason is we can maintain a doc id mapping/genealogy as segments are merged. We'd simply keep a chain of child readers in a reader, a delete would cascade through it. I'm curious as to what other use cases there are for this, maybe editing field caches?
        Hide
        Yonik Seeley added a comment -

        but some custom caches may not work on a per segment basis

        Just because you get a new IndexReader doesn't mean you have to use it, if not enough (or nothing) changed.

        Show
        Yonik Seeley added a comment - but some custom caches may not work on a per segment basis Just because you get a new IndexReader doesn't mean you have to use it, if not enough (or nothing) changed.
        Hide
        Tim Smith added a comment -

        Absolutely nothing would have to have actually changed in order to reuse an old IndexReader.
        Is there a simple way to identify if this is true? Or at least to identity if the commit/optimize resulted in absolutely no change? reliably?

        Show
        Tim Smith added a comment - Absolutely nothing would have to have actually changed in order to reuse an old IndexReader. Is there a simple way to identify if this is true? Or at least to identity if the commit/optimize resulted in absolutely no change? reliably?
        Hide
        Yonik Seeley added a comment -

        Absolutely nothing would have to have actually changed in order to reuse an old IndexReader.

        Right - see the javadoc for reopen() then... it currently says that if nothing has changed, you get the same IndexReader instance back.

        Show
        Yonik Seeley added a comment - Absolutely nothing would have to have actually changed in order to reuse an old IndexReader. Right - see the javadoc for reopen() then... it currently says that if nothing has changed, you get the same IndexReader instance back.
        Hide
        Tim Smith added a comment -

        That looks like its pretty close, and is definitely better than assuming the index changed, but still not "exactly" what would be desired (at least by me).
        looks IndexReader.reopen() will give a new IndexReader instance even if there are no "actual" changes to the index (deletes/new segments) in the event that some background merges were performed. This can result in a new IndexReader being opened (via reopen()) even if no content changed. (while this will be a more optimal search index, it will invalidate caches that will then need to be reloaded)

        a return status from commit would be more desirable
        such a status could list the number new documents, deleted documents, and new segments/merged segments, and so on
        That way, an application could choose what criteria would facilitate opening a new index
        (this is going even more off topic though)

        Show
        Tim Smith added a comment - That looks like its pretty close, and is definitely better than assuming the index changed, but still not "exactly" what would be desired (at least by me). looks IndexReader.reopen() will give a new IndexReader instance even if there are no "actual" changes to the index (deletes/new segments) in the event that some background merges were performed. This can result in a new IndexReader being opened (via reopen()) even if no content changed. (while this will be a more optimal search index, it will invalidate caches that will then need to be reloaded) a return status from commit would be more desirable such a status could list the number new documents, deleted documents, and new segments/merged segments, and so on That way, an application could choose what criteria would facilitate opening a new index (this is going even more off topic though)
        Hide
        Tim Smith added a comment -

        reopen() actually doesn't do the trick

        the following situation will always result in reopen() returning a different IndexReader instance (at least in 2.4):

        • commit()
        • open IndexReader
        • delete(docid not in index)
        • commit() - creates new segments_N file (even though no files actually changed)
        • reopen() IndexReader - returns new instance, because of new segments_N file
        Show
        Tim Smith added a comment - reopen() actually doesn't do the trick the following situation will always result in reopen() returning a different IndexReader instance (at least in 2.4): commit() open IndexReader delete(docid not in index) commit() - creates new segments_N file (even though no files actually changed) reopen() IndexReader - returns new instance, because of new segments_N file
        Hide
        Yonik Seeley added a comment -

        This is presumably a MultiReader that is returned... can you check to see if it contains the same segment reader instances (it should)?

        Show
        Yonik Seeley added a comment - This is presumably a MultiReader that is returned... can you check to see if it contains the same segment reader instances (it should)?
        Hide
        Tim Smith added a comment -

        I'm working in lucene 2.4, so i don't have access to the "child" IndexReaders
        however i'm 99.99999% positive this is the case (or some similar event) (especially as my index is optimized at this point)

        my next though was to inspect the IndexCommit for each and validate that they contain the same exact files (with same exact timestamps)
        this is just more work (need to save of the IndexCommit at open time to make sure the timestamps will be captured at open time, then compare that with the new one (and there's no convenience method on IndexCommit for this type of equality check)

        Show
        Tim Smith added a comment - I'm working in lucene 2.4, so i don't have access to the "child" IndexReaders however i'm 99.99999% positive this is the case (or some similar event) (especially as my index is optimized at this point) my next though was to inspect the IndexCommit for each and validate that they contain the same exact files (with same exact timestamps) this is just more work (need to save of the IndexCommit at open time to make sure the timestamps will be captured at open time, then compare that with the new one (and there's no convenience method on IndexCommit for this type of equality check)
        Hide
        Tim Smith added a comment -

        Looks like its actually a SegmentReader

        old index: org.apache.lucene.index.SegmentReader@f38cf0
        new index: org.apache.lucene.index.SegmentReader@7bfc04

        i guess that in 2.4, if you have an optimized index, and you call reopen(), you will always get a new SegmentReader instance (so the == check isn't really valid)
        i guess this is a non-issue in 2.9 (opened index is never a SegmentReader, even in the optimized case)

        Show
        Tim Smith added a comment - Looks like its actually a SegmentReader old index: org.apache.lucene.index.SegmentReader@f38cf0 new index: org.apache.lucene.index.SegmentReader@7bfc04 i guess that in 2.4, if you have an optimized index, and you call reopen(), you will always get a new SegmentReader instance (so the == check isn't really valid) i guess this is a non-issue in 2.9 (opened index is never a SegmentReader, even in the optimized case)
        Hide
        Tim Smith added a comment -

        Actually, it looks like i am seeing 2 separate segments at reopen() time (_3, then _4), however at this time my index should have been optimized
        will track this down (probably just a problem with my test)

        Show
        Tim Smith added a comment - Actually, it looks like i am seeing 2 separate segments at reopen() time (_3, then _4), however at this time my index should have been optimized will track this down (probably just a problem with my test)
        Hide
        Tim Smith added a comment -

        found my problem, i wasn't doing a commit() after the optimize() call, so an index opened after optimize() was not the optimized index
        reopen() works like a champ now
        the javadoc for optimize() isn't fully clear on these semantics

        Show
        Tim Smith added a comment - found my problem, i wasn't doing a commit() after the optimize() call, so an index opened after optimize() was not the optimized index reopen() works like a champ now the javadoc for optimize() isn't fully clear on these semantics
        Hide
        Tim Smith added a comment -

        IndexReader.reopen() solves my use case for wanting this functionality

        I suppose if anyone else has a use case for this functionality, they should add it

        Show
        Tim Smith added a comment - IndexReader.reopen() solves my use case for wanting this functionality I suppose if anyone else has a use case for this functionality, they should add it
        Mark Thomas made changes -
        Field Original Value New Value
        Workflow jira [ 12467060 ] Default workflow, editable Closed status [ 12563238 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563238 ] jira [ 12584157 ]
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 JIRAS looks to be resolved.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 JIRAS looks to be resolved.
        Erick Erickson made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Michael McCandless added a comment -

        IndexWriter now has .tryDeleteDocument (LUCENE-4203), which deletes by IndexReader + document ID when possible (ie, as long as a merge has not merged away that reader).

        Show
        Michael McCandless added a comment - IndexWriter now has .tryDeleteDocument ( LUCENE-4203 ), which deletes by IndexReader + document ID when possible (ie, as long as a merge has not merged away that reader).

          People

          • Assignee:
            Unassigned
            Reporter:
            Shay Banon
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development