Lucene - Core
  1. Lucene - Core
  2. LUCENE-3828

Impossible to delete doc by docId, undeleteAll or setNorm(docId..)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It appears that there is a major regression in the trunk API. It's no longer possible to:
      1. delete document by internal id (even though you can iterate and retrieve docs by internal ids)
      2. undelete all deleted (but not yet reclaimed) documents
      3. set norm value on a specific document (by internal id)

      The lack of #1 means that you have to use delete by term or by query, which in turn means that now we require that documents have a unique primary key (otherwise you won't be able to delete a particular document that shares terms with other docs). IMHO this item is critical and should be fixed.

      The lack of #2 might not be critical but it still comes handy in some situations.

      The lack of #3 means that you have to update the whole doc if you just want to correct one field, which might be ok for the time being - it's a special case of not having updateable fields in general. But it's quite inconvenient if all you want to do is to adjust a weight of doc without reindexing, something that is possible with 3.x.

        Activity

        Hide
        Uwe Schindler added a comment -

        Hi Andrzej,

        • I agree with point (1), because it might be handy to delete documents by internal docId. I am not sure why this would not be possible through IndexWriter, but the problem here is the background-merging/merging at all, so with TieredMergePolicy even with a open IndexWriter the docIds can change suddenly. The only way to get stable docIds would be some mode to freeze IndexWriter's merging, get an NRT reader, delete documents using the integer ID on IndexWriter, then unfreeze and commit. IndexReader should of course stay read-only.
        • The undeleteAll is in my opinion no longer really needed in trunk, as you can undelete all docs in Lucene 4.0 by simply ignoring liveDocs when executing Querys (e.g. by a FilterAtomicReader that returns null for getLiveDocs()). If we want to readd something like that, it should be on IndexWriter, I think that should be easily possible to undelete from IndexWriter. IndexReader should of course stay read-only.
        • The last point is already explained in the other issues related to that: Norms are in 4.0 just DocValues so once we get updateable DocValues we could handle that (of course not via IndexReader). In all cases you can change scoring by changing similarity which is much more flexible in trunk, you can even use a custom docvalues field as norm containing a float instead of the byte. Changing norm values on disk is not really the way to go anymore. And finally again: IndexReader should of course stay read-only.
        Show
        Uwe Schindler added a comment - Hi Andrzej, I agree with point (1), because it might be handy to delete documents by internal docId. I am not sure why this would not be possible through IndexWriter, but the problem here is the background-merging/merging at all, so with TieredMergePolicy even with a open IndexWriter the docIds can change suddenly. The only way to get stable docIds would be some mode to freeze IndexWriter's merging, get an NRT reader, delete documents using the integer ID on IndexWriter, then unfreeze and commit. IndexReader should of course stay read-only. The undeleteAll is in my opinion no longer really needed in trunk, as you can undelete all docs in Lucene 4.0 by simply ignoring liveDocs when executing Querys (e.g. by a FilterAtomicReader that returns null for getLiveDocs()). If we want to readd something like that, it should be on IndexWriter, I think that should be easily possible to undelete from IndexWriter. IndexReader should of course stay read-only. The last point is already explained in the other issues related to that: Norms are in 4.0 just DocValues so once we get updateable DocValues we could handle that (of course not via IndexReader). In all cases you can change scoring by changing similarity which is much more flexible in trunk, you can even use a custom docvalues field as norm containing a float instead of the byte. Changing norm values on disk is not really the way to go anymore. And finally again: IndexReader should of course stay read-only.
        Hide
        Yonik Seeley added a comment -

        I agree with point (1), because it might be handy to delete documents by internal docId. I am not sure why this would not be possible through IndexWriter, but the problem here is the background-merging/merging at all, so with TieredMergePolicy even with a open IndexWriter the docIds can change suddenly. The only way to get stable docIds would be some mode to freeze IndexWriter's merging, get an NRT reader, delete documents using the integer ID on IndexWriter, then unfreeze and commit. IndexReader should of course stay read-only.

        Seems like the best way to deleteByDocId in the IndexWriter is to somehow express it as a custom Query (rather than trying to freeze IndexWriter).

        Show
        Yonik Seeley added a comment - I agree with point (1), because it might be handy to delete documents by internal docId. I am not sure why this would not be possible through IndexWriter, but the problem here is the background-merging/merging at all, so with TieredMergePolicy even with a open IndexWriter the docIds can change suddenly. The only way to get stable docIds would be some mode to freeze IndexWriter's merging, get an NRT reader, delete documents using the integer ID on IndexWriter, then unfreeze and commit. IndexReader should of course stay read-only. Seems like the best way to deleteByDocId in the IndexWriter is to somehow express it as a custom Query (rather than trying to freeze IndexWriter).
        Hide
        Uwe Schindler added a comment -

        Seems like the best way to deleteByDocId in the IndexWriter is to somehow express it as a custom Query (rather than trying to freeze IndexWriter).

        The problem is IndexWriter executes query deletes per segment (unfortunately with AtomicReaderContext.docBase==0). I wanted to fix that already, but thats not easy with IW.

        Show
        Uwe Schindler added a comment - Seems like the best way to deleteByDocId in the IndexWriter is to somehow express it as a custom Query (rather than trying to freeze IndexWriter). The problem is IndexWriter executes query deletes per segment (unfortunately with AtomicReaderContext.docBase==0). I wanted to fix that already, but thats not easy with IW.
        Hide
        Yonik Seeley added a comment -

        The problem is IndexWriter executes query deletes per segment (unfortunately with AtomicReaderContext.docBase==0).

        Ahhhh, so no way to get a true top-level reader for anything that needs to be cross-segment (like joins, grouping, etc.) Bummer.

        Show
        Yonik Seeley added a comment - The problem is IndexWriter executes query deletes per segment (unfortunately with AtomicReaderContext.docBase==0). Ahhhh, so no way to get a true top-level reader for anything that needs to be cross-segment (like joins, grouping, etc.) Bummer.

          People

          • Assignee:
            Unassigned
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development