Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-992

IndexWriter.updateDocument is no longer atomic

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spinoff from LUCENE-847.

      Ning caught that as of LUCENE-843, we lost the atomicity of the delete
      + add in IndexWriter.updateDocument.

      Ning suggested a simple fix: move the buffered deletes into
      DocumentsWriter and let it do the delete + add atomically. This has a
      nice side effect of also consolidating the "time to flush" logic in
      DocumentsWriter.

      1. LUCENE-992.patch
        26 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        > - addDocument(Document doc, Analyzer analyzer, Term delTerm): is it
        > better to name it updateDocument?

        Good, I'll make that change.

        > - I didn't check all the variable accesses in DocumentsWriter, but
        > it seems abort() should lock for some of the variables it
        > accesses. Or make abort() a synchronized method.

        OK I will make abort synchronized.

        > - Observation: Large documents will block small documents from being
        > flushed if addDocument of large documents is called before that of
        > small ones. This is not the case before LUCENE-843.

        Right, when multiple documents are in flight at once (because multiple
        threads are adding documents), the documents must be flushed in order
        of docID. Each one grabs a unique (sequential) docID at the start
        (synchronized), does the indexing un-synchronized, then flushes
        (synchronized) but only if it is that documents "turn" to flush (ie it
        is the next docID to be written). So if a large doc grabs docID
        first, then a small doc comes through, it's possible for small docs to
        finish indexing before large doc does in which case small docs are
        buffered, waiting for large doc to flush first.

        > I also slightly changed the exception semantics in IndexWriter:
        > previously if a disk full (or other) exception was hit when flushing
        > the buffered docs, the buffered deletes were retained but the
        > partially flushed buffered docs (if any) were discarded.

        > - Observation: Before LUCENE-843, both buffered docs and buffered
        > deletes were retained when such an exception occurs. Now both
        > buffered docs and buffered deletes would be discared if an exception
        > is hit.

        Right, altough if the exception is hit after the commit point (eg,
        while building the compound file) then the buffered docs & deletes
        are added to the index.

        I plan to commit this in a day or two.

        Show
        mikemccand Michael McCandless added a comment - > - addDocument(Document doc, Analyzer analyzer, Term delTerm): is it > better to name it updateDocument? Good, I'll make that change. > - I didn't check all the variable accesses in DocumentsWriter, but > it seems abort() should lock for some of the variables it > accesses. Or make abort() a synchronized method. OK I will make abort synchronized. > - Observation: Large documents will block small documents from being > flushed if addDocument of large documents is called before that of > small ones. This is not the case before LUCENE-843 . Right, when multiple documents are in flight at once (because multiple threads are adding documents), the documents must be flushed in order of docID. Each one grabs a unique (sequential) docID at the start (synchronized), does the indexing un-synchronized, then flushes (synchronized) but only if it is that documents "turn" to flush (ie it is the next docID to be written). So if a large doc grabs docID first, then a small doc comes through, it's possible for small docs to finish indexing before large doc does in which case small docs are buffered, waiting for large doc to flush first. > I also slightly changed the exception semantics in IndexWriter: > previously if a disk full (or other) exception was hit when flushing > the buffered docs, the buffered deletes were retained but the > partially flushed buffered docs (if any) were discarded. > - Observation: Before LUCENE-843 , both buffered docs and buffered > deletes were retained when such an exception occurs. Now both > buffered docs and buffered deletes would be discared if an exception > is hit. Right, altough if the exception is hit after the commit point (eg, while building the compound file) then the buffered docs & deletes are added to the index. I plan to commit this in a day or two.
        Hide
        ningli Ning Li added a comment -

        The patch looks good! A few comments and/or observations:

        • addDocument(Document doc, Analyzer analyzer, Term delTerm): is it better to name it updateDocument?
        • I didn't check all the variable accesses in DocumentsWriter, but it seems abort() should lock for some of the variables it accesses. Or make abort() a synchronized method.
        • Observation: Large documents will block small documents from being flushed if addDocument of large documents is called before that of small ones. This is not the case before LUCENE-843.

        > I also slightly changed the exception semantics in IndexWriter:
        > previously if a disk full (or other) exception was hit when flushing
        > the buffered docs, the buffered deletes were retained but the
        > partially flushed buffered docs (if any) were discarded.

        • Observation: Before LUCENE-843, both buffered docs and buffered deletes were retained when such an exception occurs. Now both buffered docs and buffered deletes would be discared if an exception is hit.
        Show
        ningli Ning Li added a comment - The patch looks good! A few comments and/or observations: addDocument(Document doc, Analyzer analyzer, Term delTerm): is it better to name it updateDocument? I didn't check all the variable accesses in DocumentsWriter, but it seems abort() should lock for some of the variables it accesses. Or make abort() a synchronized method. Observation: Large documents will block small documents from being flushed if addDocument of large documents is called before that of small ones. This is not the case before LUCENE-843 . > I also slightly changed the exception semantics in IndexWriter: > previously if a disk full (or other) exception was hit when flushing > the buffered docs, the buffered deletes were retained but the > partially flushed buffered docs (if any) were discarded. Observation: Before LUCENE-843 , both buffered docs and buffered deletes were retained when such an exception occurs. Now both buffered docs and buffered deletes would be discared if an exception is hit.
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch.

        I added a unit test that runs 2 indexing threads (calling
        updateDocument) and 2 reader threads and asserts in the reader threads
        that the number of documents never changes.

        I also slightly changed the exception semantics in IndexWriter:
        previously if a disk full (or other) exception was hit when flushing
        the buffered docs, the buffered deletes were retained but the
        partially flushed buffered docs (if any) were discarded. I think this
        was actually a bug because the buffered deletes must also be discarded
        since they refer to document numbers that are no longer valid. So I
        changed it to also clear buffered deletes on exception, and had to
        change one unit test (TestIndexWriterDelete) to match this.

        Show
        mikemccand Michael McCandless added a comment - Attached patch. I added a unit test that runs 2 indexing threads (calling updateDocument) and 2 reader threads and asserts in the reader threads that the number of documents never changes. I also slightly changed the exception semantics in IndexWriter: previously if a disk full (or other) exception was hit when flushing the buffered docs, the buffered deletes were retained but the partially flushed buffered docs (if any) were discarded. I think this was actually a bug because the buffered deletes must also be discarded since they refer to document numbers that are no longer valid. So I changed it to also clear buffered deletes on exception, and had to change one unit test (TestIndexWriterDelete) to match this.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development