Lucene - Core
  1. Lucene - Core
  2. LUCENE-1516

Integrate IndexReader with IndexWriter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current problem is an IndexReader and IndexWriter cannot be open
      at the same time and perform updates as they both require a write
      lock to the index. While methods such as IW.deleteDocuments enables
      deleting from IW, methods such as IR.deleteDocument(int doc) and
      norms updating are not available from IW. This limits the
      capabilities of performing updates to the index dynamically or in
      realtime without closing the IW and opening an IR, deleting or
      updating norms, flushing, then opening the IW again, a process which
      can be detrimental to realtime updates.

      This patch will expose an IndexWriter.getReader method that returns
      the currently flushed state of the index as a class that implements
      IndexReader. The new IR implementation will differ from existing IR
      implementations such as MultiSegmentReader in that flushing will
      synchronize updates with IW in part by sharing the write lock. All
      methods of IR will be usable including reopen and clone.

      1. LUCENE-1516.patch
        70 kB
        Jason Rutherglen
      2. LUCENE-1516.patch
        18 kB
        Jason Rutherglen
      3. LUCENE-1516.patch
        22 kB
        Jason Rutherglen
      4. LUCENE-1516.patch
        24 kB
        Jason Rutherglen
      5. LUCENE-1516.patch
        25 kB
        Jason Rutherglen
      6. LUCENE-1516.patch
        25 kB
        Jason Rutherglen
      7. LUCENE-1516.patch
        43 kB
        Jason Rutherglen
      8. LUCENE-1516.patch
        37 kB
        Jason Rutherglen
      9. LUCENE-1516.patch
        47 kB
        Jason Rutherglen
      10. LUCENE-1516.patch
        53 kB
        Jason Rutherglen
      11. LUCENE-1516.patch
        63 kB
        Jason Rutherglen
      12. LUCENE-1516.patch
        64 kB
        Jason Rutherglen
      13. LUCENE-1516.patch
        65 kB
        Jason Rutherglen
      14. LUCENE-1516.patch
        72 kB
        Jason Rutherglen
      15. LUCENE-1516.patch
        81 kB
        Jason Rutherglen
      16. LUCENE-1516.patch
        76 kB
        Jason Rutherglen
      17. LUCENE-1516.patch
        77 kB
        Jason Rutherglen
      18. LUCENE-1516.patch
        81 kB
        Jason Rutherglen
      19. LUCENE-1516.patch
        81 kB
        Jason Rutherglen
      20. LUCENE-1516.patch
        81 kB
        Jason Rutherglen
      21. LUCENE-1516.patch
        83 kB
        Jason Rutherglen
      22. LUCENE-1516.patch
        96 kB
        Jason Rutherglen
      23. LUCENE-1516.patch
        96 kB
        Jason Rutherglen
      24. LUCENE-1516.patch
        76 kB
        Michael McCandless
      25. ssd.png
        39 kB
        Michael McCandless
      26. magnetic.png
        42 kB
        Michael McCandless
      27. LUCENE-1516.patch
        95 kB
        Michael McCandless
      28. ssd2.png
        42 kB
        Michael McCandless
      29. LUCENE-1516.patch
        90 kB
        Michael McCandless
      30. LUCENE-1516.patch
        95 kB
        Michael McCandless
      31. LUCENE-1516.patch
        97 kB
        Michael McCandless

        Activity

        Hide
        Jason Rutherglen added a comment -

        It looks like DirectoryIndexReader needs to have an IndexWriter mode
        as unfortunately subclassing won't work. In this context overriding a
        method implies the IW mode is being used.

        I assume we'll share the segmentInfos object from IW rather than
        share a clone with the IR?

        • DirectoryIndexReader.doCommit needs to be overridden
        • IndexReader uses the IndexDeletionPolicy from IndexWriter
        • DirectoryIndexReader.acquireWriteLock is overridden and synchronizes
          on the write lock of IndexWriter, other details need to be worked out
        • Method requiring synchronization in IW: optimize, expungeDeletes,
          prepareCommit, flush, addIndexes, methods that modify segmentInfos.
          Do we synchronize on IW or writeLock?
        • addDocument, updateDocument, deleteDocument do not seem to require
          synchronization
        • IW.getReader returns a cloned reader. IW keeps it's own reference.
          This is to allow IW to perform deletes using the internal reader rather than
          open a new reader and in general not affect IW's reader while still sharing
          resources.
        Show
        Jason Rutherglen added a comment - It looks like DirectoryIndexReader needs to have an IndexWriter mode as unfortunately subclassing won't work. In this context overriding a method implies the IW mode is being used. I assume we'll share the segmentInfos object from IW rather than share a clone with the IR? DirectoryIndexReader.doCommit needs to be overridden IndexReader uses the IndexDeletionPolicy from IndexWriter DirectoryIndexReader.acquireWriteLock is overridden and synchronizes on the write lock of IndexWriter, other details need to be worked out Method requiring synchronization in IW: optimize, expungeDeletes, prepareCommit, flush, addIndexes, methods that modify segmentInfos. Do we synchronize on IW or writeLock? addDocument, updateDocument, deleteDocument do not seem to require synchronization IW.getReader returns a cloned reader. IW keeps it's own reference. This is to allow IW to perform deletes using the internal reader rather than open a new reader and in general not affect IW's reader while still sharing resources.
        Hide
        Grant Ingersoll added a comment -

        It still seems to me that we should think about what objects need to be shared and how they can be produced/instantiated appropriately instead of adding a Reader onto the Writer, which, IMO, pollutes the Writer API. I know it complicates things, but I think it will be less confusing to users of the API.

        Show
        Grant Ingersoll added a comment - It still seems to me that we should think about what objects need to be shared and how they can be produced/instantiated appropriately instead of adding a Reader onto the Writer, which, IMO, pollutes the Writer API. I know it complicates things, but I think it will be less confusing to users of the API.
        Hide
        Jason Rutherglen added a comment -

        Patch includes the latest patch from LUCENE-1314 and is unfortunately
        not readable. Is there a way to create a patch minus another patch?

        The IW internal reader is always the results of the latest external
        IR.reopen. In this way deletes made in IW and from the external IR
        remain in sync. The latest IR always has the write lock anyways
        according to IR.clone semantics.

        DocumentsWriter.applyDeletes reuses the reader in IW rather than call
        SegmentReader.get/.deleteDocument

        In IW.doFlush the internal reader is reopened with the new
        segmentinfos

        Included a very basic test case

        Show
        Jason Rutherglen added a comment - Patch includes the latest patch from LUCENE-1314 and is unfortunately not readable. Is there a way to create a patch minus another patch? The IW internal reader is always the results of the latest external IR.reopen. In this way deletes made in IW and from the external IR remain in sync. The latest IR always has the write lock anyways according to IR.clone semantics. DocumentsWriter.applyDeletes reuses the reader in IW rather than call SegmentReader.get/.deleteDocument In IW.doFlush the internal reader is reopened with the new segmentinfos Included a very basic test case
        Hide
        Jason Rutherglen added a comment -

        This is an extremely early patch and has a bunch of System.outs for debugging

        • TestIndexWriterReader.testMerge tries to use IW.addIndexes and fails
          because the lookup of the SegmentReader from the IW internal IR
          returns the incorrect reader (I spent a lot of time on trying to
          figure out why and could not).
        • TestIndexWriterReader.testReader passes
        Show
        Jason Rutherglen added a comment - This is an extremely early patch and has a bunch of System.outs for debugging TestIndexWriterReader.testMerge tries to use IW.addIndexes and fails because the lookup of the SegmentReader from the IW internal IR returns the incorrect reader (I spent a lot of time on trying to figure out why and could not). TestIndexWriterReader.testReader passes
        Hide
        Jason Rutherglen added a comment -

        Some terminology for this patch, an internal reader is the
        IndexWriter's reader. An external reader given via the
        IW.reopenReader method.

        If DirectoryIndexReader has a writer, no lock is acquired on updates.
        IR.clone normally passes the writeLock to the new reader, however the
        external IR and the IW internal IR both need to hold the write lock.
        For this reason the user must be careful when flushing to insure the
        proper instance of the IR's deletes are merged with the writer.

        The external IR.flush does not flush the deletes to disk, instead it
        merges with the IW's internal IR which is in RAM. IW.flush causes
        deletes and new segments to be flushed to the directory.

        The test cases from TestIndexWriterReader testIndexWriterDeletes and
        testIndexWriterReopenSegment fail when the IW is opened again after
        commit and close. The index files are being deleted during IW.commit.
        I traced this to IW.finishCommit -> deleter.checkpoint ->
        deleter.deletePendingFiles.

        Show
        Jason Rutherglen added a comment - Some terminology for this patch, an internal reader is the IndexWriter's reader. An external reader given via the IW.reopenReader method. If DirectoryIndexReader has a writer, no lock is acquired on updates. IR.clone normally passes the writeLock to the new reader, however the external IR and the IW internal IR both need to hold the write lock. For this reason the user must be careful when flushing to insure the proper instance of the IR's deletes are merged with the writer. The external IR.flush does not flush the deletes to disk, instead it merges with the IW's internal IR which is in RAM. IW.flush causes deletes and new segments to be flushed to the directory. The test cases from TestIndexWriterReader testIndexWriterDeletes and testIndexWriterReopenSegment fail when the IW is opened again after commit and close. The index files are being deleted during IW.commit. I traced this to IW.finishCommit -> deleter.checkpoint -> deleter.deletePendingFiles.
        Hide
        Jason Rutherglen added a comment -

        Added TestIndexWriterReader.testIndexWriterDeletes,
        testIndexWriterDeletesOptimize, testIndexWriterReopenSegment,
        testIndexWriterReopenSegmentOptimize. Where the optimize methods
        fail, the non optimize ones work. The optimize methods delete _0.fdt
        and _0.fdx and so fail when the writer is created again because it
        cannot find those files. It could be a segment infos merging problem
        or something else.

        Show
        Jason Rutherglen added a comment - Added TestIndexWriterReader.testIndexWriterDeletes, testIndexWriterDeletesOptimize, testIndexWriterReopenSegment, testIndexWriterReopenSegmentOptimize. Where the optimize methods fail, the non optimize ones work. The optimize methods delete _0.fdt and _0.fdx and so fail when the writer is created again because it cannot find those files. It could be a segment infos merging problem or something else.
        Hide
        Jason Rutherglen added a comment -

        The previously mentioned issue in the test case is fixed. Perhaps
        IW.reopenInternalReader() should be called in IW.checkpoint?

        Show
        Jason Rutherglen added a comment - The previously mentioned issue in the test case is fixed. Perhaps IW.reopenInternalReader() should be called in IW.checkpoint?
        Hide
        Michael McCandless added a comment -

        Looks good, Jason. This is big change, and I expect to go through a
        number of iterations before settling... plus we still need to figure
        out how the API is exposed. Comments:

        • All this logic needs to be conditional (this also depends on what
          API we actually settle on to expose this...): right now you always
          open a reader whenever IW is created.
        • We should assume we do not need to support autoCommit=true in this
          patch (since this will land after 3.0). This simplifies things.
        • IW.reopenInternalReader only does a clone not a reopen; how does
          it cover the newly flushed segment?
        • After a merge commits you don't seem to reopen the reader? This
          is actually tricky to do right, for realtime search: we somehow
          need to allow for warming of the newly created (merged) segment,
          in such a way that we do not block the flushing of further
          segments and reopen of readers against those new segments. I
          think what may be best is to subclass IW, and override a newly
          added "postMerge" method that's invoked on the new segment before
          the merge is committed into the SegmentInfos. This is cleaner
          than allowing the change into the SegmentInfos and then having to
          make a custom deletion policy & track history of each segment.
        • It seems like reader.reopen() (where reader was obtained with
          IW.getReader()) doesn't do the right thing? (ie it's looking for
          the most recent segments_N in the Directory, but it should be
          looking for it @ IW.segmentInfos).
        • I think we should decouple "materializing deletes down to docIDs"
          from "flushing deletes to disk". IW does both as the same
          operation now (because it doesn't want to hold SR open for a long
          time), but once we have persistent open SegmentReaders we should
          separate these. It's not necessary for IW to write new .del files
          when it materializes deletes.
        • Instead of having to merge readers, I think we should have a
          single source to obtain an SR from. This way, when IW needs to
          materialize deletes, it will grab the same instance of SR for a
          given segment that the currently open MSR is using. Also, when
          merging kicks off, it'll grab the SR from the same source (this
          way deletes in RAM will be correctly merged away). Also, I think
          we should not use MSR for doing deletions (and still go segment by
          segment): it's quite a bit slower since every invocation must do
          the binary search again.
        • Likewise, you have to fix the commitMergedDeletes to decouple
          computing the new BitVector from writing the .del file to disk.
          That method should only create a new BitVector, for the newly
          merged segment. It must be synchronized to prevent any new
          deletions against the segments that were just merged. In fact,
          this is a real danger: after a merge finishes, if one continues to
          use an older reader to do deletions you get into trouble.
        • I still don't really like having both the IR and IW able to do
          deletions, with slightly different semantics. As it stands now,
          since you can't predict when IW materializes deletes, your reader
          will suddenly see a bunch of deletes appear. I think it's better
          if no deletes appear, ever, until you reopen your reader. Maybe
          we simply prevent deletion through the IR?
        • We need some serious unit tests here!
        Show
        Michael McCandless added a comment - Looks good, Jason. This is big change, and I expect to go through a number of iterations before settling... plus we still need to figure out how the API is exposed. Comments: All this logic needs to be conditional (this also depends on what API we actually settle on to expose this...): right now you always open a reader whenever IW is created. We should assume we do not need to support autoCommit=true in this patch (since this will land after 3.0). This simplifies things. IW.reopenInternalReader only does a clone not a reopen; how does it cover the newly flushed segment? After a merge commits you don't seem to reopen the reader? This is actually tricky to do right, for realtime search: we somehow need to allow for warming of the newly created (merged) segment, in such a way that we do not block the flushing of further segments and reopen of readers against those new segments. I think what may be best is to subclass IW, and override a newly added "postMerge" method that's invoked on the new segment before the merge is committed into the SegmentInfos. This is cleaner than allowing the change into the SegmentInfos and then having to make a custom deletion policy & track history of each segment. It seems like reader.reopen() (where reader was obtained with IW.getReader()) doesn't do the right thing? (ie it's looking for the most recent segments_N in the Directory, but it should be looking for it @ IW.segmentInfos). I think we should decouple "materializing deletes down to docIDs" from "flushing deletes to disk". IW does both as the same operation now (because it doesn't want to hold SR open for a long time), but once we have persistent open SegmentReaders we should separate these. It's not necessary for IW to write new .del files when it materializes deletes. Instead of having to merge readers, I think we should have a single source to obtain an SR from. This way, when IW needs to materialize deletes, it will grab the same instance of SR for a given segment that the currently open MSR is using. Also, when merging kicks off, it'll grab the SR from the same source (this way deletes in RAM will be correctly merged away). Also, I think we should not use MSR for doing deletions (and still go segment by segment): it's quite a bit slower since every invocation must do the binary search again. Likewise, you have to fix the commitMergedDeletes to decouple computing the new BitVector from writing the .del file to disk. That method should only create a new BitVector, for the newly merged segment. It must be synchronized to prevent any new deletions against the segments that were just merged. In fact, this is a real danger: after a merge finishes, if one continues to use an older reader to do deletions you get into trouble. I still don't really like having both the IR and IW able to do deletions, with slightly different semantics. As it stands now, since you can't predict when IW materializes deletes, your reader will suddenly see a bunch of deletes appear. I think it's better if no deletes appear, ever, until you reopen your reader. Maybe we simply prevent deletion through the IR? We need some serious unit tests here!
        Hide
        Jason Rutherglen added a comment -

        Mike, good points...

        since you can't predict when IW materializes deletes, your reader
        will suddenly see a bunch of deletes appear.

        The reader would need to be reopened to see the deletes. Isn't that
        expected behavior?

        Instead of having to merge readers, I think we need a single
        source to obtain an SR from

        I like this however how would IR.clone work? I like having the
        internal reader separate from the external reader. The main reason to
        expose IR from IW is to allow delete by doc id and norms updates
        (eventually column stride fields updates). I don't see how we can
        grab a reader during a merge, and block realtime deletes occurring on
        the external reader. However it is difficult to rectify deletes to an
        external SR that's been merged away.

        It seems like we're getting closer to using a unique long UID for
        each doc that is carried over between merges. I was going to
        implement this above LUCENE-1516 however we may want to make UIDs a
        part of LUCENE-1516 to implement the behavior we're discussing.

        If the updates to SR are queued, then it seems like the only way to
        achieve this is a doc UID. This way merges can happen in the
        background, the IR has a mechanism for mapping it's queue to the
        newly merged segments when flushed. Hopefully we aren't wreaking
        havoc with the IndexReader API?

        The scenario I think we're missing is if there's multiple cloned SRs
        out there. With the IW checkout an SR model how do we allow cloning?
        A clone's updates will be placed into a central original SR queue?
        The queue is drained automatically on a merge or IW.flush? What
        happens when we want the IR deletes to be searchable without flushing
        to disk? Do a reopen/clone?

        number of iterations before settling

        Agreed, if it were simple it wouldn't be fun. ☺

        It's not necessary for IW to write new .del files when it
        materializes deletes.

        Good point, DocumentsWriter.applyDeletes shouldn't be flushing to
        disk and this sounds like a test case to add to TestIndexWriterReader.

        IW.reopenInternalReader only does a clone not a reopen; however
        does it cover the newly flushed segment?

        The segmentinfos is obtained from the Writer. In the test case
        testIndexWriterReopenSegment it looks like using clone reopens the
        new segments.

        I think it's better if no deletes appear, ever, until you reopen
        your reader. Maybe we simply prevent deletion through the IR?

        Preventing deletion through the IR would seem to defeat the purpose
        of the patch unless there's some alternative mechanism for deleting
        by doc id?

        commitMergedDeletes to decouple computing the new BitVector from
        writing the .del file to disk.

        A hidden method I never noticed. I'll keep it in mind.

        It seems like reader.reopen() (where reader was obtained with
        IW.getReader()) doesn't do the right thing? (ie it's looking for the
        most recent segments_N in the Directory, but it should be looking for
        it @ IW.segmentInfos).

        Using the reopen method implementation for a Reader with IW does not
        seem necessary. It seems like it could call clone underneath?

        Show
        Jason Rutherglen added a comment - Mike, good points... since you can't predict when IW materializes deletes, your reader will suddenly see a bunch of deletes appear. The reader would need to be reopened to see the deletes. Isn't that expected behavior? Instead of having to merge readers, I think we need a single source to obtain an SR from I like this however how would IR.clone work? I like having the internal reader separate from the external reader. The main reason to expose IR from IW is to allow delete by doc id and norms updates (eventually column stride fields updates). I don't see how we can grab a reader during a merge, and block realtime deletes occurring on the external reader. However it is difficult to rectify deletes to an external SR that's been merged away. It seems like we're getting closer to using a unique long UID for each doc that is carried over between merges. I was going to implement this above LUCENE-1516 however we may want to make UIDs a part of LUCENE-1516 to implement the behavior we're discussing. If the updates to SR are queued, then it seems like the only way to achieve this is a doc UID. This way merges can happen in the background, the IR has a mechanism for mapping it's queue to the newly merged segments when flushed. Hopefully we aren't wreaking havoc with the IndexReader API? The scenario I think we're missing is if there's multiple cloned SRs out there. With the IW checkout an SR model how do we allow cloning? A clone's updates will be placed into a central original SR queue? The queue is drained automatically on a merge or IW.flush? What happens when we want the IR deletes to be searchable without flushing to disk? Do a reopen/clone? number of iterations before settling Agreed, if it were simple it wouldn't be fun. ☺ It's not necessary for IW to write new .del files when it materializes deletes. Good point, DocumentsWriter.applyDeletes shouldn't be flushing to disk and this sounds like a test case to add to TestIndexWriterReader. IW.reopenInternalReader only does a clone not a reopen; however does it cover the newly flushed segment? The segmentinfos is obtained from the Writer. In the test case testIndexWriterReopenSegment it looks like using clone reopens the new segments. I think it's better if no deletes appear, ever, until you reopen your reader. Maybe we simply prevent deletion through the IR? Preventing deletion through the IR would seem to defeat the purpose of the patch unless there's some alternative mechanism for deleting by doc id? commitMergedDeletes to decouple computing the new BitVector from writing the .del file to disk. A hidden method I never noticed. I'll keep it in mind. It seems like reader.reopen() (where reader was obtained with IW.getReader()) doesn't do the right thing? (ie it's looking for the most recent segments_N in the Directory, but it should be looking for it @ IW.segmentInfos). Using the reopen method implementation for a Reader with IW does not seem necessary. It seems like it could call clone underneath?
        Hide
        Michael McCandless added a comment -

        > since you can't predict when IW materializes deletes, your reader
        > will suddenly see a bunch of deletes appear.

        The reader would need to be reopened to see the deletes. Isn't that
        expected behavior?

        Ahh right, so long as we keep internal (private) clone, materializing
        the deletes won't affect the external reader.

        > Instead of having to merge readers, I think we need a single
        > source to obtain an SR from

        I like this however how would IR.clone work?

        It should work fine? The single source would only be used internally
        by IW (for merging, for materializing deletes, for the internal
        reader).

        I like having the internal reader separate from the external reader.

        I think we should keep that separation.

        The main reason to
        expose IR from IW is to allow delete by doc id and norms updates
        (eventually column stride fields updates). I don't see how we can
        grab a reader during a merge, and block realtime deletes occurring on
        the external reader. However it is difficult to rectify deletes to an
        external SR that's been merged away.

        It seems like we're getting closer to using a unique long UID for
        each doc that is carried over between merges. I was going to
        implement this above LUCENE-1516 however we may want to make UIDs a
        part of LUCENE-1516 to implement the behavior we're discussing.

        If the updates to SR are queued, then it seems like the only way to
        achieve this is a doc UID. This way merges can happen in the
        background, the IR has a mechanism for mapping it's queue to the
        newly merged segments when flushed. Hopefully we aren't wreaking
        havoc with the IndexReader API?

        But... do we need delete by docID once we have realtime search? I
        think the last compelling reason to keep IR's delete by docID was
        immediacy, but realtime search can give us that, from IW, even when
        deleting by Term or Query?

        (Your app can always add that long UID if it doesn't already have
        something usable).

        docIDs are free to changing inside IW. I don't see how we can hand
        out a reader, allow deletes by docID to it, and merge those deletes
        back in at a later time, unless we track the genealogy of the
        segments?

        The scenario I think we're missing is if there's multiple cloned SRs
        out there. With the IW checkout an SR model how do we allow cloning?
        A clone's updates will be placed into a central original SR queue?
        The queue is drained automatically on a merge or IW.flush? What
        happens when we want the IR deletes to be searchable without flushing
        to disk? Do a reopen/clone?

        This is why I think all changes must be done through IW if you've
        opened a reader from it. In fact, with the addition of realtime
        search to Lucene, if we also add updating norms/column-stride fields
        to IW, can't we move away from allowing any changes via IR? (Ie
        deprecate deleteDocuments/setNorms/etc.)

        > It's not necessary for IW to write new .del files when it
        > materializes deletes.

        Good point, DocumentsWriter.applyDeletes shouldn't be flushing to
        disk and this sounds like a test case to add to TestIndexWriterReader.

        Well, if IW has no persistent reader to hold the deletes, it must keep
        doing what it does now (flush immediately to disk)?

        > IW.reopenInternalReader only does a clone not a reopen; however
        > does it cover the newly flushed segment?

        The segmentinfos is obtained from the Writer. In the test case
        testIndexWriterReopenSegment it looks like using clone reopens the
        new segments.

        Wait, where is this test? Maybe you need to svn add it?

        And, clone should not be reopening segments...?

        > I think it's better if no deletes appear, ever, until you reopen
        > your reader. Maybe we simply prevent deletion through the IR?

        Preventing deletion through the IR would seem to defeat the purpose
        of the patch unless there's some alternative mechanism for deleting
        by doc id?

        See above.

        > commitMergedDeletes to decouple computing the new BitVector from
        > writing the .del file to disk.

        A hidden method I never noticed. I'll keep it in mind.

        It's actually very important. This is how IW allows deletes to
        materialize to docIDs, while a merge is running – any newly
        materialized deletes against the just-merged segments are coalesced
        and carried over to the newly created segment. Any further deletes
        must be done against the docIDs in the new segment (which is why I
        don't see how we can allow deletes by docID to happen against a
        checked out reader).

        > It seems like reader.reopen() (where reader was obtained with
        > IW.getReader()) doesn't do the right thing? (ie it's looking for the
        > most recent segments_N in the Directory, but it should be looking for
        > it @ IW.segmentInfos).

        Using the reopen method implementation for a Reader with IW does not
        seem necessary. It seems like it could call clone underneath?

        Well, clone should be very different from reopen. It seems like
        calling reader.reopen() (on reader obtained from writer) should
        basically do the same thing as calling writer.getReader(). Ie they
        are nearly synonyms? (Except for small difference in ref counting –
        I think writer.getReader() should always incRef, but reopen only
        incRefs if it returns a new reader).

        Show
        Michael McCandless added a comment - > since you can't predict when IW materializes deletes, your reader > will suddenly see a bunch of deletes appear. The reader would need to be reopened to see the deletes. Isn't that expected behavior? Ahh right, so long as we keep internal (private) clone, materializing the deletes won't affect the external reader. > Instead of having to merge readers, I think we need a single > source to obtain an SR from I like this however how would IR.clone work? It should work fine? The single source would only be used internally by IW (for merging, for materializing deletes, for the internal reader). I like having the internal reader separate from the external reader. I think we should keep that separation. The main reason to expose IR from IW is to allow delete by doc id and norms updates (eventually column stride fields updates). I don't see how we can grab a reader during a merge, and block realtime deletes occurring on the external reader. However it is difficult to rectify deletes to an external SR that's been merged away. It seems like we're getting closer to using a unique long UID for each doc that is carried over between merges. I was going to implement this above LUCENE-1516 however we may want to make UIDs a part of LUCENE-1516 to implement the behavior we're discussing. If the updates to SR are queued, then it seems like the only way to achieve this is a doc UID. This way merges can happen in the background, the IR has a mechanism for mapping it's queue to the newly merged segments when flushed. Hopefully we aren't wreaking havoc with the IndexReader API? But... do we need delete by docID once we have realtime search? I think the last compelling reason to keep IR's delete by docID was immediacy, but realtime search can give us that, from IW, even when deleting by Term or Query? (Your app can always add that long UID if it doesn't already have something usable). docIDs are free to changing inside IW. I don't see how we can hand out a reader, allow deletes by docID to it, and merge those deletes back in at a later time, unless we track the genealogy of the segments? The scenario I think we're missing is if there's multiple cloned SRs out there. With the IW checkout an SR model how do we allow cloning? A clone's updates will be placed into a central original SR queue? The queue is drained automatically on a merge or IW.flush? What happens when we want the IR deletes to be searchable without flushing to disk? Do a reopen/clone? This is why I think all changes must be done through IW if you've opened a reader from it. In fact, with the addition of realtime search to Lucene, if we also add updating norms/column-stride fields to IW, can't we move away from allowing any changes via IR? (Ie deprecate deleteDocuments/setNorms/etc.) > It's not necessary for IW to write new .del files when it > materializes deletes. Good point, DocumentsWriter.applyDeletes shouldn't be flushing to disk and this sounds like a test case to add to TestIndexWriterReader. Well, if IW has no persistent reader to hold the deletes, it must keep doing what it does now (flush immediately to disk)? > IW.reopenInternalReader only does a clone not a reopen; however > does it cover the newly flushed segment? The segmentinfos is obtained from the Writer. In the test case testIndexWriterReopenSegment it looks like using clone reopens the new segments. Wait, where is this test? Maybe you need to svn add it? And, clone should not be reopening segments...? > I think it's better if no deletes appear, ever, until you reopen > your reader. Maybe we simply prevent deletion through the IR? Preventing deletion through the IR would seem to defeat the purpose of the patch unless there's some alternative mechanism for deleting by doc id? See above. > commitMergedDeletes to decouple computing the new BitVector from > writing the .del file to disk. A hidden method I never noticed. I'll keep it in mind. It's actually very important. This is how IW allows deletes to materialize to docIDs, while a merge is running – any newly materialized deletes against the just-merged segments are coalesced and carried over to the newly created segment. Any further deletes must be done against the docIDs in the new segment (which is why I don't see how we can allow deletes by docID to happen against a checked out reader). > It seems like reader.reopen() (where reader was obtained with > IW.getReader()) doesn't do the right thing? (ie it's looking for the > most recent segments_N in the Directory, but it should be looking for > it @ IW.segmentInfos). Using the reopen method implementation for a Reader with IW does not seem necessary. It seems like it could call clone underneath? Well, clone should be very different from reopen. It seems like calling reader.reopen() (on reader obtained from writer) should basically do the same thing as calling writer.getReader(). Ie they are nearly synonyms? (Except for small difference in ref counting – I think writer.getReader() should always incRef, but reopen only incRefs if it returns a new reader).
        Hide
        Jason Rutherglen added a comment -

        Added the test case to the patch.

        Show
        Jason Rutherglen added a comment - Added the test case to the patch.
        Hide
        Jason Rutherglen added a comment -

        The path forward seems to be exposing a cloned readonly reader
        from IW.getReader. This would be easier than doing hula hoops to do
        segment genealogy (at least for now ☺)

        can't we move away from allowing any changes via IR? (Ie
        deprecate deleteDocuments/setNorms/etc.)

        This would simplify things however as a thought experiment how would
        the setNorms work if it were a part of IndexWriter?

        And, clone should not be reopening segments...?

        DirectoryIndexReader.clone(boolean openReadonly) calls
        doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly)
        which is an abstract method that in SegmentReader and
        MultiSegmentReader reopens the segments? The segment infos for a
        ReaderIW is obtained from IW, which is how it knows about the new
        segments. Perhaps not desired behavior?

        do we need delete by docID once we have realtime search? I
        think the last compelling reason to keep IR's delete by docID was
        immediacy, but realtime search can give us that, from IW, even when
        deleting by Term or Query?

        Good point! I think we may want to support it but for now it's
        shouldn't be necessary. I'm thinking of the case where someone is
        using the field cache (or some variant), performs some sort of query
        on it and then needs to delete based on doc id. What do they do?
        Would we expose a callback mechanism where a deleteFrom(IndexReader
        ir) method is exposed and deletes occur at the time of the IW's
        choosing?

        It seems like calling reader.reopen() (on reader obtained
        from writer) should basically do the same thing as calling
        writer.getReader(). Ie they are nearly synonyms? (Except for small
        difference in ref counting - I think writer.getReader() should always
        incRef, but reopen only incRefs if it returns a new reader).

        Perhaps ReaderIW.reopen will call IW.getReader underneath instead of
        using IR's usual mechanism.

        Show
        Jason Rutherglen added a comment - The path forward seems to be exposing a cloned readonly reader from IW.getReader. This would be easier than doing hula hoops to do segment genealogy (at least for now ☺) can't we move away from allowing any changes via IR? (Ie deprecate deleteDocuments/setNorms/etc.) This would simplify things however as a thought experiment how would the setNorms work if it were a part of IndexWriter? And, clone should not be reopening segments...? DirectoryIndexReader.clone(boolean openReadonly) calls doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly) which is an abstract method that in SegmentReader and MultiSegmentReader reopens the segments? The segment infos for a ReaderIW is obtained from IW, which is how it knows about the new segments. Perhaps not desired behavior? do we need delete by docID once we have realtime search? I think the last compelling reason to keep IR's delete by docID was immediacy, but realtime search can give us that, from IW, even when deleting by Term or Query? Good point! I think we may want to support it but for now it's shouldn't be necessary. I'm thinking of the case where someone is using the field cache (or some variant), performs some sort of query on it and then needs to delete based on doc id. What do they do? Would we expose a callback mechanism where a deleteFrom(IndexReader ir) method is exposed and deletes occur at the time of the IW's choosing? It seems like calling reader.reopen() (on reader obtained from writer) should basically do the same thing as calling writer.getReader(). Ie they are nearly synonyms? (Except for small difference in ref counting - I think writer.getReader() should always incRef, but reopen only incRefs if it returns a new reader). Perhaps ReaderIW.reopen will call IW.getReader underneath instead of using IR's usual mechanism.
        Hide
        Grant Ingersoll added a comment -

        This latest patch doesn't seem to apply.

        Show
        Grant Ingersoll added a comment - This latest patch doesn't seem to apply.
        Hide
        Grant Ingersoll added a comment -

        Is there ever a need for the normal IR construction anymore? Or do we always just ask for it from the IW (or wherever we choose to expose this, as I still don't think it belongs on the IW API wise, but that isn't a big deal right now) every time? I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right?

        API wise, I think we could do something like (with obvious other variations):

        IndexAccessor{
        
          IndexWriter getWriter(Directory);
        
          //returns read-only reader
          IndexReader getReader(Directory);
        
          //returns the external IR described above
          IndexReader.getReader(IndexWriter);
        }
        

        Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly.

        Show
        Grant Ingersoll added a comment - Is there ever a need for the normal IR construction anymore? Or do we always just ask for it from the IW (or wherever we choose to expose this, as I still don't think it belongs on the IW API wise, but that isn't a big deal right now) every time? I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right? API wise, I think we could do something like (with obvious other variations): IndexAccessor{ IndexWriter getWriter(Directory); //returns read-only reader IndexReader getReader(Directory); //returns the external IR described above IndexReader.getReader(IndexWriter); } Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly.
        Hide
        Michael McCandless added a comment -

        Jason, I think you need to "svn up". Or, tell us which revision you're on and we can downgrade to that revision before applying the patch. (We need "svn patch"!).

        Show
        Michael McCandless added a comment - Jason, I think you need to "svn up". Or, tell us which revision you're on and we can downgrade to that revision before applying the patch. (We need "svn patch"!).
        Hide
        Michael McCandless added a comment -

        The path forward seems to be exposing a cloned readonly reader
        from IW.getReader.

        +1

        > can't we move away from allowing any changes via IR? (Ie
        > deprecate deleteDocuments/setNorms/etc.)

        This would simplify things however as a thought experiment how would
        the setNorms work if it were a part of IndexWriter?

        I think it'd look like this?

        IndexWriter.setNorm(Term term, String field, byte norm)
        

        Ie the Term IDs the doc(s) you want to set the norm for.

        > And, clone should not be reopening segments...?

        DirectoryIndexReader.clone(boolean openReadonly) calls
        doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly)
        which is an abstract method that in SegmentReader and
        MultiSegmentReader reopens the segments? The segment infos for a
        ReaderIW is obtained from IW, which is how it knows about the new
        segments. Perhaps not desired behavior?

        OK, I think it does not reopen existing segments. Meaning, if a
        segment is in common w/ old and new, it truly clones it (does not
        reopen norms nor del). But if there is a new segment that did not
        exist in old, it opens a whole new segment reader? I'll commit an
        assert that this doesn't happen – if caller passes in "doClone=true"
        then caller should not have passed in a segmentInfos with changes?
        Else the reader is on thin ice (mismatch what's in RAM vs what
        SegmentInfo says).

        > do we need delete by docID once we have realtime search? I
        > think the last compelling reason to keep IR's delete by docID was
        > immediacy, but realtime search can give us that, from IW, even when
        > deleting by Term or Query?

        Good point! I think we may want to support it but for now it's
        shouldn't be necessary. I'm thinking of the case where someone is
        using the field cache (or some variant), performs some sort of query
        on it and then needs to delete based on doc id. What do they do?
        Would we expose a callback mechanism where a deleteFrom(IndexReader
        ir) method is exposed and deletes occur at the time of the IW's
        choosing?

        Wouldn't delete-by-Query cover this? Ie one could always make a
        Filter implementing the "look @ field cache, do some logic, provide
        docIDs to delete", wrap as Query, then delete-by-Query?

        > It seems like calling reader.reopen() (on reader obtained
        > from writer) should basically do the same thing as calling
        > writer.getReader(). Ie they are nearly synonyms? (Except for small
        > difference in ref counting - I think writer.getReader() should always
        > incRef, but reopen only incRefs if it returns a new reader).

        Perhaps ReaderIW.reopen will call IW.getReader underneath instead of
        using IR's usual mechanism.

        Right, that's what I'm thinking. Once you've obtained reader coupled
        to a writer, you can then simply reopen it whenever you want to see
        (materialize) changes done by the writer.

        We still need a solution for the "warm the just merged
        segment"... else we will not be realtime, especially when big merge
        finishes. It seems like after merge finishes, it should immediately
        1) open a SegmentReader on the new segment, 2) invoke the method you
        passed in (or you subclassed – not sure which), 3) carry over deletes
        that materialized during the merge, 4) commit the merge (replace old
        segments w/ new one).

        Show
        Michael McCandless added a comment - The path forward seems to be exposing a cloned readonly reader from IW.getReader. +1 > can't we move away from allowing any changes via IR? (Ie > deprecate deleteDocuments/setNorms/etc.) This would simplify things however as a thought experiment how would the setNorms work if it were a part of IndexWriter? I think it'd look like this? IndexWriter.setNorm(Term term, String field, byte norm) Ie the Term IDs the doc(s) you want to set the norm for. > And, clone should not be reopening segments...? DirectoryIndexReader.clone(boolean openReadonly) calls doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly) which is an abstract method that in SegmentReader and MultiSegmentReader reopens the segments? The segment infos for a ReaderIW is obtained from IW, which is how it knows about the new segments. Perhaps not desired behavior? OK, I think it does not reopen existing segments. Meaning, if a segment is in common w/ old and new, it truly clones it (does not reopen norms nor del). But if there is a new segment that did not exist in old, it opens a whole new segment reader? I'll commit an assert that this doesn't happen – if caller passes in "doClone=true" then caller should not have passed in a segmentInfos with changes? Else the reader is on thin ice (mismatch what's in RAM vs what SegmentInfo says). > do we need delete by docID once we have realtime search? I > think the last compelling reason to keep IR's delete by docID was > immediacy, but realtime search can give us that, from IW, even when > deleting by Term or Query? Good point! I think we may want to support it but for now it's shouldn't be necessary. I'm thinking of the case where someone is using the field cache (or some variant), performs some sort of query on it and then needs to delete based on doc id. What do they do? Would we expose a callback mechanism where a deleteFrom(IndexReader ir) method is exposed and deletes occur at the time of the IW's choosing? Wouldn't delete-by-Query cover this? Ie one could always make a Filter implementing the "look @ field cache, do some logic, provide docIDs to delete", wrap as Query, then delete-by-Query? > It seems like calling reader.reopen() (on reader obtained > from writer) should basically do the same thing as calling > writer.getReader(). Ie they are nearly synonyms? (Except for small > difference in ref counting - I think writer.getReader() should always > incRef, but reopen only incRefs if it returns a new reader). Perhaps ReaderIW.reopen will call IW.getReader underneath instead of using IR's usual mechanism. Right, that's what I'm thinking. Once you've obtained reader coupled to a writer, you can then simply reopen it whenever you want to see (materialize) changes done by the writer. We still need a solution for the "warm the just merged segment"... else we will not be realtime, especially when big merge finishes. It seems like after merge finishes, it should immediately 1) open a SegmentReader on the new segment, 2) invoke the method you passed in (or you subclassed – not sure which), 3) carry over deletes that materialized during the merge, 4) commit the merge (replace old segments w/ new one).
        Hide
        Michael McCandless added a comment -

        I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right?

        Right, I think we still want to allow opening a standalone (uncoupled)
        reader.

        Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly.

        I'm torn... the IndexAccessor would need to expose many variants to
        carry over all the options we now have (String or File or Directory,
        IndexCommit or not, IndexDeletionPolicy or not, create or not). It
        will end up exposing a number of new methods... and, would it try to
        be "smart" (like IndexModifier, and the LuceneIndexAccessor class in
        LUCENE-390), keeping track of references to the readers it's handed
        out, etc.? Or is it simply a pass-through to the underlying
        open/ctors we have today?

        The alternative (as of right now, unless we are missing something
        further with these changes) is adding one method to IndexWriter,
        getReader, that returns a readOnly IndexReader, "coupled" to the
        writer you got it from in that it's able to search un-committed
        changes and if you reopen it, writer will materialize all changes and
        make them visible to the reopened reader.

        I guess so far I don't really see why this small (one method) API
        change merits a switch to a whole new accessor API for creating
        readers & writers on an index? Maybe there is a
        straw-that-breaks-the-camel's-back argument that I'm missing...

        Show
        Michael McCandless added a comment - I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right? Right, I think we still want to allow opening a standalone (uncoupled) reader. Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly. I'm torn... the IndexAccessor would need to expose many variants to carry over all the options we now have (String or File or Directory, IndexCommit or not, IndexDeletionPolicy or not, create or not). It will end up exposing a number of new methods... and, would it try to be "smart" (like IndexModifier, and the LuceneIndexAccessor class in LUCENE-390 ), keeping track of references to the readers it's handed out, etc.? Or is it simply a pass-through to the underlying open/ctors we have today? The alternative (as of right now, unless we are missing something further with these changes) is adding one method to IndexWriter, getReader, that returns a readOnly IndexReader, "coupled" to the writer you got it from in that it's able to search un-committed changes and if you reopen it, writer will materialize all changes and make them visible to the reopened reader. I guess so far I don't really see why this small (one method) API change merits a switch to a whole new accessor API for creating readers & writers on an index? Maybe there is a straw-that-breaks-the-camel's-back argument that I'm missing...
        Hide
        Grant Ingersoll added a comment -

        Good points, MIke, but maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory.

        new IndexWriter(new Directory(indexFile));
        

        Additionally, there are no more variants than there already are on the IW and IR, right?

        As for pass-through or not, I think it would just pass-through, at least initially, but it certainly leaves open the possibility for reference counting in the future if someone wants to implement that.

        As someone who teaches people these APIs on a regular basis, I feel pretty confident in saying that adding an IR to the IW as a public API is going to confuse a good chunk of people just as the delete stuff on the IR currently does now. You wouldn't ask FileWriter for a FileReader, would you? I don't see why it would be good to ask a IW for an IR, API-wise (I get why we are doing this, it makes sense).

        Likewise, isn't it just as logical to ask for an IW from an IR? If I have an IR already and I want it to be aware of the writes I want to do, why wouldn't we then add IR.getIW()? And then we can have total circular dependency.

        Show
        Grant Ingersoll added a comment - Good points, MIke, but maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory. new IndexWriter( new Directory(indexFile)); Additionally, there are no more variants than there already are on the IW and IR, right? As for pass-through or not, I think it would just pass-through, at least initially, but it certainly leaves open the possibility for reference counting in the future if someone wants to implement that. As someone who teaches people these APIs on a regular basis, I feel pretty confident in saying that adding an IR to the IW as a public API is going to confuse a good chunk of people just as the delete stuff on the IR currently does now. You wouldn't ask FileWriter for a FileReader, would you? I don't see why it would be good to ask a IW for an IR, API-wise (I get why we are doing this, it makes sense). Likewise, isn't it just as logical to ask for an IW from an IR? If I have an IR already and I want it to be aware of the writes I want to do, why wouldn't we then add IR.getIW()? And then we can have total circular dependency.
        Hide
        Michael McCandless added a comment -

        maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory.

        new IndexWriter(new Directory(indexFile));
        

        (You'd presumably need to close that Directory). But, yeah, we may be
        able to drop some of them, although I do think they are convenient for
        new users of Lucene. And forcing users to switch to a totally new yet
        pass through API on ugprade, but not giving them one to one carryover,
        is not very nice.

        Additionally, there are no more variants than there already are on the IW and IR, right?

        Right, I'm just saying IndexAccessor will have many methods. And then
        you're asking every app to make this switch, on upgrade. It's alot of
        API swapping/noise vs a single added expert method to IW.

        As for pass-through or not, I think it would just pass-through, at
        least initially, but it certainly leaves open the possibility for
        reference counting in the future if someone wants to implement that.

        If we think it'll be more than just pass through, we should try to
        hash out, somewhat, what it will & won't do up front (changing it
        later is a big change)? And we should start from LUCENE-390.

        As someone who teaches people these APIs on a regular basis, I feel
        pretty confident in saying that adding an IR to the IW as a public API
        is going to confuse a good chunk of people just as the delete stuff on
        the IR currently does now.

        But this will be an expert/advanced API, a single added method to IW.
        I wouldn't expect users to be confused: on upgrade I think most users
        will not even notice its existence!

        You wouldn't ask FileWriter for a FileReader, would you?

        I'm not sure that's the right comparison – Lucene's IW does far more
        than a FileWriter. And the fact that Lucene allows "point in time"
        searching (which is very useful and rather unique) is a very big
        difference vs FileReader/Writer.

        Likewise, isn't it just as logical to ask for an IW from an IR?

        I don't think so: the functionality is not symmetric, because Lucene
        allows only one writer open at a time, but many readers (eg on
        different commits). Since a writer is the one making changes, it
        makes sense that you'd ask it, right now, to give you a reader
        reflecting all changes up to that point. And call it again later to
        get a reader seeing changes after that, etc.

        Show
        Michael McCandless added a comment - maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory. new IndexWriter( new Directory(indexFile)); (You'd presumably need to close that Directory). But, yeah, we may be able to drop some of them, although I do think they are convenient for new users of Lucene. And forcing users to switch to a totally new yet pass through API on ugprade, but not giving them one to one carryover, is not very nice. Additionally, there are no more variants than there already are on the IW and IR, right? Right, I'm just saying IndexAccessor will have many methods. And then you're asking every app to make this switch, on upgrade. It's alot of API swapping/noise vs a single added expert method to IW. As for pass-through or not, I think it would just pass-through, at least initially, but it certainly leaves open the possibility for reference counting in the future if someone wants to implement that. If we think it'll be more than just pass through, we should try to hash out, somewhat, what it will & won't do up front (changing it later is a big change)? And we should start from LUCENE-390 . As someone who teaches people these APIs on a regular basis, I feel pretty confident in saying that adding an IR to the IW as a public API is going to confuse a good chunk of people just as the delete stuff on the IR currently does now. But this will be an expert/advanced API, a single added method to IW. I wouldn't expect users to be confused: on upgrade I think most users will not even notice its existence! You wouldn't ask FileWriter for a FileReader, would you? I'm not sure that's the right comparison – Lucene's IW does far more than a FileWriter. And the fact that Lucene allows "point in time" searching (which is very useful and rather unique) is a very big difference vs FileReader/Writer. Likewise, isn't it just as logical to ask for an IW from an IR? I don't think so: the functionality is not symmetric, because Lucene allows only one writer open at a time, but many readers (eg on different commits). Since a writer is the one making changes, it makes sense that you'd ask it, right now, to give you a reader reflecting all changes up to that point. And call it again later to get a reader seeing changes after that, etc.
        Hide
        Michael McCandless added a comment -

        Or, here's an idea: can we do both? Put IndexAccessor as an optional
        "convenience" layer that simplifies the ctors and expert methods of IW
        & IR, but leave public direct access to the ctros & expert methods?
        This way on upgrade nobody is forced to migrate to an entirely new yet
        simply pass-through API?

        Or another idea is to decouple these two discussions: go ahead and add
        the single expert method to IW, but as a separate discussion/JIRA work
        out how we can simplify overall access/management of IR/IW instances?

        Show
        Michael McCandless added a comment - Or, here's an idea: can we do both? Put IndexAccessor as an optional "convenience" layer that simplifies the ctors and expert methods of IW & IR, but leave public direct access to the ctros & expert methods? This way on upgrade nobody is forced to migrate to an entirely new yet simply pass-through API? Or another idea is to decouple these two discussions: go ahead and add the single expert method to IW, but as a separate discussion/JIRA work out how we can simplify overall access/management of IR/IW instances?
        Hide
        Jason Rutherglen added a comment -

        Ah yes, patch from the old directory that need deleting. Here's the correct one. Sorry about that.

        Show
        Jason Rutherglen added a comment - Ah yes, patch from the old directory that need deleting. Here's the correct one. Sorry about that.
        Hide
        Grant Ingersoll added a comment -

        Right, I'm just saying IndexAccessor will have many methods. And then

        you're asking every app to make this switch, on upgrade. It's alot of
        API swapping/noise vs a single added expert method to IW.

        Sure, but that is already the case w/ IW/IR anyway.

        I agree about the short term noise, but in the long run it seems cleaner.

        But this will be an expert/advanced API, a single added method to IW.

        I wouldn't expect users to be confused: on upgrade I think most users
        will not even notice its existence!

        Hmm, I don't agree, but I guess it depends on the performance hit. If given a choice between the semantics of a reader that sees changes as they are made versus having to do the whole reopen thing, I'm betting most users will say "duh, I want to see my changes right away" and choose the IR that is synced w/ the IW, b/c that is what people think is the logical thing to happen and it is how DBs work, which many devs. are used to. As an app developer, if I don't have to think about IR lifecycle management, why would I want to as long as it performs? What this patch is offering, AFAICT, is the removal of IR lifecycle managment from the user.

        In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert". Now, if you're telling me this is going to be significantly slower even when updates are rare, then maybe I would stick to the current lifecycle, but if there isn't much difference, I'll take the one that pushes the lifecycle complexity down into Lucene instead of in my app.

        Show
        Grant Ingersoll added a comment - Right, I'm just saying IndexAccessor will have many methods. And then you're asking every app to make this switch, on upgrade. It's alot of API swapping/noise vs a single added expert method to IW. Sure, but that is already the case w/ IW/IR anyway. I agree about the short term noise, but in the long run it seems cleaner. But this will be an expert/advanced API, a single added method to IW. I wouldn't expect users to be confused: on upgrade I think most users will not even notice its existence! Hmm, I don't agree, but I guess it depends on the performance hit. If given a choice between the semantics of a reader that sees changes as they are made versus having to do the whole reopen thing, I'm betting most users will say "duh, I want to see my changes right away" and choose the IR that is synced w/ the IW, b/c that is what people think is the logical thing to happen and it is how DBs work, which many devs. are used to. As an app developer, if I don't have to think about IR lifecycle management, why would I want to as long as it performs? What this patch is offering, AFAICT, is the removal of IR lifecycle managment from the user. In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert". Now, if you're telling me this is going to be significantly slower even when updates are rare, then maybe I would stick to the current lifecycle, but if there isn't much difference, I'll take the one that pushes the lifecycle complexity down into Lucene instead of in my app.
        Hide
        Yonik Seeley added a comment -

        a reader that sees changes as they are made versus having to do the whole reopen thing

        It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed.

        Show
        Yonik Seeley added a comment - a reader that sees changes as they are made versus having to do the whole reopen thing It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed.
        Hide
        Michael McCandless added a comment -

        It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed.

        That's right. The current proposal is to add one method to IW:

        IndexReader getReader()
        

        that returns a point-in-time view of the index plus all changes
        buffered in IW up until that point. Then you can reopen that reader
        (or call getReader() again, which does the same thing) to quickly get
        a new point-in-time reader.

        I think the point-in-time semantics is important to keep.

        Also, you can't easily emulate point-in-time if we implemented the
        "live" approach, but you can easily do vice/versa (assuming we can
        keep reopen() time fast enough).

        EG the IndexAccessor convenience layer could do automatic reopening so
        that when you ask it for the reader it always reopens it; this would
        emulate "live updates" and hide the lifecycle management.

        Show
        Michael McCandless added a comment - It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed. That's right. The current proposal is to add one method to IW: IndexReader getReader() that returns a point-in-time view of the index plus all changes buffered in IW up until that point. Then you can reopen that reader (or call getReader() again, which does the same thing) to quickly get a new point-in-time reader. I think the point-in-time semantics is important to keep. Also, you can't easily emulate point-in-time if we implemented the "live" approach, but you can easily do vice/versa (assuming we can keep reopen() time fast enough). EG the IndexAccessor convenience layer could do automatic reopening so that when you ask it for the reader it always reopens it; this would emulate "live updates" and hide the lifecycle management.
        Hide
        Michael McCandless added a comment -

        I guess it depends on the performance hit.

        It's challenging to implement truly live updates w/ decent
        performance: I think we'd need to build the reader impl that can
        search DocumentsWriter buffer.

        Whereas the approach (patch) here is actually quite simple (all the
        hard work was already done – IndexReader.reopen,
        collection/sorting/filtering by segment, etc.).

        In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert".

        I agree: realtime search will likely be a popular feature once we
        finish it, release it, it proves stable, performant, etc. Eventually
        (maybe soon) it should be made the default.

        I think IndexAccessor makes alot of sense, but it's a big change and
        I'd rather not couple it to this issue. There are many questions to
        be hashed out (under a new issue): is it a simple pass-through? Or
        does it manage the lifecycle of the readers for you? Does it warm new
        readers? Should it emulate "live" update semantics? Should getReader
        get it from the writer if there is one (ie, making realtime search the
        "default")? Etc.

        Show
        Michael McCandless added a comment - I guess it depends on the performance hit. It's challenging to implement truly live updates w/ decent performance: I think we'd need to build the reader impl that can search DocumentsWriter buffer. Whereas the approach (patch) here is actually quite simple (all the hard work was already done – IndexReader.reopen, collection/sorting/filtering by segment, etc.). In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert". I agree: realtime search will likely be a popular feature once we finish it, release it, it proves stable, performant, etc. Eventually (maybe soon) it should be made the default. I think IndexAccessor makes alot of sense, but it's a big change and I'd rather not couple it to this issue. There are many questions to be hashed out (under a new issue): is it a simple pass-through? Or does it manage the lifecycle of the readers for you? Does it warm new readers? Should it emulate "live" update semantics? Should getReader get it from the writer if there is one (ie, making realtime search the "default")? Etc.
        Hide
        Grant Ingersoll added a comment -

        OK, I agree. Let's just mark it as expert/subject to revision and then we're good.

        We can revisit IndexAccessor separately.

        Show
        Grant Ingersoll added a comment - OK, I agree. Let's just mark it as expert/subject to revision and then we're good. We can revisit IndexAccessor separately.
        Hide
        Jason Rutherglen added a comment - - edited

        There's concurrency issues to work out.

        • IW.getReader returns a cloned read only reader
        • Removed IW.reopenReader
        • All test methods pass except testAddIndexesAndDoDeletesThreads. testAddIndexesAndDoDeletesThreads
          currently merges indexes concurrently (and fails). In the future the
          method will test merging, deleting, and searching concurrently.
        • Concurrent merges fail when "ant test-core" is run
        • DocumentsWriter.applyDeletes deletes again at the SegmentReader level
        Show
        Jason Rutherglen added a comment - - edited There's concurrency issues to work out. IW.getReader returns a cloned read only reader Removed IW.reopenReader All test methods pass except testAddIndexesAndDoDeletesThreads. testAddIndexesAndDoDeletesThreads currently merges indexes concurrently (and fails). In the future the method will test merging, deleting, and searching concurrently. Concurrent merges fail when "ant test-core" is run DocumentsWriter.applyDeletes deletes again at the SegmentReader level
        Hide
        Jason Rutherglen added a comment -
        • Added segmentReaders, segmentReaderClone to MergePolicy.OneMerge
        • mergeFinish decRefs the OneMerge.segmentReaders
        • _mergeInit clones segmentreaders
        • commitMergedDeletes uses OneMerge.segmentReaders instead of loading
          bitvectors from the directory
        • testAddIndexesAndDoDeletesThreads fails with 20 less documents (80 vs 100)
          (an index that was supposed to be added isn't showing up)
        • Getting exceptions from org.apache.lucene.TestSnapshotDeletionPolicy, such as:
        Caused by: java.io.IOException: read past EOF
            [junit] 	at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:135)
            [junit] 	at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:92)
            [junit] 	at org.apache.lucene.store.IndexOutput.copyBytes(IndexOutput.java:172)
            [junit] 	at org.apache.lucene.index.TermVectorsWriter.addRawDocuments(TermVectorsWriter.java:185)
            [junit] 	at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:447)
            [junit] 	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:145)
            [junit] 	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4823)
            [junit] 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4408)
            [junit] 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218)
            [junit] 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274)
        
        Show
        Jason Rutherglen added a comment - Added segmentReaders, segmentReaderClone to MergePolicy.OneMerge mergeFinish decRefs the OneMerge.segmentReaders _mergeInit clones segmentreaders commitMergedDeletes uses OneMerge.segmentReaders instead of loading bitvectors from the directory testAddIndexesAndDoDeletesThreads fails with 20 less documents (80 vs 100) (an index that was supposed to be added isn't showing up) Getting exceptions from org.apache.lucene.TestSnapshotDeletionPolicy, such as: Caused by: java.io.IOException: read past EOF [junit] at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:135) [junit] at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:92) [junit] at org.apache.lucene.store.IndexOutput.copyBytes(IndexOutput.java:172) [junit] at org.apache.lucene.index.TermVectorsWriter.addRawDocuments(TermVectorsWriter.java:185) [junit] at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:447) [junit] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:145) [junit] at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4823) [junit] at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4408) [junit] at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218) [junit] at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274)
        Hide
        Michael McCandless added a comment -

        I think your latest test is missing the the new unit test source.

        Show
        Michael McCandless added a comment - I think your latest test is missing the the new unit test source.
        Hide
        Michael McCandless added a comment -

        Woops, sorry – I downloaded the wrong patch.

        Show
        Michael McCandless added a comment - Woops, sorry – I downloaded the wrong patch.
        Hide
        Michael McCandless added a comment -

        I think we can simplify the approach here:

        • I don't think IW should track an "internal reader" at all.
          Instead, it tracks a pool of held-open and re-used SegmentReader
          instances. Maybe make a new SegmentReaderPool private class. It
          exposes get() to load a SegmentReader, returning the pooled one if
          present, else falling back to the normal SegmentReader.get.
        • Then, the many places where we currently pull a segment reader (to
          apply deletes, to do a merge, to carry over new deletes after
          merge finishes) as well as the new getReader() method, should use
          this pool to get the reader.
        • We would never mergeSegmentInfos, never ask the "internal reader"
          to commit, etc.

        Other things:

        • Since IndexReader will always be readonly, you can simplify the
          new DirectoryIndexReader.open method, eg we don't need to copy
          over the writeLock nor the DeletionPolicy, and we don't need the
          readOnly param, and closeDirectory is always false. In fact we
          could simply create a ReadOnly {Multi,}

          SegmentReader directly.

        • I think IndexReader.clone() should not grab the writer's
          segmentInfos? Ie it should truly be a clone(), not a reopen.
          Actually, when the reader is created we should make a private full
          copy of the SegmentInfos.
        • We lost some "private" modifiers, unecessarily; eg
          DirectoryIndexReader.writeLock.
        • I don't understand why we need merge.segmentReadersClone? If we
          only use to detect new deletions for carrying over deletes after
          merge finishes, we should instead just grab the delCount when the
          merge kicked off?
        • I think the only reason for reader to hold a reference to writer
          is so that on reopen, the reader realizes it was created from
          IW.getReader and simply forwards the request to IW. Otherwise
          writer should not be used in reader.
        • Once we no longer store/maintain/merge an "internal" reader,
          IW.getReader simply becomes a call to IndexReader.open, with two
          differences: 1) we pass in a SegmentInfos instance (instead of
          looking for last segments_N in dir), and 2) we pass in our own
          SegmentReaderPool that should be used to open the SR's.
        • You need to add mainDir.close() in
          testAddIndexesAndDoDeletesThreads.
        Show
        Michael McCandless added a comment - I think we can simplify the approach here: I don't think IW should track an "internal reader" at all. Instead, it tracks a pool of held-open and re-used SegmentReader instances. Maybe make a new SegmentReaderPool private class. It exposes get() to load a SegmentReader, returning the pooled one if present, else falling back to the normal SegmentReader.get. Then, the many places where we currently pull a segment reader (to apply deletes, to do a merge, to carry over new deletes after merge finishes) as well as the new getReader() method, should use this pool to get the reader. We would never mergeSegmentInfos, never ask the "internal reader" to commit, etc. Other things: Since IndexReader will always be readonly, you can simplify the new DirectoryIndexReader.open method, eg we don't need to copy over the writeLock nor the DeletionPolicy, and we don't need the readOnly param, and closeDirectory is always false. In fact we could simply create a ReadOnly {Multi,} SegmentReader directly. I think IndexReader.clone() should not grab the writer's segmentInfos? Ie it should truly be a clone(), not a reopen. Actually, when the reader is created we should make a private full copy of the SegmentInfos. We lost some "private" modifiers, unecessarily; eg DirectoryIndexReader.writeLock. I don't understand why we need merge.segmentReadersClone? If we only use to detect new deletions for carrying over deletes after merge finishes, we should instead just grab the delCount when the merge kicked off? I think the only reason for reader to hold a reference to writer is so that on reopen, the reader realizes it was created from IW.getReader and simply forwards the request to IW. Otherwise writer should not be used in reader. Once we no longer store/maintain/merge an "internal" reader, IW.getReader simply becomes a call to IndexReader.open, with two differences: 1) we pass in a SegmentInfos instance (instead of looking for last segments_N in dir), and 2) we pass in our own SegmentReaderPool that should be used to open the SR's. You need to add mainDir.close() in testAddIndexesAndDoDeletesThreads.
        Hide
        Jason Rutherglen added a comment -

        > make a new SegmentReaderPool private class

        I'd prefer the SegmentReaderPool model over the patch's existing one
        as it is simpler and closer to how the underlying system actually
        works meaning it works directly with the segments in a systematized way.

        > We would never mergeSegmentInfos, never ask the "internal reader"
        to commit

        Good, merging the segmentInfos is confusing and tricky to debug

        > Since IndexReader will always be readonly, you can simplify the new
        DirectoryIndexReader.open method

        +1

        > why we need merge.segmentReadersClone?

        I was modeling after the segmentInfosClone. If it's not necessary
        I'll happily remove it.

        > I think the only reason for reader to hold a reference to writer is
        so that on reopen, the reader realizes it was created from
        IW.getReader and simply forwards the request to IW

        +1

        > Wouldn't delete-by-Query cover this? Ie one could always make a
        Filter implementing the "look @ field cache, do some logic, provide
        docIDs to delete", wrap as Query, then delete-by-Query? (from a
        previous coment)

        Yes that will work.

        How are these sample SegmentReaderPool method signatures?

        private class SegmentReaderPool {
          public SegmentReader get(SegmentInfo si) {}
            
          public void decRef(SegmentReader sr) {}
        }
        
        Show
        Jason Rutherglen added a comment - > make a new SegmentReaderPool private class I'd prefer the SegmentReaderPool model over the patch's existing one as it is simpler and closer to how the underlying system actually works meaning it works directly with the segments in a systematized way. > We would never mergeSegmentInfos, never ask the "internal reader" to commit Good, merging the segmentInfos is confusing and tricky to debug > Since IndexReader will always be readonly, you can simplify the new DirectoryIndexReader.open method +1 > why we need merge.segmentReadersClone? I was modeling after the segmentInfosClone. If it's not necessary I'll happily remove it. > I think the only reason for reader to hold a reference to writer is so that on reopen, the reader realizes it was created from IW.getReader and simply forwards the request to IW +1 > Wouldn't delete-by-Query cover this? Ie one could always make a Filter implementing the "look @ field cache, do some logic, provide docIDs to delete", wrap as Query, then delete-by-Query? (from a previous coment) Yes that will work. How are these sample SegmentReaderPool method signatures? private class SegmentReaderPool { public SegmentReader get(SegmentInfo si) {} public void decRef(SegmentReader sr) {} }
        Hide
        Jason Rutherglen added a comment -

        In SegmentInfo it's hashCode only takes into
        account the name and directory, why not the delete generation?

        Show
        Jason Rutherglen added a comment - In SegmentInfo it's hashCode only takes into account the name and directory, why not the delete generation?
        Hide
        Michael McCandless added a comment -

        How are these sample SegmentReaderPool method signatures?

        Looks good for starters. decRef would simply pass-through to the SR's
        normal decRef, but would then also decRef its copy if the ref count has
        dropped to 1?

        In SegmentInfo it's hashCode only takes into
        account the name and directory, why not the delete generation?

        Two SegmentInfo instances are considered equal if the dir is == and
        the name is .equals, so hashCode must match that (be the same value
        when equals returns true).

        Show
        Michael McCandless added a comment - How are these sample SegmentReaderPool method signatures? Looks good for starters. decRef would simply pass-through to the SR's normal decRef, but would then also decRef its copy if the ref count has dropped to 1? In SegmentInfo it's hashCode only takes into account the name and directory, why not the delete generation? Two SegmentInfo instances are considered equal if the dir is == and the name is .equals, so hashCode must match that (be the same value when equals returns true).
        Hide
        Jason Rutherglen added a comment -
        • Added SegmentReaderPool that maintains a map of SegmentInfoKey ->
          SegmentReader. SegmentInfoKey returns a hashcode of a SegmentInfo
          that also includes the deletes and norms generations.
          SegmentReaderPool also maintains a map of SegmentReaders that
          represent "working copies" (cloned readers) for realtime deletes
          occurring. When IW.getReader is called the "working copy" readers
          replace the existing readers keyed with the same segmentinfo in the
          SegmentReaderPool. This way IW.deleteDocument applies to the in
          memory cloned "working copies" but does not alter SegmentReaders
          already returned from SegmentReaderPool.
        • The test methods in TestIndexWriterReader pass
        • I'm hesitant to automatically flush on IW.getReader as IW.flush has
          a number of parameters. However for updateDocument to work properly
          we'll probably need to do this?
        • DocumentsWriter.applyDeletes should no longer need to apply deletes
          as they are already applied in realtime in IW.deleteDocument? It
          should only need to call SR.commitChanges?
        • "ant test-core" fails in ConcurrentMergeScheduler with the same
          exception as above in TestSnapshotDeletionPolicy (and some others). I
          suspect in the case of TestSnapshotDeletionPolicy the errors could be
          due to the SegmentReader holding files open while they are being
          copied. Perhaps this would explain the EOFExceptions?
        Show
        Jason Rutherglen added a comment - Added SegmentReaderPool that maintains a map of SegmentInfoKey -> SegmentReader. SegmentInfoKey returns a hashcode of a SegmentInfo that also includes the deletes and norms generations. SegmentReaderPool also maintains a map of SegmentReaders that represent "working copies" (cloned readers) for realtime deletes occurring. When IW.getReader is called the "working copy" readers replace the existing readers keyed with the same segmentinfo in the SegmentReaderPool. This way IW.deleteDocument applies to the in memory cloned "working copies" but does not alter SegmentReaders already returned from SegmentReaderPool. The test methods in TestIndexWriterReader pass I'm hesitant to automatically flush on IW.getReader as IW.flush has a number of parameters. However for updateDocument to work properly we'll probably need to do this? DocumentsWriter.applyDeletes should no longer need to apply deletes as they are already applied in realtime in IW.deleteDocument? It should only need to call SR.commitChanges? "ant test-core" fails in ConcurrentMergeScheduler with the same exception as above in TestSnapshotDeletionPolicy (and some others). I suspect in the case of TestSnapshotDeletionPolicy the errors could be due to the SegmentReader holding files open while they are being copied. Perhaps this would explain the EOFExceptions?
        Hide
        Jason Rutherglen added a comment -
        • Fixed the basic concurrency merge issue (TestSnapshotDeletionPolicy passes)
        • Many unit tests fail due to other reasons
        Show
        Jason Rutherglen added a comment - Fixed the basic concurrency merge issue (TestSnapshotDeletionPolicy passes) Many unit tests fail due to other reasons
        Hide
        Jason Rutherglen added a comment -
        • TestAddIndexesNoOptimize test passes
        • There's an incRef issue still
        Show
        Jason Rutherglen added a comment - TestAddIndexesNoOptimize test passes There's an incRef issue still
        Hide
        Michael McCandless added a comment -

        Jason could you rebase the patch on current trunk? (Or post the revision you're on?).

        Show
        Michael McCandless added a comment - Jason could you rebase the patch on current trunk? (Or post the revision you're on?).
        Hide
        Jason Rutherglen added a comment -
        • Revised to trunk
        • Still an incRef issue.
          TestIndexWriterReader.testDeleteFromIndexWriter fails.
          TestConcurrentMergeScheduler fails, as well as others
        • Removed the working copy stuff from SegmentReaderPool as the
          deletes need to be applied to the internal readers, the external
          IW.getReader is a clone of the internal SRs.
        Show
        Jason Rutherglen added a comment - Revised to trunk Still an incRef issue. TestIndexWriterReader.testDeleteFromIndexWriter fails. TestConcurrentMergeScheduler fails, as well as others Removed the working copy stuff from SegmentReaderPool as the deletes need to be applied to the internal readers, the external IW.getReader is a clone of the internal SRs.
        Hide
        Jason Rutherglen added a comment -

        We'll need a method to check the SegmentReaderPool for readers that
        may be removed. It seems it would check IW.segmentInfos, pending
        merges, current merges for segmentinfo(s) that are in use, then all
        others in the pool are expunged. I'm not sure if this method would
        need to run in a background thread.

        Show
        Jason Rutherglen added a comment - We'll need a method to check the SegmentReaderPool for readers that may be removed. It seems it would check IW.segmentInfos, pending merges, current merges for segmentinfo(s) that are in use, then all others in the pool are expunged. I'm not sure if this method would need to run in a background thread.
        Hide
        Michael McCandless added a comment -

        Looks good!:

        • The SegmentReaderPool class should be package private.
        • You're still passing readOnly & closeDirectory into the new
          DirectoryIndexReader ctor; I think you can pass in only the
          writer, and then it can get all it needs from that (pool, dir,
          segmentInfos)?
        • Hmm... I think we should change commitMergedDeletes to use the
          already computed doc ID maps, instead of loading the old deletes.
          This way you don't need the old SR.
        • I think you don't need SegmentInfoKey? Can't you key off of just
          the SegmentInfo? I think for each segment name we only ever need
          one SR instance in the pool (dir is always the same)? (The
          readers we give out will make clones of these).
        • Why do you applyDeletesImmediately? I think we should
          buffer/flush like we normally do? That code shouldn't need to
          change.
        • I think getReader does need to flush, else added docs won't be
          seen in the reopend reader. You should call flush(true, true,
          true) – because you want to triggerMerges, flushDocStores,
          and flushDeletes.
        • Are segmentInfosList/printSegmentInfosList temporary debugging?
          Can you stick "nocommits" on them?

        For SegmentReaderPool:

        • I think IW should always use the pool API internally to load &
          decRef the SRs it uses.
        • If getReader has never been called, then the pool should not hold
          refs, ie IW should behave exactly as it does today (get SR, use
          it, commit it, close it). Ie, the pool should detect on decRef
          that the RC is now 1, and release the SR.
        • If getReader has been called, then the pool should hold onto an SR
          as long as the writer's segmentInfos still references that
          segment. This means, after a merge commits we should prune the
          pool. Ie, we decRef and remove the SR from the pool, but we don't
          close it since a reader may still be using it. Maybe add a
          prune() method?
        • Somehow, we should not let the SR commit itself on close – that's
          up to IW to decide. If getReader has not been called then IW must
          commit deletes as soon as applies them (like today). Else, it's
          only on closing the pool that deletes are committed. EG pending
          deletes on SRs that have been merged can be freely discarded,
          since those deletes "made it" into the newly merged segment.
        • So I don't think we need a BG thread to do the culling.
        Show
        Michael McCandless added a comment - Looks good!: The SegmentReaderPool class should be package private. You're still passing readOnly & closeDirectory into the new DirectoryIndexReader ctor; I think you can pass in only the writer, and then it can get all it needs from that (pool, dir, segmentInfos)? Hmm... I think we should change commitMergedDeletes to use the already computed doc ID maps, instead of loading the old deletes. This way you don't need the old SR. I think you don't need SegmentInfoKey? Can't you key off of just the SegmentInfo? I think for each segment name we only ever need one SR instance in the pool (dir is always the same)? (The readers we give out will make clones of these). Why do you applyDeletesImmediately? I think we should buffer/flush like we normally do? That code shouldn't need to change. I think getReader does need to flush, else added docs won't be seen in the reopend reader. You should call flush(true, true, true) – because you want to triggerMerges, flushDocStores, and flushDeletes. Are segmentInfosList/printSegmentInfosList temporary debugging? Can you stick "nocommits" on them? For SegmentReaderPool: I think IW should always use the pool API internally to load & decRef the SRs it uses. If getReader has never been called, then the pool should not hold refs, ie IW should behave exactly as it does today (get SR, use it, commit it, close it). Ie, the pool should detect on decRef that the RC is now 1, and release the SR. If getReader has been called, then the pool should hold onto an SR as long as the writer's segmentInfos still references that segment. This means, after a merge commits we should prune the pool. Ie, we decRef and remove the SR from the pool, but we don't close it since a reader may still be using it. Maybe add a prune() method? Somehow, we should not let the SR commit itself on close – that's up to IW to decide. If getReader has not been called then IW must commit deletes as soon as applies them (like today). Else, it's only on closing the pool that deletes are committed. EG pending deletes on SRs that have been merged can be freely discarded, since those deletes "made it" into the newly merged segment. So I don't think we need a BG thread to do the culling.
        Hide
        Jason Rutherglen added a comment -

        I think we should change commitMergedDeletes to use the
        already computed doc ID maps, instead of loading the old deletes.

        Where do I get the already computed doc ID maps?

        I think getReader does need to flush, else added docs won't be
        seen in the reopend reader. You should call flush(true, true,
        true) - because you want to triggerMerges, flushDocStores,
        and flushDeletes.

        I added a flushDeletesToDir flag that defaults to true except for IW.getReader.

        Show
        Jason Rutherglen added a comment - I think we should change commitMergedDeletes to use the already computed doc ID maps, instead of loading the old deletes. Where do I get the already computed doc ID maps? I think getReader does need to flush, else added docs won't be seen in the reopend reader. You should call flush(true, true, true) - because you want to triggerMerges, flushDocStores, and flushDeletes. I added a flushDeletesToDir flag that defaults to true except for IW.getReader.
        Hide
        Jason Rutherglen added a comment -

        Found the doc ID map in SegmentMerger in IW.commitMerge

        Show
        Jason Rutherglen added a comment - Found the doc ID map in SegmentMerger in IW.commitMerge
        Hide
        Jason Rutherglen added a comment -
        • commitMergedDeletes uses the docIdMap from commitMerge SegmentMerger to map new deletes to the newly merged segment
        • Removed applyDeletesImmediately
        • Haven't solved the decRef issue which appears in TestIndexWriterReader.testDeleteFromIndexWriter. It doesn't show up in the debugger.
        [junit] java.lang.AssertionError
            [junit] 	at org.apache.lucene.index.SegmentReader$Ref.incRef(SegmentReader.java:103)
            [junit] 	at org.apache.lucene.index.SegmentReader.incRef(SegmentReader.java:341)
            [junit] 	at org.apache.lucene.index.IndexWriter$SegmentReaderPool.get(IndexWriter.java:647)
            [junit] 	at org.apache.lucene.index.IndexWriter$SegmentReaderPool.getReadOnlyClone(IndexWriter.java:615)
            [junit] 	at org.apache.lucene.index.MultiSegmentReader.<init>(MultiSegmentReader.java:86)
            [junit] 	at org.apache.lucene.index.ReadOnlyMultiSegmentReader.<init>(ReadOnlyMultiSegmentReader.java:35)
            [junit] 	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:396)
            [junit] 	at org.apache.lucene.index.TestIndexWriterReader.testDeleteFromIndexWriter(TestIndexWriterReader.java:159)
            [junit] 	at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
        
        
        Show
        Jason Rutherglen added a comment - commitMergedDeletes uses the docIdMap from commitMerge SegmentMerger to map new deletes to the newly merged segment Removed applyDeletesImmediately Haven't solved the decRef issue which appears in TestIndexWriterReader.testDeleteFromIndexWriter. It doesn't show up in the debugger. [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.SegmentReader$Ref.incRef(SegmentReader.java:103) [junit] at org.apache.lucene.index.SegmentReader.incRef(SegmentReader.java:341) [junit] at org.apache.lucene.index.IndexWriter$SegmentReaderPool.get(IndexWriter.java:647) [junit] at org.apache.lucene.index.IndexWriter$SegmentReaderPool.getReadOnlyClone(IndexWriter.java:615) [junit] at org.apache.lucene.index.MultiSegmentReader.<init>(MultiSegmentReader.java:86) [junit] at org.apache.lucene.index.ReadOnlyMultiSegmentReader.<init>(ReadOnlyMultiSegmentReader.java:35) [junit] at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:396) [junit] at org.apache.lucene.index.TestIndexWriterReader.testDeleteFromIndexWriter(TestIndexWriterReader.java:159) [junit] at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
        Hide
        Jason Rutherglen added a comment -
        • The patch is not committable as the debugging and comments need cleaning
        • The clone issue was hacked around so more tests pass
        • TestAtomicUpdate fails
        Show
        Jason Rutherglen added a comment - The patch is not committable as the debugging and comments need cleaning The clone issue was hacked around so more tests pass TestAtomicUpdate fails
        Hide
        Jeremy Volkman added a comment -

        I noticed the comments about IW.getReader() flushing the current state of the writer, launching merges, etc. Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk)

        Show
        Jeremy Volkman added a comment - I noticed the comments about IW.getReader() flushing the current state of the writer, launching merges, etc. Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk)
        Hide
        Michael McCandless added a comment -

        Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk)

        For added docs, yes. But deletions will carry straight through in memory.

        Show
        Michael McCandless added a comment - Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk) For added docs, yes. But deletions will carry straight through in memory.
        Hide
        Jason Rutherglen added a comment -
        • Most tests pass. Because SegmentReaders are held by the writer,
          files remain open whereas before this patch they did not. This causes
          some tests to fail.
        • I'm working a way to remove unreferenced segments from the pool
        Show
        Jason Rutherglen added a comment - Most tests pass. Because SegmentReaders are held by the writer, files remain open whereas before this patch they did not. This causes some tests to fail. I'm working a way to remove unreferenced segments from the pool
        Hide
        Jason Rutherglen added a comment -
        • In IW.checkpoint, SegmentReaderPool.closeUnusedSegmentReaders is
          called which gathers a set of all currently active segment names and
          closes the segment readers not in the set. When the IW is closed an
          assertion checks to insure the IW.segmentInfos names are the same as
          those in the pool.
        • Bunches of private variables were made package protected and need to
          be changed back
        Show
        Jason Rutherglen added a comment - In IW.checkpoint, SegmentReaderPool.closeUnusedSegmentReaders is called which gathers a set of all currently active segment names and closes the segment readers not in the set. When the IW is closed an assertion checks to insure the IW.segmentInfos names are the same as those in the pool. Bunches of private variables were made package protected and need to be changed back
        Hide
        Jason Rutherglen added a comment -
        • In TestConcurrentMergeScheduler.testNoWaitClose I'm seeing a couple
          of exceptions. I'm not quite sure what to make of them. The
          testNoWaitClose method calls IW.close(false) a number of times on an
          IW that uses ConcurrentMergeScheduler.
        [junit] Caused by: java.lang.AssertionError
            [junit] 	at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:553)
            [junit] 	at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:470)
            [junit] 	at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:481)
            [junit] 	at org.apache.lucene.index.IndexWriter.decrefMergeSegments(IndexWriter.java:4485)
            [junit] 	at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:4473)
            [junit] 	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4910)
            [junit] 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4539)
            [junit] 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218)
            [junit] 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274)
        
        [junit] java.lang.AssertionError
            [junit] 	at org.apache.lucene.index.IndexWriter.finishMerges(IndexWriter.java:3350)
            [junit] 	at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2090)
            [junit] 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2045)
            [junit] 	at org.apache.lucene.index.TestConcurrentMergeScheduler.testNoWaitClose(TestConcurrentMergeScheduler.java:211)
            [junit] 	at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
        
        Show
        Jason Rutherglen added a comment - In TestConcurrentMergeScheduler.testNoWaitClose I'm seeing a couple of exceptions. I'm not quite sure what to make of them. The testNoWaitClose method calls IW.close(false) a number of times on an IW that uses ConcurrentMergeScheduler. [junit] Caused by: java.lang.AssertionError [junit] at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:553) [junit] at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:470) [junit] at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:481) [junit] at org.apache.lucene.index.IndexWriter.decrefMergeSegments(IndexWriter.java:4485) [junit] at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:4473) [junit] at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4910) [junit] at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4539) [junit] at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218) [junit] at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274) [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.IndexWriter.finishMerges(IndexWriter.java:3350) [junit] at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2090) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2045) [junit] at org.apache.lucene.index.TestConcurrentMergeScheduler.testNoWaitClose(TestConcurrentMergeScheduler.java:211) [junit] at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
        Hide
        Jason Rutherglen added a comment -

        The IndexFileDeleter$RefCount.DecRef exception is related to an incorrect ref count of .del files when TestConcurrentMergeScheduler is run.

        Show
        Jason Rutherglen added a comment - The IndexFileDeleter$RefCount.DecRef exception is related to an incorrect ref count of .del files when TestConcurrentMergeScheduler is run.
        Hide
        Jason Rutherglen added a comment -

        Because the patch does not flush deletes to disk like the existing
        code does, the SegmentInfo delGen etc isn't updated at the same
        points as expected. In tracing the code the .del files that have 0
        refs then are decRefed throwing the assertion error were never
        incRefed by IndexFileDeleter. This is because the deletes are carried
        over in the IW internal readers. Eventually the .del files are
        written but aren't increfed as they usually would be. Is there a good
        solution to this? Do we need an interim SegmentInfo variable that
        indicates future flushed deletions?

        Show
        Jason Rutherglen added a comment - Because the patch does not flush deletes to disk like the existing code does, the SegmentInfo delGen etc isn't updated at the same points as expected. In tracing the code the .del files that have 0 refs then are decRefed throwing the assertion error were never incRefed by IndexFileDeleter. This is because the deletes are carried over in the IW internal readers. Eventually the .del files are written but aren't increfed as they usually would be. Is there a good solution to this? Do we need an interim SegmentInfo variable that indicates future flushed deletions?
        Hide
        Michael McCandless added a comment -

        I added a flushDeletesToDir flag that defaults to true except for IW.getReader.

        I think that should not be necessary. On releasing a reader, if it
        has pending changes and we want to drop it from the pool, we should
        move its deletes to the directory?

        SegmentMerger should clear the hasChanges on the reader after its done
        merging, so we don't bother saving delete files for segments that were
        merged away.

        In TestConcurrentMergeScheduler.testNoWaitClose I'm seeing a couple of exceptions.

        I think the first is the root cause (and causes the 2nd one).

        Which test case hits that? The test cases that intentionally throw
        exceptions at interesting times are especially fun to debug

        Because the patch does not flush deletes to disk like the existing
        code does, the SegmentInfo delGen etc isn't updated at the same
        points as expected.

        IFD should be fine with this – there must be something else at play,
        causing us to over-decRef.

        Other notes:

        • You still have SegmentInfoKey (see above)
        • The new DirectoryIndexReader.open should just take IndexWriter
          (see above)?
        • Rather than the getSegmentsInUse "polling" approach, I think you
          can incrementally add & remove readers from the pool? EG after a
          merge commits, drop the readers for the just-merged segments and
          add a reader for the newly merged segment (and using it to hold
          the deletes carried over in commitMergedDeletes). I think for
          just-flushed segments you can wait for a getReader() call to
          happen again, to init their readers, unless something else does
          first (flushing deletes, merging)?
        • You commented out the last part of commitMergedDeletes, that
          actually saves the deletes. You need to instead get the reader
          for the merged segment from the pool and hand it the new deletes.
        • We lost the cutover to using the already computed docMap in
          commitMergedDeletes? We should put that back – it saves having
          to read in the prior deletions from disk.
        Show
        Michael McCandless added a comment - I added a flushDeletesToDir flag that defaults to true except for IW.getReader. I think that should not be necessary. On releasing a reader, if it has pending changes and we want to drop it from the pool, we should move its deletes to the directory? SegmentMerger should clear the hasChanges on the reader after its done merging, so we don't bother saving delete files for segments that were merged away. In TestConcurrentMergeScheduler.testNoWaitClose I'm seeing a couple of exceptions. I think the first is the root cause (and causes the 2nd one). Which test case hits that? The test cases that intentionally throw exceptions at interesting times are especially fun to debug Because the patch does not flush deletes to disk like the existing code does, the SegmentInfo delGen etc isn't updated at the same points as expected. IFD should be fine with this – there must be something else at play, causing us to over-decRef. Other notes: You still have SegmentInfoKey (see above) The new DirectoryIndexReader.open should just take IndexWriter (see above)? Rather than the getSegmentsInUse "polling" approach, I think you can incrementally add & remove readers from the pool? EG after a merge commits, drop the readers for the just-merged segments and add a reader for the newly merged segment (and using it to hold the deletes carried over in commitMergedDeletes). I think for just-flushed segments you can wait for a getReader() call to happen again, to init their readers, unless something else does first (flushing deletes, merging)? You commented out the last part of commitMergedDeletes, that actually saves the deletes. You need to instead get the reader for the merged segment from the pool and hand it the new deletes. We lost the cutover to using the already computed docMap in commitMergedDeletes? We should put that back – it saves having to read in the prior deletions from disk.
        Hide
        Jason Rutherglen added a comment -
        • TestConcurrentMergeScheduler is passing
        • A handful of other tests don't pass, looking into them next
        • DirectoryIndexReader.open with a writer isn't necessary so the
          related method is removed
        • I'm not sure we need to write out the deletes of unused segment
          readers because they are no longer used and so should not be
          required. I realized the unused readers needs to also check with
          the deletion policy?
        • The TestConcurrentMergeScheduler. testNoWaitClose exception is
          caused by the way .del files are created when SR.commitChanges is
          called, altering the SR segmentinfo without IFD.incRefing the .del
          file. The solution, manually incref the .del file after the pool
          commits on the SRs. Before this process was handled transparently by
          IW.checkpoint. Also the other issue was the SegmentInfo files
          variable was created before the delGen change, then the .del file was
          not being returned by SI.files(). With the .del file being returned
          the IFD related exception went away.
        • SegmentInfoKey is removed, SegmentInfo is used as the key for
          SegmentReaderPool.get

        > You commented out the last part of commitMergedDeletes, that
        actually saves the deletes. You need to instead get the reader for
        the merged segment from the pool and hand it the new deletes.

        I wasn't sure what you meant by this, in the patch the deletes are
        copied into the merged reader. Do you mean instead the merged reader
        should not be opened and instead the deletes file needs to be written
        to?

        > We lost the cutover to using the already computed docMap in
        commitMergedDeletes?

        SegmentMerger wasn't always returning the docMap so I stopped using
        it.

        Show
        Jason Rutherglen added a comment - TestConcurrentMergeScheduler is passing A handful of other tests don't pass, looking into them next DirectoryIndexReader.open with a writer isn't necessary so the related method is removed I'm not sure we need to write out the deletes of unused segment readers because they are no longer used and so should not be required. I realized the unused readers needs to also check with the deletion policy? The TestConcurrentMergeScheduler. testNoWaitClose exception is caused by the way .del files are created when SR.commitChanges is called, altering the SR segmentinfo without IFD.incRefing the .del file. The solution, manually incref the .del file after the pool commits on the SRs. Before this process was handled transparently by IW.checkpoint. Also the other issue was the SegmentInfo files variable was created before the delGen change, then the .del file was not being returned by SI.files(). With the .del file being returned the IFD related exception went away. SegmentInfoKey is removed, SegmentInfo is used as the key for SegmentReaderPool.get > You commented out the last part of commitMergedDeletes, that actually saves the deletes. You need to instead get the reader for the merged segment from the pool and hand it the new deletes. I wasn't sure what you meant by this, in the patch the deletes are copied into the merged reader. Do you mean instead the merged reader should not be opened and instead the deletes file needs to be written to? > We lost the cutover to using the already computed docMap in commitMergedDeletes? SegmentMerger wasn't always returning the docMap so I stopped using it.
        Hide
        Jason Rutherglen added a comment -

        A little over zealous in altering the flushDeletesToDisk flag in IW.flush

        Show
        Jason Rutherglen added a comment - A little over zealous in altering the flushDeletesToDisk flag in IW.flush
        Hide
        Michael McCandless added a comment -

        Thanks Jason, good progress...

        I'm not sure we need to write out the deletes of unused segment
        readers because they are no longer used and so should not be
        required.

        EG I think DocumentsWriter.applyDeletes should get the reader from
        pool, do deletes, then call pool.release(reader), and that release
        would write the changes to disk if we're not pooling readers. Then we
        don't need a new "flushDeletesToDir" propogated around.

        On an explicit commit(), we should also sweep the pool and write
        changes to disk for any SR that has pending changes.

        > You commented out the last part of commitMergedDeletes, that
        actually saves the deletes. You need to instead get the reader for
        the merged segment from the pool and hand it the new deletes.

        I wasn't sure what you meant by this, in the patch the deletes are
        copied into the merged reader. Do you mean instead the merged reader
        should not be opened and instead the deletes file needs to be written
        to?

        Woops – sorry: you are doing it correctly (applying to the
        mergedReader). I missed that.

        SegmentMerger wasn't always returning the docMap so I stopped using it.

        It should always return it, unless there were no deletes on that
        segment when the merge started, in which case the docMap is null. But
        you can handle that case (just check if there are now any deletes, and
        carry them over).

        (Then we shouldn't need merge.segmentReadersClone, again).

        The solution, manually incref the .del file after the pool
        commits on the SRs.

        That's spooky – why exactly is it needed? We should only do an extra
        incRef if we can explain exactly which decRef it will correspond to.
        Shouldn't IW.checkpoint still work properly for the .del files?

        Also, SegmentInfo.files should be cleared whenever delGen is advanced
        – can you give more details on what path doesn't properly clear the
        files?

        The fact that an SR can carry deletes in memory without writing a new
        .del file should not impact IFD. Whenever we do advance delGen and
        write new .del files, then we must call checkpoint().

        So I think we need to really explain the root cause here...

        Others:

        • You overrode decRef in DirectoryIndexReader, to not write changes
          whenever a writer is present, but I think that a better approach
          is to leave decRef as it is and then from writer, clear hasChanges
          when you want to discard them (because they were merged away).
        • You inserted applyDeletes at the top of commitMergedDeletes – why
          was that needed?
        • There's alot of noise in the patch – whitespace changes,
          commented out debug code, etc. Can you remove some of it? It
          makes it harder to separate signal from noise...
        Show
        Michael McCandless added a comment - Thanks Jason, good progress... I'm not sure we need to write out the deletes of unused segment readers because they are no longer used and so should not be required. EG I think DocumentsWriter.applyDeletes should get the reader from pool, do deletes, then call pool.release(reader), and that release would write the changes to disk if we're not pooling readers. Then we don't need a new "flushDeletesToDir" propogated around. On an explicit commit(), we should also sweep the pool and write changes to disk for any SR that has pending changes. > You commented out the last part of commitMergedDeletes, that actually saves the deletes. You need to instead get the reader for the merged segment from the pool and hand it the new deletes. I wasn't sure what you meant by this, in the patch the deletes are copied into the merged reader. Do you mean instead the merged reader should not be opened and instead the deletes file needs to be written to? Woops – sorry: you are doing it correctly (applying to the mergedReader). I missed that. SegmentMerger wasn't always returning the docMap so I stopped using it. It should always return it, unless there were no deletes on that segment when the merge started, in which case the docMap is null. But you can handle that case (just check if there are now any deletes, and carry them over). (Then we shouldn't need merge.segmentReadersClone, again). The solution, manually incref the .del file after the pool commits on the SRs. That's spooky – why exactly is it needed? We should only do an extra incRef if we can explain exactly which decRef it will correspond to. Shouldn't IW.checkpoint still work properly for the .del files? Also, SegmentInfo.files should be cleared whenever delGen is advanced – can you give more details on what path doesn't properly clear the files? The fact that an SR can carry deletes in memory without writing a new .del file should not impact IFD. Whenever we do advance delGen and write new .del files, then we must call checkpoint(). So I think we need to really explain the root cause here... Others: You overrode decRef in DirectoryIndexReader, to not write changes whenever a writer is present, but I think that a better approach is to leave decRef as it is and then from writer, clear hasChanges when you want to discard them (because they were merged away). You inserted applyDeletes at the top of commitMergedDeletes – why was that needed? There's alot of noise in the patch – whitespace changes, commented out debug code, etc. Can you remove some of it? It makes it harder to separate signal from noise...
        Hide
        Jason Rutherglen added a comment - - edited

        Here's a summary of the which tests are failing grouped by the
        guessed cause:

        • TestDeletionPolicy, TestIndexFileDeleter, TestIndexReader fails (not
          all methods but they seem to be related, i.e. files are being left
          open that should not be)
        • In TestIndexWriter most test methods pass, however a few such as
          testCommitOnCloseDiskUsage fail.
          TestIndexWriterDelete.testUpdatesOnDiskFull fails for reasons
          presumably similar to TestIndexWriter.testCommitOnCloseDiskUsage
        • TestTransactions and TestStressIndexing2 fails. At first glance I'm
          not sure why
        Show
        Jason Rutherglen added a comment - - edited Here's a summary of the which tests are failing grouped by the guessed cause: TestDeletionPolicy, TestIndexFileDeleter, TestIndexReader fails (not all methods but they seem to be related, i.e. files are being left open that should not be) In TestIndexWriter most test methods pass, however a few such as testCommitOnCloseDiskUsage fail. TestIndexWriterDelete.testUpdatesOnDiskFull fails for reasons presumably similar to TestIndexWriter.testCommitOnCloseDiskUsage TestTransactions and TestStressIndexing2 fails. At first glance I'm not sure why
        Hide
        Jason Rutherglen added a comment -

        On IW.close -> SegmentReaderPool.flushChanged ->
        segmentreader.commitChanges was being called on SRs that had no
        changes. TestDeletionPolicy and TestIndexFileDeleter passes.

        Show
        Jason Rutherglen added a comment - On IW.close -> SegmentReaderPool.flushChanged -> segmentreader.commitChanges was being called on SRs that had no changes. TestDeletionPolicy and TestIndexFileDeleter passes.
        Hide
        Jason Rutherglen added a comment -

        TestIndexReader.testDocsOutOfOrderJIRA140 fails because IW.close
        isn't called before dir.close. Is this a bug in the unit test?

        Show
        Jason Rutherglen added a comment - TestIndexReader.testDocsOutOfOrderJIRA140 fails because IW.close isn't called before dir.close. Is this a bug in the unit test?
        Hide
        Jason Rutherglen added a comment -

        TestIndexWriterDelete.testOperationsOnDiskFull fails with
        MockRAMDirectory.close (still open files) because IW.close isn't
        called.

        TestIndexWriter.testImmediateDiskFullWithThreads fails because
        IW.close fails on the disk full exception. Should IW.closeInternal ->
        SegmentReaderPool.close be placed in the finally clause?

        TestTransactions fails because the purposely random failure occurs
        during IW.close (SRP.close specifically)

        The TestStressIndexing2 failure could be tricky

        Show
        Jason Rutherglen added a comment - TestIndexWriterDelete.testOperationsOnDiskFull fails with MockRAMDirectory.close (still open files) because IW.close isn't called. TestIndexWriter.testImmediateDiskFullWithThreads fails because IW.close fails on the disk full exception. Should IW.closeInternal -> SegmentReaderPool.close be placed in the finally clause? TestTransactions fails because the purposely random failure occurs during IW.close (SRP.close specifically) The TestStressIndexing2 failure could be tricky
        Hide
        Michael McCandless added a comment -

        TestIndexReader.testDocsOutOfOrderJIRA140 fails because IW.close
        isn't called before dir.close. Is this a bug in the unit test?

        Yes, this looks like a bug in the test. This also means when we eventually commit this, we'll have to first fix that bug on the back-compat tests branch.

        TestIndexWriterDelete.testOperationsOnDiskFull fails with
        MockRAMDirectory.close (still open files) because IW.close isn't
        called.

        I don't understand: that test looks like it does call IW.close for all IW's opened? (It's a little tricky, because modifier.close gets called the 2nd time the for(int x = 0...) loop runs).

        TestIndexWriter.testImmediateDiskFullWithThreads fails because
        IW.close fails on the disk full exception. Should IW.closeInternal ->
        SegmentReaderPool.close be placed in the finally clause?

        Does the 2nd call to close (close(false)) also hit an exception? Perhaps, modify the test so that if that 2nd close hits an exception, call abort?

        It's good that you're down to mainly the exceptions-based test failures... though I think you should focus more on the bigger structural changes to the approach (eg switching back to docMap for merging deletes, adding SegmentReaderPool.release, which should write changes to the dir & calling it from all places that do a .get(), and the other comments above) before trying to get all tests to pass.

        Show
        Michael McCandless added a comment - TestIndexReader.testDocsOutOfOrderJIRA140 fails because IW.close isn't called before dir.close. Is this a bug in the unit test? Yes, this looks like a bug in the test. This also means when we eventually commit this, we'll have to first fix that bug on the back-compat tests branch. TestIndexWriterDelete.testOperationsOnDiskFull fails with MockRAMDirectory.close (still open files) because IW.close isn't called. I don't understand: that test looks like it does call IW.close for all IW's opened? (It's a little tricky, because modifier.close gets called the 2nd time the for(int x = 0...) loop runs). TestIndexWriter.testImmediateDiskFullWithThreads fails because IW.close fails on the disk full exception. Should IW.closeInternal -> SegmentReaderPool.close be placed in the finally clause? Does the 2nd call to close (close(false)) also hit an exception? Perhaps, modify the test so that if that 2nd close hits an exception, call abort? It's good that you're down to mainly the exceptions-based test failures... though I think you should focus more on the bigger structural changes to the approach (eg switching back to docMap for merging deletes, adding SegmentReaderPool.release, which should write changes to the dir & calling it from all places that do a .get(), and the other comments above) before trying to get all tests to pass.
        Hide
        Jason Rutherglen added a comment -

        On an explicit commit(), we should also sweep the pool and
        write changes to disk for any SR that has pending changes.

        I created SegmentReaderPool.commitAll which commits changes for all
        SRs in the pool in IW.startCommit.

        applyDeletes at the top of commitMergedDeletes

        Removed

        switching back to docMap for merging deletes

        Where should I get the numDeletedDocs from? (Used for docUpto +=
        docCount - previousReader.numDeletedDocs()) Should the entire
        docIdMap be scanned? Is there an expense in cloning the segment
        readers besides the extra bitvectors?

        • SegmentReaderPool.release is implemented instead of using
          reader.decRef. I think you're saying put this patch's decRef logic in
          the SRP.release method?
        • Added IW.close to TestIndexReader.testDocsOutOfOrderJIRA140
        • TestTransactions may sometimes be failing legitimately during the
          prepareCommit, the exception can't be reliably reproduced but perhaps
          another test case can be written that does
        • TestIndexWriterDelete.testErrorAfterApplyDeletes fails due to
          IW.commit not throwing an expected exception
        Show
        Jason Rutherglen added a comment - On an explicit commit(), we should also sweep the pool and write changes to disk for any SR that has pending changes. I created SegmentReaderPool.commitAll which commits changes for all SRs in the pool in IW.startCommit. applyDeletes at the top of commitMergedDeletes Removed switching back to docMap for merging deletes Where should I get the numDeletedDocs from? (Used for docUpto += docCount - previousReader.numDeletedDocs()) Should the entire docIdMap be scanned? Is there an expense in cloning the segment readers besides the extra bitvectors? SegmentReaderPool.release is implemented instead of using reader.decRef. I think you're saying put this patch's decRef logic in the SRP.release method? Added IW.close to TestIndexReader.testDocsOutOfOrderJIRA140 TestTransactions may sometimes be failing legitimately during the prepareCommit, the exception can't be reliably reproduced but perhaps another test case can be written that does TestIndexWriterDelete.testErrorAfterApplyDeletes fails due to IW.commit not throwing an expected exception
        Hide
        Michael McCandless added a comment -

        Jason can you resync to trunk?

        Show
        Michael McCandless added a comment - Jason can you resync to trunk?
        Hide
        Michael McCandless added a comment -

        Where should I get the numDeletedDocs from?

        SegmentMerger.getDelCounts()

        Should the entire docIdMap be scanned?

        Yes, when the before/after delCount is different.

        Is there an expense in cloning the segment readers besides the extra bitvectors?

        Copy-on-write of the bitvectors (time and space) is the biggest cost I
        think. Since docMap already has everything we need, I think we should
        use it... we could even separate out this change and do it first, if
        you want.

        I think you're saying put this patch's decRef logic in
        the SRP.release method?

        Actually the release method should always commit changes, if there are
        any, when an SR is removed SR from the pool.

        Then, something higher up (merging, once successful) should clear
        changes when it's safe, so that release doesn't save anything.

        The "decRef that never commits whenever writer is present" in
        DirectoryIndexReader is too low level, I think.

        Show
        Michael McCandless added a comment - Where should I get the numDeletedDocs from? SegmentMerger.getDelCounts() Should the entire docIdMap be scanned? Yes, when the before/after delCount is different. Is there an expense in cloning the segment readers besides the extra bitvectors? Copy-on-write of the bitvectors (time and space) is the biggest cost I think. Since docMap already has everything we need, I think we should use it... we could even separate out this change and do it first, if you want. I think you're saying put this patch's decRef logic in the SRP.release method? Actually the release method should always commit changes, if there are any, when an SR is removed SR from the pool. Then, something higher up (merging, once successful) should clear changes when it's safe, so that release doesn't save anything. The "decRef that never commits whenever writer is present" in DirectoryIndexReader is too low level, I think.
        Hide
        Jason Rutherglen added a comment -
        • The patch is updated to trunk, in most tests "there are still open
          files" failures occur now (maybe it's not related to the latest revision)
        • mergeMiddle isn't synchronized so if we use a pooled reader
          (instead of a frozen clone) couldn't deletes be applied to the SR as
          merging is happening?
        Show
        Jason Rutherglen added a comment - The patch is updated to trunk, in most tests "there are still open files" failures occur now (maybe it's not related to the latest revision) mergeMiddle isn't synchronized so if we use a pooled reader (instead of a frozen clone) couldn't deletes be applied to the SR as merging is happening?
        Hide
        Michael McCandless added a comment -

        mergeMiddle isn't synchronized so if we use a pooled reader
        (instead of a frozen clone) couldn't deletes be applied to the SR as
        merging is happening?

        Hmm – good point. You're right, we need to clone the reader up front, so deletions don't change during the merging process.

        In which case let's just stick with the current approach in the patch?

        Show
        Michael McCandless added a comment - mergeMiddle isn't synchronized so if we use a pooled reader (instead of a frozen clone) couldn't deletes be applied to the SR as merging is happening? Hmm – good point. You're right, we need to clone the reader up front, so deletions don't change during the merging process. In which case let's just stick with the current approach in the patch?
        Hide
        Michael McCandless added a comment -

        Jason can I have the conch here? (Ie take your patch and make some changes / iterate). Are all of your changes included in the last patch?

        Show
        Michael McCandless added a comment - Jason can I have the conch here? (Ie take your patch and make some changes / iterate). Are all of your changes included in the last patch?
        Hide
        Jason Rutherglen added a comment -

        Hi Mike, yes please! I have not had time to work on it. All my changes are in the last patch. I can clean it up, but it would not be until later today.

        Show
        Jason Rutherglen added a comment - Hi Mike, yes please! I have not had time to work on it. All my changes are in the last patch. I can clean it up, but it would not be until later today.
        Hide
        Michael McCandless added a comment -

        OK, don't worry about cleaning things up – I'll start from what's here now. Thanks!

        Show
        Michael McCandless added a comment - OK, don't worry about cleaning things up – I'll start from what's here now. Thanks!
        Hide
        Michael McCandless added a comment -

        New patch attached. All tests pass. Back-compat tests pass with no
        changes.

        I made a number of changes, added asserts, cleaned things up, added
        javadocs, CHANGES entry, etc.

        I try to consistently refer to this as "near real-time search".
        Calling it realtime is overstating it, especially when you compare it
        to what "realtime operating systems" do.

        I also added IndexWriter.setMergedSegmentWarmer(..) and abstract base
        class so the caller can have a merged segment warmed before it's
        committed to the index. This is important for reducing turnaround
        time after merges are completed; eg it means you can go ahead adding
        new docs/deletes, and opening new readers, while the merged segment is
        being warmed.

        Net/net I think the approach here is sound, and it's a good step
        forward. I think we can get this in for 2.9.

        But, I added a few nocommits that we still need to figure out.

        Otherwise this is close to committable, though I'd like to let it age
        for a while so people can look/test.

        We also still need to pursue in-memory transactional data structures
        for representing deletes/norms (LUCENE-1526 is already open for
        this, and it should not block this issue).

        Next up I plan to run some basic tests to see what the "typical" delay
        is for opening a near-realtime reader...

        Show
        Michael McCandless added a comment - New patch attached. All tests pass. Back-compat tests pass with no changes. I made a number of changes, added asserts, cleaned things up, added javadocs, CHANGES entry, etc. I try to consistently refer to this as "near real-time search". Calling it realtime is overstating it, especially when you compare it to what "realtime operating systems" do. I also added IndexWriter.setMergedSegmentWarmer(..) and abstract base class so the caller can have a merged segment warmed before it's committed to the index. This is important for reducing turnaround time after merges are completed; eg it means you can go ahead adding new docs/deletes, and opening new readers, while the merged segment is being warmed. Net/net I think the approach here is sound, and it's a good step forward. I think we can get this in for 2.9. But, I added a few nocommits that we still need to figure out. Otherwise this is close to committable, though I'd like to let it age for a while so people can look/test. We also still need to pursue in-memory transactional data structures for representing deletes/norms ( LUCENE-1526 is already open for this, and it should not block this issue). Next up I plan to run some basic tests to see what the "typical" delay is for opening a near-realtime reader...
        Hide
        Michael McCandless added a comment -

        OK I ran a basic initial test of the latency when opening a near
        real-time reader.

        Using contrib/benchmark, I index wikipedia docs like normal, but then
        I added a NearRealTimeReaderTask, which runs a BG thread that once
        every N (I did 3) seconds it gets a new reader from the writer.

        Then it does a simple search for term "1" in the body, and sorts by
        the docdate field.

        I measured milli-seconds to reopen and to run the search, and plot
        those as a function of index size.

        I ran two tests. The first (attached as ssd.png) stores the index on
        an Intel X25M solid-state disk; the second (attached as magnetic.png)
        on a WD Velociraptor.

        Notes:

        • The reopen time is ~700 msec in both cases, and doesn't change
          much as index grows (which is nice).
        • It is quite noisy, likely due to merges committing.
        • I logged (but did not graph) the flush time vs actual reopen time,
          and it's the flush time that dominates. This is good because with
          a slower indexing rate, this flush time would go way down. My
          guess is flushing is CPU bound not IO bound.
        • The search time ramps up linearly (expected), and also shows
          spikes due to merging. There's one massive spike at the end of
          the SSD one that's odd (did not correspond to reopening after a
          merge, though perhaps during a merge).
        • This is a somewhat overly stressful test because I'm indexing docs
          at full speed. Whereas I'd expect for the typical near realtime
          search app, the docs would usually be trickling in more slowly and
          a reopen would happen after just a few docs.
        • SSD and magnetic look pretty darn similar, though magnetic shows
          more noise and maybe is more affected by merges.
        • I'm doing no deletions in this test, but the typical near
          real-time app presumably would.
        Show
        Michael McCandless added a comment - OK I ran a basic initial test of the latency when opening a near real-time reader. Using contrib/benchmark, I index wikipedia docs like normal, but then I added a NearRealTimeReaderTask, which runs a BG thread that once every N (I did 3) seconds it gets a new reader from the writer. Then it does a simple search for term "1" in the body, and sorts by the docdate field. I measured milli-seconds to reopen and to run the search, and plot those as a function of index size. I ran two tests. The first (attached as ssd.png) stores the index on an Intel X25M solid-state disk; the second (attached as magnetic.png) on a WD Velociraptor. Notes: The reopen time is ~700 msec in both cases, and doesn't change much as index grows (which is nice). It is quite noisy, likely due to merges committing. I logged (but did not graph) the flush time vs actual reopen time, and it's the flush time that dominates. This is good because with a slower indexing rate, this flush time would go way down. My guess is flushing is CPU bound not IO bound. The search time ramps up linearly (expected), and also shows spikes due to merging. There's one massive spike at the end of the SSD one that's odd (did not correspond to reopening after a merge, though perhaps during a merge). This is a somewhat overly stressful test because I'm indexing docs at full speed. Whereas I'd expect for the typical near realtime search app, the docs would usually be trickling in more slowly and a reopen would happen after just a few docs. SSD and magnetic look pretty darn similar, though magnetic shows more noise and maybe is more affected by merges. I'm doing no deletions in this test, but the typical near real-time app presumably would.
        Hide
        Michael McCandless added a comment -

        Disregard the search time in the above results... we have a sneaky bug
        (LUCENE-1579) that is causing FieldCache to not be re-used for shared
        segments in a reopened reader. This makes the search time after
        reopen far worse than it should be.

        Show
        Michael McCandless added a comment - Disregard the search time in the above results... we have a sneaky bug ( LUCENE-1579 ) that is causing FieldCache to not be re-used for shared segments in a reopened reader. This makes the search time after reopen far worse than it should be.
        Hide
        Michael McCandless added a comment -

        New patch: Fixed a few small issues... and made some changes to
        contrib/benchmark to help in running more realistic near real-time
        tests:

        • Fixed LineDocMaker to properly set "docid" primary key field.
        • Added UpdateDocTask that calls IndexWriter.updateDocument,
          randomly picking a docid.
        • Added NearRealTimeReader task, that creates BG thread that every N
          seconds opens a reader, runs a static search and prints results.

        This patch also contains patch from LUCENE-1579.

        So, using this you can 1) create a large index, 2) create an alg that
        does doc updates at a fixed rate and then tests the near real-time
        reader performance.

        Show
        Michael McCandless added a comment - New patch: Fixed a few small issues... and made some changes to contrib/benchmark to help in running more realistic near real-time tests: Fixed LineDocMaker to properly set "docid" primary key field. Added UpdateDocTask that calls IndexWriter.updateDocument, randomly picking a docid. Added NearRealTimeReader task, that creates BG thread that every N seconds opens a reader, runs a static search and prints results. This patch also contains patch from LUCENE-1579 . So, using this you can 1) create a large index, 2) create an alg that does doc updates at a fixed rate and then tests the near real-time reader performance.
        Hide
        Michael McCandless added a comment -

        OK using the last patch, I ran another near real-time test, using this
        alg:

        
        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
        
        merge.policy=org.apache.lucene.index.LogDocMergePolicy
        
        docs.file=/Volumes/External/lucene/wiki.txt
        doc.stored = false
        doc.term.vector = false
        doc.add.log.step=10
        max.field.length=2147483647
        
        directory=FSDirectory
        autocommit=false
        compound=false
        merge.factor = 10
        ram.flush.mb = 128
        doc.maker.forever = false
        doc.random.id.limit = 3204040
        
        work.dir=/lucene/work
        
        { "BuildIndex"
          - OpenIndex
          - NearRealtimeReader(1)
           { "UpdateDocs" UpdateDoc > : 100000 : 50/sec
          - CloseIndex
        }
        
        RepSumByPrefRound BuildIndex
        

        It opens a full (3.2M docs, previously built) wikipedia index, then
        randomly selects a doc and updates it (deletes old, adds new) at the
        rate of 50 docs/sec. Then, once per second I open a new reader, do
        the same search (term "1", sorted by date).

        I attached another graph (ssd2.png) with the results, showing reopen &
        search time as a function of how many updates have been done; rough
        comments:

        • Search time is pretty constant ~35 msec, except occassional
          glitches where it goes as high as ~340 msec. Net/net very
          reasonable I think.
        • Search time is remarkably non-noisy, except for occasional
          spikes.
        • Reopen time is also fast (~ 40 msec) but is more noisy.
        • It's not clear the merges are really impacting things that much.
          It could simply be that I didn't run test for long enough for a
          big merge to run. Also, this index has no stored fields nor term
          vectors, so if we added those, merges would get slower.
        • This is a better test than last one, since it's doing some deletes
        • Since I open writer with autoCommit false, and near-realtime
          carries all pending deletes in RAM, no *.del file ever gets
          written to the index
        Show
        Michael McCandless added a comment - OK using the last patch, I ran another near real-time test, using this alg: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker merge.policy=org.apache.lucene.index.LogDocMergePolicy docs.file=/Volumes/External/lucene/wiki.txt doc.stored = false doc.term.vector = false doc.add.log.step=10 max.field.length=2147483647 directory=FSDirectory autocommit= false compound= false merge.factor = 10 ram.flush.mb = 128 doc.maker.forever = false doc.random.id.limit = 3204040 work.dir=/lucene/work { "BuildIndex" - OpenIndex - NearRealtimeReader(1) { "UpdateDocs" UpdateDoc > : 100000 : 50/sec - CloseIndex } RepSumByPrefRound BuildIndex It opens a full (3.2M docs, previously built) wikipedia index, then randomly selects a doc and updates it (deletes old, adds new) at the rate of 50 docs/sec. Then, once per second I open a new reader, do the same search (term "1", sorted by date). I attached another graph (ssd2.png) with the results, showing reopen & search time as a function of how many updates have been done; rough comments: Search time is pretty constant ~35 msec, except occassional glitches where it goes as high as ~340 msec. Net/net very reasonable I think. Search time is remarkably non-noisy, except for occasional spikes. Reopen time is also fast (~ 40 msec) but is more noisy. It's not clear the merges are really impacting things that much. It could simply be that I didn't run test for long enough for a big merge to run. Also, this index has no stored fields nor term vectors, so if we added those, merges would get slower. This is a better test than last one, since it's doing some deletes Since I open writer with autoCommit false, and near-realtime carries all pending deletes in RAM, no *.del file ever gets written to the index
        Hide
        Jason Rutherglen added a comment -

        Mike, nice work!

        I will hopefully test this week.

        Show
        Jason Rutherglen added a comment - Mike, nice work! I will hopefully test this week.
        Hide
        Michael McCandless added a comment -

        Nice work to you too – I just picked things up where you left off! I'm still working on the nocommits, but I think we can commit this soon.

        Show
        Michael McCandless added a comment - Nice work to you too – I just picked things up where you left off! I'm still working on the nocommits, but I think we can commit this soon.
        Hide
        Michael McCandless added a comment -

        New patch: sync'd to trunk, cleaned up the nocommits (some fixing, some turned into TODOs).

        I'd like to add some more tests, but otherwise I think this is ready to commit.

        Show
        Michael McCandless added a comment - New patch: sync'd to trunk, cleaned up the nocommits (some fixing, some turned into TODOs). I'd like to add some more tests, but otherwise I think this is ready to commit.
        Hide
        Michael McCandless added a comment -

        Attached patch: added more test cases, and fixed the near real-time reader to NOT try to open segments in a foreign Directory (which happens when addIndexes* is running).

        Show
        Michael McCandless added a comment - Attached patch: added more test cases, and fixed the near real-time reader to NOT try to open segments in a foreign Directory (which happens when addIndexes* is running).
        Hide
        Michael McCandless added a comment -

        Added another test case to TestIndexWriterReader, stress testing adding/deleting docs while constant opening near real-time reader.

        Show
        Michael McCandless added a comment - Added another test case to TestIndexWriterReader, stress testing adding/deleting docs while constant opening near real-time reader.
        Hide
        Jason Rutherglen added a comment -

        In ReaderPool.get(SegmentInfo info, boolean doOpenStores, int readBufferSize) the readBufferSize needs to be passed into SegmentReader.get

        Show
        Jason Rutherglen added a comment - In ReaderPool.get(SegmentInfo info, boolean doOpenStores, int readBufferSize) the readBufferSize needs to be passed into SegmentReader.get
        Hide
        Michael McCandless added a comment -

        Good catch! I'll fix.

        Show
        Michael McCandless added a comment - Good catch! I'll fix.
        Hide
        Michael McCandless added a comment -

        I think NRT search is finally ready to go in... I'll wait another day or two.

        Show
        Michael McCandless added a comment - I think NRT search is finally ready to go in... I'll wait another day or two.
        Hide
        Michael McCandless added a comment -

        I just committed this. Thanks Jason!

        Show
        Michael McCandless added a comment - I just committed this. Thanks Jason!
        Hide
        Jason Rutherglen added a comment -

        I noticed NearRealtimeReaderTask is in trunk, can you include the .alg? How does one create a line based wiki file?

        Show
        Jason Rutherglen added a comment - I noticed NearRealtimeReaderTask is in trunk, can you include the .alg? How does one create a line based wiki file?
        Hide
        Shai Erera added a comment -

        look at extractWikipedia.alg, which uses the EnwikiDocMaker, the enwiki.xml file and the WriteLineDoc task. BTW, with LUCENE-1591 you can run this alg with bzip.compression=true and feed it with the .bz2 file rather than the xml.

        Show
        Shai Erera added a comment - look at extractWikipedia.alg, which uses the EnwikiDocMaker, the enwiki.xml file and the WriteLineDoc task. BTW, with LUCENE-1591 you can run this alg with bzip.compression=true and feed it with the .bz2 file rather than the xml.
        Hide
        Michael McCandless added a comment -

        This is the alg I use:

        { "BuildIndex"
          - OpenIndex
          - NearRealtimeReader(1)
           { "UpdateDocs" UpdateDoc > : 100000 : 50/sec
          - CloseIndex
        }
        

        Ie, using a BG thread, the NRT reader reopens once per second, runs a fixed search, and prints timing info.

        Meanwhile, a single thread replaces the docs at the rate of 50/sec up to 100K docs.

        Show
        Michael McCandless added a comment - This is the alg I use: { "BuildIndex" - OpenIndex - NearRealtimeReader(1) { "UpdateDocs" UpdateDoc > : 100000 : 50/sec - CloseIndex } Ie, using a BG thread, the NRT reader reopens once per second, runs a fixed search, and prints timing info. Meanwhile, a single thread replaces the docs at the rate of 50/sec up to 100K docs.
        Hide
        Jason Rutherglen added a comment -

        We need an IndexWriter.getMergedSegmentWarmer method?

        Show
        Jason Rutherglen added a comment - We need an IndexWriter.getMergedSegmentWarmer method?
        Hide
        Michael McCandless added a comment -

        We need an IndexWriter.getMergedSegmentWarmer method?

        Yes, I just committed; thanks!

        Show
        Michael McCandless added a comment - We need an IndexWriter.getMergedSegmentWarmer method? Yes, I just committed; thanks!
        Hide
        Jason Rutherglen added a comment -

        In LogMergePolicy.findMergesToExpungeDeletes I think we need to
        change how deletes are checked for. I added this change into
        LUCENE-1313, however as we're about to release 2.9, we probably
        need to implement it.

        Currently we check the info for deletes, however with this
        patch, I think we need to check the segmentReader which could
        have deletes that don't show up in the info.

        Show
        Jason Rutherglen added a comment - In LogMergePolicy.findMergesToExpungeDeletes I think we need to change how deletes are checked for. I added this change into LUCENE-1313 , however as we're about to release 2.9, we probably need to implement it. Currently we check the info for deletes, however with this patch, I think we need to check the segmentReader which could have deletes that don't show up in the info.
        Hide
        Michael McCandless added a comment -

        Currently we check the info for deletes, however with this
        patch, I think we need to check the segmentReader which could
        have deletes that don't show up in the info.

        Good catch! Can you open a new issue & attach patch? Though: how would you do this? Right now MergePolicy never receives a SegmentReader, and makes all its decisions based on the SegmentInfo. Each SegmentReader tracks its own pendingDelCount... maybe we add a private pendingDelCount to SegmentInfo, and change SegmentReader to use that instead? That'd be a single source, and then the merge policy could retrieve it...

        Show
        Michael McCandless added a comment - Currently we check the info for deletes, however with this patch, I think we need to check the segmentReader which could have deletes that don't show up in the info. Good catch! Can you open a new issue & attach patch? Though: how would you do this? Right now MergePolicy never receives a SegmentReader, and makes all its decisions based on the SegmentInfo. Each SegmentReader tracks its own pendingDelCount... maybe we add a private pendingDelCount to SegmentInfo, and change SegmentReader to use that instead? That'd be a single source, and then the merge policy could retrieve it...
        Hide
        Jason Rutherglen added a comment -

        I think we will want to do something like what field cache does
        with CreationPlaceholder for IndexWriter.readerPool. Otherwise
        we have the (I think somewhat problematic) issue of all other
        readerPool.get* methods waiting for an SR to warm.

          public synchronized SegmentReader get(SegmentInfo info, boolean doOpenStores, int readBufferSize) throws IOException {
              if (poolReaders) {
                readBufferSize = BufferedIndexInput.BUFFER_SIZE;
              }
        
              SegmentReader sr = (SegmentReader) readerMap.get(info);
              if (sr == null) {
                // TODO: we may want to avoid doing this while
                // synchronized
                // Returns a ref, which we xfer to readerMap:
                sr = SegmentReader.get(info, readBufferSize, doOpenStores);
                readerMap.put(info, sr);
              } else if (doOpenStores) {
                sr.openDocStores();
              }
        
              // Return a ref to our caller
              sr.incRef();
              return sr;
            }
        
        Show
        Jason Rutherglen added a comment - I think we will want to do something like what field cache does with CreationPlaceholder for IndexWriter.readerPool. Otherwise we have the (I think somewhat problematic) issue of all other readerPool.get* methods waiting for an SR to warm. public synchronized SegmentReader get(SegmentInfo info, boolean doOpenStores, int readBufferSize) throws IOException { if (poolReaders) { readBufferSize = BufferedIndexInput.BUFFER_SIZE; } SegmentReader sr = (SegmentReader) readerMap.get(info); if (sr == null ) { // TODO: we may want to avoid doing this while // synchronized // Returns a ref, which we xfer to readerMap: sr = SegmentReader.get(info, readBufferSize, doOpenStores); readerMap.put(info, sr); } else if (doOpenStores) { sr.openDocStores(); } // Return a ref to our caller sr.incRef(); return sr; }

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 672h
              672h
              Remaining:
              Remaining Estimate - 672h
              672h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development