Lucene - Core
  1. Lucene - Core
  2. LUCENE-2814

stop writing shared doc stores across segments

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think.

      It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment.

      This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged.

      However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is "congruent". Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be.

      Furthermore, the optimization adds a lot of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores.

      So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy.

      1. LUCENE-2814.patch
        119 kB
        Earwin Burrfoot
      2. LUCENE-2814.patch
        119 kB
        Earwin Burrfoot
      3. LUCENE-2814.patch
        99 kB
        Earwin Burrfoot
      4. LUCENE-2814.patch
        75 kB
        Michael McCandless
      5. LUCENE-2814.patch
        76 kB
        Earwin Burrfoot

        Issue Links

          Activity

          Hide
          Earwin Burrfoot added a comment -

          I'll take this. I think.

          Show
          Earwin Burrfoot added a comment - I'll take this. I think.
          Hide
          Shai Erera added a comment -

          This is great Mike. Does this mean we will be able to merge non consecutive segments?

          Show
          Shai Erera added a comment - This is great Mike. Does this mean we will be able to merge non consecutive segments?
          Hide
          Michael McCandless added a comment -

          Does this mean we will be able to merge non consecutive segments?

          Well... it's rather decoupled from this issue. Meaning, we were always free to allow non-contiguous merges, and segment would simply have forced the merge of the doc stores in that case.

          But, yes, because of the same reasons above, such a merge policy is much lower perf hit than it used to be. And so now it is again compelling to explore this.... we just have to go and do LUCENE-1076. I think it should not be hard; we just have to relax the merge methods in IW that check that the segments are contiguous.

          This added freedom for the merge policy would be compelling, ie, it can then do a much better job cherry picking similarly sized segments for merging. I think it'd also simplify merge policies like BalancedSegMP, eg (I think?) no longer requiring a viterbi search to pick the best merges.

          Show
          Michael McCandless added a comment - Does this mean we will be able to merge non consecutive segments? Well... it's rather decoupled from this issue. Meaning, we were always free to allow non-contiguous merges, and segment would simply have forced the merge of the doc stores in that case. But, yes, because of the same reasons above, such a merge policy is much lower perf hit than it used to be. And so now it is again compelling to explore this.... we just have to go and do LUCENE-1076 . I think it should not be hard; we just have to relax the merge methods in IW that check that the segments are contiguous. This added freedom for the merge policy would be compelling, ie, it can then do a much better job cherry picking similarly sized segments for merging. I think it'd also simplify merge policies like BalancedSegMP, eg (I think?) no longer requiring a viterbi search to pick the best merges.
          Hide
          Michael McCandless added a comment -

          Thanks Earwin

          We should verify the back-compat test has at least one index w/ shared doc stores.

          DocumentsWriter should get simpler, ie its closeDocStores is now private, and, it should no longer need to call writer.checkpoint() since it will no longer alter the SIS (and no longer build a cfx file) on closing the doc stores.

          It would be nice to remove the closeDocStore methods from the full indexing chain, since now each component should do that itself, privately, on flush, but maybe we should do that as a 2nd step?

          Show
          Michael McCandless added a comment - Thanks Earwin We should verify the back-compat test has at least one index w/ shared doc stores. DocumentsWriter should get simpler, ie its closeDocStores is now private, and, it should no longer need to call writer.checkpoint() since it will no longer alter the SIS (and no longer build a cfx file) on closing the doc stores. It would be nice to remove the closeDocStore methods from the full indexing chain, since now each component should do that itself, privately, on flush, but maybe we should do that as a 2nd step?
          Hide
          Earwin Burrfoot added a comment -

          We should verify the back-compat test has at least one index w/ shared doc stores.

          I believe I've seen some back-compat failures with my quick'n'dirty patch that removed both reading and writing shared docstores.
          So it should be ok.

          Show
          Earwin Burrfoot added a comment - We should verify the back-compat test has at least one index w/ shared doc stores. I believe I've seen some back-compat failures with my quick'n'dirty patch that removed both reading and writing shared docstores. So it should be ok.
          Hide
          Earwin Burrfoot added a comment -

          First iteration.

          Passes all tests except TestNRTThreads. Something to do with numDocsInStore and numDocsInRam merged together?
          Lots of non-critical nocommits (just markers for places I'd like to recheck).
          DW.docStoreEnabled and *.closeDocStore() have to go, before committing

          Show
          Earwin Burrfoot added a comment - First iteration. Passes all tests except TestNRTThreads. Something to do with numDocsInStore and numDocsInRam merged together? Lots of non-critical nocommits (just markers for places I'd like to recheck). DW.docStoreEnabled and *.closeDocStore() have to go, before committing
          Hide
          Michael McCandless added a comment -

          Same as Earwin's patch just sync'd w/ current trunk.

          Show
          Michael McCandless added a comment - Same as Earwin's patch just sync'd w/ current trunk.
          Hide
          Michael McCandless added a comment -

          OK I dug here... the reason why TestNRTThreads fails is because you moved the numDocsInRAM++ out of DW.getThreadState into WaitQueue.writeDocument.

          When we buffer del terms in DW.deleteTerm/Terms/Query/Queries, we grab the current numDocsInRAM as the "docID upto", to record when it comes time to apply the delete which docID we must stop at.

          But with your change, this value is now an undercount, since numDocsInRAM is now acting like numDocsInStore.

          One way to fix this would be to change the delete methods to use nextDocID instead of numDocsInRAM?

          But I think I'd prefer to put back numDocsInRAM++ in getThreadState...

          Show
          Michael McCandless added a comment - OK I dug here... the reason why TestNRTThreads fails is because you moved the numDocsInRAM++ out of DW.getThreadState into WaitQueue.writeDocument. When we buffer del terms in DW.deleteTerm/Terms/Query/Queries, we grab the current numDocsInRAM as the "docID upto", to record when it comes time to apply the delete which docID we must stop at. But with your change, this value is now an undercount, since numDocsInRAM is now acting like numDocsInStore. One way to fix this would be to change the delete methods to use nextDocID instead of numDocsInRAM? But I think I'd prefer to put back numDocsInRAM++ in getThreadState...
          Hide
          Michael Busch added a comment -

          The shared doc stores are actually already completely removed in the realtime branch (part of LUCENE-2324).

          Does someone want to help with the merge, then we can land the realtime branch (which is pretty much only DWPT and removing doc stores) in trunk sometime soon?

          Show
          Michael Busch added a comment - The shared doc stores are actually already completely removed in the realtime branch (part of LUCENE-2324 ). Does someone want to help with the merge, then we can land the realtime branch (which is pretty much only DWPT and removing doc stores) in trunk sometime soon?
          Hide
          Earwin Burrfoot added a comment -

          Ugh.
          On first glance @ realtime branch, my patch is not a strict subset. But most stuff, including things I'm struggling with now, is definetly in.
          So, I guess, I'll stop wasting effort.

          What is the state for realtime branch? Is it stable? How soon is "sometime soon"?

          Show
          Earwin Burrfoot added a comment - Ugh. On first glance @ realtime branch, my patch is not a strict subset. But most stuff, including things I'm struggling with now, is definetly in. So, I guess, I'll stop wasting effort. What is the state for realtime branch? Is it stable? How soon is "sometime soon"?
          Hide
          Michael Busch added a comment -

          Well I need to merge with the recent changes in trunk (especially LUCENE-2680).
          The merge is pretty hard, but I'm planning to spend most of my weekend on it.

          If I can get most tests to pass again (most were passing before the merge), then I think the only outstanding thing is LUCENE-2573 before we could land it in trunk.

          Show
          Michael Busch added a comment - Well I need to merge with the recent changes in trunk (especially LUCENE-2680 ). The merge is pretty hard, but I'm planning to spend most of my weekend on it. If I can get most tests to pass again (most were passing before the merge), then I think the only outstanding thing is LUCENE-2573 before we could land it in trunk.
          Hide
          Michael McCandless added a comment -

          I think taking things one step at a time would be good here?

          Ie remove doc stores from trunk, let that bake on trunk for a while, then merge to RT? So that what then remains on RT is DWPT / tiered flushing? Else RT is a monster change?

          Show
          Michael McCandless added a comment - I think taking things one step at a time would be good here? Ie remove doc stores from trunk, let that bake on trunk for a while, then merge to RT? So that what then remains on RT is DWPT / tiered flushing? Else RT is a monster change?
          Hide
          Michael Busch added a comment -

          I think taking things one step at a time would be good here?

          Probably still a smaller change than flex indexing

          But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far. In this particular case it's just a bit sad to redo all this work now, because I think I got the removal of doc stores right in RT and all related tests to pass.

          Show
          Michael Busch added a comment - I think taking things one step at a time would be good here? Probably still a smaller change than flex indexing But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far. In this particular case it's just a bit sad to redo all this work now, because I think I got the removal of doc stores right in RT and all related tests to pass.
          Hide
          Earwin Burrfoot added a comment -

          So, what's the plan?

          Show
          Earwin Burrfoot added a comment - So, what's the plan?
          Hide
          Michael Busch added a comment -

          So, what's the plan?

          I can't really work on this much before Saturday. But during the weekend I can work on the RT merge and maybe try to pull out the docstore removal changes and create a separate patch. Have to see how hard that is. If it's not too difficult I'll post a separate patch, otherwise I'll commit the merge to RT and maybe convince you guys to help a bit with getting the RT branch ready for landing in trunk?

          Show
          Michael Busch added a comment - So, what's the plan? I can't really work on this much before Saturday. But during the weekend I can work on the RT merge and maybe try to pull out the docstore removal changes and create a separate patch. Have to see how hard that is. If it's not too difficult I'll post a separate patch, otherwise I'll commit the merge to RT and maybe convince you guys to help a bit with getting the RT branch ready for landing in trunk?
          Hide
          Earwin Burrfoot added a comment -

          Instead of you pulling out docstore removal, I can finish that patch. But then merging's gonna be even greater bitch. Probably. But maybe not.
          Do you do IRC? It can be faster to discuss in realtime, and you could also tell what help you need with the branch.

          Show
          Earwin Burrfoot added a comment - Instead of you pulling out docstore removal, I can finish that patch. But then merging's gonna be even greater bitch. Probably. But maybe not. Do you do IRC? It can be faster to discuss in realtime, and you could also tell what help you need with the branch.
          Hide
          Earwin Burrfoot added a comment -

          Patch updated to trunk, no nocommits, no *.closeDocStore(), tests pass.

          SegmentWriteState vs DocumentsWriter bother me.
          We track flushed files in both, we inconsistently get current segment from both of them.

          Show
          Earwin Burrfoot added a comment - Patch updated to trunk, no nocommits, no *.closeDocStore(), tests pass. SegmentWriteState vs DocumentsWriter bother me. We track flushed files in both, we inconsistently get current segment from both of them.
          Hide
          Michael McCandless added a comment -

          Probably still a smaller change than flex indexing

          Yes, true!

          But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far.

          Well, not a mistake... this was unavoidable given that trunk was so far from what DWPT needs. But with per-seg deletes (LUCENE-2680) done, and no more doc stores (this issue), I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!).

          I can help merge too! I think coordinating on IRC #lucene is a good idea? It seems like LUCENE-2573 needs to be incorporated into IW's new FlushControl class (which is already coordinating flush-due-to-deletes and flush-due-to-added-docs-of-one-DWPT).

          Show
          Michael McCandless added a comment - Probably still a smaller change than flex indexing Yes, true! But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far. Well, not a mistake... this was unavoidable given that trunk was so far from what DWPT needs. But with per-seg deletes ( LUCENE-2680 ) done, and no more doc stores (this issue), I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!). I can help merge too! I think coordinating on IRC #lucene is a good idea? It seems like LUCENE-2573 needs to be incorporated into IW's new FlushControl class (which is already coordinating flush-due-to-deletes and flush-due-to-added-docs-of-one-DWPT).
          Hide
          Jason Rutherglen added a comment -

          I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!)

          Indeed!

          I think coordinating on IRC #lucene is a good idea?

          It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira.

          It seems like LUCENE-2573 needs to be incorporated into IW's new FlushControl class

          Right, into the DWPT branch.

          Show
          Jason Rutherglen added a comment - I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!) Indeed! I think coordinating on IRC #lucene is a good idea? It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira. It seems like LUCENE-2573 needs to be incorporated into IW's new FlushControl class Right, into the DWPT branch.
          Hide
          Steve Rowe added a comment -

          I think coordinating on IRC #lucene is a good idea?

          It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira.

          #lucene-dev is logged.

          Show
          Steve Rowe added a comment - I think coordinating on IRC #lucene is a good idea? It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira. #lucene-dev is logged.
          Hide
          Jason Rutherglen added a comment -

          Steven, is that on a wiki page?

          The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on

          Show
          Jason Rutherglen added a comment - Steven, is that on a wiki page? The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on
          Hide
          Steve Rowe added a comment - - edited

          Steven, is that on a wiki page?

          I don't know, I never put it anywhere, just discussed on dev@l.a.o. Feel free to do so if you like.

          The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on

          Yeah, it's very rarely used.

          Several Lucene people who use #lucene are strongly against logging, so I set up #lucene-dev as a place to host on-the-record Lucene conversations. I mentioned it because this is what you want.

          Show
          Steve Rowe added a comment - - edited Steven, is that on a wiki page? I don't know, I never put it anywhere, just discussed on dev@l.a.o. Feel free to do so if you like. The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on Yeah, it's very rarely used. Several Lucene people who use #lucene are strongly against logging, so I set up #lucene-dev as a place to host on-the-record Lucene conversations. I mentioned it because this is what you want.
          Hide
          Earwin Burrfoot added a comment -

          New patch. Now with even more lines removed!

          DocStore-related index chain components used to track open/closed files through DocumentsWriter.
          Closed files list was unused, and is silently gone.
          Open files list was used to:

          • prevent not-yet-flushed shared docstores from being deleted by IndexFileDeleter.
            • no shared docstores, no need + IFD no longer requires a reference to DW
          • delete already opened docstore files, when aborting.
            • index chain now handles this on its own + has cleaner error handling code.
          Show
          Earwin Burrfoot added a comment - New patch. Now with even more lines removed! DocStore-related index chain components used to track open/closed files through DocumentsWriter. Closed files list was unused, and is silently gone. Open files list was used to: prevent not-yet-flushed shared docstores from being deleted by IndexFileDeleter. no shared docstores, no need + IFD no longer requires a reference to DW delete already opened docstore files, when aborting. index chain now handles this on its own + has cleaner error handling code.
          Hide
          Michael McCandless added a comment -

          Patch looks great! Nice work Earwin. I think it's ready to commit.

          Except, can you resync to trunk? I hit failures applying one hunk to
          DW.java.

          Also, on the nocommit on exc in DW.addDocument, yes I think that
          (IFD.deleteNewFiles, not checkpoint) is still needed because DW can
          orphan the store files on abort?

          Or: we could fix DW.abort to directly call Dir.deleteFile (instead of
          relying on IFD.deleteNewFiles). Ie, w/ no shared doc stores, these
          files should never have been registered w/ IFD so they can be
          privately managed by DW.

          But, if we end up leaving the delete up above, we should put the
          docWriter null check back so silly apps that close IW while still
          indexing don't get NPEs.

          I'm not looking forward to the 3.x back port!!

          Show
          Michael McCandless added a comment - Patch looks great! Nice work Earwin. I think it's ready to commit. Except, can you resync to trunk? I hit failures applying one hunk to DW.java. Also, on the nocommit on exc in DW.addDocument, yes I think that (IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the store files on abort? Or: we could fix DW.abort to directly call Dir.deleteFile (instead of relying on IFD.deleteNewFiles). Ie, w/ no shared doc stores, these files should never have been registered w/ IFD so they can be privately managed by DW. But, if we end up leaving the delete up above, we should put the docWriter null check back so silly apps that close IW while still indexing don't get NPEs. I'm not looking forward to the 3.x back port!!
          Hide
          Earwin Burrfoot added a comment -

          Synced to trunk.

          Also, on the nocommit on exc in DW.addDocument, yes I think that (IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the store files on abort?

          Orphaned files are deleted directly in StoredFieldsWriter.abort() and TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now gone.
          Turns out checkpoint() is also no longer needed.

          I have no other lingering cleanup urges, this is ready to be committed. I think.

          Show
          Earwin Burrfoot added a comment - Synced to trunk. Also, on the nocommit on exc in DW.addDocument, yes I think that (IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the store files on abort? Orphaned files are deleted directly in StoredFieldsWriter.abort() and TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now gone. Turns out checkpoint() is also no longer needed. I have no other lingering cleanup urges, this is ready to be committed. I think.
          Hide
          Michael McCandless added a comment -

          OK I committed to trunk. I'll let this bake for a while on trunk before backporting to 3.x...

          Thanks Earwin!

          Show
          Michael McCandless added a comment - OK I committed to trunk. I'll let this bake for a while on trunk before backporting to 3.x... Thanks Earwin!
          Hide
          Jason Rutherglen added a comment -

          backporting to 3.x...

          Out of curiosity, why are we backporting to 3.x or are we planning on also backporting the DWPT branch?

          Show
          Jason Rutherglen added a comment - backporting to 3.x... Out of curiosity, why are we backporting to 3.x or are we planning on also backporting the DWPT branch?
          Hide
          Michael McCandless added a comment -

          Out of curiosity, why are we backporting to 3.x or are we planning on also backporting the DWPT branch?

          Well, 3.x is eligible for any new features that don't break back
          compat. It's a feature branch not a bug-fix branch.

          I'd like to see removing shared doc stores backported in particular
          because it's a good simplification of the code, ie, it can prevent
          sneaky bugs.

          And I think we should backport DWPT? But first things first, we gotta
          land it on trunk And let it bake some... then backport.

          Show
          Michael McCandless added a comment - Out of curiosity, why are we backporting to 3.x or are we planning on also backporting the DWPT branch? Well, 3.x is eligible for any new features that don't break back compat. It's a feature branch not a bug-fix branch. I'd like to see removing shared doc stores backported in particular because it's a good simplification of the code, ie, it can prevent sneaky bugs. And I think we should backport DWPT? But first things first, we gotta land it on trunk And let it bake some... then backport.
          Hide
          Jason Rutherglen added a comment -

          And I think we should backport DWPT?

          Really? It's a fairly large restructuring.

          Also, I realized that LUCENE-2312 should be implementable in the single DW case, which perhaps could be backported as well.

          Show
          Jason Rutherglen added a comment - And I think we should backport DWPT? Really? It's a fairly large restructuring. Also, I realized that LUCENE-2312 should be implementable in the single DW case, which perhaps could be backported as well.
          Hide
          Michael McCandless added a comment -

          Really? It's a fairly large restructuring.

          It is, but it's all under-the-hood.

          Also, I realized that LUCENE-2312 should be implementable in the single DW case, which perhaps could be backported as well.

          Right!

          Show
          Michael McCandless added a comment - Really? It's a fairly large restructuring. It is, but it's all under-the-hood. Also, I realized that LUCENE-2312 should be implementable in the single DW case, which perhaps could be backported as well. Right!
          Hide
          Michael Busch added a comment -

          OK I committed to trunk. I'll let this bake for a while on trunk before backporting to 3.x...
          Thanks Earwin!

          Man, you guys really ruined my Sunday with this commit

          I got so many merge conflicts, that I decided to merge first only up to rev 1050655 (the rev before this commit) and up to HEAD in a second merge.

          I'm down to 64 compile errors (from 800), hopefully I can finish the merge tomorrow. Just wanted you to know that I'm making progress here with the DWPTs.

          Show
          Michael Busch added a comment - OK I committed to trunk. I'll let this bake for a while on trunk before backporting to 3.x... Thanks Earwin! Man, you guys really ruined my Sunday with this commit I got so many merge conflicts, that I decided to merge first only up to rev 1050655 (the rev before this commit) and up to HEAD in a second merge. I'm down to 64 compile errors (from 800), hopefully I can finish the merge tomorrow. Just wanted you to know that I'm making progress here with the DWPTs.
          Hide
          Michael McCandless added a comment -

          Man, you guys really ruined my Sunday with this commit

          Sorry

          Is there any way I can help?

          Show
          Michael McCandless added a comment - Man, you guys really ruined my Sunday with this commit Sorry Is there any way I can help?
          Hide
          Michael Busch added a comment -

          Sorry

          No worries - I'm being overly dramatic

          Is there any way I can help?

          Let me try to get it to compile, and then I'll commit. I'm sure a bunch of tests will fail, help would be great then. Also a general review of my changes to IndexWriter/DocumentsWriter/DWPT would be great. I should be able to commit my merge by end of today.

          Show
          Michael Busch added a comment - Sorry No worries - I'm being overly dramatic Is there any way I can help? Let me try to get it to compile, and then I'll commit. I'm sure a bunch of tests will fail, help would be great then. Also a general review of my changes to IndexWriter/DocumentsWriter/DWPT would be great. I should be able to commit my merge by end of today.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development