Lucene - Core
  1. Lucene - Core
  2. LUCENE-2680

Improve how IndexWriter flushes deletes against existing segments

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      IndexWriter buffers up all deletes (by Term and Query) and only
      applies them if 1) commit or NRT getReader() is called, or 2) a merge
      is about to kickoff.

      We do this because, for a large index, it's very costly to open a
      SegmentReader for every segment in the index. So we defer as long as
      we can. We do it just before merge so that the merge can eliminate
      the deleted docs.

      But, most merges are small, yet in a big index we apply deletes to all
      of the segments, which is really very wasteful.

      Instead, we should only apply the buffered deletes to the segments
      that are about to be merged, and keep the buffer around for the
      remaining segments.

      I think it's not so hard to do; we'd have to have generations of
      pending deletions, because the newly merged segment doesn't need the
      same buffered deletions applied again. So every time a merge kicks
      off, we pinch off the current set of buffered deletions, open a new
      set (the next generation), and record which segment was created as of
      which generation.

      This should be a very sizable gain for large indices that mix
      deletes, though, less so in flex since opening the terms index is much
      faster.

      1. LUCENE-2680.patch
        141 kB
        Michael McCandless
      2. LUCENE-2680.patch
        124 kB
        Michael McCandless
      3. LUCENE-2680.patch
        124 kB
        Michael McCandless
      4. LUCENE-2680.patch
        57 kB
        Jason Rutherglen
      5. LUCENE-2680.patch
        56 kB
        Jason Rutherglen
      6. LUCENE-2680.patch
        19 kB
        Jason Rutherglen
      7. LUCENE-2680.patch
        19 kB
        Jason Rutherglen
      8. LUCENE-2680.patch
        9 kB
        Jason Rutherglen
      9. LUCENE-2680.patch
        42 kB
        Jason Rutherglen
      10. LUCENE-2680.patch
        44 kB
        Jason Rutherglen
      11. LUCENE-2680.patch
        43 kB
        Jason Rutherglen
      12. LUCENE-2680.patch
        41 kB
        Jason Rutherglen
      13. LUCENE-2680.patch
        42 kB
        Jason Rutherglen
      14. LUCENE-2680.patch
        45 kB
        Jason Rutherglen
      15. LUCENE-2680.patch
        37 kB
        Jason Rutherglen
      16. LUCENE-2680.patch
        30 kB
        Jason Rutherglen
      17. LUCENE-2680.patch
        33 kB
        Jason Rutherglen

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Hmm... I think there's another silliness going on inside IW: when applying deletes, we one-by-one open the SR, apply deletes, close it.

          But then immediately thereafter we open the N segments to be merged.

          We should somehow not do this double open, eg, use the pool temporarily, so that the reader is opened to apply deletes, and then kept open in order to do the merging. Using the pool should be fine because the merge forcefully evicts the sub readers from the pool after completion.

          Show
          Michael McCandless added a comment - Hmm... I think there's another silliness going on inside IW: when applying deletes, we one-by-one open the SR, apply deletes, close it. But then immediately thereafter we open the N segments to be merged. We should somehow not do this double open, eg, use the pool temporarily, so that the reader is opened to apply deletes, and then kept open in order to do the merging. Using the pool should be fine because the merge forcefully evicts the sub readers from the pool after completion.
          Hide
          Jason Rutherglen added a comment -

          Maybe we should implement this as pending deletes per segment rather than using a generational system because with LUCENE-2655, we may need to maintain the per query/term docidupto per segment. The downside is the extraneous memory consumed by the hash map, however, if we use BytesRefHash this'll be reduced, or would it? Because we'd be writing the term bytes to a unique byte pool per segment? Hmm... Maybe there's a more efficient way.

          Show
          Jason Rutherglen added a comment - Maybe we should implement this as pending deletes per segment rather than using a generational system because with LUCENE-2655 , we may need to maintain the per query/term docidupto per segment. The downside is the extraneous memory consumed by the hash map, however, if we use BytesRefHash this'll be reduced, or would it? Because we'd be writing the term bytes to a unique byte pool per segment? Hmm... Maybe there's a more efficient way.
          Hide
          Michael McCandless added a comment -

          Tracking per-segment would be easier but I worry about indices that have large numbers of segments... eg w/ a large mergeFactor and frequent flushing you can get very many segments.

          So if we track per-segment, suddenly the RAM required (as well as CPU cost of copying these deletions to the N segments) is multiplied by the number of segments.

          Show
          Michael McCandless added a comment - Tracking per-segment would be easier but I worry about indices that have large numbers of segments... eg w/ a large mergeFactor and frequent flushing you can get very many segments. So if we track per-segment, suddenly the RAM required (as well as CPU cost of copying these deletions to the N segments) is multiplied by the number of segments.
          Hide
          Jason Rutherglen added a comment -

          The general approach is to reuse BufferedDeletes though place them into a segment info keyed map for those segments generated post lastSegmentIndex as per what has been discussed here https://issues.apache.org/jira/browse/LUCENE-2655?focusedCommentId=12922894&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12922894 and below.

          • lastSegmentIndex is added to IW
          • DW segmentDeletes is a map of segment info -> buffered deletes. In the apply deletes method buffered deletes are pulled for a given segment info if they exist, otherwise they're taken from deletesFlushedLastSeg.
          • I'm not entirely sure what pushDeletes should do now, probably the same thing as currently, only the name should change slightly in that it's pushing deletes only for the RAM buffer docs.
          • There needs to be tests to ensure the docid-upto logic is working correctly
          • I'm not sure what to do with DW hasDeletes (it's usage is commented out)
          • Does there need to be separate deletes for the ram buffer vis-à-vis the (0 - lastSegmentIndex) deletes?
          • The memory accounting'll now get interesting as we'll need to track the RAM usage of terms/queries across multiple maps.
          • In commitMerge, DW verifySegmentDeletes removes the unused info -> deletes
          • testDeletes deletes a doc in segment 1, then merges segments 1 and 2. We then test to insure the deletes were in fact applied only to segment 1 and 2.
          • testInitLastSegmentIndex insures that on IW init, the lastSegmentIndex is in fact set
          Show
          Jason Rutherglen added a comment - The general approach is to reuse BufferedDeletes though place them into a segment info keyed map for those segments generated post lastSegmentIndex as per what has been discussed here https://issues.apache.org/jira/browse/LUCENE-2655?focusedCommentId=12922894&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12922894 and below. lastSegmentIndex is added to IW DW segmentDeletes is a map of segment info -> buffered deletes. In the apply deletes method buffered deletes are pulled for a given segment info if they exist, otherwise they're taken from deletesFlushedLastSeg. I'm not entirely sure what pushDeletes should do now, probably the same thing as currently, only the name should change slightly in that it's pushing deletes only for the RAM buffer docs. There needs to be tests to ensure the docid-upto logic is working correctly I'm not sure what to do with DW hasDeletes (it's usage is commented out) Does there need to be separate deletes for the ram buffer vis-à-vis the (0 - lastSegmentIndex) deletes? The memory accounting'll now get interesting as we'll need to track the RAM usage of terms/queries across multiple maps. In commitMerge, DW verifySegmentDeletes removes the unused info -> deletes testDeletes deletes a doc in segment 1, then merges segments 1 and 2. We then test to insure the deletes were in fact applied only to segment 1 and 2. testInitLastSegmentIndex insures that on IW init, the lastSegmentIndex is in fact set
          Hide
          Jason Rutherglen added a comment -

          The most recent one "wins", and we should do only one delete (per segment) for that term.

          How should we define this recency and why does it matter? Should it be per term/query or for the entire BD?

          I think there's an issue with keeping lastSegmentIndex in DW, while it's easy to maintain, Mike had mentioned keeping the lastSegmentIndex per BufferedDeletes object. Coalescing the BDs should be easier to maintain after successful merge than maintaining a separate BD for them. We'll see.

          I'll put together another patch with these changes.

          Show
          Jason Rutherglen added a comment - The most recent one "wins", and we should do only one delete (per segment) for that term. How should we define this recency and why does it matter? Should it be per term/query or for the entire BD? I think there's an issue with keeping lastSegmentIndex in DW, while it's easy to maintain, Mike had mentioned keeping the lastSegmentIndex per BufferedDeletes object. Coalescing the BDs should be easier to maintain after successful merge than maintaining a separate BD for them. We'll see. I'll put together another patch with these changes.
          Hide
          Jason Rutherglen added a comment -

          Here's a new patch with properly working last segment index.

          The trunk version of apply deletes has become applyDeletesAll and is functionally unchanged.

          There's a new method, DW applyDeletesToSegments called by _mergeInit for segments that are about to be merged. The deleted terms and queries for these segments are kept in hash sets because docid-uptos are not needed.

          Like the last patch DW maintains the last segment index. There's no need to maintain the last-segindex per BD, instead I think it's only useful per DW, and for trunk we only have one DW being used at a time.

          On successful merge, the last segment index is set to the segment index previous to the start segment of the merge. The merged segments deletes are coalesced into the startIndex-1's segment deletes.

          Show
          Jason Rutherglen added a comment - Here's a new patch with properly working last segment index. The trunk version of apply deletes has become applyDeletesAll and is functionally unchanged. There's a new method, DW applyDeletesToSegments called by _mergeInit for segments that are about to be merged. The deleted terms and queries for these segments are kept in hash sets because docid-uptos are not needed. Like the last patch DW maintains the last segment index. There's no need to maintain the last-segindex per BD, instead I think it's only useful per DW, and for trunk we only have one DW being used at a time. On successful merge, the last segment index is set to the segment index previous to the start segment of the merge. The merged segments deletes are coalesced into the startIndex-1's segment deletes.
          Hide
          Jason Rutherglen added a comment -

          I'm redoing things a bit to take into account the concurrency of merges. For example, if a merge fails, we need to not have removed those segments' deletes to be applied. Also probably the most tricky part is that lastSegmentIndex could have changed since a merge started, which means we need to be careful about how and which deletes we coalesce.

          Show
          Jason Rutherglen added a comment - I'm redoing things a bit to take into account the concurrency of merges. For example, if a merge fails, we need to not have removed those segments' deletes to be applied. Also probably the most tricky part is that lastSegmentIndex could have changed since a merge started, which means we need to be careful about how and which deletes we coalesce.
          Hide
          Jason Rutherglen added a comment -

          Another use case that can be wacky is if commit is called and a merge is finishing before or after, in that case all (point-in-time) deletes will have been applied by commit, however do we want to clear all per-segment deletes at the end of commit? This would blank out deletes being applied by the merge, most of which should be cleared out, however if new deletes arrived during the commit (is this possible?), then we want these to be attached to segments and not lost. I guess we want to DW sync'd clear out deletes in the applyDeletesAll method. ADA will apply those deletes, any incoming will queue up and be shuffled around.

          Show
          Jason Rutherglen added a comment - Another use case that can be wacky is if commit is called and a merge is finishing before or after, in that case all (point-in-time) deletes will have been applied by commit, however do we want to clear all per-segment deletes at the end of commit? This would blank out deletes being applied by the merge, most of which should be cleared out, however if new deletes arrived during the commit (is this possible?), then we want these to be attached to segments and not lost. I guess we want to DW sync'd clear out deletes in the applyDeletesAll method. ADA will apply those deletes, any incoming will queue up and be shuffled around.
          Hide
          Jason Rutherglen added a comment -

          In my head at least I think the concurrency issues are worked
          out in this patch. We're not taking into account recency of
          deletes as I'm not sure it matters. DW applyDeletesToSegments
          takes care of the coalescing of segment deletes as this is a
          synced DW method called by a synced IW method, meaning
          nothing'll be changing anywhere, so we're good with the possible
          concurrency issues. I'm still a little worried about concurrent
          incoming deleted terms/queries, however those can't be added
          until after a successful ADTS call due to the DW sync.

          Guess it's time for the complete tests to run.

          Show
          Jason Rutherglen added a comment - In my head at least I think the concurrency issues are worked out in this patch. We're not taking into account recency of deletes as I'm not sure it matters. DW applyDeletesToSegments takes care of the coalescing of segment deletes as this is a synced DW method called by a synced IW method, meaning nothing'll be changing anywhere, so we're good with the possible concurrency issues. I'm still a little worried about concurrent incoming deleted terms/queries, however those can't be added until after a successful ADTS call due to the DW sync. Guess it's time for the complete tests to run.
          Hide
          Jason Rutherglen added a comment -

          There's an issue in that we're redundantly applying deletes in the applyDeletesAll case because the deletes may have already been applied to a segment when a merge happened, ie, by applyDeletesToSegments. In the ADA case we need to use applyDeletesToSegments up to the segment point when the buffered deletes can be used.

          Show
          Jason Rutherglen added a comment - There's an issue in that we're redundantly applying deletes in the applyDeletesAll case because the deletes may have already been applied to a segment when a merge happened, ie, by applyDeletesToSegments. In the ADA case we need to use applyDeletesToSegments up to the segment point when the buffered deletes can be used.
          Hide
          Jason Rutherglen added a comment -

          This brings up another issue which is we're blindly iterating over docs in a segment reader to delete, even if we can know ahead of time that the reader's docs are going to exceed the term/query's docid-upto (from the max doc of the reader). In applyDeletes we're opening a term docs iterator, though I think we're breaking at the first doc and moving on if the docid-upto is exceeded. This term docs iterator opening can be skipped.

          Show
          Jason Rutherglen added a comment - This brings up another issue which is we're blindly iterating over docs in a segment reader to delete, even if we can know ahead of time that the reader's docs are going to exceed the term/query's docid-upto (from the max doc of the reader). In applyDeletes we're opening a term docs iterator, though I think we're breaking at the first doc and moving on if the docid-upto is exceeded. This term docs iterator opening can be skipped.
          Hide
          Jason Rutherglen added a comment -

          Here's a nice little checkpoint with more tests passing.

          • A last known segment is recorded, which is the last segment seen when
            adding a delete term/query per segment. This is for a applyDeletesAll
            check to ensure a given query/term has not already been applied to a
            segment. If a term/query exists in the per-segment deletes and is in
            deletesFlushed, we delete, unless we're beyond the last known segment, at
            which point we simply delete (adhering of course to the docid-upto).
          • In the interest of accuracy I nixed lastSegmentIndex in favor of
            lastSegmentInfo which is easier for debugging and implementation when
            segments are shuffled around and/or removed/added. There's not too much of
            a penalty in terms of performance.
          • org.apache.lucene.index tests pass
          • I need to address the applying deletes only on readers within the
            docid-upto per term/query, perhaps that's best left to a different Jira
            issue.
          • Still not committable as it needs cleaning up, complete unit tests, who
            knows what else.
          Show
          Jason Rutherglen added a comment - Here's a nice little checkpoint with more tests passing. A last known segment is recorded, which is the last segment seen when adding a delete term/query per segment. This is for a applyDeletesAll check to ensure a given query/term has not already been applied to a segment. If a term/query exists in the per-segment deletes and is in deletesFlushed, we delete, unless we're beyond the last known segment, at which point we simply delete (adhering of course to the docid-upto). In the interest of accuracy I nixed lastSegmentIndex in favor of lastSegmentInfo which is easier for debugging and implementation when segments are shuffled around and/or removed/added. There's not too much of a penalty in terms of performance. org.apache.lucene.index tests pass I need to address the applying deletes only on readers within the docid-upto per term/query, perhaps that's best left to a different Jira issue. Still not committable as it needs cleaning up, complete unit tests, who knows what else.
          Hide
          Jason Rutherglen added a comment -

          All tests pass except org.apache.lucene.index.TestIndexWriterMergePolicy testMaxBufferedDocsChange. Odd. I'm looking into this.

          [junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; 
          segs=_65:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c3->_32 _64:c7->_62
          

          Also, in IW deleteDocument we're calling a new method, getSegmentInfos which is sync'ed on IW. Maybe we should use an atomic reference to a read only segment infos instead?

          Show
          Jason Rutherglen added a comment - All tests pass except org.apache.lucene.index.TestIndexWriterMergePolicy testMaxBufferedDocsChange. Odd. I'm looking into this. [junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c3->_32 _64:c7->_62 Also, in IW deleteDocument we're calling a new method, getSegmentInfos which is sync'ed on IW. Maybe we should use an atomic reference to a read only segment infos instead?
          Hide
          Jason Rutherglen added a comment -

          Sorry, spoke too soon, I made a small change to not redundantly delete, in apply deletes all and TestStressIndexing2 is breaking. I think we need to "push" segment infos changes to DW as they happen. I'm guessing that segment infos are being shuffled around and so the infos passed into DW in IW deleteDoc methods may be out of date by the time deletes are attached to segments. Hopefully there aren't any lurking deadlock issues with this.

          Show
          Jason Rutherglen added a comment - Sorry, spoke too soon, I made a small change to not redundantly delete, in apply deletes all and TestStressIndexing2 is breaking. I think we need to "push" segment infos changes to DW as they happen. I'm guessing that segment infos are being shuffled around and so the infos passed into DW in IW deleteDoc methods may be out of date by the time deletes are attached to segments. Hopefully there aren't any lurking deadlock issues with this.
          Hide
          Jason Rutherglen added a comment -

          Pushing the segment infos seems to have cleared up some of the tests failing, however intermittently (1/4 of the time) there's the one below.

          I'm going to re-add lastSegmentInfo/Index, and assert that if we're not using it, that the deletes obtained from the segmentinfo -> deletes map is the same.

          [junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
              [junit] Testcase: testRandom(org.apache.lucene.index.TestStressIndexing2):	FAILED
              [junit] expected:<12> but was:<11>
              [junit] junit.framework.AssertionFailedError: expected:<12> but was:<11>
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.testRandom(TestStressIndexing2.java:89)
          
          Show
          Jason Rutherglen added a comment - Pushing the segment infos seems to have cleared up some of the tests failing, however intermittently (1/4 of the time) there's the one below. I'm going to re-add lastSegmentInfo/Index, and assert that if we're not using it, that the deletes obtained from the segmentinfo -> deletes map is the same. [junit] Testsuite: org.apache.lucene.index.TestStressIndexing2 [junit] Testcase: testRandom(org.apache.lucene.index.TestStressIndexing2): FAILED [junit] expected:<12> but was:<11> [junit] junit.framework.AssertionFailedError: expected:<12> but was:<11> [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) [junit] at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278) [junit] at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271) [junit] at org.apache.lucene.index.TestStressIndexing2.testRandom(TestStressIndexing2.java:89)
          Hide
          Jason Rutherglen added a comment -

          I wasn't coalescing the merged segment's deletes, with that implemented, TestStressIndexing2 ran successfully 49 of 50 times. Below is the error:

          [junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
              [junit] Testcase: testMultiConfig(org.apache.lucene.index.TestStressIndexing2):	FAILED
              [junit] expected:<5> but was:<4>
              [junit] junit.framework.AssertionFailedError: expected:<5> but was:<4>
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:115)
          
          Show
          Jason Rutherglen added a comment - I wasn't coalescing the merged segment's deletes, with that implemented, TestStressIndexing2 ran successfully 49 of 50 times. Below is the error: [junit] Testsuite: org.apache.lucene.index.TestStressIndexing2 [junit] Testcase: testMultiConfig(org.apache.lucene.index.TestStressIndexing2): FAILED [junit] expected:<5> but was:<4> [junit] junit.framework.AssertionFailedError: expected:<5> but was:<4> [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) [junit] at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278) [junit] at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271) [junit] at org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:115)
          Hide
          Jason Rutherglen added a comment -

          Putting a sync on DW block around the bulk of the segment alterations in IW commitMerge seems to have quelled the TestStressIndexing2 test failures. Nice.

          Show
          Jason Rutherglen added a comment - Putting a sync on DW block around the bulk of the segment alterations in IW commitMerge seems to have quelled the TestStressIndexing2 test failures. Nice.
          Hide
          Jason Rutherglen added a comment -

          Here's a check point patch before I re-add lastSegmentInfo/Index. All tests pass except for what's below. I'm guessing segments with all docs deleted, are deleted before the test expects.

          [junit] Testcase: testCommitThreadSafety(org.apache.lucene.index.TestIndexWriter):	FAILED
              [junit] 
              [junit] junit.framework.AssertionFailedError: 
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
              [junit] 	at org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety(TestIndexWriter.java:4699)
              [junit] 
              [junit] 
              [junit] Testcase: testCommitThreadSafety(org.apache.lucene.index.TestIndexWriter):	FAILED
              [junit] Some threads threw uncaught exceptions!
              [junit] junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
              [junit] 	at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437)
              [junit] 
              [junit] 
              [junit] Tests run: 116, Failures: 2, Errors: 0, Time elapsed: 159.577 sec
              [junit] 
              [junit] ------------- Standard Output ---------------
              [junit] NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testCommitThreadSafety -Dtests.seed=1826133140332330367:8102643307925777745
              [junit] NOTE: test params are: codec=MockFixedIntBlock(blockSize=564), locale=es_CR, timezone=Asia/Urumqi
              [junit] ------------- ---------------- ---------------
              [junit] ------------- Standard Error -----------------
              [junit] The following exceptions were thrown by threads:
              [junit] *** Thread: Thread-1106 ***
              [junit] java.lang.RuntimeException: java.lang.AssertionError: term=f:0_8; r=DirectoryReader(_0:c1  _1:c1  _2:c1  _3:c1  _4:c1  _5:c1  _6:c1  _7:c2  _8:c4 ) expected:<1> but was:<0>
              [junit] 	at org.apache.lucene.index.TestIndexWriter$9.run(TestIndexWriter.java:4690)
              [junit] Caused by: java.lang.AssertionError: term=f:0_8; r=DirectoryReader(_0:c1  _1:c1  _2:c1  _3:c1  _4:c1  _5:c1  _6:c1  _7:c2  _8:c4 ) expected:<1> but was:<0>
              [junit] 	at org.junit.Assert.fail(Assert.java:91)
              [junit] 	at org.junit.Assert.failNotEquals(Assert.java:645)
              [junit] 	at org.junit.Assert.assertEquals(Assert.java:126)
              [junit] 	at org.junit.Assert.assertEquals(Assert.java:470)
              [junit] 	at org.apache.lucene.index.TestIndexWriter$9.run(TestIndexWriter.java:4684)
              [junit] NOTE: all tests run in this JVM:
              [junit] [TestMockAnalyzer, TestByteSlices, TestFilterIndexReader, TestIndexFileDeleter, TestIndexReaderClone, TestIndexReaderReopen, TestIndexWriter]
          
          Show
          Jason Rutherglen added a comment - Here's a check point patch before I re-add lastSegmentInfo/Index. All tests pass except for what's below. I'm guessing segments with all docs deleted, are deleted before the test expects. [junit] Testcase: testCommitThreadSafety(org.apache.lucene.index.TestIndexWriter): FAILED [junit] [junit] junit.framework.AssertionFailedError: [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) [junit] at org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety(TestIndexWriter.java:4699) [junit] [junit] [junit] Testcase: testCommitThreadSafety(org.apache.lucene.index.TestIndexWriter): FAILED [junit] Some threads threw uncaught exceptions! [junit] junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) [junit] at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437) [junit] [junit] [junit] Tests run: 116, Failures: 2, Errors: 0, Time elapsed: 159.577 sec [junit] [junit] ------------- Standard Output --------------- [junit] NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testCommitThreadSafety -Dtests.seed=1826133140332330367:8102643307925777745 [junit] NOTE: test params are: codec=MockFixedIntBlock(blockSize=564), locale=es_CR, timezone=Asia/Urumqi [junit] ------------- ---------------- --------------- [junit] ------------- Standard Error ----------------- [junit] The following exceptions were thrown by threads: [junit] *** Thread : Thread -1106 *** [junit] java.lang.RuntimeException: java.lang.AssertionError: term=f:0_8; r=DirectoryReader(_0:c1 _1:c1 _2:c1 _3:c1 _4:c1 _5:c1 _6:c1 _7:c2 _8:c4 ) expected:<1> but was:<0> [junit] at org.apache.lucene.index.TestIndexWriter$9.run(TestIndexWriter.java:4690) [junit] Caused by: java.lang.AssertionError: term=f:0_8; r=DirectoryReader(_0:c1 _1:c1 _2:c1 _3:c1 _4:c1 _5:c1 _6:c1 _7:c2 _8:c4 ) expected:<1> but was:<0> [junit] at org.junit.Assert.fail(Assert.java:91) [junit] at org.junit.Assert.failNotEquals(Assert.java:645) [junit] at org.junit.Assert.assertEquals(Assert.java:126) [junit] at org.junit.Assert.assertEquals(Assert.java:470) [junit] at org.apache.lucene.index.TestIndexWriter$9.run(TestIndexWriter.java:4684) [junit] NOTE: all tests run in this JVM: [junit] [TestMockAnalyzer, TestByteSlices, TestFilterIndexReader, TestIndexFileDeleter, TestIndexReaderClone, TestIndexReaderReopen, TestIndexWriter]
          Hide
          Jason Rutherglen added a comment -

          I placed (for now) the segment deletes directly into the segment info object. There's applied term/queries sets which are checked against when apply deletes all is called. All tests pass except for TestTransactions and TestPersistentSnapshotDeletionPolicy only because of an assertion check I added, that the last segment info is in fact in the newly pushed segment infos. I think in both cases segment infos is being altered in IW in a place where the segment infos isn't being pushed, yet. I wanted to checkpoint this though as it's a fairly well working at this point, including the last segment info/index, which is can be turned on or off via a static variable.

          Show
          Jason Rutherglen added a comment - I placed (for now) the segment deletes directly into the segment info object. There's applied term/queries sets which are checked against when apply deletes all is called. All tests pass except for TestTransactions and TestPersistentSnapshotDeletionPolicy only because of an assertion check I added, that the last segment info is in fact in the newly pushed segment infos. I think in both cases segment infos is being altered in IW in a place where the segment infos isn't being pushed, yet. I wanted to checkpoint this though as it's a fairly well working at this point, including the last segment info/index, which is can be turned on or off via a static variable.
          Hide
          Jason Rutherglen added a comment -

          Everything passes, except for tests that involve IW rollback. We need to be able to rollback the last segment info/index in DW, however I'm not sure how we want to do that quite yet.

          Show
          Jason Rutherglen added a comment - Everything passes, except for tests that involve IW rollback. We need to be able to rollback the last segment info/index in DW, however I'm not sure how we want to do that quite yet.
          Hide
          Jason Rutherglen added a comment -

          In DW abort (called by IW rollbackInternal) we should be able to simply clear all per segment pending deletes, however, I'm not sure we can do that, in fact, if we have applied deletes for a merge, then we rollback, we can't undo those deletes thereby breaking our current rollback model?

          Show
          Jason Rutherglen added a comment - In DW abort (called by IW rollbackInternal) we should be able to simply clear all per segment pending deletes, however, I'm not sure we can do that, in fact, if we have applied deletes for a merge, then we rollback, we can't undo those deletes thereby breaking our current rollback model?
          Hide
          Jason Rutherglen added a comment -

          Here's an uncleaned up cut with all tests passing. I nulled out
          the lastSegmentInfo on abort which fixes the my own assertion
          that was causing the rollback tests to not pass. I don't know if
          this is cheating or not yet just to get the tests to pass.

          Show
          Jason Rutherglen added a comment - Here's an uncleaned up cut with all tests passing. I nulled out the lastSegmentInfo on abort which fixes the my own assertion that was causing the rollback tests to not pass. I don't know if this is cheating or not yet just to get the tests to pass.
          Hide
          Jason Rutherglen added a comment -

          I'm running test-core multiple times and am seeing some lurking test
          failures (thanks to the randomized tests that have been recently added).
          I'm guessing they're related to the syncs on IW and DW not being in "sync"
          some of the time.

          I will clean up the patch so that others may properly review it and
          hopefully we can figure out what's going on.

          Show
          Jason Rutherglen added a comment - I'm running test-core multiple times and am seeing some lurking test failures (thanks to the randomized tests that have been recently added). I'm guessing they're related to the syncs on IW and DW not being in "sync" some of the time. I will clean up the patch so that others may properly review it and hopefully we can figure out what's going on.
          Hide
          Jason Rutherglen added a comment -

          Here's a cleaned up patch, please take a look. I ran 'ant test-core' 5 times with no failures, however running the below several times does eventually produce a failure.

          ant test-core -Dtestcase=TestThreadedOptimize -Dtestmethod=testThreadedOptimize -Dtests.seed=1547315783637080859:5267275843141383546

          ant test-core -Dtestcase=TestIndexWriterMergePolicy -Dtestmethod=testMaxBufferedDocsChange -Dtests.seed=7382971652679988823:-6672235304390823521

          Show
          Jason Rutherglen added a comment - Here's a cleaned up patch, please take a look. I ran 'ant test-core' 5 times with no failures, however running the below several times does eventually produce a failure. ant test-core -Dtestcase=TestThreadedOptimize -Dtestmethod=testThreadedOptimize -Dtests.seed=1547315783637080859:5267275843141383546 ant test-core -Dtestcase=TestIndexWriterMergePolicy -Dtestmethod=testMaxBufferedDocsChange -Dtests.seed=7382971652679988823:-6672235304390823521
          Hide
          Jason Rutherglen added a comment -

          The problem could be that IW deleteDocument is not synced on IW,
          when I tried adding the sync, there was deadlock perhaps from DW
          waitReady. We could be adding pending deletes to segments that
          are not quite current because we're not adding them in an IW
          sync block.

          Show
          Jason Rutherglen added a comment - The problem could be that IW deleteDocument is not synced on IW, when I tried adding the sync, there was deadlock perhaps from DW waitReady. We could be adding pending deletes to segments that are not quite current because we're not adding them in an IW sync block.
          Hide
          Jason Rutherglen added a comment -

          Ok, TestThreadedOptimize works when the DW sync'ed pushSegmentInfos method
          isn't called anymore (no extra per-segment deleting is going on), and stops
          working when pushSegmentInfos is turned back on. Something about the sync
          on DW is causing a problem. Hmm... We need another way to pass segment
          infos around consistently.

          Show
          Jason Rutherglen added a comment - Ok, TestThreadedOptimize works when the DW sync'ed pushSegmentInfos method isn't called anymore (no extra per-segment deleting is going on), and stops working when pushSegmentInfos is turned back on. Something about the sync on DW is causing a problem. Hmm... We need another way to pass segment infos around consistently.
          Hide
          Jason Rutherglen added a comment -

          I think I've taken LUCENE-2680 as far as I can, though I'll
          probably add some more assertions in there for good measure,
          such as whether or not a delete has in fact been applied etc. It
          seems to be working though again I should add more assertions to
          that effect. I think there's a niggling sync issue in there as
          TestThreadedOptimize only fails when I try to run it 100s of
          times. I think the sync on DW is causing a wait notify to be
          missed or skipped or something like that, as occasionally the
          isOptimized call fails as well. This is likely related to the
          appearance of deletes not being applied to segment(s) as
          evidenced by the difference in the actual doc count and the
          expected doc count.

          Below is the most common assertion failure. Maybe I should
          upload my patch that includes a method that iterates 200 times
          on testThreadedOptimize?

          [junit] ------------- ---------------- ---------------
              [junit] Testsuite: org.apache.lucene.index.TestThreadedOptimize
              [junit] Testcase: testThreadedOptimize(org.apache.lucene.index.TestThreadedOptimize):	FAILED
              [junit] expected:<248> but was:<266>
              [junit] junit.framework.AssertionFailedError: expected:<248> but was:<266>
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
              [junit] 	at org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.java:119)
              [junit] 	at org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThreadedOptimize.java:141)
              [junit] 
              [junit] 
              [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.748 sec
          
          Show
          Jason Rutherglen added a comment - I think I've taken LUCENE-2680 as far as I can, though I'll probably add some more assertions in there for good measure, such as whether or not a delete has in fact been applied etc. It seems to be working though again I should add more assertions to that effect. I think there's a niggling sync issue in there as TestThreadedOptimize only fails when I try to run it 100s of times. I think the sync on DW is causing a wait notify to be missed or skipped or something like that, as occasionally the isOptimized call fails as well. This is likely related to the appearance of deletes not being applied to segment(s) as evidenced by the difference in the actual doc count and the expected doc count. Below is the most common assertion failure. Maybe I should upload my patch that includes a method that iterates 200 times on testThreadedOptimize? [junit] ------------- ---------------- --------------- [junit] Testsuite: org.apache.lucene.index.TestThreadedOptimize [junit] Testcase: testThreadedOptimize(org.apache.lucene.index.TestThreadedOptimize): FAILED [junit] expected:<248> but was:<266> [junit] junit.framework.AssertionFailedError: expected:<248> but was:<266> [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) [junit] at org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.java:119) [junit] at org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThreadedOptimize.java:141) [junit] [junit] [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.748 sec
          Hide
          Jason Rutherglen added a comment -

          I think I've isolated this test failure to recording the applied deletes.
          Because we're using last segment index/info, I was adding deletes that may
          or may not have been applied to a particular segment to the last segment
          info. I'm not sure what to do in this case as if we record the applied
          terms per segment, but keep the pending terms in last segment info, we're
          effectively not gaining anything from using last segment info because
          we're then recording all of the terms per-segment anyways. In fact, this
          is how I've isolated that this is the issue, I simply removed the usage of
          last segment info, and instead went to maintaining pending deletes
          per-segment. I'll give it some thought.

          In conclusion, when deletes are recorded per-segment with no last segment
          info, the test passes after 200 times.

          Show
          Jason Rutherglen added a comment - I think I've isolated this test failure to recording the applied deletes. Because we're using last segment index/info, I was adding deletes that may or may not have been applied to a particular segment to the last segment info. I'm not sure what to do in this case as if we record the applied terms per segment, but keep the pending terms in last segment info, we're effectively not gaining anything from using last segment info because we're then recording all of the terms per-segment anyways. In fact, this is how I've isolated that this is the issue, I simply removed the usage of last segment info, and instead went to maintaining pending deletes per-segment. I'll give it some thought. In conclusion, when deletes are recorded per-segment with no last segment info, the test passes after 200 times.
          Hide
          Jason Rutherglen added a comment -

          Alright, we needed to clone the per-segment pending deletes in the
          _mergeInit prior to the merge, like cloning the SRs. There were other
          terms arriving after they were applied to a merge, then the coalescing of
          applied deletes was incorrect. I believe that this was the remaining
          lingering issue. The previous failures seem to have gone away, I ran the
          test 400 times. I'll upload a new patch shortly.

          Show
          Jason Rutherglen added a comment - Alright, we needed to clone the per-segment pending deletes in the _mergeInit prior to the merge, like cloning the SRs. There were other terms arriving after they were applied to a merge, then the coalescing of applied deletes was incorrect. I believe that this was the remaining lingering issue. The previous failures seem to have gone away, I ran the test 400 times. I'll upload a new patch shortly.
          Hide
          Jason Rutherglen added a comment -

          I'm still seeing the error no matter what I do. Sometimes the index is not
          optimized, and sometimes there are too many docs. It requires thousands of
          iterations to provoke either test error. Perhaps it's simply related to
          merges that are scheduled but IW close isn't waiting on properly.

          Show
          Jason Rutherglen added a comment - I'm still seeing the error no matter what I do. Sometimes the index is not optimized, and sometimes there are too many docs. It requires thousands of iterations to provoke either test error. Perhaps it's simply related to merges that are scheduled but IW close isn't waiting on properly.
          Hide
          Michael McCandless added a comment -

          TestThreadedOptimize is a known intermittent failure – I'm trying to track it down!! (LUCENE-2618)

          Show
          Michael McCandless added a comment - TestThreadedOptimize is a known intermittent failure – I'm trying to track it down!! ( LUCENE-2618 )
          Hide
          Jason Rutherglen added a comment -

          Ah, nice, I should have looked for previous intermittent failures via Jira.

          Show
          Jason Rutherglen added a comment - Ah, nice, I should have looked for previous intermittent failures via Jira.
          Hide
          Jason Rutherglen added a comment -

          Now that the intermittent failures have been successfully dealt with, ie,
          LUCENE-2618, LUCENE-2576, and LUCENE-2118, I'll merge this patch to trunk,
          then it's probably time for benchmarking. That'll probably include
          something like indexing, then updating many documents and comparing the
          index time vs. trunk?

          Show
          Jason Rutherglen added a comment - Now that the intermittent failures have been successfully dealt with, ie, LUCENE-2618 , LUCENE-2576 , and LUCENE-2118 , I'll merge this patch to trunk, then it's probably time for benchmarking. That'll probably include something like indexing, then updating many documents and comparing the index time vs. trunk?
          Hide
          Jason Rutherglen added a comment -

          Straight indexing and deleting will probably not show much of an
          improvement from this patch. In trunk, apply deletes (all) is called on
          all segments prior to a merge, so we need a synthetic way to measure the
          improvement. One way is to monitor the merge time of small segments (of an
          index with many deletes, and many existing large segments) with this patch
          vs. trunk. This'll show that this patch in that case is faster (because
          we're only applying deletes to the smaller segments).

          I think I'll add a merge start time variable to OneMerge that'll be set in
          mergeinit. The var could also be useful for the info stream debug log. The
          benchmark will simply print out the merge times (which'll be manufactured
          synthetically).

          Show
          Jason Rutherglen added a comment - Straight indexing and deleting will probably not show much of an improvement from this patch. In trunk, apply deletes (all) is called on all segments prior to a merge, so we need a synthetic way to measure the improvement. One way is to monitor the merge time of small segments (of an index with many deletes, and many existing large segments) with this patch vs. trunk. This'll show that this patch in that case is faster (because we're only applying deletes to the smaller segments). I think I'll add a merge start time variable to OneMerge that'll be set in mergeinit. The var could also be useful for the info stream debug log. The benchmark will simply print out the merge times (which'll be manufactured synthetically).
          Hide
          Michael McCandless added a comment -

          Why do we still have deletesFlushed? And why do we still need to
          remap docIDs on merge? I thought with this new approach the docIDUpto
          for each buffered delete Term/Query would be a local docID to that
          segment?

          On flush the deletesInRAM should be carried directly over to the
          segmentDeletes, and there shouldn't be a deletesFlushed?

          A few other small things:

          • You can use SegmentInfos.clone to copy the segment infos? (it
            makes a deep copy)
          • SegmentDeletes.clearAll() need not iterate through the
            terms/queries to subtract the RAM used? Ie just multiply by
            .size() instead and make one call to deduct RAM used?
          • The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a
            simple HashSet not a HashMap? Ie we are over-counting RAM used
            now? (Same for by query)
          • Can we store segment's deletes elsewhere? The SegmentInfo should
            be a lightweight class... eg it's used by DirectoryReader to read
            the index, and if it's read only DirectoryReader there's no need
            for it to allocate the SegmentDeletes? These data structures
            should only be held by IndexWriter/DocumentsWriter.
          • Do we really need to track appliedTerms/appliedQueries? Ie is
            this just an optimization so that if the caller deletes by the
            Term/Query again we know to skip it? Seems unnecessary if that's
            all...
          Show
          Michael McCandless added a comment - Why do we still have deletesFlushed? And why do we still need to remap docIDs on merge? I thought with this new approach the docIDUpto for each buffered delete Term/Query would be a local docID to that segment? On flush the deletesInRAM should be carried directly over to the segmentDeletes, and there shouldn't be a deletesFlushed? A few other small things: You can use SegmentInfos.clone to copy the segment infos? (it makes a deep copy) SegmentDeletes.clearAll() need not iterate through the terms/queries to subtract the RAM used? Ie just multiply by .size() instead and make one call to deduct RAM used? The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a simple HashSet not a HashMap? Ie we are over-counting RAM used now? (Same for by query) Can we store segment's deletes elsewhere? The SegmentInfo should be a lightweight class... eg it's used by DirectoryReader to read the index, and if it's read only DirectoryReader there's no need for it to allocate the SegmentDeletes? These data structures should only be held by IndexWriter/DocumentsWriter. Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization so that if the caller deletes by the Term/Query again we know to skip it? Seems unnecessary if that's all...
          Hide
          Michael McCandless added a comment -

          Also: why are we tracking the last segment info/index? Ie, this should only be necessary on cutover to DWPT right? Because effectively today we have only a single "DWPT"?

          Show
          Michael McCandless added a comment - Also: why are we tracking the last segment info/index? Ie, this should only be necessary on cutover to DWPT right? Because effectively today we have only a single "DWPT"?
          Hide
          Jason Rutherglen added a comment -

          Why do we still have deletesFlushed? And why do we still need to
          remap docIDs on merge? I thought with this new approach the docIDUpto for
          each buffered delete Term/Query would be a local docID to that
          segment?

          Deletes flushed can be removed if we store the docid-upto per segment.
          Then we'll go back to having a hash map of deletes.

          The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a
          simple HashSet not a HashMap? Ie we are over-counting RAM used now? (Same
          for by query)

          Intuitively, yes, however here's the constructor of hash set:

           public HashSet() { map = new HashMap<E,Object>(); } 

          why are we tracking the last segment info/index?

          I thought last segment was supposed to be used to mark the last segment of
          a commit/flush. This way we save on the hash(set,map) space on the
          segments upto the last segment when the commit occurred.

          Can we store segment's deletes elsewhere?

          We can, however I had to minimize places in the code that were potentially
          causing errors (trying to reduce the problem set, which helped locate the
          intermittent exceptions), syncing segment infos with the per-segment
          deletes was one was one of those places. That and I thought it'd be worth
          a try simplify (at the expense of breaking the unstated intention of the
          SI class).

          Do we really need to track appliedTerms/appliedQueries? Ie is this
          just an optimization so that if the caller deletes by the Term/Query again
          we know to skip it?

          Yes to the 2nd question. Why would we want to try deleting multiple times?
          The cost is the terms dictionary lookup which you're saying is in the
          noise? I think potentially cracking open a query again could be costly in
          cases where the query is indeed expensive.

          not iterate through the terms/queries to subtract the RAM
          used?

          Well, the RAM usage tracking can't be completely defined until we finish
          how we're storing the terms/queries.

          Show
          Jason Rutherglen added a comment - Why do we still have deletesFlushed? And why do we still need to remap docIDs on merge? I thought with this new approach the docIDUpto for each buffered delete Term/Query would be a local docID to that segment? Deletes flushed can be removed if we store the docid-upto per segment. Then we'll go back to having a hash map of deletes. The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a simple HashSet not a HashMap? Ie we are over-counting RAM used now? (Same for by query) Intuitively, yes, however here's the constructor of hash set: public HashSet() { map = new HashMap<E, Object >(); } why are we tracking the last segment info/index? I thought last segment was supposed to be used to mark the last segment of a commit/flush. This way we save on the hash(set,map) space on the segments upto the last segment when the commit occurred. Can we store segment's deletes elsewhere? We can, however I had to minimize places in the code that were potentially causing errors (trying to reduce the problem set, which helped locate the intermittent exceptions), syncing segment infos with the per-segment deletes was one was one of those places. That and I thought it'd be worth a try simplify (at the expense of breaking the unstated intention of the SI class). Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization so that if the caller deletes by the Term/Query again we know to skip it? Yes to the 2nd question. Why would we want to try deleting multiple times? The cost is the terms dictionary lookup which you're saying is in the noise? I think potentially cracking open a query again could be costly in cases where the query is indeed expensive. not iterate through the terms/queries to subtract the RAM used? Well, the RAM usage tracking can't be completely defined until we finish how we're storing the terms/queries.
          Hide
          Michael McCandless added a comment -

          Deletes flushed can be removed if we store the docid-upto per segment.
          Then we'll go back to having a hash map of deletes.

          I think we should do this?

          Ie, each flushed segment stores the map of del Term/Query to
          docid-upto, where that docid-upto is private to the segment (no
          remapping on merges needed).

          When it's time to apply deletes to about-to-be-merged segments, we
          must apply all "future" segments deletions unconditionally to each
          segment, and then conditionally (respecting the local docid-upto)
          apply that segment's deletions.

          Intuitively, yes, however here's the constructor of hash set:

          public HashSet() { map = new HashMap<E,Object>(); }
          

          Ugh I forgot about that. Is that still true? That's awful.

          why are we tracking the last segment info/index?

          I thought last segment was supposed to be used to mark the last segment of
          a commit/flush. This way we save on the hash(set,map) space on the
          segments upto the last segment when the commit occurred.

          Hmm... I think lastSegment was needed only for the multiple DWPT
          case, to record the last segment already flushed in the index as of
          when that DWPT was created. This is so we know "going back" when we
          can start unconditionally apply the buffered delete term.

          With the single DWPT we effectively have today isn't last segment
          always going to be what we just flushed? (Or null if we haven't yet
          done a flush in the current session).

          Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization so that if the caller deletes by the Term/Query again we know to skip it?

          Yes to the 2nd question. Why would we want to try deleting multiple times?
          The cost is the terms dictionary lookup which you're saying is in the
          noise? I think potentially cracking open a query again could be costly in
          cases where the query is indeed expensive.

          I'm saying this is unlikely to be worthwhile way to spend RAM.

          EG most apps wouldn't delete by same term again, like they'd
          "typically" go and process a big batch of docs, deleting by an id
          field and adding the new version of the doc, where a given id is seen
          only once in this session, and then IW is committed/closed?

          Show
          Michael McCandless added a comment - Deletes flushed can be removed if we store the docid-upto per segment. Then we'll go back to having a hash map of deletes. I think we should do this? Ie, each flushed segment stores the map of del Term/Query to docid-upto, where that docid-upto is private to the segment (no remapping on merges needed). When it's time to apply deletes to about-to-be-merged segments, we must apply all "future" segments deletions unconditionally to each segment, and then conditionally (respecting the local docid-upto) apply that segment's deletions. Intuitively, yes, however here's the constructor of hash set: public HashSet() { map = new HashMap<E,Object>(); } Ugh I forgot about that. Is that still true? That's awful. why are we tracking the last segment info/index? I thought last segment was supposed to be used to mark the last segment of a commit/flush. This way we save on the hash(set,map) space on the segments upto the last segment when the commit occurred. Hmm... I think lastSegment was needed only for the multiple DWPT case, to record the last segment already flushed in the index as of when that DWPT was created. This is so we know "going back" when we can start unconditionally apply the buffered delete term. With the single DWPT we effectively have today isn't last segment always going to be what we just flushed? (Or null if we haven't yet done a flush in the current session). Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization so that if the caller deletes by the Term/Query again we know to skip it? Yes to the 2nd question. Why would we want to try deleting multiple times? The cost is the terms dictionary lookup which you're saying is in the noise? I think potentially cracking open a query again could be costly in cases where the query is indeed expensive. I'm saying this is unlikely to be worthwhile way to spend RAM. EG most apps wouldn't delete by same term again, like they'd "typically" go and process a big batch of docs, deleting by an id field and adding the new version of the doc, where a given id is seen only once in this session, and then IW is committed/closed?
          Hide
          Jason Rutherglen added a comment -

          DWPT deletes has perhaps confused this issue a little bit.

          Tracking per-segment would be easier but I worry about indices that
          have large numbers of segments... eg w/ a large mergeFactor and frequent
          flushing you can get very many segments.

          I think we may be back tracking here as I had earlier proposed we simply
          store each term/query in a map per segment, however I think that was nixed
          in favor of last segment + deletes per segment afterwards. We're not
          worried about the cost of storing pending deletes in a map per segment
          anymore?

          With the single DWPT we effectively have today isn't last segment
          always going to be what we just flushed? (Or null if we haven't yet done a
          flush in the current session).

          Pretty much.

          EG most apps wouldn't delete by same term again, like they'd
          "typically" go and process a big batch of docs, deleting by an id field
          and adding the new version of the doc, where a given id is seen only once
          in this session, and then IW is committed/closed?

          In an extreme RT app that uses Lucene like a database, it could in fact
          update a doc many times, then we'd start accumulating and deleting the
          same ID over and over again. However in the straight batch indexing model
          outlined, that is unlikely to happen.

          When it's time to apply deletes to about-to-be-merged segments, we
          must apply all "future" segments deletions unconditionally to each
          segment, and then conditionally (respecting the local docid-upto) apply
          that segment's deletions.

          I'll use this as the go-ahead design then.

          Is that still true?

          That's from Java 1.6.

          Show
          Jason Rutherglen added a comment - DWPT deletes has perhaps confused this issue a little bit. Tracking per-segment would be easier but I worry about indices that have large numbers of segments... eg w/ a large mergeFactor and frequent flushing you can get very many segments. I think we may be back tracking here as I had earlier proposed we simply store each term/query in a map per segment, however I think that was nixed in favor of last segment + deletes per segment afterwards. We're not worried about the cost of storing pending deletes in a map per segment anymore? With the single DWPT we effectively have today isn't last segment always going to be what we just flushed? (Or null if we haven't yet done a flush in the current session). Pretty much. EG most apps wouldn't delete by same term again, like they'd "typically" go and process a big batch of docs, deleting by an id field and adding the new version of the doc, where a given id is seen only once in this session, and then IW is committed/closed? In an extreme RT app that uses Lucene like a database, it could in fact update a doc many times, then we'd start accumulating and deleting the same ID over and over again. However in the straight batch indexing model outlined, that is unlikely to happen. When it's time to apply deletes to about-to-be-merged segments, we must apply all "future" segments deletions unconditionally to each segment, and then conditionally (respecting the local docid-upto) apply that segment's deletions. I'll use this as the go-ahead design then. Is that still true? That's from Java 1.6.
          Hide
          Jason Rutherglen added a comment -

          Additionally we need to decide how accounting'll work for
          maxBufferedDeleteTerms. We won't have a centralized place to keep track of
          the number of terms, and the unique term count in aggregate over many
          segments could be a little too time consuming calculate in a method like
          doApplyDeletes. An alternative is to maintain a global unique term count,
          such that when a term is added, every other per-segment deletes is checked
          for that term, and if it's not already been tallied, we increment the number
          of buffered terms.

          Show
          Jason Rutherglen added a comment - Additionally we need to decide how accounting'll work for maxBufferedDeleteTerms. We won't have a centralized place to keep track of the number of terms, and the unique term count in aggregate over many segments could be a little too time consuming calculate in a method like doApplyDeletes. An alternative is to maintain a global unique term count, such that when a term is added, every other per-segment deletes is checked for that term, and if it's not already been tallied, we increment the number of buffered terms.
          Hide
          Michael McCandless added a comment -

          I think we may be back tracking here as I had earlier proposed we simply
          store each term/query in a map per segment, however I think that was nixed
          in favor of last segment + deletes per segment afterwards. We're not
          worried about the cost of storing pending deletes in a map per segment
          anymore?

          OK sorry now I remember.

          Hmm but, my objection then was to carrying all deletes backward to all
          segments?

          Whereas now I think what we can do is only record the deletions that
          were added when that segment was a RAM buffer, in its pending deletes
          map? This should be fine, since we aren't storing a single deletion
          in multiple places (well, until DWPTs anyway). It's just that on
          applying deletes to a segment because it's about to be merged we have
          to do a merge sort of the buffered deletes all "future" segments.

          BTW it could also be possible to not necessarily apply deletes when a
          segment is merged; eg if there are few enough deletes it may not be
          worthwhile. But we can leave that to another issue.

          Additionally we need to decide how accounting'll work for
          maxBufferedDeleteTerms. We won't have a centralized place to keep track of
          the number of terms, and the unique term count in aggregate over many
          segments could be a little too time consuming calculate in a method like
          doApplyDeletes. An alternative is to maintain a global unique term count,
          such that when a term is added, every other per-segment deletes is checked
          for that term, and if it's not already been tallied, we increment the number
          of buffered terms.

          Maybe we should change the definition to be total number of pending
          delete term/queries? (Ie, not dedup'd across segments). This seems
          reasonable since w/ this new approach the RAM consumed is in
          proportion to that total number and not to dedup'd count?

          Show
          Michael McCandless added a comment - I think we may be back tracking here as I had earlier proposed we simply store each term/query in a map per segment, however I think that was nixed in favor of last segment + deletes per segment afterwards. We're not worried about the cost of storing pending deletes in a map per segment anymore? OK sorry now I remember. Hmm but, my objection then was to carrying all deletes backward to all segments? Whereas now I think what we can do is only record the deletions that were added when that segment was a RAM buffer, in its pending deletes map? This should be fine, since we aren't storing a single deletion in multiple places (well, until DWPTs anyway). It's just that on applying deletes to a segment because it's about to be merged we have to do a merge sort of the buffered deletes all "future" segments. BTW it could also be possible to not necessarily apply deletes when a segment is merged; eg if there are few enough deletes it may not be worthwhile. But we can leave that to another issue. Additionally we need to decide how accounting'll work for maxBufferedDeleteTerms. We won't have a centralized place to keep track of the number of terms, and the unique term count in aggregate over many segments could be a little too time consuming calculate in a method like doApplyDeletes. An alternative is to maintain a global unique term count, such that when a term is added, every other per-segment deletes is checked for that term, and if it's not already been tallied, we increment the number of buffered terms. Maybe we should change the definition to be total number of pending delete term/queries? (Ie, not dedup'd across segments). This seems reasonable since w/ this new approach the RAM consumed is in proportion to that total number and not to dedup'd count?
          Hide
          Jason Rutherglen added a comment -

          Maybe we should change the definition to be total number of pending
          delete term/queries?

          Lets go with this, as even though we could record the total unique term
          count, the approach outlined is more conservative.

          I think what we can do is only record the deletions that were added
          when that segment was a RAM buffer, in its pending deletes map

          Ok, sounds like a design that'll work well.

          Show
          Jason Rutherglen added a comment - Maybe we should change the definition to be total number of pending delete term/queries? Lets go with this, as even though we could record the total unique term count, the approach outlined is more conservative. I think what we can do is only record the deletions that were added when that segment was a RAM buffer, in its pending deletes map Ok, sounds like a design that'll work well.
          Hide
          Jason Rutherglen added a comment -

          Flush deletes equals true means that all deletes are applied, however when it's false, that means we're moving the pending deletes into the newly flushed segment, as is, with no docId-upto remapping.

          Show
          Jason Rutherglen added a comment - Flush deletes equals true means that all deletes are applied, however when it's false, that means we're moving the pending deletes into the newly flushed segment, as is, with no docId-upto remapping.
          Hide
          Jason Rutherglen added a comment -

          We can "upgrade" to an int[] from an ArrayList<Integer> for the aborted docs.

          Show
          Jason Rutherglen added a comment - We can "upgrade" to an int[] from an ArrayList<Integer> for the aborted docs.
          Hide
          Jason Rutherglen added a comment -

          I'm seeing the following error which is probably triggered by the new per-segment deletes code, however also could be related to the recent CFS format changes?

          MockDirectoryWrapper: cannot close: there are still open files: {_0.cfs=1, _1.cfs=1}
              [junit] java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_0.cfs=1, _1.cfs=1}
              [junit] 	at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:395)
              [junit] 	at org.apache.lucene.index.TestIndexReader.testReopenChangeReadonly(TestIndexReader.java:1717)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:921)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:859)
              [junit] Caused by: java.lang.RuntimeException: unclosed IndexInput
              [junit] 	at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:350)
              [junit] 	at org.apache.lucene.store.Directory.openInput(Directory.java:138)
              [junit] 	at org.apache.lucene.index.CompoundFileReader.<init>(CompoundFileReader.java:67)
              [junit] 	at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:121)
              [junit] 	at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:527)
              [junit] 	at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:628)
              [junit] 	at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:603)
              [junit] 	at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1081)
              [junit] 	at org.apache.lucene.index.IndexWriter.applyDeletesAll(IndexWriter.java:4300)
              [junit] 	at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3440)
              [junit] 	at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3276)
              [junit] 	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3266)
              [junit] 	at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3131)
              [junit] 	at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3206)
          
          Show
          Jason Rutherglen added a comment - I'm seeing the following error which is probably triggered by the new per-segment deletes code, however also could be related to the recent CFS format changes? MockDirectoryWrapper: cannot close: there are still open files: {_0.cfs=1, _1.cfs=1} [junit] java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_0.cfs=1, _1.cfs=1} [junit] at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:395) [junit] at org.apache.lucene.index.TestIndexReader.testReopenChangeReadonly(TestIndexReader.java:1717) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:921) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:859) [junit] Caused by: java.lang.RuntimeException: unclosed IndexInput [junit] at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:350) [junit] at org.apache.lucene.store.Directory.openInput(Directory.java:138) [junit] at org.apache.lucene.index.CompoundFileReader.<init>(CompoundFileReader.java:67) [junit] at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:121) [junit] at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:527) [junit] at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:628) [junit] at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:603) [junit] at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1081) [junit] at org.apache.lucene.index.IndexWriter.applyDeletesAll(IndexWriter.java:4300) [junit] at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3440) [junit] at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3276) [junit] at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3266) [junit] at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3131) [junit] at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3206)
          Hide
          Jason Rutherglen added a comment -

          In trying to implement the per-segment deletes I encountered this error
          from TestStressIndexing2. So I started over with a new checkout of trunk,
          and started slowly adding in the per-segment code, running
          TestStressIndexing2 as each part was added. The attached patch breaks,
          though the deletes are still using the old code. There's clearly some kind
          of synchronization issue, though nothing esoteric has been added, yikes.

          [junit] Testcase: testMultiConfigMany(org.apache.lucene.index.TestStressIndexing2):	FAILED
              [junit] expected:<20> but was:<19>
              [junit] junit.framework.AssertionFailedError: expected:<20> but was:<19>
          
          Show
          Jason Rutherglen added a comment - In trying to implement the per-segment deletes I encountered this error from TestStressIndexing2. So I started over with a new checkout of trunk, and started slowly adding in the per-segment code, running TestStressIndexing2 as each part was added. The attached patch breaks, though the deletes are still using the old code. There's clearly some kind of synchronization issue, though nothing esoteric has been added, yikes. [junit] Testcase: testMultiConfigMany(org.apache.lucene.index.TestStressIndexing2): FAILED [junit] expected:<20> but was:<19> [junit] junit.framework.AssertionFailedError: expected:<20> but was:<19>
          Hide
          Jason Rutherglen added a comment -

          Running TestStressIndexing2 500 times on trunk causes this error which is probably intermittent:

          [junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
              [junit] Testcase: testMultiConfigMany(org.apache.lucene.index.TestStressIndexing2):	Caused an ERROR
              [junit] Array index out of range: 0
              [junit] java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 0
              [junit] 	at java.util.Vector.get(Vector.java:721)
              [junit] 	at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1049)
              [junit] 	at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4291)
              [junit] 	at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3444)
              [junit] 	at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3279)
              [junit] 	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3269)
              [junit] 	at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1760)
              [junit] 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1723)
              [junit] 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1687)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.indexRandom(TestStressIndexing2.java:233)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:123)
              [junit] 	at org.apache.lucene.index.TestStressIndexing2.testMultiConfigMany(TestStressIndexing2.java:97)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
          
          Show
          Jason Rutherglen added a comment - Running TestStressIndexing2 500 times on trunk causes this error which is probably intermittent: [junit] Testsuite: org.apache.lucene.index.TestStressIndexing2 [junit] Testcase: testMultiConfigMany(org.apache.lucene.index.TestStressIndexing2): Caused an ERROR [junit] Array index out of range: 0 [junit] java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 0 [junit] at java.util.Vector.get(Vector.java:721) [junit] at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1049) [junit] at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4291) [junit] at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3444) [junit] at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3279) [junit] at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3269) [junit] at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1760) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1723) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1687) [junit] at org.apache.lucene.index.TestStressIndexing2.indexRandom(TestStressIndexing2.java:233) [junit] at org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:123) [junit] at org.apache.lucene.index.TestStressIndexing2.testMultiConfigMany(TestStressIndexing2.java:97) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
          Hide
          Jason Rutherglen added a comment -

          The above isn't on trunk, I misread the screen.

          Show
          Jason Rutherglen added a comment - The above isn't on trunk, I misread the screen.
          Hide
          Jason Rutherglen added a comment -

          I've isolated the mismatch in num docs between the CMS vs. SMS generated
          indexes to applying the deletes to the merging segments (whereas currently
          we were/are not applying deletes to merging segments and
          TestStressIndexing2 passes). Assuming the deletes are being applied
          correctly to the merging segments, perhaps the logic of gathering up
          forward segment deletes is incorrect somehow in the concurrent merge case.
          When deletes were held in a map per segment, this test was passing.

          Show
          Jason Rutherglen added a comment - I've isolated the mismatch in num docs between the CMS vs. SMS generated indexes to applying the deletes to the merging segments (whereas currently we were/are not applying deletes to merging segments and TestStressIndexing2 passes). Assuming the deletes are being applied correctly to the merging segments, perhaps the logic of gathering up forward segment deletes is incorrect somehow in the concurrent merge case. When deletes were held in a map per segment, this test was passing.
          Hide
          Jason Rutherglen added a comment -

          A test to see if the problem is the deletes per-segment go forward logic is to iterate over the deletes flushed map using the docid-upto to stay within the boundaries of the segment(s) being merged.

          Show
          Jason Rutherglen added a comment - A test to see if the problem is the deletes per-segment go forward logic is to iterate over the deletes flushed map using the docid-upto to stay within the boundaries of the segment(s) being merged.
          Hide
          Michael McCandless added a comment -

          What a nice small patch

          I think the getDeletesSegmentsForward shouldn't be folding in the
          deletesInRAM? Ie, that newly flushed info will have carried over the
          previous deletes in RAM?

          I think pushDeletes/pushSegmentDeletes should be merged, and we should
          nuke DocumentsWriter.deletesFlushed? Ie, we should push directly from
          deletesInRAM to the new SegmentInfo? EG you are now pushing all
          deletesFlushed into the new SegmentInfo when actually you should only
          push the deletes for that one segment.

          We shouldn't do the remap deletes anymore. We can remove
          DocumentsWriter.get/set/updateFlushedDocCount too.

          Hmm... so what are we supposed to do if someone opens IW, does a bunch
          of deletes, then commits? Ie flushDocs is false, so there's no new
          SegmentInfo. I think in this case we can stick the deletes against
          the last segment in the index, with the docidUpto set to the maxDoc()
          of that segment?

          Show
          Michael McCandless added a comment - What a nice small patch I think the getDeletesSegmentsForward shouldn't be folding in the deletesInRAM? Ie, that newly flushed info will have carried over the previous deletes in RAM? I think pushDeletes/pushSegmentDeletes should be merged, and we should nuke DocumentsWriter.deletesFlushed? Ie, we should push directly from deletesInRAM to the new SegmentInfo? EG you are now pushing all deletesFlushed into the new SegmentInfo when actually you should only push the deletes for that one segment. We shouldn't do the remap deletes anymore. We can remove DocumentsWriter.get/set/updateFlushedDocCount too. Hmm... so what are we supposed to do if someone opens IW, does a bunch of deletes, then commits? Ie flushDocs is false, so there's no new SegmentInfo. I think in this case we can stick the deletes against the last segment in the index, with the docidUpto set to the maxDoc() of that segment?
          Hide
          Jason Rutherglen added a comment -
          • I added pushDeletesLastSegment to doc writer
          • Deletes flushed is gone, only deletesInRAM exists
          • In the apply merge deletes case, won't we want to add deletesInRAM in
            the getForwardDeletes method?
          • The TestStressIndexing2 test still fails so there is still something
            incorrect.
          • Though for the failing unit test it does not matter, we need to figure
            out a solution for the pending doc ids deletions, eg, they can't simply
            transferred around, they probably need to be applied as soon as possible.
            Otherwise they require remapping.
          Show
          Jason Rutherglen added a comment - I added pushDeletesLastSegment to doc writer Deletes flushed is gone, only deletesInRAM exists In the apply merge deletes case, won't we want to add deletesInRAM in the getForwardDeletes method? The TestStressIndexing2 test still fails so there is still something incorrect. Though for the failing unit test it does not matter, we need to figure out a solution for the pending doc ids deletions, eg, they can't simply transferred around, they probably need to be applied as soon as possible. Otherwise they require remapping.
          Hide
          Michael McCandless added a comment -

          In the apply merge deletes case, won't we want to add deletesInRAM in the getForwardDeletes method?

          No, we can't add those deletes until the current buffered segment is successfully flushed.

          Eg, say the segment hits a disk full on flush, and DocsWriter aborts (discards all buffered docs/deletions from that segment). If we included these deletesInRAM when applying deletes then suddenly the app will see that some deletes were applied yet the added documents were not. So on disk full during flush, calls to .updateDocument may wind up deleting the old doc but not adding the new one.

          So we need to keep them segregated for proper error case semantics.

          Though for the failing unit test it does not matter, we need to figure
          out a solution for the pending doc ids deletions, eg, they can't simply
          transferred around, they probably need to be applied as soon as possible.
          Otherwise they require remapping.

          Hmm why must we remap? Can't we carry these buffered deleteByDocIDs along with the segment? The docIDs would be the segment's docIDs (ie no base added) so no shifting is needed?

          These deleted docIDs would only apply to the current segment, ie would not be included in getForwardDeletes?

          Show
          Michael McCandless added a comment - In the apply merge deletes case, won't we want to add deletesInRAM in the getForwardDeletes method? No, we can't add those deletes until the current buffered segment is successfully flushed. Eg, say the segment hits a disk full on flush, and DocsWriter aborts (discards all buffered docs/deletions from that segment). If we included these deletesInRAM when applying deletes then suddenly the app will see that some deletes were applied yet the added documents were not. So on disk full during flush, calls to .updateDocument may wind up deleting the old doc but not adding the new one. So we need to keep them segregated for proper error case semantics. Though for the failing unit test it does not matter, we need to figure out a solution for the pending doc ids deletions, eg, they can't simply transferred around, they probably need to be applied as soon as possible. Otherwise they require remapping. Hmm why must we remap? Can't we carry these buffered deleteByDocIDs along with the segment? The docIDs would be the segment's docIDs (ie no base added) so no shifting is needed? These deleted docIDs would only apply to the current segment, ie would not be included in getForwardDeletes?
          Hide
          Jason Rutherglen added a comment -

          I made the changes listed above, ie, docids and deletesInRAM aren't included in getForwardDeletes. However TestStressIndexing2 still fails, and the numbers are still off significantly which probably indicates it's not a synchronization issue.

          Show
          Jason Rutherglen added a comment - I made the changes listed above, ie, docids and deletesInRAM aren't included in getForwardDeletes. However TestStressIndexing2 still fails, and the numbers are still off significantly which probably indicates it's not a synchronization issue.
          Hide
          Michael McCandless added a comment -

          So nice to see remapDeletes deleted!

          • Don't forget to remove DocumentsWriter.get/set/updateFlushedDocCount too.
          • Can you move the deletes out of SegmentInfo? We can just use a
            Map<SegmentInfo,BufferedDeletes>? But remember to delete segments
            from the map once we commit the merge...
          • I think DocsWriter shouldn't hold onto the SegmentInfos; we should
            pass it in to only those methods that need it. That SegmentInfos
            is protected under IW's monitor so it makes me nervous if it's
            also a member on DW.
          • Hmm we're no longer accounting for RAM usage of per-segment
            deletes? I think we need an AtomicInt, which we incr w/ RAM used
            on pushing deletes into a segment, and decr on clearing?
          • The change to the message(...) in DW.applyDeletes is wrong (ie
            switching to deletesInRAM); I think we should just remove the
            details, ie so it says "applying deletes on N segments"? But then
            add a more detailed message per-segment with the aggregated
            (forward) deletes details?
          • I think we should move this delete handling out of DW as much as
            possible... that's really IW's role (DW is "about" flushing the
            next segment, not tracking details associated with all other
            segments in the index)
          • Instead of adding pushDeletesLastSegment, can we just have IW call
            pushDeletes(lastSegmentInfo)?
          • Calling .getForwardDeletes inside the for loop iterating over the
            infos is actually O(N^2) cost, and it could matter for
            delete-intensive many-segment indices. Can you change this,
            instead, to walk the infos backwards, incrementally building up
            the forward deletes to apply to each segment by adding in that
            infos deletions?
          Show
          Michael McCandless added a comment - So nice to see remapDeletes deleted! Don't forget to remove DocumentsWriter.get/set/updateFlushedDocCount too. Can you move the deletes out of SegmentInfo? We can just use a Map<SegmentInfo,BufferedDeletes>? But remember to delete segments from the map once we commit the merge... I think DocsWriter shouldn't hold onto the SegmentInfos; we should pass it in to only those methods that need it. That SegmentInfos is protected under IW's monitor so it makes me nervous if it's also a member on DW. Hmm we're no longer accounting for RAM usage of per-segment deletes? I think we need an AtomicInt, which we incr w/ RAM used on pushing deletes into a segment, and decr on clearing? The change to the message(...) in DW.applyDeletes is wrong (ie switching to deletesInRAM); I think we should just remove the details, ie so it says "applying deletes on N segments"? But then add a more detailed message per-segment with the aggregated (forward) deletes details? I think we should move this delete handling out of DW as much as possible... that's really IW's role (DW is "about" flushing the next segment, not tracking details associated with all other segments in the index) Instead of adding pushDeletesLastSegment, can we just have IW call pushDeletes(lastSegmentInfo)? Calling .getForwardDeletes inside the for loop iterating over the infos is actually O(N^2) cost, and it could matter for delete-intensive many-segment indices. Can you change this, instead, to walk the infos backwards, incrementally building up the forward deletes to apply to each segment by adding in that infos deletions?
          Hide
          Jason Rutherglen added a comment -

          I guess you think the sync on doc writer is the cause of the
          TestStressIndexing2 unit test failure?

          I think we should move this delete handling out of DW

          I agree, I originally took this approach however unit tests were failing
          when segment infos was passed directly into the apply deletes method(s).
          This'll be the 2nd time however apparently the 3rd time's the charm.

          I'll make the changes and cross my fingers.

          Show
          Jason Rutherglen added a comment - I guess you think the sync on doc writer is the cause of the TestStressIndexing2 unit test failure? I think we should move this delete handling out of DW I agree, I originally took this approach however unit tests were failing when segment infos was passed directly into the apply deletes method(s). This'll be the 2nd time however apparently the 3rd time's the charm. I'll make the changes and cross my fingers.
          Hide
          Jason Rutherglen added a comment -

          I started on taking the approach of moving deletes to a SegmentDeletes class
          that's a member of IW. Removing DW's addDeleteTerm is/was fairly trivial.

          In moving deletes out of DW, how should we handle the bufferDeleteTerms sync on
          DW and the containing waitReady? The purpose of BDT is to check if RAM
          consumption has reached it's peak, and if so, balance out the ram usage and/or
          flush pending deletes that are ram consuming. This is probably why deletes are
          intertwined with DW. We could change DW's BDT method though I'm loathe to
          change the wait logic of DW for fear of causing a ripple effect of inexplicable
          unit test failures elsewhere.

          Show
          Jason Rutherglen added a comment - I started on taking the approach of moving deletes to a SegmentDeletes class that's a member of IW. Removing DW's addDeleteTerm is/was fairly trivial. In moving deletes out of DW, how should we handle the bufferDeleteTerms sync on DW and the containing waitReady? The purpose of BDT is to check if RAM consumption has reached it's peak, and if so, balance out the ram usage and/or flush pending deletes that are ram consuming. This is probably why deletes are intertwined with DW. We could change DW's BDT method though I'm loathe to change the wait logic of DW for fear of causing a ripple effect of inexplicable unit test failures elsewhere.
          Hide
          Michael McCandless added a comment -

          I guess you think the sync on doc writer is the cause of the
          TestStressIndexing2 unit test failure?

          I'm not sure what's causing the failure, but, I think getting the net approach roughly right is the first goal, and then we see what's failing.

          I think we should move this delete handling out of DW

          I agree, I originally took this approach however unit tests were failing
          when segment infos was passed directly into the apply deletes method(s).
          This'll be the 2nd time however apparently the 3rd time's the charm.

          Not only moving the SegmentInfos out of DW as a member, but also move all the applyDeletes logic out. Ie it should be IW that pulls readers from the pool, walks the merged del term/queries/per-seg docIDs and actually does the deletion.

          In moving deletes out of DW, how should we handle the bufferDeleteTerms sync on DW and the containing waitReady?

          I think all the bufferDeleteX would move into IW, and timeToFlushDeletes. The RAM accounting can be done fully inside IW.

          The waitReady(null) is there so that DW.pauseAllThreads also pauses any threads doing deletions. But, in moving these methods to IW, we'd make them sync on IW (they are now sync'd on DW), which takes care of pausing these threads.

          Show
          Michael McCandless added a comment - I guess you think the sync on doc writer is the cause of the TestStressIndexing2 unit test failure? I'm not sure what's causing the failure, but, I think getting the net approach roughly right is the first goal, and then we see what's failing. I think we should move this delete handling out of DW I agree, I originally took this approach however unit tests were failing when segment infos was passed directly into the apply deletes method(s). This'll be the 2nd time however apparently the 3rd time's the charm. Not only moving the SegmentInfos out of DW as a member, but also move all the applyDeletes logic out. Ie it should be IW that pulls readers from the pool, walks the merged del term/queries/per-seg docIDs and actually does the deletion. In moving deletes out of DW, how should we handle the bufferDeleteTerms sync on DW and the containing waitReady? I think all the bufferDeleteX would move into IW, and timeToFlushDeletes. The RAM accounting can be done fully inside IW. The waitReady(null) is there so that DW.pauseAllThreads also pauses any threads doing deletions. But, in moving these methods to IW, we'd make them sync on IW (they are now sync'd on DW), which takes care of pausing these threads.
          Hide
          Jason Rutherglen added a comment -

          The waitReady(null) is there so that DW.pauseAllThreads also pauses any threads doing deletions

          waitReady is used in getThreadState as well as bufferDeleteX, we may need to redundantly add it to SegmentDeletes? Maybe not. We'll be sync'ing on IW when adding deletions, that seems like it'll be OK.

          in moving these methods to IW, we'd make them sync on IW (they are now sync'd on DW), which takes care of pausing these threads

          Because we're sync'ing on IW we don't need to pause the indexing threads? Ok this is because doFlushX is sync'd on IW.

          The RAM accounting can be done fully inside IW.

          Well, inside of SegmentDeletes.

          Show
          Jason Rutherglen added a comment - The waitReady(null) is there so that DW.pauseAllThreads also pauses any threads doing deletions waitReady is used in getThreadState as well as bufferDeleteX, we may need to redundantly add it to SegmentDeletes? Maybe not. We'll be sync'ing on IW when adding deletions, that seems like it'll be OK. in moving these methods to IW, we'd make them sync on IW (they are now sync'd on DW), which takes care of pausing these threads Because we're sync'ing on IW we don't need to pause the indexing threads? Ok this is because doFlushX is sync'd on IW. The RAM accounting can be done fully inside IW. Well, inside of SegmentDeletes.
          Hide
          Jason Rutherglen added a comment -

          This patch separates out most of the deletes storage and processing into a
          SegmentDeletes class. The segments are walked backwards to coalesce the pending
          deletions. I think/hope this logic is correct.

          Update document is a bit tricky as we cannot sync on IW to insure correctness
          of del term addition, nor can we really sync inside of DW without probably
          causing deadlock. When I simply delete the doc after adding it in IW,
          TestStressIndexing2 fails miserably with 0 num docs.

          Show
          Jason Rutherglen added a comment - This patch separates out most of the deletes storage and processing into a SegmentDeletes class. The segments are walked backwards to coalesce the pending deletions. I think/hope this logic is correct. Update document is a bit tricky as we cannot sync on IW to insure correctness of del term addition, nor can we really sync inside of DW without probably causing deadlock. When I simply delete the doc after adding it in IW, TestStressIndexing2 fails miserably with 0 num docs.
          Hide
          Jason Rutherglen added a comment -

          Now we're returning the DocumentsWriterThreadState to IW to record the exact doc id as the del term limit. TestStressIndexing2 fails but far less from the mark, eg, 1 when it does and actually passes some of the time.

          Show
          Jason Rutherglen added a comment - Now we're returning the DocumentsWriterThreadState to IW to record the exact doc id as the del term limit. TestStressIndexing2 fails but far less from the mark, eg, 1 when it does and actually passes some of the time.
          Hide
          Jason Rutherglen added a comment -

          Here's a random guess, I think because with this patch we're applying deletes
          sometimes multiple times, whereas before we were applying all of them and
          clearing them out at once, there's a mismatch in terms of over/under-applying
          deletes. Oddly when deletes are performed in _mergeInit on all segments vs.
          only on the segments being merged, the former has a much higher success rate.
          This is strange because all deletes will have been applied by the time
          commit/getreader is called anyways.

          Show
          Jason Rutherglen added a comment - Here's a random guess, I think because with this patch we're applying deletes sometimes multiple times, whereas before we were applying all of them and clearing them out at once, there's a mismatch in terms of over/under-applying deletes. Oddly when deletes are performed in _mergeInit on all segments vs. only on the segments being merged, the former has a much higher success rate. This is strange because all deletes will have been applied by the time commit/getreader is called anyways.
          Hide
          Michael McCandless added a comment -

          OK I started from the last patch and iterated quite a bit.

          The per-segment deletes are now working! All tests pass. It turned
          out to be a little more hairy than we thought because you also must
          account for deletes that are pushed backwards onto a segment being
          merged (eg due to a flush, or a merge just ahead) while that merge is
          still running.

          I swapped the two classes – SegmentDeletes tracks deletes for one
          segment, while BufferedDeletes tracks deletes for all segments.

          I also wound up doing a fair amount of cleanup/refactoring on how
          DW/IW interact, I think a good step towards DWPT. EG IW's flush is
          now much smaller (could almost become unsync'd) since I moved
          flushing doc stores, building CFS, etc. down into DW.

          I added a new FlushControl class to manage (external to DW) triggering
          of flushing due to RAM, add doc count, buffered del count. This way
          DWPT can share the single FlushControl instance in IW.

          Show
          Michael McCandless added a comment - OK I started from the last patch and iterated quite a bit. The per-segment deletes are now working! All tests pass. It turned out to be a little more hairy than we thought because you also must account for deletes that are pushed backwards onto a segment being merged (eg due to a flush, or a merge just ahead) while that merge is still running. I swapped the two classes – SegmentDeletes tracks deletes for one segment, while BufferedDeletes tracks deletes for all segments. I also wound up doing a fair amount of cleanup/refactoring on how DW/IW interact, I think a good step towards DWPT. EG IW's flush is now much smaller (could almost become unsync'd) since I moved flushing doc stores, building CFS, etc. down into DW. I added a new FlushControl class to manage (external to DW) triggering of flushing due to RAM, add doc count, buffered del count. This way DWPT can share the single FlushControl instance in IW.
          Hide
          Jason Rutherglen added a comment -

          When patching there are errors on IndexWriter.

          Show
          Jason Rutherglen added a comment - When patching there are errors on IndexWriter.
          Hide
          Michael McCandless added a comment -

          Woops, sorry about that Jason – I wasn't fully updated. Try this one?

          Show
          Michael McCandless added a comment - Woops, sorry about that Jason – I wasn't fully updated. Try this one?
          Hide
          Jason Rutherglen added a comment -

          We're close, I think SegmentDeletes is missing?

          Show
          Jason Rutherglen added a comment - We're close, I think SegmentDeletes is missing?
          Hide
          Michael McCandless added a comment -

          Ugh, OK new patch. 3rd time's a charm?

          Show
          Michael McCandless added a comment - Ugh, OK new patch. 3rd time's a charm?
          Hide
          Jason Rutherglen added a comment -

          The patch applied.

          Ok, a likely cause of the TestStressIndexing2 failures was that when we're
          flushing deletes to the last segment (because a segment isn't being flushed),
          we needed to move deletes also to the newly merged segment?

          In the patch we've gone away from sync'ing on IW when deleting, which was a
          challenge because we needed the sync on DW to properly wait on flushing threads
          etc.

          Show
          Jason Rutherglen added a comment - The patch applied. Ok, a likely cause of the TestStressIndexing2 failures was that when we're flushing deletes to the last segment (because a segment isn't being flushed), we needed to move deletes also to the newly merged segment? In the patch we've gone away from sync'ing on IW when deleting, which was a challenge because we needed the sync on DW to properly wait on flushing threads etc.
          Hide
          Jason Rutherglen added a comment -

          All tests pass.

          Show
          Jason Rutherglen added a comment - All tests pass.
          Hide
          Michael McCandless added a comment -

          Ok, a likely cause of the TestStressIndexing2 failures was that when we're
          flushing deletes to the last segment (because a segment isn't being flushed),
          we needed to move deletes also to the newly merged segment?

          Right, and also the case where a merge just-ahead of you kicks off and dumps its merged deletes onto you.

          Show
          Michael McCandless added a comment - Ok, a likely cause of the TestStressIndexing2 failures was that when we're flushing deletes to the last segment (because a segment isn't being flushed), we needed to move deletes also to the newly merged segment? Right, and also the case where a merge just-ahead of you kicks off and dumps its merged deletes onto you.
          Hide
          Michael McCandless added a comment -

          OK I committed this to trunk.

          Since it's a biggish change I'll hold off on back-porting to 3.x for now... let's let hudson chew on it some first.

          Show
          Michael McCandless added a comment - OK I committed this to trunk. Since it's a biggish change I'll hold off on back-porting to 3.x for now... let's let hudson chew on it some first.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          Roman Alekseenkov added a comment -

          Hey, is it something that was ported to 3.x, or not really?

          Show
          Roman Alekseenkov added a comment - Hey, is it something that was ported to 3.x, or not really?
          Hide
          Robert Muir added a comment -

          Hi, this was backported since lucene 3.1

          Show
          Robert Muir added a comment - Hi, this was backported since lucene 3.1
          Hide
          Roman Alekseenkov added a comment -

          thank you, Robert

          I was asking because we are having issues with 3.4.0 where applyDeletes() takes an large amount of time on commit for 150GB index, and this is stopping all indexing threads. it looks like applyDeletes() is re-scanning an entire index, even though it's unnecessary as we are only adding documents to the index but not deleting them

          if this optimization was backported, then I will probably have to find a solution for my problem elsewhere...

          Show
          Roman Alekseenkov added a comment - thank you, Robert I was asking because we are having issues with 3.4.0 where applyDeletes() takes an large amount of time on commit for 150GB index, and this is stopping all indexing threads. it looks like applyDeletes() is re-scanning an entire index, even though it's unnecessary as we are only adding documents to the index but not deleting them if this optimization was backported, then I will probably have to find a solution for my problem elsewhere...
          Hide
          Robert Muir added a comment -

          even though it's unnecessary as we are only adding documents to the index but not deleting them

          Hi Roman, i saw your post.

          I think by default when you add a document with unique id X, Solr deletes-by-term of X.

          But I'm pretty sure it has an option (sorry i dont know what it is), where you can tell it
          that you are sure that the documents you are adding are new and it won't do this.

          Show
          Robert Muir added a comment - even though it's unnecessary as we are only adding documents to the index but not deleting them Hi Roman, i saw your post. I think by default when you add a document with unique id X, Solr deletes-by-term of X. But I'm pretty sure it has an option (sorry i dont know what it is), where you can tell it that you are sure that the documents you are adding are new and it won't do this.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development