Lucene - Core
  1. Lucene - Core
  2. LUCENE-5646

stored fields bulk merging doesn't quite work right

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      from doing some profiling of merging:

      CompressingStoredFieldsWriter has 3 codepaths (as i see it):
      1. optimized bulk copy (no deletions in chunk). In this case compressed data is copied over.
      2. semi-optimized copy: in this case its optimized for an existing storedfieldswriter, and it decompresses and recompresses doc-at-a-time around any deleted docs in the chunk.
      3. ordinary merging

      In my dataset, i only see #2 happening, never #1. The logic for determining if we can do #1 seems to be:

      onChunkBoundary && chunkSmallEnough && chunkLargeEnough && noDeletions
      

      I think the logic for "chunkLargeEnough" is out of sync with the MAX_DOCS_PER_CHUNK limit? e.g. instead of:

      startOffsets[it.chunkDocs - 1] + it.lengths[it.chunkDocs - 1] >= chunkSize // chunk is large enough
      

      it should be something like:

      (it.chunkDocs >= MAX_DOCUMENTS_PER_CHUNK || startOffsets[it.chunkDocs - 1] + it.lengths[it.chunkDocs - 1] >= chunkSize) // chunk is large enough
      

      But this only works "at first" then falls out of sync in my tests. Once this happens, it never reverts back to #1 algorithm and sticks with #2. So its still not quite right.

      Maybe Adrien Grand knows off the top of his head...

        Activity

        Hide
        Robert Muir added a comment -

        perhaps the reason i fall "out of sync" is because the first segment ended on a non-chunk boundary (i have no deletions).

        So when it moves to the next segment, it falls out of sync and never "recovers". I'm not sure what we can do here: but it seems unless you have very large docs, you aren't gonna get a "pure-bulk copy" even with my fix, because the chances of everything aligning is quite rare.

        Maybe there is a way we could (temporarily for that marge) force flush() at the end of segment transitions to avoid this, so that the optimization would continue, if we could then recombine them in the next merge to eventually recover?

        Show
        Robert Muir added a comment - perhaps the reason i fall "out of sync" is because the first segment ended on a non-chunk boundary (i have no deletions). So when it moves to the next segment, it falls out of sync and never "recovers". I'm not sure what we can do here: but it seems unless you have very large docs, you aren't gonna get a "pure-bulk copy" even with my fix, because the chances of everything aligning is quite rare. Maybe there is a way we could (temporarily for that marge) force flush() at the end of segment transitions to avoid this, so that the optimization would continue, if we could then recombine them in the next merge to eventually recover?
        Hide
        Robert Muir added a comment -

        another option would be to drop this case #1 entirely, and just focus on optimizing case #2 to be fast.

        Forget about the performance perspective, its quite a bit scary how the stars have to align just perfectly for case #1 to happen. What if we have bugs that aren't being found because its just so rare? Unfortunately, builds.apache.org just went down as I tried to inspect the clover report to see how many times tests even hit the #1 case, so I don't yet know.

        Show
        Robert Muir added a comment - another option would be to drop this case #1 entirely, and just focus on optimizing case #2 to be fast. Forget about the performance perspective, its quite a bit scary how the stars have to align just perfectly for case #1 to happen. What if we have bugs that aren't being found because its just so rare? Unfortunately, builds.apache.org just went down as I tried to inspect the clover report to see how many times tests even hit the #1 case, so I don't yet know.
        Hide
        Robert Muir added a comment -

        I investigated clover, the case #1 is hit a few times but only by all the wrong tests:

        org.apache.solr.search.TestRealTimeGet.testStressGetRealtime
        org.apache.solr.spelling.IndexBasedSpellCheckerTest.testAlternateLocation 
        org.apache.solr.morphlines.cell.SolrCellMorphlineTest.testSolrCellDocumentTypes2
        

        honestly this is a little concerning. I think we need to do something here.

        Show
        Robert Muir added a comment - I investigated clover, the case #1 is hit a few times but only by all the wrong tests: org.apache.solr.search.TestRealTimeGet.testStressGetRealtime org.apache.solr.spelling.IndexBasedSpellCheckerTest.testAlternateLocation org.apache.solr.morphlines.cell.SolrCellMorphlineTest.testSolrCellDocumentTypes2 honestly this is a little concerning. I think we need to do something here.
        Hide
        Robert Muir added a comment -

        Patch disabling the "one in a billion" optimization. IMO Its too rare you ever get this, and doesn't see hardly any test coverage.

        In order to fix bulk merge to really bulk-copy over compressed data, it would have to be more sophisticated I think: e.g. allowing/tracking "padding" for final chunks in segments and at some point, determining it should GC the padding by forcing decompression/recompression. Honestly I'm not sure that kind of stuff belongs in bulk merge.

        NOTE: I didnt not do any similar inspection yet of term vectors. But IIRC that one has less fancy stuff in bulk merge.

        Show
        Robert Muir added a comment - Patch disabling the "one in a billion" optimization. IMO Its too rare you ever get this, and doesn't see hardly any test coverage. In order to fix bulk merge to really bulk-copy over compressed data, it would have to be more sophisticated I think: e.g. allowing/tracking "padding" for final chunks in segments and at some point, determining it should GC the padding by forcing decompression/recompression. Honestly I'm not sure that kind of stuff belongs in bulk merge. NOTE: I didnt not do any similar inspection yet of term vectors. But IIRC that one has less fancy stuff in bulk merge.
        Hide
        Robert Muir added a comment -

        I have not thought about what we shoudl do here 4.8.1-wise.

        I cannot prove that this optimization corrupts data yet, but i feel we should either remove it, or fix it and test the crap out of it.

        If there is objection to removing it, I will try to think about a test that really hammers it (possibly would be quite slow). I have the feeling something bad might happen...

        Show
        Robert Muir added a comment - I have not thought about what we shoudl do here 4.8.1-wise. I cannot prove that this optimization corrupts data yet, but i feel we should either remove it, or fix it and test the crap out of it. If there is objection to removing it, I will try to think about a test that really hammers it (possibly would be quite slow). I have the feeling something bad might happen...
        Hide
        Adrien Grand added a comment -

        Indeed, code path #1 almost only happens on the first segment because it is unlikely that segments end on a chunk boundary. I'm +1 to removing it.

        NOTE: I didnt not do any similar inspection yet of term vectors. But IIRC that one has less fancy stuff in bulk merge.

        Actually, they are worse: term vectors only have #1 and #3. So I think we should just use the default merge routine (which is what has been happening in practice in most cases anyway).

        Show
        Adrien Grand added a comment - Indeed, code path #1 almost only happens on the first segment because it is unlikely that segments end on a chunk boundary. I'm +1 to removing it. NOTE: I didnt not do any similar inspection yet of term vectors. But IIRC that one has less fancy stuff in bulk merge. Actually, they are worse: term vectors only have #1 and #3. So I think we should just use the default merge routine (which is what has been happening in practice in most cases anyway).
        Hide
        Robert Muir added a comment -

        Thanks Adrien, Ill take care of this one first and move on to vectors: ill open an issue

        Show
        Robert Muir added a comment - Thanks Adrien, Ill take care of this one first and move on to vectors: ill open an issue
        Hide
        ASF subversion and git services added a comment -

        Commit 1592731 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1592731 ]

        LUCENE-5646: Remove rare/undertested bulk merge algorithm in CompressingStoredFieldsWriter

        Show
        ASF subversion and git services added a comment - Commit 1592731 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1592731 ] LUCENE-5646 : Remove rare/undertested bulk merge algorithm in CompressingStoredFieldsWriter
        Hide
        ASF subversion and git services added a comment -

        Commit 1592734 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1592734 ]

        LUCENE-5646: Remove rare/undertested bulk merge algorithm in CompressingStoredFieldsWriter

        Show
        ASF subversion and git services added a comment - Commit 1592734 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1592734 ] LUCENE-5646 : Remove rare/undertested bulk merge algorithm in CompressingStoredFieldsWriter
        Hide
        Robert Muir added a comment -

        Thanks Adrien!

        Show
        Robert Muir added a comment - Thanks Adrien!

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development