Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5646

stored fields bulk merging doesn't quite work right

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 4.9, 6.0
    • None
    • None
    • New

    Description

      from doing some profiling of merging:

      CompressingStoredFieldsWriter has 3 codepaths (as i see it):
      1. optimized bulk copy (no deletions in chunk). In this case compressed data is copied over.
      2. semi-optimized copy: in this case its optimized for an existing storedfieldswriter, and it decompresses and recompresses doc-at-a-time around any deleted docs in the chunk.
      3. ordinary merging

      In my dataset, i only see #2 happening, never #1. The logic for determining if we can do #1 seems to be:

      onChunkBoundary && chunkSmallEnough && chunkLargeEnough && noDeletions
      

      I think the logic for "chunkLargeEnough" is out of sync with the MAX_DOCS_PER_CHUNK limit? e.g. instead of:

      startOffsets[it.chunkDocs - 1] + it.lengths[it.chunkDocs - 1] >= chunkSize // chunk is large enough
      

      it should be something like:

      (it.chunkDocs >= MAX_DOCUMENTS_PER_CHUNK || startOffsets[it.chunkDocs - 1] + it.lengths[it.chunkDocs - 1] >= chunkSize) // chunk is large enough
      

      But this only works "at first" then falls out of sync in my tests. Once this happens, it never reverts back to #1 algorithm and sticks with #2. So its still not quite right.

      Maybe Adrien Grand knows off the top of his head...

      Attachments

        1. LUCENE-5646.patch
          5 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment