Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today its only usable really by stored fields/term vectors, has hardcoded logic in SegmentMerger specific to certain impls, etc.

      It would be better if this was generalized to terms/postings/norms/docvalues as well.

      Bulk merge is boring, the real idea is to allow codecs to do more: e.g. with this patch they could do streaming checksum validation, or prevent the loading of "latent" norms, or other things we cannot do today.

      1. LUCENE-5894.patch
        177 kB
        Robert Muir

        Activity

        Hide
        rcmuir Robert Muir added a comment -

        Stab at a patch:

        • moved all bulk merge stuff out of segmentmerger to codec private
        • moved all merge logic out of segment merger into codec apis (so they can completely override)
        • added missing getXXXReader to SegmentReader, its not needed for postings as we already have direct access with fields(), but is needed for norms/docvalues
        • separated norms processing from docvalues
        • refactored dv update handling: when dv updates are in place you get a producer that delegates to the correct ones (this is a nice separation out of SR)
        Show
        rcmuir Robert Muir added a comment - Stab at a patch: moved all bulk merge stuff out of segmentmerger to codec private moved all merge logic out of segment merger into codec apis (so they can completely override) added missing getXXXReader to SegmentReader, its not needed for postings as we already have direct access with fields(), but is needed for norms/docvalues separated norms processing from docvalues refactored dv update handling: when dv updates are in place you get a producer that delegates to the correct ones (this is a nice separation out of SR)
        Hide
        mikemccand Michael McCandless added a comment -

        +1, I think this patch is nice; it's great to have merging fully under
        control of the codec. There are lots of nice improvements here:

        • SegmentMerger is much simpler
        • Merging responsibility moves to XXXConsumer, and bulk-merge optos
          (and new MatchingReaders class) are now entirely codec private
          (CompressingStoredFields/TVFormat)
        • Moved old writers (Lucene40StoredFields/TVsWriter) to
          test-framework so compressing (current default) is the only writer
          now.
        • We now need a NormsConsumer/Producer (can't reuse DVConsumer) since the
          source for norms must be "known" in the default merge impl.
        • Factored out SegmentDocValuesProducer to hold all per-field DVPs,
          updates.
        • Also separated out the classes in IW that buffer up norms in RAM
          until flush from the DV classes, letting you remove
          trackDocsWithField boolean...

        I think this is a good cleanup!

        Show
        mikemccand Michael McCandless added a comment - +1, I think this patch is nice; it's great to have merging fully under control of the codec. There are lots of nice improvements here: SegmentMerger is much simpler Merging responsibility moves to XXXConsumer, and bulk-merge optos (and new MatchingReaders class) are now entirely codec private (CompressingStoredFields/TVFormat) Moved old writers (Lucene40StoredFields/TVsWriter) to test-framework so compressing (current default) is the only writer now. We now need a NormsConsumer/Producer (can't reuse DVConsumer) since the source for norms must be "known" in the default merge impl. Factored out SegmentDocValuesProducer to hold all per-field DVPs, updates. Also separated out the classes in IW that buffer up norms in RAM until flush from the DV classes, letting you remove trackDocsWithField boolean... I think this is a good cleanup!
        Hide
        rjernst Ryan Ernst added a comment -

        +1, LGTM

        Show
        rjernst Ryan Ernst added a comment - +1, LGTM
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1619392 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1619392 ]

        LUCENE-5894: refactor bulk merge logic

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1619392 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1619392 ] LUCENE-5894 : refactor bulk merge logic
        Hide
        rcmuir Robert Muir added a comment -

        This change is good to backport, but i would prefer it not go into 4.10 last minute.

        Ryan Ernst would you be ok with creating release branch soon?

        Show
        rcmuir Robert Muir added a comment - This change is good to backport, but i would prefer it not go into 4.10 last minute. Ryan Ernst would you be ok with creating release branch soon?
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1619477 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1619477 ]

        LUCENE-5894: refactor bulk merge logic

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1619477 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619477 ] LUCENE-5894 : refactor bulk merge logic
        Hide
        anshumg Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        anshumg Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            rcmuir Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development