Lucene - Core
  1. Lucene - Core
  2. LUCENE-1737

Always use bulk-copy when merging stored fields and term vectors

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene has nice optimizations in place during merging of stored fields
      (LUCENE-1043) and term vectors (LUCENE-1120) whereby the bytes are
      bulk copied to the new segmetn. This is much faster than decoding &
      rewriting one document at a time.

      However the optimization is rather brittle: it relies on the mapping
      of field name to number to be the same ("congruent") for the segment
      being merged.

      Unfortunately, the field mapping will be congruent only if the app
      adds the same fields in precisely the same order to each document.

      I think we should fix IndexWriter to assign the same field number for
      a given field that has been assigned in the past. Ie, when writing a
      new segment, we pre-seed the field numbers based on past segments.
      All other aspects of FieldInfo would remain fully dynamic.

      1. LUCENE-1737.patch
        16 kB
        Michael McCandless
      2. LUCENE-1737.patch
        4 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Clearing 2.9 fix version.

          Show
          Michael McCandless added a comment - Clearing 2.9 fix version.
          Hide
          Michael McCandless added a comment -

          This turned out to be very simply – a tiny patch!

          Show
          Michael McCandless added a comment - This turned out to be very simply – a tiny patch!
          Hide
          Michael McCandless added a comment -

          I realized we should fix a few more cases here to use bulk-copy more often. First, on opening a pre-4.0 index, we should sweep all segments to union the FieldInfos so newly written segments are congruent with all past segments as much as possible. Second, when merging we should start from the current FieldInfos.

          Even with this, if you addIndexes(Directory[]), which simply copies in new segments, if the fields name->number assignment on those incoming indices doesn't match the current index, then when those segments are merged they can't be bulk copied.

          Show
          Michael McCandless added a comment - I realized we should fix a few more cases here to use bulk-copy more often. First, on opening a pre-4.0 index, we should sweep all segments to union the FieldInfos so newly written segments are congruent with all past segments as much as possible. Second, when merging we should start from the current FieldInfos. Even with this, if you addIndexes(Directory[]), which simply copies in new segments, if the fields name->number assignment on those incoming indices doesn't match the current index, then when those segments are merged they can't be bulk copied.
          Hide
          Michael McCandless added a comment -

          The fixes above can only be done once we always merge doc stores on merging segments, which will be done in LUCENE-2814.

          Show
          Michael McCandless added a comment - The fixes above can only be done once we always merge doc stores on merging segments, which will be done in LUCENE-2814 .
          Hide
          Michael McCandless added a comment -

          Patch.

          It has one nocommit which we can remove once LUCENE-2814 is in.

          Show
          Michael McCandless added a comment - Patch. It has one nocommit which we can remove once LUCENE-2814 is in.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development