Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      See LUCENE-5646 for the motivation.

      Long term it might be nice to add algorithm #2 to term vectors if its possible and not too complex.

      But for now, I think we should avoid such rare optimizations.

        Activity

        Hide
        Robert Muir added a comment -

        We can easily apply the same algorithm as LUCENE-6183 so the optimization will be effective and not a rare one. I think we should remove the existing logic completely for that, falling back to the default algorithm otherwise. I can investigate this tomorrow.

        We should think about implementing getMergeInstance() like LUCENE-6115, so that the fallback default algorithm is reasonable for all other cases.

        Show
        Robert Muir added a comment - We can easily apply the same algorithm as LUCENE-6183 so the optimization will be effective and not a rare one. I think we should remove the existing logic completely for that, falling back to the default algorithm otherwise. I can investigate this tomorrow. We should think about implementing getMergeInstance() like LUCENE-6115 , so that the fallback default algorithm is reasonable for all other cases.
        Hide
        Robert Muir added a comment -

        initial patch.

        if you look at the current logic, it tries to optimize this, but doesnt have any safety switch. it also doesn't always work: it sometimes falls back to the default algorithm (even when there are no deletions). This is because of this conditional logic in trunk, comments mine:

        // we currently never bulk merge the last chunk.
        // if this one is not "full" and leaves pending docs, we never resync.
        if (docBase + chunkDocs < maxDoc) ...
        

        When there are deletions it will always fall back and never resync, so handling that case currently doesn't achieve anything except complexity.

        The falling out of sync it does impacts smaller vectors docs more because default algorithm is slow (no getMergeInstance). For big documents (size > chunkSize), you also dodge the resync problem above.

        I replaced it with the same algorithm from stored fields.

        If turned on vectors (5 fields), indexing, and store with small documents (10 fields):

        trunk:
        timeIndexing=601139
        timeMerging=49046
        
        SM 0 [2015-01-17 03:26:30.208; main]: 3765 msec to merge vectors [9490360 docs]
        SM 0 [2015-01-17 03:26:04.928; main]: 5508 msec to merge vectors [7300730 docs]
        SM 0 [2015-01-17 03:25:43.179; main]: 1261 msec to merge vectors [189430 docs]
        
        patch:
        timeIndexing=422731
        timeMerging=43832
        
        SM 0 [2015-01-17 03:37:15.480; main]: 2183 msec to merge vectors [9490360 docs]
        SM 0 [2015-01-17 03:36:50.698; main]: 1492 msec to merge vectors [7300730 docs]
        SM 0 [2015-01-17 03:36:32.620; main]: 27 msec to merge vectors [189430 docs]
        

        You can see the forceMerge time is not so much better, because 2/3 of the collection is in the first segment, but overall indexing is improved because it impacts other merges (see the 189430 one)

        Anyway I think its better overall, at least for simplicity and additional safety mechanism.

        Show
        Robert Muir added a comment - initial patch. if you look at the current logic, it tries to optimize this, but doesnt have any safety switch. it also doesn't always work: it sometimes falls back to the default algorithm (even when there are no deletions). This is because of this conditional logic in trunk, comments mine: // we currently never bulk merge the last chunk. // if this one is not "full" and leaves pending docs, we never resync. if (docBase + chunkDocs < maxDoc) ... When there are deletions it will always fall back and never resync, so handling that case currently doesn't achieve anything except complexity. The falling out of sync it does impacts smaller vectors docs more because default algorithm is slow (no getMergeInstance). For big documents (size > chunkSize), you also dodge the resync problem above. I replaced it with the same algorithm from stored fields. If turned on vectors (5 fields), indexing, and store with small documents (10 fields): trunk: timeIndexing=601139 timeMerging=49046 SM 0 [2015-01-17 03:26:30.208; main]: 3765 msec to merge vectors [9490360 docs] SM 0 [2015-01-17 03:26:04.928; main]: 5508 msec to merge vectors [7300730 docs] SM 0 [2015-01-17 03:25:43.179; main]: 1261 msec to merge vectors [189430 docs] patch: timeIndexing=422731 timeMerging=43832 SM 0 [2015-01-17 03:37:15.480; main]: 2183 msec to merge vectors [9490360 docs] SM 0 [2015-01-17 03:36:50.698; main]: 1492 msec to merge vectors [7300730 docs] SM 0 [2015-01-17 03:36:32.620; main]: 27 msec to merge vectors [189430 docs] You can see the forceMerge time is not so much better, because 2/3 of the collection is in the first segment, but overall indexing is improved because it impacts other merges (see the 189430 one) Anyway I think its better overall, at least for simplicity and additional safety mechanism.
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1653330 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1653330 ]

        LUCENE-5647: fix term vectors bulk merge (same algorithm now as stored fields)

        Show
        ASF subversion and git services added a comment - Commit 1653330 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1653330 ] LUCENE-5647 : fix term vectors bulk merge (same algorithm now as stored fields)
        Hide
        ASF subversion and git services added a comment -

        Commit 1653359 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1653359 ]

        LUCENE-5647: fix term vectors bulk merge (same algorithm now as stored fields)

        Show
        ASF subversion and git services added a comment - Commit 1653359 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653359 ] LUCENE-5647 : fix term vectors bulk merge (same algorithm now as stored fields)
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development