Lucene - Core
  1. Lucene - Core
  2. LUCENE-6183

Avoid re-compression on stored fields merge

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We removed this optimization before, it didnt really work right because it required things to be "aligned".

      But I think we can do it simpler and safer. This recompression is a big cpu hog in merge, and limits our options compression-wise (especially ones like LZ4-HC that are only slower at write-time).

      1. LUCENE-6183.patch
        19 kB
        Robert Muir
      2. LUCENE-6183.patch
        12 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's a first stab. I think some of the code can be simplified further, and we should take a pass thru to see if there are any cheap checks we should make.

        Show
        Robert Muir added a comment - Here's a first stab. I think some of the code can be simplified further, and we should take a pass thru to see if there are any cheap checks we should make.
        Hide
        Robert Muir added a comment -

        I ran a benchmark indexing log data (just stored fields only, no actual "indexing").
        Stored fields merging in this case is 5x faster with BEST_SPEED and 10x faster with BEST_COMPRESSION. Any space differences are trivial.

        I will run it also with the deflate-6 in the patch, but I think it will be fine.

        iwc.setMergeScheduler(new SerialMergeScheduler());
        iwc.setMaxBufferedDocs(10001);
        iwc.setMergePolicy(new LogDocMergePolicy());

        BEST_SPEED (lz4)
        Trunk:
        timeIndexing=578014
        timeForceMerging=183421
        SM 0 [2015-01-15 04:05:30.380; main]: 114732 msec to merge stored fields [6881288 docs]
        -rw-rw-r--  1 rmuir rmuir 4690955837 Jan 15 04:05 _7j0.fdt
        -rw-rw-r--  1 rmuir rmuir    2559414 Jan 15 04:05 _7j0.fdx
        
        Patch:
        timeIndexing=389148
        timeForceMerging=37476
        SM 0 [2015-01-15 03:49:20.538; main]: 21690 msec to merge stored fields [6881288 docs]
        -rw-rw-r--  1 rmuir rmuir 4691200952 Jan 15 03:49 _6xq.fdt
        -rw-rw-r--  1 rmuir rmuir    2557794 Jan 15 03:49 _6xq.fdx
        
        BEST_COMPRESSION (deflate-3)
        
        Trunk:
        timeIndexing=586511
        timeForceMerging=204363
        SM 0 [2015-01-15 03:33:11.906; main]: 130097 msec to merge stored fields [6881288 docs]
        -rw-rw-r--  1 rmuir rmuir 2673871545 Jan 15 03:33 _5r6.fdt
        -rw-rw-r--  1 rmuir rmuir     731953 Jan 15 03:33 _5r6.fdx
        
        Patch:
        timeIndexing=364453
        timeForceMerging=19519
        SM 0 [2015-01-15 03:41:05.477; main]: 11641 msec to merge stored fields [6881288 docs]
        -rw-rw-r--  1 rmuir rmuir 2674305752 Jan 15 03:41 _6cg.fdt
        -rw-rw-r--  1 rmuir rmuir     735374 Jan 15 03:41 _6cg.fdx
        
        Show
        Robert Muir added a comment - I ran a benchmark indexing log data (just stored fields only, no actual "indexing"). Stored fields merging in this case is 5x faster with BEST_SPEED and 10x faster with BEST_COMPRESSION. Any space differences are trivial. I will run it also with the deflate-6 in the patch, but I think it will be fine. iwc.setMergeScheduler(new SerialMergeScheduler()); iwc.setMaxBufferedDocs(10001); iwc.setMergePolicy(new LogDocMergePolicy()); BEST_SPEED (lz4) Trunk: timeIndexing=578014 timeForceMerging=183421 SM 0 [2015-01-15 04:05:30.380; main]: 114732 msec to merge stored fields [6881288 docs] -rw-rw-r-- 1 rmuir rmuir 4690955837 Jan 15 04:05 _7j0.fdt -rw-rw-r-- 1 rmuir rmuir 2559414 Jan 15 04:05 _7j0.fdx Patch: timeIndexing=389148 timeForceMerging=37476 SM 0 [2015-01-15 03:49:20.538; main]: 21690 msec to merge stored fields [6881288 docs] -rw-rw-r-- 1 rmuir rmuir 4691200952 Jan 15 03:49 _6xq.fdt -rw-rw-r-- 1 rmuir rmuir 2557794 Jan 15 03:49 _6xq.fdx BEST_COMPRESSION (deflate-3) Trunk: timeIndexing=586511 timeForceMerging=204363 SM 0 [2015-01-15 03:33:11.906; main]: 130097 msec to merge stored fields [6881288 docs] -rw-rw-r-- 1 rmuir rmuir 2673871545 Jan 15 03:33 _5r6.fdt -rw-rw-r-- 1 rmuir rmuir 731953 Jan 15 03:33 _5r6.fdx Patch: timeIndexing=364453 timeForceMerging=19519 SM 0 [2015-01-15 03:41:05.477; main]: 11641 msec to merge stored fields [6881288 docs] -rw-rw-r-- 1 rmuir rmuir 2674305752 Jan 15 03:41 _6cg.fdt -rw-rw-r-- 1 rmuir rmuir 735374 Jan 15 03:41 _6cg.fdx
        Hide
        Robert Muir added a comment -

        Here is the deflate-6 proposed here. I think for BEST_COMPRESSION its now a good tradeoff.

        timeIndexing=401208
        timeForceMerging=17187
        SM 0 [2015-01-15 04:17:53.489; main]: 10939 msec to merge stored fields [6881288 docs]
        -rw-rw-r--  1 rmuir rmuir 2322463087 Jan 15 04:17 _84a.fdt
        -rw-rw-r--  1 rmuir rmuir     733578 Jan 15 04:17 _84a.fdx
        
        Show
        Robert Muir added a comment - Here is the deflate-6 proposed here. I think for BEST_COMPRESSION its now a good tradeoff. timeIndexing=401208 timeForceMerging=17187 SM 0 [2015-01-15 04:17:53.489; main]: 10939 msec to merge stored fields [6881288 docs] -rw-rw-r-- 1 rmuir rmuir 2322463087 Jan 15 04:17 _84a.fdt -rw-rw-r-- 1 rmuir rmuir 733578 Jan 15 04:17 _84a.fdx
        Hide
        Michael McCandless added a comment -

        +1

        tooDirty is conservative (errs towards re-compressing) which I think is good...

        Show
        Michael McCandless added a comment - +1 tooDirty is conservative (errs towards re-compressing) which I think is good...
        Hide
        Adrien Grand added a comment -

        The numbers are great!
        +1 to the patch and to moving high compression to level=6.
        It's interesting that it makes forceMerge faster with high compression now.

        Show
        Adrien Grand added a comment - The numbers are great! +1 to the patch and to moving high compression to level=6. It's interesting that it makes forceMerge faster with high compression now.
        Hide
        Robert Muir added a comment -

        Updated patch: I think its ready.

        I added a test for dirty chunk gc, added checks and asserts and code comments, fixed fileformat docs, and added an escape hatch.

        Show
        Robert Muir added a comment - Updated patch: I think its ready. I added a test for dirty chunk gc, added checks and asserts and code comments, fixed fileformat docs, and added an escape hatch.
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1652269 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1652269 ]

        LUCENE-6183: Avoid re-compression on stored fields merge

        Show
        ASF subversion and git services added a comment - Commit 1652269 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1652269 ] LUCENE-6183 : Avoid re-compression on stored fields merge
        Hide
        ASF subversion and git services added a comment -

        Commit 1652275 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1652275 ]

        LUCENE-6183: Avoid re-compression on stored fields merge

        Show
        ASF subversion and git services added a comment - Commit 1652275 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1652275 ] LUCENE-6183 : Avoid re-compression on stored fields merge
        Hide
        ASF subversion and git services added a comment -

        Commit 1652342 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1652342 ]

        LUCENE-6183: be prepared for future packedints version changes

        Show
        ASF subversion and git services added a comment - Commit 1652342 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1652342 ] LUCENE-6183 : be prepared for future packedints version changes
        Hide
        ASF subversion and git services added a comment -

        Commit 1652343 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1652343 ]

        LUCENE-6183: be prepared for future packedints version changes

        Show
        ASF subversion and git services added a comment - Commit 1652343 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1652343 ] LUCENE-6183 : be prepared for future packedints version changes
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development