Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8962

Can we merge small segments during refresh, for faster searching?

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.0, 8.6, 8.7
    • core/index
    • None
    • New

    Description

      Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 has merge-on-refresh (by Simon).  See MergePolicy.findFullFlushMerges

      The original description follows:


      With near-real-time search we ask IndexWriter to write all in-memory segments to disk and open an IndexReader to search them, and this is typically a quick operation.

      However, when you use many threads for concurrent indexing, IndexWriter will accumulate write many small segments during refresh and this then adds search-time cost as searching must visit all of these tiny segments.

      The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve {{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader?  It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...

      One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...

      I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!

      Attachments

        1. test.diff
          3 kB
          Simon Willnauer
        2. LUCENE-8962_demo.png
          24 kB
          Michael Froh
        3. image-2021-03-22-10-36-32-201.png
          4 kB
          Michael McCandless
        4. failure_log.txt
          36.16 MB
          Simon Willnauer
        5. failed-tests.patch
          8 kB
          Nhat Nguyen

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 31h
                  31h