Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8962

Can we merge small segments during refresh, for faster searching?

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: main (9.0), 8.6, 8.7
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 has merge-on-refresh (by Simon).  See MergePolicy.findFullFlushMerges

      The original description follows:


      With near-real-time search we ask IndexWriter to write all in-memory segments to disk and open an IndexReader to search them, and this is typically a quick operation.

      However, when you use many threads for concurrent indexing, IndexWriter will accumulate write many small segments during refresh and this then adds search-time cost as searching must visit all of these tiny segments.

      The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve {{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader?  It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...

      One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...

      I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!

        Attachments

        1. image-2021-03-22-10-36-32-201.png
          4 kB
          Michael McCandless
        2. test.diff
          3 kB
          Simon Willnauer
        3. failure_log.txt
          36.16 MB
          Simon Willnauer
        4. failed-tests.patch
          8 kB
          Nhat Nguyen
        5. LUCENE-8962_demo.png
          24 kB
          Michael Froh

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mikemccand Michael McCandless
              • Votes:
                0 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 31h
                  31h