Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8962

Can we merge small segments during refresh, for faster searching?

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      With near-real-time search we ask IndexWriter to write all in-memory segments to disk and open an IndexReader to search them, and this is typically a quick operation.

      However, when you use many threads for concurrent indexing, IndexWriter will accumulate write many small segments during refresh and this then adds search-time cost as searching must visit all of these tiny segments.

      The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve {{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader?  It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...

      One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...

      I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!

        Attachments

        1. LUCENE-8962_demo.png
          24 kB
          Michael Froh

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mikemccand Michael McCandless
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 10m
                  5h 10m