Lucene - Core
  1. Lucene - Core
  2. LUCENE-6849

Add IndexWriter API to write segment(s) without refreshing them

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today, the only way to have IndexWriter free up some heap is to invoke refresh or flush or close it, but these are all quite costly, and do much more than simply "move bytes to disk".

      I think we should add a simple API, e.g. "move the biggest in-memory segment to disk" to 1) give more granularity (there could be multiple in-memory segments), and 2) only move bytes to disk (not refresh, not fsync, etc.).

      This way apps that want to be more careful on how heap is used can have more control.

      1. LUCENE-6849.patch
        5 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Trivial patch, tests pass.

        I think the simplest approach here is to just make the existing flush
        method public?

        This will write all in-memory segments to disk, but not fsync nor open
        new (NRT) reader on them.

        We can later explore finer granularity if we really need to flush only
        the biggest segment or something.

        Show
        Michael McCandless added a comment - Trivial patch, tests pass. I think the simplest approach here is to just make the existing flush method public? This will write all in-memory segments to disk, but not fsync nor open new (NRT) reader on them. We can later explore finer granularity if we really need to flush only the biggest segment or something.
        Hide
        Shai Erera added a comment - - edited

        LGTM. And +1 on making both flush public API. It's an expert API and I believe users who intend to call flush() can also understand the implications of calling flush(true, true). Later we can consider consolidating and enhance this with a flush(FlushOptions) method, where FlushOptions lets you specify whether you want to merge, applyDeletes, segment size flush threshold etc.

        A few comments:

        • If you make the second flush() public
          • I think we should document in flush() when you should use the second one?
          • We should add a testFlushNoCommitButMergeAndApplyDeletes?
          • Add the second flush() variant to RandomIndexWriter?
        • In RandomIndexWriter.maybeFlushOrCommit, should we also sometimes randomly apply deletes and trigger merges?
        Show
        Shai Erera added a comment - - edited LGTM. And +1 on making both flush public API. It's an expert API and I believe users who intend to call flush() can also understand the implications of calling flush(true, true) . Later we can consider consolidating and enhance this with a flush(FlushOptions) method, where FlushOptions lets you specify whether you want to merge, applyDeletes, segment size flush threshold etc. A few comments: If you make the second flush() public I think we should document in flush() when you should use the second one? We should add a testFlushNoCommitButMergeAndApplyDeletes? Add the second flush() variant to RandomIndexWriter? In RandomIndexWriter.maybeFlushOrCommit , should we also sometimes randomly apply deletes and trigger merges?
        Hide
        Robert Muir added a comment -

        I disagree. I don't think we should add anything but a no-arg flush() here.

        If its going to be more engineered than that, then this is the wrong direction: lets just not expose it at all.

        Show
        Robert Muir added a comment - I disagree. I don't think we should add anything but a no-arg flush() here. If its going to be more engineered than that, then this is the wrong direction: lets just not expose it at all.
        Hide
        Adrien Grand added a comment -

        +1 on the patch and agreed that we should only expose a no-argument flush method, it's actually nice that the patch decreased the visibility of flush(boolean, boolean).

        Show
        Adrien Grand added a comment - +1 on the patch and agreed that we should only expose a no-argument flush method, it's actually nice that the patch decreased the visibility of flush(boolean, boolean).
        Hide
        ASF subversion and git services added a comment -

        Commit 1713646 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1713646 ]

        LUCENE-6849: expose IndexWriter.flush to move in-memory segments to disk

        Show
        ASF subversion and git services added a comment - Commit 1713646 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1713646 ] LUCENE-6849 : expose IndexWriter.flush to move in-memory segments to disk
        Hide
        ASF subversion and git services added a comment -

        Commit 1713649 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1713649 ]

        LUCENE-6849: expose IndexWriter.flush to move in-memory segments to disk

        Show
        ASF subversion and git services added a comment - Commit 1713649 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713649 ] LUCENE-6849 : expose IndexWriter.flush to move in-memory segments to disk
        Hide
        Simon Willnauer added a comment -

        nice change - simple but powerful, seems like we got that part of IW reasonably clean

        Show
        Simon Willnauer added a comment - nice change - simple but powerful, seems like we got that part of IW reasonably clean

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development