Lucene - Core
  1. Lucene - Core
  2. LUCENE-2402

Add an explicit method to invoke IndexDeletionPolicy

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Today, if one uses an IDP which holds onto segments, such as SnapshotDeletionPolicy, or any other IDP in the tests, those segments are left in the index even if the IDP no longer references them, until IW.commit() is called (and actually does something). I'd like to add a specific method to IW which will invoke the IDP's logic and get rid of the unused segments w/o forcing the user to call IW.commit(). There are a couple of reasons for that:

      • Segments take up sometimes valuable HD space, and the application may wish to reclaim that space immediately. In some scenarios, the index is updated once in several hours (or even days), and waiting until then may not be acceptable.
      • I think it's a cleaner solution than waiting for the next commit() to happen. One can still wait for it if one wants, but otherwise it will give you the ability to immediately get rid of those segments.
      • TestSnapshotDeletionPolicy includes this code, which only strengthens (IMO) the need for such method:
        // Add one more document to force writer to commit a
        // final segment, so deletion policy has a chance to
        // delete again:
        Document doc = new Document();
        doc.add(new Field("content", "aaa", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
        writer.addDocument(doc);
        

      If IW had an explicit method, that code would not need to exist there at all ...

      Here comes the fun part - naming the baby:

      • invokeDeletionPolicy – describes exactly what is going to happen. However, if the user did not set IDP at all (relying on default, which I think many do), users won't understand what is it.
      • deleteUnusedSegments - more user-friendly, assuming users understand what 'segments' are.

      BTW, IW already has deleteUnusedFiles() which only tries to delete unreferenced files that failed to delete before (such as on Windows, due to e.g. open readers). Perhaps instead of inventing a new name, we can change IW.deleteUnusedFiles to call IndexFileDeleter.checkpoint (instead of deletePendingFiles) which deletes those files + calls IDP.onCommit().

      1. LUCENE-2402.patch
        5 kB
        Shai Erera
      2. LUCENE-2402.patch
        4 kB
        Shai Erera

        Activity

        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563486 ] jira [ 12584252 ]
        Mark Thomas made changes -
        Workflow jira [ 12508729 ] Default workflow, editable Closed status [ 12563486 ]
        Shai Erera made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Shai Erera added a comment -

        Committed revision 941417.

        Show
        Shai Erera added a comment - Committed revision 941417.
        Shai Erera made changes -
        Fix Version/s 3.1 [ 12314822 ]
        Shai Erera made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        Shai Erera added a comment -

        Backport to 3.1

        Show
        Shai Erera added a comment - Backport to 3.1
        Shai Erera made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Shai Erera added a comment -

        Committed revision 936605.

        Show
        Shai Erera added a comment - Committed revision 936605.
        Hide
        Shai Erera added a comment -

        ok I'll remove them before commit. Will commit this later - giving chance for more people to review.

        Show
        Shai Erera added a comment - ok I'll remove them before commit. Will commit this later - giving chance for more people to review.
        Hide
        Michael McCandless added a comment -

        Patch looks good Shai!

        But you can remove the " ..." in the message output – I had put into my code thinking there may be details we put instead of that "...", but, I think there are no further details.

        Show
        Michael McCandless added a comment - Patch looks good Shai! But you can remove the " ..." in the message output – I had put into my code thinking there may be details we put instead of that "...", but, I think there are no further details.
        Shai Erera made changes -
        Attachment LUCENE-2402.patch [ 12442442 ]
        Hide
        Shai Erera added a comment -

        Adds revisitPolicy to IFD (package-private) and also calls it from IW.deleteUnusedFiles. All tests pass

        Show
        Shai Erera added a comment - Adds revisitPolicy to IFD (package-private) and also calls it from IW.deleteUnusedFiles. All tests pass
        Hide
        Michael McCandless added a comment -

        +1, I think that's a good approach.

        Show
        Michael McCandless added a comment - +1, I think that's a good approach.
        Hide
        Shai Erera added a comment -

        Ok I understand. About the name, revisitPolicy is not exactly accurate (I think?) because it also deletes the pending files (and not just revisit the policy). Unless IW.deleteUnusedFiles will invoke both deletePendingFiles and revisitPolicy ... the latter will just do

        if (commits.size() > 0) {
          policy.onCommit(commits);
          deleteCommits();
        }
        

        What do you think?

        Show
        Shai Erera added a comment - Ok I understand. About the name, revisitPolicy is not exactly accurate (I think?) because it also deletes the pending files (and not just revisit the policy). Unless IW.deleteUnusedFiles will invoke both deletePendingFiles and revisitPolicy ... the latter will just do if (commits.size() > 0) { policy.onCommit(commits); deleteCommits(); } What do you think?
        Hide
        Michael McCandless added a comment -

        So maybe reuse deletePendingFiles?

        Hmm I think this is too much? EG we call that on every checkpoint call... so this'd mean the IDP gets 2 onCommit calls per commit (one "fake" one and one "real")?

        I think it's better if the "fake" onCommit only arrives when IW.deleteUnusedFiles is invoked?

        Show
        Michael McCandless added a comment - So maybe reuse deletePendingFiles? Hmm I think this is too much? EG we call that on every checkpoint call... so this'd mean the IDP gets 2 onCommit calls per commit (one "fake" one and one "real")? I think it's better if the "fake" onCommit only arrives when IW.deleteUnusedFiles is invoked?
        Hide
        Shai Erera added a comment -

        I think we need to open up a new package private method in IFD, eg "revisitPolicy" or some such

        So maybe reuse deletePendingFiles? I.e. this method does not accept anything, and so seems that revisitPolicy won't (just pass on to IDP its 'commits' member). IFD is anyway an internal API ... I'll give it a try. deletePendingFiles would just do what it does, then in the end call policy.onCommit(commits) and deleteCommits() ... what do you think? And then IW does not need to change.

        Show
        Shai Erera added a comment - I think we need to open up a new package private method in IFD, eg "revisitPolicy" or some such So maybe reuse deletePendingFiles? I.e. this method does not accept anything, and so seems that revisitPolicy won't (just pass on to IDP its 'commits' member). IFD is anyway an internal API ... I'll give it a try. deletePendingFiles would just do what it does, then in the end call policy.onCommit(commits) and deleteCommits() ... what do you think? And then IW does not need to change.
        Hide
        Michael McCandless added a comment -

        The failure happens in CommitPoint's ctor in the assert statement which verifies the SegmentInos does not have external Directory.

        Urgh.... indeed we must protect against one thread doing addIndexes
        and another thread calling deleteUnusedFiles.

        The way things work today (and I agree we should fix this) is
        addIndexes* immediately modify the in memory segments to include
        foreign (external Dir) segments, then proceed to target these foreign
        segments by merging them away.

        I've tried to sync on commitLock (which seems good anyway), but the test kept failing.

        This isn't strictly necessary, I think? The two ops (ongoing commit
        – takes time since fync can be so slow – and deleting unused files)
        are orthogonal. They both invoke IDP/IFD, but this is still protected
        (sync'd on IW)...

        Only when passing rollbackSI to checkpoint does the test pass.

        In fact this is the right track, I think... rollbackSI is a clone of
        the last committed segments, whereas the "live" segments contains all
        uncommitted stuff that's happened since. We really should not be
        treating these pending changes as if they were a commit point... so
        using rollbackSI makes sense.

        But, the problem is, IFD.checkpoint will hold a new commit point when
        you pass isCommit=true, which is no good. I think we need to open up
        a new package private method in IFD, eg "revisitPolicy" or some such, which
        just does:

        if (infoStream != null) {
          message("now visit...");
        }
        deletePendingFiles();
        policy.onCommit(commits);
        deleteCommits();
        

        Ie, most of what IFD.checkpoint does when isCommit=true, minus the
        incRef (which has already been done, in the past, for this segments)
        and the commits.add of a new commit point.

        Invoking IDP.onCommit still isn't quite right (no new commit was done)
        but I think it's OK for now? (Adding some kind of "visit" method
        feels like overkill...).

        BTW, the test fails on DirReader.doClose, where it checks if writer != null and then calls deleteUnusedFiles. So I guess it's a NRT problem only.

        Hmm the problem should be wider than just NRT. Any time one thread
        calls deleteUnusedFiles while another is doing addIndexes*, this bug
        should be hit-able.

        In general, that that addIndexesNoOptimize messes w/ SI seems dangerous to me, because that's undocumented and unprotected

        I agree.

        I'd love to [eventually] change addIndexes*, so that it does all its
        work "privately" and only in the end atomically "checks in" the
        (not-foreign) segments it produced. It gets tricky, though, since
        "normal" segment merging, and flushing, is still ongoing, and we'd not
        want to do redundant merging work.

        This also messes up NRT, ie, if you open an NRT reader during an
        addIndexes*, you can see some segments already added and some now –
        ie NRT violates the advertised atomicity of addIndexes* (the javadocs
        note this).

        I think we really need to factor IW apart:

        1. Indexer (add/update/delete), also flushes new segments
        1. Keeper of the segments file (exposes API to make atomic changes to
          segments file, does commits, interacts w/ IDP/IFD)
        1. Merger (normal merging, optimize, expungeDeletes, addIndexes)
        1. Reader pool
        Show
        Michael McCandless added a comment - The failure happens in CommitPoint's ctor in the assert statement which verifies the SegmentInos does not have external Directory. Urgh.... indeed we must protect against one thread doing addIndexes and another thread calling deleteUnusedFiles. The way things work today (and I agree we should fix this) is addIndexes* immediately modify the in memory segments to include foreign (external Dir) segments, then proceed to target these foreign segments by merging them away. I've tried to sync on commitLock (which seems good anyway), but the test kept failing. This isn't strictly necessary, I think? The two ops (ongoing commit – takes time since fync can be so slow – and deleting unused files) are orthogonal. They both invoke IDP/IFD, but this is still protected (sync'd on IW)... Only when passing rollbackSI to checkpoint does the test pass. In fact this is the right track, I think... rollbackSI is a clone of the last committed segments, whereas the "live" segments contains all uncommitted stuff that's happened since. We really should not be treating these pending changes as if they were a commit point... so using rollbackSI makes sense. But, the problem is, IFD.checkpoint will hold a new commit point when you pass isCommit=true, which is no good. I think we need to open up a new package private method in IFD, eg "revisitPolicy" or some such, which just does: if (infoStream != null ) { message( "now visit..." ); } deletePendingFiles(); policy.onCommit(commits); deleteCommits(); Ie, most of what IFD.checkpoint does when isCommit=true, minus the incRef (which has already been done, in the past, for this segments) and the commits.add of a new commit point. Invoking IDP.onCommit still isn't quite right (no new commit was done) but I think it's OK for now? (Adding some kind of "visit" method feels like overkill...). BTW, the test fails on DirReader.doClose, where it checks if writer != null and then calls deleteUnusedFiles. So I guess it's a NRT problem only. Hmm the problem should be wider than just NRT. Any time one thread calls deleteUnusedFiles while another is doing addIndexes*, this bug should be hit-able. In general, that that addIndexesNoOptimize messes w/ SI seems dangerous to me, because that's undocumented and unprotected I agree. I'd love to [eventually] change addIndexes*, so that it does all its work "privately" and only in the end atomically "checks in" the (not-foreign) segments it produced. It gets tricky, though, since "normal" segment merging, and flushing, is still ongoing, and we'd not want to do redundant merging work. This also messes up NRT, ie, if you open an NRT reader during an addIndexes*, you can see some segments already added and some now – ie NRT violates the advertised atomicity of addIndexes* (the javadocs note this). I think we really need to factor IW apart: Indexer (add/update/delete), also flushes new segments Keeper of the segments file (exposes API to make atomic changes to segments file, does commits, interacts w/ IDP/IFD) Merger (normal merging, optimize, expungeDeletes, addIndexes) Reader pool
        Shai Erera made changes -
        Field Original Value New Value
        Attachment LUCENE-2402.patch [ 12442400 ]
        Hide
        Shai Erera added a comment -

        Patch changes deleteUnusedFiles to call IFD.checkpoint and also adds a testDeleteUnusedFiles2 to TestIndexWriter.

        Currently, TestIndexWriterReader.testDuringAddIndexes fails, if deleteUnusedFiles is coded like this:

        public synchronized void deleteUnusedFiles() throws IOException {
          deleter.checkpoint(segmentInfos, true);
        }
        

        The failure happens in CommitPoint's ctor in the assert statement which verifies the SegmentInos does not have external Directory. When I debug-traced the test, it passed and so I concluded it's a concurrency issue (and indeed testDuringAddIndexes spawns several threads.

        addIndexesNoOptimize does change SegmentInfos as it adds indexes, however at the end it 'fixes' their Directory reference. I wondered how is regular commit() works when addIndexesNoOptimize is called, but couldn't find any synchronization block where one blocks the other. Eventually, I've changed deleteUnusedFiles to this:

        public synchronized void deleteUnusedFiles() throws IOException {
          synchronized (commitLock) {
            deleter.checkpoint(rollbackSegmentInfos, true);
            // deleter.checkpoint((SegmentInfos) segmentInfos.clone(), true);
          }
        }
        

        I've tried to sync on commitLock (which seems good anyway), but the test kept failing. Even cloning SI did not work because it might have changed just before the clone. Only when passing rollbackSI to checkpoint does the test pass. But I'm not sure if that's the right solution, as when I debug-traced it and put a break point just before the call to checkpoint, SI included one segment w/ a different name than rollbackSI ...

        BTW, the test fails on DirReader.doClose, where it checks if writer != null and then calls deleteUnusedFiles. So I guess it's a NRT problem only.

        In general, that that addIndexesNoOptimize messes w/ SI seems dangerous to me, because that's undocumented and unprotected - e.g. if someone extends IW and adds some logic which requires reading SI ... I'm not sure how to solve it, but that seems unrelated to that issue (probably much more complicated to solve).

        Show
        Shai Erera added a comment - Patch changes deleteUnusedFiles to call IFD.checkpoint and also adds a testDeleteUnusedFiles2 to TestIndexWriter. Currently, TestIndexWriterReader.testDuringAddIndexes fails, if deleteUnusedFiles is coded like this: public synchronized void deleteUnusedFiles() throws IOException { deleter.checkpoint(segmentInfos, true ); } The failure happens in CommitPoint's ctor in the assert statement which verifies the SegmentInos does not have external Directory. When I debug-traced the test, it passed and so I concluded it's a concurrency issue (and indeed testDuringAddIndexes spawns several threads. addIndexesNoOptimize does change SegmentInfos as it adds indexes, however at the end it 'fixes' their Directory reference. I wondered how is regular commit() works when addIndexesNoOptimize is called, but couldn't find any synchronization block where one blocks the other. Eventually, I've changed deleteUnusedFiles to this: public synchronized void deleteUnusedFiles() throws IOException { synchronized (commitLock) { deleter.checkpoint(rollbackSegmentInfos, true ); // deleter.checkpoint((SegmentInfos) segmentInfos.clone(), true ); } } I've tried to sync on commitLock (which seems good anyway), but the test kept failing. Even cloning SI did not work because it might have changed just before the clone. Only when passing rollbackSI to checkpoint does the test pass. But I'm not sure if that's the right solution, as when I debug-traced it and put a break point just before the call to checkpoint, SI included one segment w/ a different name than rollbackSI ... BTW, the test fails on DirReader.doClose, where it checks if writer != null and then calls deleteUnusedFiles. So I guess it's a NRT problem only. In general, that that addIndexesNoOptimize messes w/ SI seems dangerous to me, because that's undocumented and unprotected - e.g. if someone extends IW and adds some logic which requires reading SI ... I'm not sure how to solve it, but that seems unrelated to that issue (probably much more complicated to solve).
        Hide
        Michael McCandless added a comment -

        Lets reuse IW.deleteUnusedFiles() ?

        +1

        Show
        Michael McCandless added a comment - Lets reuse IW.deleteUnusedFiles() ? +1
        Hide
        Earwin Burrfoot added a comment -

        Lets reuse IW.deleteUnusedFiles() ?
        No need to multiply confusion )

        Show
        Earwin Burrfoot added a comment - Lets reuse IW.deleteUnusedFiles() ? No need to multiply confusion )
        Shai Erera created issue -

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development