Lucene - Core
  1. Lucene - Core
  2. LUCENE-4575

Allow IndexWriter to commit, even just commitData

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spinoff from here http://lucene.472066.n3.nabble.com/commit-with-only-commitData-td4022155.html.

      In some cases, it is valuable to be able to commit changes to the index, even if the changes are just commitData. Such data is sometimes used by applications to register in the index some global application information/state.

      The proposal is:

      • Add a setCommitData() API and separate it from commit() and prepareCommit() (simplify their API)
      • When that API is called, flip on the dirty/changes bit, so that this gets committed even if no other changes were made to the index.

      I will work on a patch a post.

      1. LUCENE-4575.patch
        31 kB
        Shai Erera
      2. LUCENE-4575.patch
        27 kB
        Shai Erera
      3. LUCENE-4575.patch
        6 kB
        Shai Erera
      4. LUCENE-4575-testcase.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        Shai Erera added a comment -

        Patch adds setCommitData to IndexWriter and increase changeCount as well as sets that commitData on segmentInfos. It also adds a test to verify the behavior.

        Regarding back-compat - I prefer to nuke commit(data) and prepcommit(data), in exchange for this API, for both trunk and 4.x.

        This patch however supports the old commit/prepcommit(data) API, but I think that it will be simpler if we just nuke these API. The migration to the new API is a no-brainer, just call setCommitData before your commit().

        I don't intend to commit it yet, depending on how we decide to handle back-compat. If we decide to keep the back-compat support, I want to move the commit(data) and prepCommit(data) impls to their respective no-data versions, and then have these API deprecated and call setCommitData() followed by the respective no-data version.

        Show
        Shai Erera added a comment - Patch adds setCommitData to IndexWriter and increase changeCount as well as sets that commitData on segmentInfos. It also adds a test to verify the behavior. Regarding back-compat - I prefer to nuke commit(data) and prepcommit(data), in exchange for this API, for both trunk and 4.x. This patch however supports the old commit/prepcommit(data) API, but I think that it will be simpler if we just nuke these API. The migration to the new API is a no-brainer, just call setCommitData before your commit(). I don't intend to commit it yet, depending on how we decide to handle back-compat. If we decide to keep the back-compat support, I want to move the commit(data) and prepCommit(data) impls to their respective no-data versions, and then have these API deprecated and call setCommitData() followed by the respective no-data version.
        Hide
        Michael McCandless added a comment -

        +1 to do a hard break; this is expert.

        Show
        Michael McCandless added a comment - +1 to do a hard break; this is expert.
        Hide
        Shai Erera added a comment -

        Thanks. I forgot to mention two things about the changes in the patch, which I wasn't sure about:

        1. I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections?
        2. I set pass the copied map directly to segmentInfos, rather than saving it in a member in IW. Do you see any issues with it? (I'm thinking about rollback, even though we have another copy of the segmentInfos for rollback purposes ...)
        Show
        Shai Erera added a comment - Thanks. I forgot to mention two things about the changes in the patch, which I wasn't sure about: I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections? I set pass the copied map directly to segmentInfos, rather than saving it in a member in IW. Do you see any issues with it? (I'm thinking about rollback, even though we have another copy of the segmentInfos for rollback purposes ...)
        Hide
        Yonik Seeley added a comment -

        I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections?

        Do any users care about order (i.e. they pass in a LinkedHashMap)? If would be trivial to preserve if it added value for some.

        Show
        Yonik Seeley added a comment - I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections? Do any users care about order (i.e. they pass in a LinkedHashMap)? If would be trivial to preserve if it added value for some.
        Hide
        Shai Erera added a comment -

        We use commitData extensively but we don't care about the order. We store key/value pairs.

        I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap). Therefore, if we want to preserve the type of the Map, we'd need to change DataInput/Output code. I'm not sure it's worth the hassle, but let's discuss that anyway on a separate issue? It's not really related to how the map is set.

        Show
        Shai Erera added a comment - We use commitData extensively but we don't care about the order. We store key/value pairs. I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap). Therefore, if we want to preserve the type of the Map, we'd need to change DataInput/Output code. I'm not sure it's worth the hassle, but let's discuss that anyway on a separate issue? It's not really related to how the map is set.
        Hide
        Uwe Schindler added a comment -

        The API returns Map<String,String>, so we make no garanties about order.

        Show
        Uwe Schindler added a comment - The API returns Map<String,String>, so we make no garanties about order.
        Hide
        Yonik Seeley added a comment -

        I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap).

        If a user cared about order, then they would pass a LinkedHashMap. Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap.

        it's not really related to how the map is set.

        It is... if you make a copy of the map and we want to preserve order, it's new LinkedHashMap instead of HashMap.

        It's a minor enough point I don't think it does deserve it's own issue. I don't personally care about preserving order - but I did think it was worth at least bringing up.

        Show
        Yonik Seeley added a comment - I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap). If a user cared about order, then they would pass a LinkedHashMap. Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap. it's not really related to how the map is set. It is... if you make a copy of the map and we want to preserve order, it's new LinkedHashMap instead of HashMap. It's a minor enough point I don't think it does deserve it's own issue. I don't personally care about preserving order - but I did think it was worth at least bringing up.
        Hide
        Shai Erera added a comment -

        Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap.

        So you propose that the code will always initialize LHM in DataInput, that way preserving order whether required or not? Yes, I guess that we can do that. But I wonder if we should? We didn't so far, and nobody complained. And since it's an internal change, we can always make that change in the future if somebody asks?

        Show
        Shai Erera added a comment - Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap. So you propose that the code will always initialize LHM in DataInput, that way preserving order whether required or not? Yes, I guess that we can do that. But I wonder if we should? We didn't so far, and nobody complained. And since it's an internal change, we can always make that change in the future if somebody asks?
        Hide
        Shai Erera added a comment -

        Patch removes commit/prepCommit(commitData) versions and modifies code that used it (mostly tests code) to call setCommit beforehand.

        I also added a CHANGES entry under back-compat section.

        All tests pass. I'll wait for some review before I commit it.

        Show
        Shai Erera added a comment - Patch removes commit/prepCommit(commitData) versions and modifies code that used it (mostly tests code) to call setCommit beforehand. I also added a CHANGES entry under back-compat section. All tests pass. I'll wait for some review before I commit it.
        Hide
        Michael McCandless added a comment -

        I thought we were going to rename ensureOpen's confusing boolean param?

        IW.setCommitData should be sync'd I think, eg to ensure visibility
        across threads of the changes to sis.userData?

        Hmm ... I think there's a thread hazard here, during commit; I think
        if pendingCommit is not null you should also call
        pendingCommit.setUserData? Else, a commit can finish and "undo" the
        user's change to the commit data (see finishCommit, where it calls
        .setUserData). Maybe we need a thread safety test
        here ...

        Show
        Michael McCandless added a comment - I thought we were going to rename ensureOpen's confusing boolean param? IW.setCommitData should be sync'd I think, eg to ensure visibility across threads of the changes to sis.userData? Hmm ... I think there's a thread hazard here, during commit; I think if pendingCommit is not null you should also call pendingCommit.setUserData? Else, a commit can finish and "undo" the user's change to the commit data (see finishCommit, where it calls .setUserData). Maybe we need a thread safety test here ...
        Hide
        Michael McCandless added a comment -

        Actually I think we should just remove that .setUserData inside finishCommit?

        Also, should we add an IW.getCommitData?

        Show
        Michael McCandless added a comment - Actually I think we should just remove that .setUserData inside finishCommit? Also, should we add an IW.getCommitData?
        Hide
        Shai Erera added a comment -

        I thought we were going to rename ensureOpen's confusing boolean param?

        Right, but for some reason I thought that you're going to do that . I'll do it in the next patch.

        IW.setCommitData should be sync'd I think, eg to ensure visibility across threads of the changes to sis.userData?

        Ok

        Hmm ... I think there's a thread hazard here, during commit

        I think you're right. Not sure how practical, because I believe that usually the commit thread will also be the one that calls setCommitData, but it is possible.
        I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either:

        1. Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member.
        2. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null.

        Option #2 means that there's no API break, no synchronization is needed on setCommitData and practically everything remains the same. We can still remove the redundant .setCommitData in finishCommit regadless.

        should we add an IW.getCommitData?

        I think that that'd be great ! Today the only way to do it is if you refresh a reader (expensive). I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?

        Show
        Shai Erera added a comment - I thought we were going to rename ensureOpen's confusing boolean param? Right, but for some reason I thought that you're going to do that . I'll do it in the next patch. IW.setCommitData should be sync'd I think, eg to ensure visibility across threads of the changes to sis.userData? Ok Hmm ... I think there's a thread hazard here, during commit I think you're right. Not sure how practical, because I believe that usually the commit thread will also be the one that calls setCommitData, but it is possible. I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either: Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null. Option #2 means that there's no API break, no synchronization is needed on setCommitData and practically everything remains the same. We can still remove the redundant .setCommitData in finishCommit regadless. should we add an IW.getCommitData? I think that that'd be great ! Today the only way to do it is if you refresh a reader (expensive). I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?
        Hide
        Michael McCandless added a comment -

        I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either:

        1. Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member.
        2. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null.

        Hmm, I'd rather not store the member inside IW and inside SIS; just seems safer to have a single clear place where this is tracked.

        Also, I like the new API so I'd rather not do #2?

        I think just removing that line in finishCommit should fix the bug ... but first we need a test exposing it.

        I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?

        Yes.

        Show
        Michael McCandless added a comment - I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either: Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null. Hmm, I'd rather not store the member inside IW and inside SIS; just seems safer to have a single clear place where this is tracked. Also, I like the new API so I'd rather not do #2? I think just removing that line in finishCommit should fix the bug ... but first we need a test exposing it. I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos? Yes.
        Hide
        Shai Erera added a comment -

        I'll make the changes, and also it seems like you were suggesting that earlier – allow setCommitData to affect the pendingCommit too. I think that's valuable because you can e.g. call prerCommit() -> setCommitData() -> commit() – the setCD() in the middle lets you create a commitData that will pertain to the state of the index after the commit.

        I'll make all the changes and post a new patch, probably tomorrow.

        Show
        Shai Erera added a comment - I'll make the changes, and also it seems like you were suggesting that earlier – allow setCommitData to affect the pendingCommit too. I think that's valuable because you can e.g. call prerCommit() -> setCommitData() -> commit() – the setCD() in the middle lets you create a commitData that will pertain to the state of the index after the commit. I'll make all the changes and post a new patch, probably tomorrow.
        Hide
        Michael McCandless added a comment -

        I'll make a test exposing the bug ...

        Show
        Michael McCandless added a comment - I'll make a test exposing the bug ...
        Hide
        Michael McCandless added a comment -

        Simple test showing that commit data is lost ... I didn't need to use threads; just call .setCommitData after prepareCommit and before commit.

        Show
        Michael McCandless added a comment - Simple test showing that commit data is lost ... I didn't need to use threads; just call .setCommitData after prepareCommit and before commit.
        Hide
        Shai Erera added a comment -

        The test isn't exactly accurate, because it tests a scenario that is currently not supported. I.e., after calling prepareCommit(), nothing that you do on IW will be committed. Rather, to expose the bug it should be modified as follows:

        iw.setCommitData(data1);
        iw.prepareCommit();
        iw.setCommitData(data2); // that will be ignored by follow-on commit
        iw.commit();
        checkCommitData(); // will see data1
        iw.commit(); // that 'should' commit data2
        checkCommitData(); // that will see data1 again, because of the copy that happens in finishCommit()
        

        I'll modify the test like so and include it in my next patch.

        Show
        Shai Erera added a comment - The test isn't exactly accurate, because it tests a scenario that is currently not supported. I.e., after calling prepareCommit(), nothing that you do on IW will be committed. Rather, to expose the bug it should be modified as follows: iw.setCommitData(data1); iw.prepareCommit(); iw.setCommitData(data2); // that will be ignored by follow-on commit iw.commit(); checkCommitData(); // will see data1 iw.commit(); // that 'should' commit data2 checkCommitData(); // that will see data1 again, because of the copy that happens in finishCommit() I'll modify the test like so and include it in my next patch.
        Hide
        Shai Erera added a comment -

        Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit(). Following commit() merely calls infos.finishCommit() which writes the checksum and closes the output.

        Can we modify segmentInfos.write() to not write the commitData, but move it to finishCommit()? Not sure that I like this approach, because it means that finishCommit() will do slightly more work, which increases the chance of getting an IOException during commit() after prepareCommit() successfully returned, but on the other hand it's the gains might be worth it? Being able to write commitData after you know all your document additions/deletions/updates are 'safe' might prove valuable. And finishCommit() already does I/O, writing checksum ...

        What do you think?

        Show
        Shai Erera added a comment - Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit(). Following commit() merely calls infos.finishCommit() which writes the checksum and closes the output. Can we modify segmentInfos.write() to not write the commitData, but move it to finishCommit()? Not sure that I like this approach, because it means that finishCommit() will do slightly more work, which increases the chance of getting an IOException during commit() after prepareCommit() successfully returned, but on the other hand it's the gains might be worth it? Being able to write commitData after you know all your document additions/deletions/updates are 'safe' might prove valuable. And finishCommit() already does I/O, writing checksum ... What do you think?
        Hide
        Michael McCandless added a comment -

        Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit().

        Oh yeah ... I forgot about that

        Hmm ... I don't think we should move writing the commit data to finishCommit? Is it really so hard for the app to provide the commit data before calling prepareCommit?

        Show
        Michael McCandless added a comment - Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit(). Oh yeah ... I forgot about that Hmm ... I don't think we should move writing the commit data to finishCommit? Is it really so hard for the app to provide the commit data before calling prepareCommit?
        Hide
        Shai Erera added a comment -

        I don't think that we should add more work to finishCommit() either. Being able to setCommitData after prep() is just a bonus. It didn't work before, and it will continue to not work now. And I can't think of a good usecase for why an app would not be able to set commitData prior to prep(). If it comes up, we can discuss a solution again. At least we know that moving commitData write to finishCommit will solve it.

        I'll make sure the test exposes the bug you reported in IW.finishCommit().

        Show
        Shai Erera added a comment - I don't think that we should add more work to finishCommit() either. Being able to setCommitData after prep() is just a bonus. It didn't work before, and it will continue to not work now. And I can't think of a good usecase for why an app would not be able to set commitData prior to prep(). If it comes up, we can discuss a solution again. At least we know that moving commitData write to finishCommit will solve it. I'll make sure the test exposes the bug you reported in IW.finishCommit().
        Hide
        Shai Erera added a comment -

        Patch addresses the bug that Mike reported and adds a test for it. Also adds IW.getCommitData().

        Show
        Shai Erera added a comment - Patch addresses the bug that Mike reported and adds a test for it. Also adds IW.getCommitData().
        Hide
        Michael McCandless added a comment -

        +1, looks great. Thanks Shai!

        Show
        Michael McCandless added a comment - +1, looks great. Thanks Shai!
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1416361

        LUCENE-4575: add IndexWriter.setCommitData

        Show
        Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1416361 LUCENE-4575 : add IndexWriter.setCommitData
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x. Thanks Mike !

        Show
        Shai Erera added a comment - Committed to trunk and 4x. Thanks Mike !
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1416367

        LUCENE-4575: add IndexWriter.setCommitData

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1416367 LUCENE-4575 : add IndexWriter.setCommitData
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1416367

        LUCENE-4575: add IndexWriter.setCommitData

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1416367 LUCENE-4575 : add IndexWriter.setCommitData

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development