Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 4.1, 6.0
    • core/index
    • None
    • New, Patch Available

    Description

      Spinoff from here http://lucene.472066.n3.nabble.com/commit-with-only-commitData-td4022155.html.

      In some cases, it is valuable to be able to commit changes to the index, even if the changes are just commitData. Such data is sometimes used by applications to register in the index some global application information/state.

      The proposal is:

      • Add a setCommitData() API and separate it from commit() and prepareCommit() (simplify their API)
      • When that API is called, flip on the dirty/changes bit, so that this gets committed even if no other changes were made to the index.

      I will work on a patch a post.

      Attachments

        1. LUCENE-4575.patch
          6 kB
          Shai Erera
        2. LUCENE-4575.patch
          27 kB
          Shai Erera
        3. LUCENE-4575-testcase.patch
          2 kB
          Michael McCandless
        4. LUCENE-4575.patch
          31 kB
          Shai Erera

        Activity

          shaie Shai Erera added a comment -

          Patch adds setCommitData to IndexWriter and increase changeCount as well as sets that commitData on segmentInfos. It also adds a test to verify the behavior.

          Regarding back-compat - I prefer to nuke commit(data) and prepcommit(data), in exchange for this API, for both trunk and 4.x.

          This patch however supports the old commit/prepcommit(data) API, but I think that it will be simpler if we just nuke these API. The migration to the new API is a no-brainer, just call setCommitData before your commit().

          I don't intend to commit it yet, depending on how we decide to handle back-compat. If we decide to keep the back-compat support, I want to move the commit(data) and prepCommit(data) impls to their respective no-data versions, and then have these API deprecated and call setCommitData() followed by the respective no-data version.

          shaie Shai Erera added a comment - Patch adds setCommitData to IndexWriter and increase changeCount as well as sets that commitData on segmentInfos. It also adds a test to verify the behavior. Regarding back-compat - I prefer to nuke commit(data) and prepcommit(data), in exchange for this API, for both trunk and 4.x. This patch however supports the old commit/prepcommit(data) API, but I think that it will be simpler if we just nuke these API. The migration to the new API is a no-brainer, just call setCommitData before your commit(). I don't intend to commit it yet, depending on how we decide to handle back-compat. If we decide to keep the back-compat support, I want to move the commit(data) and prepCommit(data) impls to their respective no-data versions, and then have these API deprecated and call setCommitData() followed by the respective no-data version.

          +1 to do a hard break; this is expert.

          mikemccand Michael McCandless added a comment - +1 to do a hard break; this is expert.
          shaie Shai Erera added a comment -

          Thanks. I forgot to mention two things about the changes in the patch, which I wasn't sure about:

          1. I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections?
          2. I set pass the copied map directly to segmentInfos, rather than saving it in a member in IW. Do you see any issues with it? (I'm thinking about rollback, even though we have another copy of the segmentInfos for rollback purposes ...)
          shaie Shai Erera added a comment - Thanks. I forgot to mention two things about the changes in the patch, which I wasn't sure about: I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections? I set pass the copied map directly to segmentInfos, rather than saving it in a member in IW. Do you see any issues with it? (I'm thinking about rollback, even though we have another copy of the segmentInfos for rollback purposes ...)
          yseeley@gmail.com Yonik Seeley added a comment -

          I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections?

          Do any users care about order (i.e. they pass in a LinkedHashMap)? If would be trivial to preserve if it added value for some.

          yseeley@gmail.com Yonik Seeley added a comment - I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections? Do any users care about order (i.e. they pass in a LinkedHashMap)? If would be trivial to preserve if it added value for some.
          shaie Shai Erera added a comment -

          We use commitData extensively but we don't care about the order. We store key/value pairs.

          I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap). Therefore, if we want to preserve the type of the Map, we'd need to change DataInput/Output code. I'm not sure it's worth the hassle, but let's discuss that anyway on a separate issue? It's not really related to how the map is set.

          shaie Shai Erera added a comment - We use commitData extensively but we don't care about the order. We store key/value pairs. I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap). Therefore, if we want to preserve the type of the Map, we'd need to change DataInput/Output code. I'm not sure it's worth the hassle, but let's discuss that anyway on a separate issue? It's not really related to how the map is set.
          uschindler Uwe Schindler added a comment -

          The API returns Map<String,String>, so we make no garanties about order.

          uschindler Uwe Schindler added a comment - The API returns Map<String,String>, so we make no garanties about order.
          yseeley@gmail.com Yonik Seeley added a comment -

          I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap).

          If a user cared about order, then they would pass a LinkedHashMap. Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap.

          it's not really related to how the map is set.

          It is... if you make a copy of the map and we want to preserve order, it's new LinkedHashMap instead of HashMap.

          It's a minor enough point I don't think it does deserve it's own issue. I don't personally care about preserving order - but I did think it was worth at least bringing up.

          yseeley@gmail.com Yonik Seeley added a comment - I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap). If a user cared about order, then they would pass a LinkedHashMap. Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap. it's not really related to how the map is set. It is... if you make a copy of the map and we want to preserve order, it's new LinkedHashMap instead of HashMap. It's a minor enough point I don't think it does deserve it's own issue. I don't personally care about preserving order - but I did think it was worth at least bringing up.
          shaie Shai Erera added a comment -

          Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap.

          So you propose that the code will always initialize LHM in DataInput, that way preserving order whether required or not? Yes, I guess that we can do that. But I wonder if we should? We didn't so far, and nobody complained. And since it's an internal change, we can always make that change in the future if somebody asks?

          shaie Shai Erera added a comment - Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap. So you propose that the code will always initialize LHM in DataInput, that way preserving order whether required or not? Yes, I guess that we can do that. But I wonder if we should? We didn't so far, and nobody complained. And since it's an internal change, we can always make that change in the future if somebody asks?
          shaie Shai Erera added a comment -

          Patch removes commit/prepCommit(commitData) versions and modifies code that used it (mostly tests code) to call setCommit beforehand.

          I also added a CHANGES entry under back-compat section.

          All tests pass. I'll wait for some review before I commit it.

          shaie Shai Erera added a comment - Patch removes commit/prepCommit(commitData) versions and modifies code that used it (mostly tests code) to call setCommit beforehand. I also added a CHANGES entry under back-compat section. All tests pass. I'll wait for some review before I commit it.

          I thought we were going to rename ensureOpen's confusing boolean param?

          IW.setCommitData should be sync'd I think, eg to ensure visibility
          across threads of the changes to sis.userData?

          Hmm ... I think there's a thread hazard here, during commit; I think
          if pendingCommit is not null you should also call
          pendingCommit.setUserData? Else, a commit can finish and "undo" the
          user's change to the commit data (see finishCommit, where it calls
          .setUserData). Maybe we need a thread safety test
          here ...

          mikemccand Michael McCandless added a comment - I thought we were going to rename ensureOpen's confusing boolean param? IW.setCommitData should be sync'd I think, eg to ensure visibility across threads of the changes to sis.userData? Hmm ... I think there's a thread hazard here, during commit; I think if pendingCommit is not null you should also call pendingCommit.setUserData? Else, a commit can finish and "undo" the user's change to the commit data (see finishCommit, where it calls .setUserData). Maybe we need a thread safety test here ...

          Actually I think we should just remove that .setUserData inside finishCommit?

          Also, should we add an IW.getCommitData?

          mikemccand Michael McCandless added a comment - Actually I think we should just remove that .setUserData inside finishCommit? Also, should we add an IW.getCommitData?
          shaie Shai Erera added a comment -

          I thought we were going to rename ensureOpen's confusing boolean param?

          Right, but for some reason I thought that you're going to do that . I'll do it in the next patch.

          IW.setCommitData should be sync'd I think, eg to ensure visibility across threads of the changes to sis.userData?

          Ok

          Hmm ... I think there's a thread hazard here, during commit

          I think you're right. Not sure how practical, because I believe that usually the commit thread will also be the one that calls setCommitData, but it is possible.
          I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either:

          1. Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member.
          2. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null.

          Option #2 means that there's no API break, no synchronization is needed on setCommitData and practically everything remains the same. We can still remove the redundant .setCommitData in finishCommit regadless.

          should we add an IW.getCommitData?

          I think that that'd be great ! Today the only way to do it is if you refresh a reader (expensive). I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?

          shaie Shai Erera added a comment - I thought we were going to rename ensureOpen's confusing boolean param? Right, but for some reason I thought that you're going to do that . I'll do it in the next patch. IW.setCommitData should be sync'd I think, eg to ensure visibility across threads of the changes to sis.userData? Ok Hmm ... I think there's a thread hazard here, during commit I think you're right. Not sure how practical, because I believe that usually the commit thread will also be the one that calls setCommitData, but it is possible. I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either: Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null. Option #2 means that there's no API break, no synchronization is needed on setCommitData and practically everything remains the same. We can still remove the redundant .setCommitData in finishCommit regadless. should we add an IW.getCommitData? I think that that'd be great ! Today the only way to do it is if you refresh a reader (expensive). I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?

          I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either:

          1. Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member.
          2. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null.

          Hmm, I'd rather not store the member inside IW and inside SIS; just seems safer to have a single clear place where this is tracked.

          Also, I like the new API so I'd rather not do #2?

          I think just removing that line in finishCommit should fix the bug ... but first we need a test exposing it.

          I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?

          Yes.

          mikemccand Michael McCandless added a comment - I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either: Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member. Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null. Hmm, I'd rather not store the member inside IW and inside SIS; just seems safer to have a single clear place where this is tracked. Also, I like the new API so I'd rather not do #2? I think just removing that line in finishCommit should fix the bug ... but first we need a test exposing it. I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos? Yes.
          shaie Shai Erera added a comment -

          I'll make the changes, and also it seems like you were suggesting that earlier – allow setCommitData to affect the pendingCommit too. I think that's valuable because you can e.g. call prerCommit() -> setCommitData() -> commit() – the setCD() in the middle lets you create a commitData that will pertain to the state of the index after the commit.

          I'll make all the changes and post a new patch, probably tomorrow.

          shaie Shai Erera added a comment - I'll make the changes, and also it seems like you were suggesting that earlier – allow setCommitData to affect the pendingCommit too. I think that's valuable because you can e.g. call prerCommit() -> setCommitData() -> commit() – the setCD() in the middle lets you create a commitData that will pertain to the state of the index after the commit. I'll make all the changes and post a new patch, probably tomorrow.

          I'll make a test exposing the bug ...

          mikemccand Michael McCandless added a comment - I'll make a test exposing the bug ...

          Simple test showing that commit data is lost ... I didn't need to use threads; just call .setCommitData after prepareCommit and before commit.

          mikemccand Michael McCandless added a comment - Simple test showing that commit data is lost ... I didn't need to use threads; just call .setCommitData after prepareCommit and before commit.
          shaie Shai Erera added a comment -

          The test isn't exactly accurate, because it tests a scenario that is currently not supported. I.e., after calling prepareCommit(), nothing that you do on IW will be committed. Rather, to expose the bug it should be modified as follows:

          iw.setCommitData(data1);
          iw.prepareCommit();
          iw.setCommitData(data2); // that will be ignored by follow-on commit
          iw.commit();
          checkCommitData(); // will see data1
          iw.commit(); // that 'should' commit data2
          checkCommitData(); // that will see data1 again, because of the copy that happens in finishCommit()
          

          I'll modify the test like so and include it in my next patch.

          shaie Shai Erera added a comment - The test isn't exactly accurate, because it tests a scenario that is currently not supported. I.e., after calling prepareCommit(), nothing that you do on IW will be committed. Rather, to expose the bug it should be modified as follows: iw.setCommitData(data1); iw.prepareCommit(); iw.setCommitData(data2); // that will be ignored by follow-on commit iw.commit(); checkCommitData(); // will see data1 iw.commit(); // that 'should' commit data2 checkCommitData(); // that will see data1 again, because of the copy that happens in finishCommit() I'll modify the test like so and include it in my next patch.
          shaie Shai Erera added a comment -

          Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit(). Following commit() merely calls infos.finishCommit() which writes the checksum and closes the output.

          Can we modify segmentInfos.write() to not write the commitData, but move it to finishCommit()? Not sure that I like this approach, because it means that finishCommit() will do slightly more work, which increases the chance of getting an IOException during commit() after prepareCommit() successfully returned, but on the other hand it's the gains might be worth it? Being able to write commitData after you know all your document additions/deletions/updates are 'safe' might prove valuable. And finishCommit() already does I/O, writing checksum ...

          What do you think?

          shaie Shai Erera added a comment - Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit(). Following commit() merely calls infos.finishCommit() which writes the checksum and closes the output. Can we modify segmentInfos.write() to not write the commitData, but move it to finishCommit()? Not sure that I like this approach, because it means that finishCommit() will do slightly more work, which increases the chance of getting an IOException during commit() after prepareCommit() successfully returned, but on the other hand it's the gains might be worth it? Being able to write commitData after you know all your document additions/deletions/updates are 'safe' might prove valuable. And finishCommit() already does I/O, writing checksum ... What do you think?

          Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit().

          Oh yeah ... I forgot about that

          Hmm ... I don't think we should move writing the commit data to finishCommit? Is it really so hard for the app to provide the commit data before calling prepareCommit?

          mikemccand Michael McCandless added a comment - Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit(). Oh yeah ... I forgot about that Hmm ... I don't think we should move writing the commit data to finishCommit? Is it really so hard for the app to provide the commit data before calling prepareCommit?
          shaie Shai Erera added a comment -

          I don't think that we should add more work to finishCommit() either. Being able to setCommitData after prep() is just a bonus. It didn't work before, and it will continue to not work now. And I can't think of a good usecase for why an app would not be able to set commitData prior to prep(). If it comes up, we can discuss a solution again. At least we know that moving commitData write to finishCommit will solve it.

          I'll make sure the test exposes the bug you reported in IW.finishCommit().

          shaie Shai Erera added a comment - I don't think that we should add more work to finishCommit() either. Being able to setCommitData after prep() is just a bonus. It didn't work before, and it will continue to not work now. And I can't think of a good usecase for why an app would not be able to set commitData prior to prep(). If it comes up, we can discuss a solution again. At least we know that moving commitData write to finishCommit will solve it. I'll make sure the test exposes the bug you reported in IW.finishCommit().
          shaie Shai Erera added a comment -

          Patch addresses the bug that Mike reported and adds a test for it. Also adds IW.getCommitData().

          shaie Shai Erera added a comment - Patch addresses the bug that Mike reported and adds a test for it. Also adds IW.getCommitData().

          +1, looks great. Thanks Shai!

          mikemccand Michael McCandless added a comment - +1, looks great. Thanks Shai!
          commit-tag-bot Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1416361

          LUCENE-4575: add IndexWriter.setCommitData

          commit-tag-bot Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1416361 LUCENE-4575 : add IndexWriter.setCommitData
          shaie Shai Erera added a comment -

          Committed to trunk and 4x. Thanks Mike !

          shaie Shai Erera added a comment - Committed to trunk and 4x. Thanks Mike !
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1416367

          LUCENE-4575: add IndexWriter.setCommitData

          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1416367 LUCENE-4575 : add IndexWriter.setCommitData
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1416367

          LUCENE-4575: add IndexWriter.setCommitData

          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1416367 LUCENE-4575 : add IndexWriter.setCommitData
          tomoko Tomoko Uchida added a comment -

          This issue was moved to GitHub issue: #5641.

          tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #5641 .

          People

            shaie Shai Erera
            shaie Shai Erera
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: