Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
New, Patch Available
Description
Spinoff from here http://lucene.472066.n3.nabble.com/commit-with-only-commitData-td4022155.html.
In some cases, it is valuable to be able to commit changes to the index, even if the changes are just commitData. Such data is sometimes used by applications to register in the index some global application information/state.
The proposal is:
- Add a setCommitData() API and separate it from commit() and prepareCommit() (simplify their API)
- When that API is called, flip on the dirty/changes bit, so that this gets committed even if no other changes were made to the index.
I will work on a patch a post.
Attachments
Attachments
- LUCENE-4575.patch
- 6 kB
- Shai Erera
- LUCENE-4575.patch
- 27 kB
- Shai Erera
- LUCENE-4575-testcase.patch
- 2 kB
- Michael McCandless
- LUCENE-4575.patch
- 31 kB
- Shai Erera
Activity
Thanks. I forgot to mention two things about the changes in the patch, which I wasn't sure about:
- I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections?
- I set pass the copied map directly to segmentInfos, rather than saving it in a member in IW. Do you see any issues with it? (I'm thinking about rollback, even though we have another copy of the segmentInfos for rollback purposes ...)
I currently copy the commitData map on setCommitData. It seems safe to do it, and I don't think commitData are huge. Any objections?
Do any users care about order (i.e. they pass in a LinkedHashMap)? If would be trivial to preserve if it added value for some.
We use commitData extensively but we don't care about the order. We store key/value pairs.
I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap). Therefore, if we want to preserve the type of the Map, we'd need to change DataInput/Output code. I'm not sure it's worth the hassle, but let's discuss that anyway on a separate issue? It's not really related to how the map is set.
The API returns Map<String,String>, so we make no garanties about order.
I don't think though that it's trivial to support. Currently the user can pass any Map, but IndexReader returns in practice a HashMap (DataInput.readStringStringMap initializes a HashMap).
If a user cared about order, then they would pass a LinkedHashMap. Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap.
it's not really related to how the map is set.
It is... if you make a copy of the map and we want to preserve order, it's new LinkedHashMap instead of HashMap.
It's a minor enough point I don't think it does deserve it's own issue. I don't personally care about preserving order - but I did think it was worth at least bringing up.
Then the only thing that would need to change is DataInput.readStringStringMap: s/HashMap/LinkedHashMap.
So you propose that the code will always initialize LHM in DataInput, that way preserving order whether required or not? Yes, I guess that we can do that. But I wonder if we should? We didn't so far, and nobody complained. And since it's an internal change, we can always make that change in the future if somebody asks?
Patch removes commit/prepCommit(commitData) versions and modifies code that used it (mostly tests code) to call setCommit beforehand.
I also added a CHANGES entry under back-compat section.
All tests pass. I'll wait for some review before I commit it.
I thought we were going to rename ensureOpen's confusing boolean param?
IW.setCommitData should be sync'd I think, eg to ensure visibility
across threads of the changes to sis.userData?
Hmm ... I think there's a thread hazard here, during commit; I think
if pendingCommit is not null you should also call
pendingCommit.setUserData? Else, a commit can finish and "undo" the
user's change to the commit data (see finishCommit, where it calls
.setUserData). Maybe we need a thread safety test
here ...
Actually I think we should just remove that .setUserData inside finishCommit?
Also, should we add an IW.getCommitData?
I thought we were going to rename ensureOpen's confusing boolean param?
Right, but for some reason I thought that you're going to do that . I'll do it in the next patch.
IW.setCommitData should be sync'd I think, eg to ensure visibility across threads of the changes to sis.userData?
Ok
Hmm ... I think there's a thread hazard here, during commit
I think you're right. Not sure how practical, because I believe that usually the commit thread will also be the one that calls setCommitData, but it is possible.
I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either:
- Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member.
- Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null.
Option #2 means that there's no API break, no synchronization is needed on setCommitData and practically everything remains the same. We can still remove the redundant .setCommitData in finishCommit regadless.
should we add an IW.getCommitData?
I think that that'd be great ! Today the only way to do it is if you refresh a reader (expensive). I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?
I agree that calling that setCommitData in finishCommit is redundant, but perhaps we can solve it more elegantly by either:
- Not storing the setCommitData in infos, but rather in a private IW member. Then in startCommit set it on the cloned infos. It's essentially how it's done today, only now the commit data will be copied from a member.
- Stick w/ current API commit(commitData) and prepareCommit(commitData), and just make sure that commit goes through even if changeCount == previousChangeCount, but commitUserData != null.
Hmm, I'd rather not store the member inside IW and inside SIS; just seems safer to have a single clear place where this is tracked.
Also, I like the new API so I'd rather not do #2?
I think just removing that line in finishCommit should fix the bug ... but first we need a test exposing it.
I think that the code in finishCommit ensures that we can always pull the commitData from segmentInfos?
Yes.
I'll make the changes, and also it seems like you were suggesting that earlier – allow setCommitData to affect the pendingCommit too. I think that's valuable because you can e.g. call prerCommit() -> setCommitData() -> commit() – the setCD() in the middle lets you create a commitData that will pertain to the state of the index after the commit.
I'll make all the changes and post a new patch, probably tomorrow.
Simple test showing that commit data is lost ... I didn't need to use threads; just call .setCommitData after prepareCommit and before commit.
The test isn't exactly accurate, because it tests a scenario that is currently not supported. I.e., after calling prepareCommit(), nothing that you do on IW will be committed. Rather, to expose the bug it should be modified as follows:
iw.setCommitData(data1); iw.prepareCommit(); iw.setCommitData(data2); // that will be ignored by follow-on commit iw.commit(); checkCommitData(); // will see data1 iw.commit(); // that 'should' commit data2 checkCommitData(); // that will see data1 again, because of the copy that happens in finishCommit()
I'll modify the test like so and include it in my next patch.
Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit(). Following commit() merely calls infos.finishCommit() which writes the checksum and closes the output.
Can we modify segmentInfos.write() to not write the commitData, but move it to finishCommit()? Not sure that I like this approach, because it means that finishCommit() will do slightly more work, which increases the chance of getting an IOException during commit() after prepareCommit() successfully returned, but on the other hand it's the gains might be worth it? Being able to write commitData after you know all your document additions/deletions/updates are 'safe' might prove valuable. And finishCommit() already does I/O, writing checksum ...
What do you think?
Hmmm ... setting the commitData on pendingCommit cannot work, b/c the commitData is written to segnOutput on prepareCommit().
Oh yeah ... I forgot about that
Hmm ... I don't think we should move writing the commit data to finishCommit? Is it really so hard for the app to provide the commit data before calling prepareCommit?
I don't think that we should add more work to finishCommit() either. Being able to setCommitData after prep() is just a bonus. It didn't work before, and it will continue to not work now. And I can't think of a good usecase for why an app would not be able to set commitData prior to prep(). If it comes up, we can discuss a solution again. At least we know that moving commitData write to finishCommit will solve it.
I'll make sure the test exposes the bug you reported in IW.finishCommit().
Patch addresses the bug that Mike reported and adds a test for it. Also adds IW.getCommitData().
[trunk commit] Shai Erera
http://svn.apache.org/viewvc?view=revision&revision=1416361
LUCENE-4575: add IndexWriter.setCommitData
[branch_4x commit] Shai Erera
http://svn.apache.org/viewvc?view=revision&revision=1416367
LUCENE-4575: add IndexWriter.setCommitData
[branch_4x commit] Shai Erera
http://svn.apache.org/viewvc?view=revision&revision=1416367
LUCENE-4575: add IndexWriter.setCommitData
Patch adds setCommitData to IndexWriter and increase changeCount as well as sets that commitData on segmentInfos. It also adds a test to verify the behavior.
Regarding back-compat - I prefer to nuke commit(data) and prepcommit(data), in exchange for this API, for both trunk and 4.x.
This patch however supports the old commit/prepcommit(data) API, but I think that it will be simpler if we just nuke these API. The migration to the new API is a no-brainer, just call setCommitData before your commit().
I don't intend to commit it yet, depending on how we decide to handle back-compat. If we decide to keep the back-compat support, I want to move the commit(data) and prepCommit(data) impls to their respective no-data versions, and then have these API deprecated and call setCommitData() followed by the respective no-data version.