|
[
Permlink
| « Hide
]
Otis Gospodnetic added a comment - 11/Dec/06 10:47 PM
In a direct email to me, Robert said: "All of the files can be prepended with the ASL."
Otis Gospodnetic made changes - 11/Dec/06 10:47 PM
A generic version probably needs to implement reference counting on the Segments or IndexReader in order to know when they can be safely closed.
Michael Busch made changes - 18/Jul/07 11:12 PM
i somehow missed seeing this issues before ... i don't really understand the details, but a few comments that come to mind...
1) this approach seems to assume that when reopening a MyMultiReader, the sub readers will all be MySegmentReaders .. assuming we generalize this to MultiReader/SegmentTeader, this wouldn't work in the case were people are using a MultiReader containing other MultiReaders ... not to mention the possibility of people who have written their own IndexReader implementations. 2) there is no more commit lock correct? ... is this approach something that can still be valid using the current backoff/retry mechanism involved with opening segments?
Hoss Man made changes - 19/Jul/07 12:56 AM
> i somehow missed seeing this issues before ...
actually, me too... first I came across this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/31592?search_string=refresh;#31592 in which Doug suggested adding a static method IndexReader.open(IndexReader) I started implementing this, using Dougs and Roberts ideas and then realized > in generally we should probably try to approach reopening a reader as a Yeah we could do that. However, this might not be so easy to implement. Also the recursive walk would have to take place within the FindSegmentsFile I decided therefore to only allow IndexReaders to be refreshed if they were First version of my patch:
This first version is for review and not ready to All tests pass.
Michael Busch made changes - 20/Jul/07 07:13 AM
> Yeah we could do that. However, this might not be so easy to implement. ...this being the curse that is MultiReader – it can serve two very differenet purposes. You seem to have already solved the multisegments in a single directory approach, the MultiReader over many subreader part actually seems much easier to me (just call your open method on all of the subreaders) the only tricky part is detecting which behavior should be used when. This could be driven by a simple boolean property of MultiReader indicating whether it owns it's directory and we need to look for new segments or not – in which case we just need to refresh the subreaders. (My personal preference would be to change MultiReader so "this.directory" is null if it was open over several other subReaders, right now it's just assigned to the first one arbitrarily, but there may be other consequences of changing that) Incidentally: I don't think it's crucial that this be done as a recursive method. the same approach i describe could be added to static utility like what you've got, I just think that if it's possible to do it recursively we should so that if someone does write their own MultiReader or SegmentReader subclass they can still benefit from any core reopening logic as long as theey do their part to "reopen" their extensions. an extremely hackish refactoring of the previous patch that demonstrates the method working recursively and dealing with MultiReaders constructed over multiple subReaders.
a few notes: 1) no where near enough tests of the subReader situation
Hoss Man made changes - 21/Jul/07 01:46 AM
Michael Busch made changes - 27/Jul/07 12:57 AM
Now, after
patch here, which was now easier because MultiReader is now separated into two classes. Notes:
I think the general contract of reopen() should be to always return a new IndexReader
Michael Busch made changes - 01/Aug/07 09:41 PM
I ran some quick performance tests with this patch:
1) The test opens an IndexReader, deletes one document by random docid, closes the Reader. I run these two steps 5000 times in a loop. First run: Index size: 4.5M, optimized
Second run: Index size: 3.3M, 24 segments (14x 230.000, 10x 10.000)
> I think closing the old reader is fine. What do others think? Is keeping the old
> reader after a reopen() a useful usecase? In a multi-threaded environment, one wants to open a new reader, but needs to wait until all requests finish before closing the old reader. Seems like reference counting is the only way to handle that case. (note: i haven't looked at the latest patch in detail, just commenting on the comments) One key problem i see with automatically closing things in reopen is MultiReader: it's perfectly legal to do something like this psuedocode... IndexReader a, b, c = ... one possibility would be for the semantics of reopen to close old readers only if it completely owns them; ie: MultiReader should never close anything in reopen, MultiSegmentReader should close all of the subreaders since it opened them in the first place ... things get into a grey area with SegementReader though. In general i think the safest thing to do is for reopen to never close. Yonik's comment showcases one of the most compelling reasons why it can be important for clients to be able to keep using an old IndexReader instance, and it's easy enough for clients that want the old one closed to do something like... IndexReader r = ... (one question that did jump out at me while greping the patch for the where old readers were being closed: why is the meat of reopen still in "IndexReader" with a "if (reader instanceof SegmentReader)" style logic in it? can't the different reopen mechanisms be refactored down into SegmentReader and MultiSegmentReader respectively? shouldn't the default impl of IndexReader throw an UnsuppportedOperationException?) > IndexReader a, b, c = ...
> MultiReader ab = new MultiReader({a, b}) > MultiReader bc = new MultiReader({b, c}) > ...b changes on disk... > ab.reopen(); // this shouldn't affect bc; > > one possibility would be for the semantics of reopen to close old readers only > if it completely owns them; ie: MultiReader should never close anything in > reopen, MultiSegmentReader should close all of the subreaders since it opened > them in the first place So if 'b' in your example is a MultiSegmentReader, then the reopen() call > IndexReader r = ... ... is actually easy enough as you pointed out, so that the extra complexity is not > In general i think the safest thing to do is for reopen to never close. So yes, I agree. > why is the meat of reopen still in "IndexReader" with a "if (reader instanceof I'm not sure if the code would become cleaner if we did that. Sometimes a > I'm not sure if the code would become cleaner if we did that. Sometimes a SegmentReader would then have to i don't hink there would be anything wrong with SegmentReader.reopen returning a MultiSegmentReader in some cases (or vice versa) but it definitely seems wrong to me for a parent class to be explicitly casing "this" to one of two know subclasses ... making reopen abstract in the base class (or throw UnsupportOp if for API back compatibility) seems like the only safe way to ensure any future IndexReader subclasses work properly.
Michael Busch made changes - 22/Aug/07 11:21 PM
We should first refactor segmentInfos into IndexReader's subclasses.
I'm attaching a new version of the patch that has a lot of changes compared to the last patch:
to IndexReader which returns false by default. IndexReader.clone() checks if the actual implementation supports clone() (i. e. the above method returns true). If not, it throws an UnsupportedOperationException, if yes, it returns super.clone(). I was not sure about whether to throw an (unchecked) UnsupportedOperationException or a CloneNotSupportedException in this case. I decided to throw UnsupportedOperationException even though it's not really following the clone() guidelines, because it avoids the necessity to catch the CloneNotSupportedException every time clone() is called (in the reopen() methods of all IndexReader implementations). As an example for how the clone() method is used let me describe how MultiReader.reopen() works: it tries to reopen every of its subreaders. If at least one subreader could be reopened successfully, then a new MultiReader instance is created and the reopened subreaders are added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no index changes) is now cloned() and added to the new MultiReader.
As the javadoc says it has two benefits: 1) it speeds up reopening and reduces ressources, and 2) it allows to reopen readers, that use non-cloneable subreaders. Please let me know what you think about these changes, especially about the clone() implementation.
Michael Busch made changes - 02/Oct/07 06:24 AM
I ran new performance tests with the latest patch similar to the tests I explained in an earlier comment on this issue.
I'm using again a 4.5M index. In each round I delete one document and (re)open the IndexReader thereafter. Here are the numbers for 5000 rounds:
Now in each round I delete on document and also set the norm for one random document. Numbers for 1000 rounds:
I think these numbers look pretty good. We get a quite decent speedup even if the old readers are not closed. I would like to commit this in a couple of days to get ready for Lucene 2.3. It would be nice if someone could review the latest patch! Hoss? Others? I think this looks pretty good Michael!
Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers. > Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers.
Yeah, the cloning part was kind of tedious. Read-only readers would indeed make our life much easier here. I'm wondering how many people are using the IndexReader to alter the norms anyway? Well, thanks for reviewing the patch, Yonik! Nice to see all the good work on this. We are still on a 1.9 derivative.
Hopefully we'll be able to move to stock 2.X release in the future. I haven't looked at the patch yet (i really really plan to this weekend, baring catastrophe) but i'm confused as to why you went the cloning route (along with the complexity of the api changes to indicate when it is/isn't supported) ... based on the comments, it seems to boil down to...
> As an example for how the clone() method is used let me describe how MultiReader.reopen() that seems like circular logic: the clone method is used so that the sub readers can be cloned why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it? And if (for reasons that aren't clear to me) it is important for MultiReader to use a clone of it's subreaders when their reopen method returns "this" then shouldn't clients do the same thing? ... why not make reopen always return this.clone() if the index hasn't changed (which now that i think about it, would also help by punting on the isCloneSupported issue – each class would already know if it was clonable. maybe this will make more sense once i read the patch ... i just wanted to through it out there in case someone had a chance to reply before i get a chance. > why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it?
Let's say you have a MultiReader with two subreaders: IndexReader ir1 = IndexReader.open(index1); IndexReader ir2 = IndexReader.open(index2); IndexReader mr = new MultiReader(new IndexReader[] {ir1, ir2}); Now index1 changes and you reopen the MultiReader and keep the old one open: IndexReader mr2 = mr.reopen(); ir1 would now be reopened and let's assume we wouldn't clone ir2. If you use mr2 now to e.g. delete a doc and that doc happens to be in index2, then mr1 would also see the changes because both MultiReaders share the same subreader ir2 and are thus not independent from each other. > why not make reopen always return this.clone() clone() might be an expensive operation. We only need to clone if at least one of the subreaders has changed. > > Too bad so much needs to be cloned in the case that
> > closeOldReader==false... maybe someday in the future we can have > > read-only readers. > > Yeah, the cloning part was kind of tedious. Read-only readers would > indeed make our life much easier here. I'm wondering how many people > are using the IndexReader to alter the norms anyway? I think the closeOldReader=false case is actually quite important. Because in production, I think you'd have to use that, so that your Since fully warming could take a long time (minutes?) you need that Can we take a copy-on-write approach? EG, this is how OS's handle the This would mean that "read-only" uses of the cloned reader never Actually if we went back to the sharing (not cloning) approach, could In Michael's example above, on calling mr2.deleteDoc, you would hit an I think this would let us have our cake and eat it too: re-opening Would this work? A couple other questions/points:
This might be a good place to add an assert: if you are not
Okay, read the patch. I'm on board with the need for Clonable now ... it's all about isolation. if "r.reopen(false) == r" then the client is responsible for recognizing that it's got the same instance and can make the judgement call about reusing the instance or cloning depending on it's needs ... internally in things like MultiReader we have to assume we need a clone for isolation.
Questions and comments... 1. CompoundFileReader, FieldsReader, IndexReader, and BitVector all have clone methods where they silently catch and ignore CloneNotSupportedExceptions from super.clone()... if we don't expect the exception to ever happen, we should just let the exception propogate up the stack (there is no down side to declaring that clone() throws CloneNotSupportedException). If we think the exception might happen, but it's not a big deal if it does and we can ignore it, then we should put a comment in the catch block to that effect. i'm not clear which are the cases here. 2. i don't remember if the old patches had this issue as well, but having "return new FilterIndexReader(reopened);" in FilterIndexReader doesn't really help anyone using FilterIndexReader – they're going to want an instance of their own subclass. to meet the contract of FilterIndexReader, we should implement all "abstract" methods and delegate to the inner reader - so in theory do don't need a new instance of FIlterIndexReader, we could just return in.reopen(closeold) ... my gut says it would be better to leave the method out entirely so that the default impl which throws UnSupOpEx is used — that way subclasses who want to use reopen must define their own (instead of getting confused when their filtering logic stops working after they reopen for the first time) 3. instead of having an isClonable() method, why don't we just remove the "implements Clonable" declaration from IndexReader and put it on the concrete subclasses – then use "instanceof Cloneable" to determine if something is clonable? ... for that matter, the only place isCloneSupported is really used is in IndexReader.clone where an exception is thrown if Clone is not supported ... subclasses which know they are Clonable don't need this from the base class, subclasses which don't implement Clonable but are used in combination with core classes whose reopen method requires it could be skiped by the caller (since they don't implement Clonable) .. 4. javadocs frequently refer to "reopened successfully" and "refreshed successfully" when what it really means is "reopen() returned a newer/fresher reader" ... this may confuse people who are use to thinking of "successfully" a "without error" 5. random thought: are their any synchronization issues we're missing here? I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs. is there any chance for inconsistent state? a rough variation on Michael's latest patch that makes the changes along two of the lines i was suggesting before reagrding FilterIndexReader and ising "instanceof Cloneable" instead of "isCloneSupported()" two important things to note:
1) i didn't change any of the documentation, i was trying to change as little aspossibel so the two patches could be compared side-by-side 2) this patch is currently broken. TestIndexReaderReopen gets an exception i can't understand ... but i have to stop now, and i wanted to post what i had in case people had comments. ...now that i've done this exercise, i'm not convinced that it's a better way to go, but it does seem like isCloneSupported isn't neccessary, that's the whole point of the Cloneable interface.
Hoss Man made changes - 07/Oct/07 07:57 PM
> I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs.
Hmmm there is cause for concern (and I should have had my mt-safe hat on
Thanks all for the reviews and comments!
There seem to be some issues/open questions concerning the cloning. > Actually if we went back to the sharing (not cloning) approach, could Interesting, yes that should work in case we have two readers (the > > Actually if we went back to the sharing (not cloning) approach,
> > could we insert a check for any writing operation against the > > re-opened reader that throws an exception if the original reader > > is not yet closed? > > Interesting, yes that should work in case we have two readers (the > original one and the re-opened one). But what happens if the user > calls reopen twice to get two re-opened instances back? Then there > would be three instances, and without cloning the two re-opened ones > would also share the same resources. Is this a desirable use case or > would it be okay to restrict reopen() so that it can only create one > new instance? Hmmm good point. Actually, we could allow more then one re-open call if we take the Then, any reader should refuse to do a writing operation if its This way if you have a reader X and you did reopen to get Y and did BTW this would also allow for very efficient "snapshots" during > This way if you have a reader X and you did reopen to get Y and did
> reopen again to get Z then the shared sub-readers between X, Y and Z > would not allow any write operations until 2 of the three had been > closed. I think that would work? Yes I think it would work. However, this approach has two downside IMO:
Of course the cloning approach has disadvantages too. For custom So I'm really not sure which way the better one is. I think I'm slightly > We have to be very careful about cross-referencing multiple readers. One correction: there should be no cross-referencing, right? The only I think the downside of this proposed "shared readers" (non-cloning) I think the big upside of "shared readers" is reopening becomes quite in deciding "to clone or not clone" for the purposes of implementing reopen, it may make sense to step back and ask a two broader questions... 1) what was the motivation for approaching reopen using clones in the first place the answer to the first question, as i understand it, relates to the issue of indexreaders not being "read only" object ... operations can be taken which modify the readers state, and those operations can be flushed to disk when the reader is closed. so readerB = readerA.reopen(closeOld=false) and then readerA.delete(...) is called, there is ambiguity as to what should happen in readerB. so the clone route seems pretty straight forward if and only if we have an unambiguous concept of cloning a reader (because then the clone approach to reopen becomes unambiguous as well). alternately, since reopen is a new method, we can resolve the ambiguity anyway we want, as long as it's documented ... in theory we should pick a solution that seems to serve the most benefit ... perhaps that solution is to document reopen with "if reopen(closeOld=false) returns a new reader it will share state with the current reader, attempting to do the following operations on this new reader will result in undefined behavior" the answer the the second is only important if we want to go the cloning route ... but it is pretty important in a larger sense then just reopening ... f we start to say that any of the IndexReader classes are Clonable we have to be very clear about what that means in all cases where someone might clone it in addition to reopen ... in particular, i worry about what it means to clone a reader which has already had "write" operations performed on it (something the clone based version of reopen will never do because a write operation indicates the reader must be current), but some client code might as soon as we add the Clonable interface to a class. > in deciding "to clone or not clone" for the purposes of implementing
> reopen, it may make sense to step back and ask a two broader questions... I agree! > 1) what was the motivation for approaching reopen using clones in the Good summarization! You are right, I started the clone approach because > "if reopen(closeOld=false) returns a new reader it will share state with This would mean, that we simply warn the user that performing write Hmm, so what are our long-term plans for indexreaders? If our goal is to If everybody likes this approach I'll go ahead and submit a new patch.
I think that's a fine direction. Note however that IndexWriter implements delete by calling IndexReader.delete(). That method could be made package-private, so that users cannot call it, but then this makes it impossible for someone to subclass IndexReader from a different package. So perhaps delete() needs to move to a subclass of IndexReader? That gets messy...
Actually all of the lock/commit logic moved from IndexReader to DirectoryIndexReader already, and the delete logic is in SegmentReader, which subclasses DirectoryIndexReader. So we could remove the deleteDocument() API from IndexReader but leave it in DirectoryIndexReader. Then it would still be possible to extend IndexReader from a different package just as today, and IndexWriter could use DirectoryIndexReader for performing deletes. These changes should be trivial. Got it. IndexWriter only works with SegmentReaders anyway, not with an arbitrary IndexReader implementation: IndexReader is extensible, but IndexWriter is not. I'd (momentarily) forgotten that. Nevermind.
Having a read-only IndexReader would (should?) mean being able to remove "synchronized" from some things like isDeleted()... a nice performance win for multi-processor systems for things that didn't have access to the deleted-docs bitvec.
> If our goal is to make them read-only (we can delete via IndexWriter already) But you can only delete-by-term.
Or we could take the approach you suggested in http://www.gossamer-threads.com/lists/lucene/java-dev/52017 That would mean to add a callback after flush to get a current IndexReader to get the docids and to use the IndexWriter then to perform deleteDocument(docId) or setNorm(). These methods could also take an IndexReader as argument, e. g. deleteDocument(IndexReader reader, int docId), which would throw an IOException if the passed in reader is stale (i. e. docids have changed since the reader was opened). Just as IndexReader does it today. Does this make sense or am I missing something?
Michael Busch made changes - 17/Oct/07 06:32 PM
I just opened
As for reopen() I'd like to go with Hoss' suggestion for now and add warning comments to reopen() saying that using an re-opened IndexReader with closeOldReader==false for write operations will result in an undefined behavior. Any objections?
How about just defining the behavior such that any pending changes are flushed. That would make it more useful because you could then reopen readers you used for deletes. An alternative would be a method to explicitly flush changes on a reader, giving one the ability to then reopen it, but I like the former better since it avoids adding another API call.
Hmm, I'm not sure I understand. A reader which is being used for deletes or setting norms is always current (it owns the write lock), so there should never be the need to re-open such a reader. However, if you re-open an existing reader which was not used for deletes before and use the new instance (b) to perform deletes, it will result in a undefined behavior for the old reader (a): IndexReader a = ..... .... IndexReader b = a.reopen(); b.deleteDocument(...);
I was thinking about the "add some docs" then "delete some docs" scenario:
Ah, thanks for that clarification. I guess that should remain undefined. Hmm one other thing: how should IndexReader.close() work? If we re-open a reader (a is the old reader, b is the new one), and then someone calls a.close(), then a should not close any resources that it shares with b.
One way to make this work would be reference counting. Since we want to avoid that, we could simply restrict reopen() from being called twice for the same reader. Then there would never be more than 2 readers sharing the same ressources. The old reader a would remember that reopen() was called already and would not close the shared ressources on close(). However, the new reader b would close all ressources, meaning that reader a would not work anymore after b.close() was called. I think reference counting would solve this issue quite nicely. How come we want to avoid reference counting? It seems like the right solution here.
The implementation seems simple. When a reader is opened, it starts with RC 1. When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC). When a MultiReader needs to use the reader, it calls incref. And when that MultiReader is done with it, it calls decref. Whenever the RC hits 0 it's safe to free all resources. Wouldn't that work? > When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC)
But if a reader is shared, how do you tell two real closes from an erronous double-close?
You're right, for a MultiReader and ParallelReader this would work and wouldn't be hard to implement. It is quite different when it comes to SegmentReaders. SegmentReader.reopen() checks SegmentInfos if there is a segment with the same name but newer normGen or delGen. If there is one then a new SegmentReader instance is created that reuses resources like e. g. TermInfosReader and loads the new generation of the del or norm file. Now you can end up having a bunch of SegmentReaders that share the same resources but don't know about each other. The reference counting would have to happen somewhere else, e. g. in the TermInfosReader. Of course this is doable, but it's a special case and more complicated compared to the "restrict reopen() to only be called once"-approach. > But if a reader is shared, how do you tell two real closes from an erronous double-close?
> Perhaps have a top level close() that is only invoked once via isClosed, and a projected doClose() that a multi-reader would use that actually does the decref? Yes I think that's the right approach. > It is quite different when it comes to SegmentReaders. SegmentReader.reopen() checks SegmentInfos if there is a segment with the same name but newer normGen or delGen. If there is one then a new SegmentReader instance is created that reuses resources like e. g. TermInfosReader and loads the new generation of the del or norm file. For SegmentReader, I think on reopen(), everything would be shared We wouldn't need to push reference counting into all the readers that Hi Mike,
I'm not sure if I fully understand your comment. Consider the following (quite constructed) example: IndexReader reader1 = IndexReader.open(index1); // optimized index, reader1 is a SegmentReader IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)}); ... // modify index2 IndexReader multiReader2 = multiReader1.reopen(); // only index2 changed, so multiReader2 uses reader1 and has to increment the refcount of reader1 ... // modify index1 IndexReader reader2 = reader1.reopen(); // reader2 is a new instance of SegmentReader that shares resources with reader1 ... // modify index1 IndexReader reader3 = reader2.reopen(); // reader3 is a new instance of SegmentReader that shares resources with reader1 and reader2 Now the user closes the readers in this order:
reader1 should be marked as closed after 2., right? Because multiReader1.close() and multiReader2.close() have to decrement reader1's refcount. But the underlying files have to remain open until after 4., because reader2 and reader3 use reader1's resources. So don't we need 2 refcount values in reader1? One that tells us when the reader itself can be marked as closed, and one that tells when the resources can be closed? Then multiReader1 and multiReader2 would decrement the first refCount, whereas reader2 and reader3 both have to "know" reader1, so that they can decrement the second refcount. I hope I'm just completely confused now and someone tells me that the whole thing is much simpler It's not nearly this complex (we don't need two ref counts). If we
follow the simple rule that "every time reader X wants to use reader Y, it increfs it" and "whenver reader X is done using reader Y, it decrefs it", all should work correctly. Also we should think of "close()" as the way that the external user Let's walk through your example... I'm assuming in your example you meant for reader2 and reader3 to also Also in your example let's insert missing "reader1.close()" as the When reader1 is created it has RC 1. When multiReader1 is created, reader1 now has RC 2. When multiReader2 is created, reader1 now has RC 3. When reader2 is created (by reader1.reopen()), it incref's reader1 When reader3 was created (by reader2.reopen()), it incref's reader2 Now, we close. After reader1.close() is called, reader1 sets isOpen=false (to prevent With multiReader1.close(), multiReader1 is not at RC 0, and so it With multiReader2.close(), likewise it is now at RC 0 and so it With reader2.close(), it decrefs its own RC, however that brings its Finally, with reader3.close(), it is now at RC 0 and so it decrefs the
Yes that's what I meant. Sorry, I didn't make that clear.
Doesn't what you describe change the semantics of MultiReader.close()? If you do: IndexReader reader1 = IndexReader.open(index1); IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)}); multiReader1.close(); then today multiReader1.close() also closes reader1. That's why I consciously omitted reader1.close(). Consequently, if you do IndexReader reader1 = IndexReader.open(index1); IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)}); IndexReader multiReader2 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index3)}); multiReader1.close(); then multiReader2 is not usable anymore, because multiReader1.close() closes reader1. But that can be explicitly avoided by the user because it's known that multiReader1 and multiReader2 share the same reader. Now, with the reopen() code: IndexReader reader1 = IndexReader.open(index1); // optimized index, reader1 is a SegmentReader IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)}); ... // modify index2 IndexReader multiReader2 = multiReader1.reopen(); // only index2 changed, so multiReader2 uses reader1 and has to increment the refcount of reader1 The user gets a new reader instance from multiReader.reopen(), but can't tell which of the subreaders has been reopened and which are shared. That's why multiReader1.close() should not close reader1 in this case and we need refcounting in order to make this work. So do you suggest that a MultiReader should increment the refcounts when it is opened as well (in the constructor)? I believe we can implement it like this, but as I said it changes the semantics of MultiReader.close() (and ParallelReader.close() is, I believe, the same). A user would then have to close subreaders manually.
Ahh, I missed that MultiReader is allowed to close all readers that It's somewhat "aggressive" of MultiReader/ParallelReader to do that? I think we are forced to keep this semantics, for backwards Anyway, it's simple to preserve this semantics with reference With that change, plus the change below, your example works fine.
This is why I don't like the semantics we have today – I don't think
Both of these cases are easy to fix with reference counting: we just
Actually I retract this: it's no longer necessary as long as we change I think this is actually a nice unexpected side-effect of using
Two things:
Hmm, right. MultiReader/ParallelReader must keep track of whether it
I think this is OK?
This was essentially the reason why I suggested to use two refcount values: So I'll go ahead and implement the refcount approach unless anybody objects. Oh and Mike, thanks for bearing with me What about a new constructor for MultiReader/ParallelReader that implements more sensible semantics (increment refcount on readers passed to it, and decrement on close())?
Yeah, when reference counting is implemented then such a constructor should be easy to add.
Thank you for bearing with me!
+1 Ok here is the next one
This patch implements the refCounting as discussed with Mike and Yonik Other changes/improvements/comments:
I also made the changes suggested by Hoss (thanks!):
All unit tests pass.
Michael Busch made changes - 29/Oct/07 08:53 PM
Patch looks great! I'm still working through it but found a few small
issues... It might be good to put a "assert refCount > 0" at various places like I'm seeing a failure in contrib/memory testcase: [junit] *********** FILE=./NOTICE.txt
[junit] Fatal error at query=Apache, file=./NOTICE.txt, anal=org.apache.lucene.analysis.SimpleAnalyzer@341960
[junit] ------------- ---------------- ---------------
[junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an ERROR
[junit] this IndexReader is closed
[junit] org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
[junit] at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:158)
[junit] at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:632)
[junit] at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64)
[junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:118)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:97)
[junit] at org.apache.lucene.index.memory.MemoryIndexTest.query(MemoryIndexTest.java:412)
[junit] at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java:313)
[junit] at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest.java:234)
I think it's because MemoryIndexReader (private class in OK I think this patch is very close! I finished reviewing it –
here's some more feedback:
Thanks Mike! Very good review and feedback!
Agreed. I added those asserts to incRef() and decRef(). I didn't add it
Oups, I fixed this already. I changed the (deprecated) ctr
Yes of course! Changed it.
OK I see. I made this change as well. I also made the change that
Good point. Ok I changed this and also the test cases that check the refCount
Michael Busch made changes - 31/Oct/07 07:02 AM
Looks great! All tests pass for me.
How about if SegmentReader.close() always calls Norm.decRef(), Also make sure that if a SegmentReader is decRef'd to 0 and close was One more comment: I think you can partially share Norm instances? Eg Otherwise no more comments!
Hmm I was thinking about this before (that's actually why I put that
Good idea! Will make the change.
Ugh, you're right. The challenge is sometimes a reference to SR means OK, I like your original proposal: SR overrides incRef() and incrs its Likewise, SR overrides decRef() to decr its RC and RC for each Norm. This way when MR calls SR.incRef/decRef then all Norms and the SR's RC OK, I think it's finally working now!
SegmentReader now overwrites incRef() and increments the readers RC, I also added the method doCloseUnsharedResources() to IndexReader, I also made the change you suggested, Mike, to only reload the I added some new tests to verify the norms ref counting. All unit
Michael Busch made changes - 02/Nov/07 09:53 AM
I just did a quick partial review of SegmentReader for thread safety only and ran across some potential issues
There's probably other stuff, but I stopped looking. Since we are sharing things now, every method that was synchronized is now potentially unsafe. Synchronizing on the object being shared is probably a much better strategy now. This is complex enough that in addition to review, I think we need a good multi-threaded test - 100 or 1000 threads over a ram directory, all changing, querying, retrieving docs, reopening, closing, etc. It also looks like Norm.incRef is used in an unsafe manner (unsynchronized, or synchronized on the reader), and also Norm.decRef() is called inside a synchronized(norms) block, but an individual Norm may be shared across multiple Hashtables, right?
I don't think that norms even needs to be a synchronized Hashtable... it could be changed to a HashMap since it's contents never change, right? OK, reviewed the latest patch:
+1 We should fix all the synchronization issues you've found, create this
Or reverse it... write the test first so we have confidence that it can at least uncover some of these issues.
Excellent, I agree!
OK, let's scratch my "ready to commit" comment A question about thread-safety here. I agree that we must On the other hand: We're saying that performing write So I think the multi-threaded testcase should not Sorry, I hadn't kept up with this issue wrt what was going to be legal (and we should definitely only test what will be legal in the MT test). So that removes the deletedDocs issue I guess.
To find concurrency issues with an unit test is hard to do, because your potential problems lie in the time domain and not in the code domain.
From my experience following things can have impact on the results of such a test:
And be prepared that one time your tests runs through without a problem and on the next run it breaks... Just my € 0.02 Changes in this patch:
Still outstanding:
Michael Busch made changes - 10/Nov/07 06:47 AM
Changes:
The thread-safety test still sometimes fails. The weird
Michael Busch made changes - 12/Nov/07 10:28 AM
I think the cause of the intermittant failure in the test is a missing
try/finally in doReopen to properly close/decRef everything on exception. Because of lockless commits, a commit could be in-process while you > I think the cause of the intermittant failure in the test is a missing
> try/finally in doReopen to properly close/decRef everything on > exception. Awesome! Thanks so much for pointing me there, Mike! I was getting a I should have read the comment in SegmentReader#initialize more } finally { // With lock-less commits, it's entirely possible (and // fine) to hit a FileNotFound exception above. In // this case, we want to explicitly close any subset // of things that were opened so that we don't have to // wait for a GC to do so. if (!success) { doClose(); } } While debugging, it's easy to miss such an exception, because So it seems that this was indeed the cause for the failing test case. OK, all tests pass now, including the thread-safety test.
I ran it several times on different JVMs. Changes:
Michael Busch made changes - 13/Nov/07 01:26 AM
No problem, I lost some hairs tracking that one down too!! OK, latest patch looks good! I love the new threaded unit test. Only two smallish comments:
So how about a public IndexReader.flush() method so that one could also reopen readers that were used for changes?
Usecase: reader.deleteDocument()
Yes, will do!
Hmm, what if then in clone.close() an exception is thrown from Hmm but actually we could change the order in close() so that
Since our goal is it to make IndexReader read-only in the future
+1
I think also if we do decide to do this we should open a new issue for it? > Since our goal is it to make IndexReader read-only in the future
> ( flush() would make reopen() useful in more cases, and > I think also if we do decide to do this we should open a new issue for it? Yes, that's fine.
+1 I'll open a new issue. Changes:
Michael Busch made changes - 14/Nov/07 08:59 PM
Patch looks good. Only thing I found was this leftover
System.out.println, in SegmentReader.java: System.out.println("refCount " + getRefCount()); Thanks for the review, Mike! I'll remove the println.
Ok, I think this patch has been reviewed a bunch of times and Changes:
I'm going to commit this soon!
Michael Busch made changes - 17/Nov/07 08:44 PM
Michael Busch made changes - 17/Nov/07 09:38 PM
Michael Busch made changes - 25/Jan/08 03:23 AM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||