Issue Details (XML | Word | Printable)

Key: LUCENE-743
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael Busch
Reporter: Otis Gospodnetic
Votes: 3
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

IndexReader.reopen()

Created: 11/Dec/06 10:46 PM   Updated: 25/Jan/08 03:23 AM
Return to search
Component/s: Index
Affects Version/s: None
Fix Version/s: 2.3

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works IndexReaderUtils.java 2006-12-11 10:47 PM Otis Gospodnetic 5 kB
Text File Licensed for inclusion in ASF works lucene-743-take10.patch 2007-11-17 08:44 PM Michael Busch 80 kB
Text File Licensed for inclusion in ASF works lucene-743-take2.patch 2007-10-02 06:24 AM Michael Busch 52 kB
Text File Licensed for inclusion in ASF works lucene-743-take3.patch 2007-10-29 08:53 PM Michael Busch 54 kB
Text File Licensed for inclusion in ASF works lucene-743-take4.patch 2007-10-31 07:02 AM Michael Busch 59 kB
Text File Licensed for inclusion in ASF works lucene-743-take5.patch 2007-11-02 09:53 AM Michael Busch 62 kB
Text File Licensed for inclusion in ASF works lucene-743-take6.patch 2007-11-10 06:47 AM Michael Busch 76 kB
Text File Licensed for inclusion in ASF works lucene-743-take7.patch 2007-11-12 10:28 AM Michael Busch 76 kB
Text File Licensed for inclusion in ASF works lucene-743-take8.patch 2007-11-13 01:26 AM Michael Busch 80 kB
Text File Licensed for inclusion in ASF works lucene-743-take9.patch 2007-11-14 08:59 PM Michael Busch 79 kB
Text File Licensed for inclusion in ASF works lucene-743.patch 2007-08-01 09:41 PM Michael Busch 28 kB
Text File Licensed for inclusion in ASF works lucene-743.patch 2007-07-21 01:46 AM Hoss Man 24 kB
Text File Licensed for inclusion in ASF works lucene-743.patch 2007-07-20 07:13 AM Michael Busch 21 kB
Java Source File Licensed for inclusion in ASF works MyMultiReader.java 2006-12-11 10:47 PM Otis Gospodnetic 0.6 kB
Java Source File Licensed for inclusion in ASF works MySegmentReader.java 2006-12-11 10:47 PM Otis Gospodnetic 0.7 kB
Text File Licensed for inclusion in ASF works varient-no-isCloneSupported.BROKEN.patch 2007-10-07 07:57 PM Hoss Man 53 kB
Issue Links:
Dependants
 
Reference

Lucene Fields: Patch Available
Resolution Date: 17/Nov/07 09:38 PM


 Description  « Hide
This is Robert Engels' implementation of IndexReader.reopen() functionality, as a set of 3 new classes (this was easier for him to implement, but should probably be folded into the core, if this looks good).

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Otis Gospodnetic added a comment - 11/Dec/06 10:47 PM
In a direct email to me, Robert said: "All of the files can be prepended with the ASL."

robert engels added a comment - 11/Dec/06 10:57 PM
A generic version probably needs to implement reference counting on the Segments or IndexReader in order to know when they can be safely closed.

Hoss Man added a comment - 19/Jul/07 12:39 AM
i somehow missed seeing this issues before ... i don't really understand the details, but a few comments that come to mind...

1) this approach seems to assume that when reopening a MyMultiReader, the sub readers will all be MySegmentReaders .. assuming we generalize this to MultiReader/SegmentTeader, this wouldn't work in the case were people are using a MultiReader containing other MultiReaders ... not to mention the possibility of people who have written their own IndexReader implementations.
in generally we should probably try to approach reopening a reader as a recursive operation if possible where each type of reader is responsible for checking to see if it's underlying data has changed, if not return itself, if so return a new reader in it's place (much like rewrite works for Queries)

2) there is no more commit lock correct? ... is this approach something that can still be valid using the current backoff/retry mechanism involved with opening segments?


Michael Busch added a comment - 20/Jul/07 07:10 AM
> i somehow missed seeing this issues before ...

actually, me too... first I came across this thread:

http://www.gossamer-threads.com/lists/lucene/java-dev/31592?search_string=refresh;#31592

in which Doug suggested adding a static method IndexReader.open(IndexReader)
which would either return the passed in IndexReader instance in case is is
up to date or return a new, refreshed instance.

I started implementing this, using Dougs and Roberts ideas and then realized
that there was already this open issue. But the code here is quite outdated.

> in generally we should probably try to approach reopening a reader as a
> recursive operation

Yeah we could do that. However, this might not be so easy to implement.
For example, if a user creates a MultiReader instance and adds whatever
subreaders, we would have to recursively refresh the underlying readers.
But if the MultiReader was created automatically by IndexReader.open() just
calling refresh on the subreaders is not enough. New SegmentReaders have to
be opened for new segments.

Also the recursive walk would have to take place within the FindSegmentsFile
logic.

I decided therefore to only allow IndexReaders to be refreshed if they were
created by one of the IndexReader.open() methods. I'm going to submit a first
version of my patch soon. Do you think this is too limiting?


Michael Busch added a comment - 20/Jul/07 07:13 AM
First version of my patch:
  • Adds the static method IndexReader.open(IndexReader)
    that returns a new instance of IndexReader in case
    the reader could be updated. If it was up to date
    then the passed-in instance is returned. Only
    IndexReader instances that were created by one of
    the IndexReader.open() methods can be refreshed.
  • SegmentsReader.reopen(SegmentInfos) looks in the
    SegmentInfos for a segment with the same name. If
    one could be found then either the deleted docs or
    the norms were updated, otherwise the segment name
    would have changed. reopen() clones the
    SegmentReader and either loads the deleted docs,
    the norms or both. Then the clone is returned and
    the original SegmentReader is marked as closed.
  • If the index has only one segment, then
    IndexReader.open(IndexReader) checks if the passed
    in IndexReader can be refreshed. This is only
    possible if it is no MultiReader and if the segment
    name has not changed. Otherwise a new SegmentReader
    instance is returned and the old reader is closed.
  • If the index has multiple segments, then
    IndexReader.open(IndexReader) creates a new
    MultiReader and tries to reuse the passed-in
    reader (in case it's a SegmentReader) or its
    subreaders (in case it's a MultiReader). For new
    segments new SegmentReaders are created. Old
    readers that couldn't be reused are closed.
  • Adds the new testcase TestIndexReaderReopen. It
    includes the method
    assertIndexEquals(IndexReader, IndexReader) that
    checks whether boths IndexReaders have the same
    content. The testcase creates an index and
    performes different modifications on that index.
    One IndexReader is refreshed after each index
    modification and compared to a fresh IndexReader
    which is opened after each modification.

This first version is for review and not ready to
commit. I want to add more extensive tests and
probably clean up the code.

All tests pass.


Hoss Man added a comment - 21/Jul/07 12:26 AM

> Yeah we could do that. However, this might not be so easy to implement.
> For example, if a user creates a MultiReader instance and adds whatever
> subreaders, we would have to recursively refresh the underlying readers.
> But if the MultiReader was created automatically by IndexReader.open() just
> calling refresh on the subreaders is not enough. New SegmentReaders have to
> be opened for new segments.

...this being the curse that is MultiReader – it can serve two very differenet purposes.

You seem to have already solved the multisegments in a single directory approach, the MultiReader over many subreader part actually seems much easier to me (just call your open method on all of the subreaders) the only tricky part is detecting which behavior should be used when. This could be driven by a simple boolean property of MultiReader indicating whether it owns it's directory and we need to look for new segments or not – in which case we just need to refresh the subreaders. (My personal preference would be to change MultiReader so "this.directory" is null if it was open over several other subReaders, right now it's just assigned to the first one arbitrarily, but there may be other consequences of changing that)

Incidentally: I don't think it's crucial that this be done as a recursive method. the same approach i describe could be added to static utility like what you've got, I just think that if it's possible to do it recursively we should so that if someone does write their own MultiReader or SegmentReader subclass they can still benefit from any core reopening logic as long as theey do their part to "reopen" their extensions.


Hoss Man added a comment - 21/Jul/07 01:46 AM
an extremely hackish refactoring of the previous patch that demonstrates the method working recursively and dealing with MultiReaders constructed over multiple subReaders.

a few notes:

1) no where near enough tests of the subReader situation
2) the refactoring is very very ugly and brutish ... most of the code is still in IndexReader just because it needs so many things that are private – things that really seems like they should be pushed down into SegmentReader (or made protected)
3) test triggered an apparent NPE in MultiReader.isOptimized() when there are subReaders, i hacked arround this in the test, see usages of assertIndexEqualsZZZ vs assertIndexEquals
4) the FilterIndexReader situation is ... interesting. in theory FilterIndexReader should really be abstract (since if you aren't subclassing it anyway, why do you want it?)


Michael Busch added a comment - 01/Aug/07 09:41 PM
Now, after LUCENE-781, LUCENE-970 and LUCENE-832 are committed, I updated the latest
patch here, which was now easier because MultiReader is now separated into two classes.

Notes:

  • As Hoss suggested I added the reopen() method to IndexReader non-static.
  • MultiReader and ParallelReader now overwrite reopen() to reopen the subreaders
    recursively.
  • FilteredReader also overwrites reopen(). It checks if the underlying reader has
    changed, and in that case returns a new instance of FilteredReader.

I think the general contract of reopen() should be to always return a new IndexReader
instance if it was successfully refreshed and return the same instance otherwise,
because IndexReaders are used as keys in caches.
A remaining question here is if the old reader(s) should be closed then or not.
This patch closes the old readers for now, if we want to change that we probably have
to add some reference counting mechanism, as Robert suggested already. Then I would
also have to change the SegmentReader.reopen() implementation to clone resources like
the dictionary, norms and delete bits.
I think closing the old reader is fine. What do others think? Is keeping the old
reader after a reopen() a useful usecase?


Michael Busch added a comment - 01/Aug/07 09:46 PM
I ran some quick performance tests with this patch:

1) The test opens an IndexReader, deletes one document by random docid, closes the Reader.
So this reader doesn't have to open the dictionary or the norms.
2) Another reader is opened (or alternatively reopened) and one TermQuery is executed, so
this reader has to read the norms and the dictionary.

I run these two steps 5000 times in a loop.

First run: Index size: 4.5M, optimized

  • 1) + TermQuery: 103 sec
  • 1) + 2) (open): 806 sec, so open() takes 703 sec
  • 1) + 2) (reopen): 118 sec, so reopen() takes 15 sec ==> Speedup: 46.9 X

Second run: Index size: 3.3M, 24 segments (14x 230.000, 10x 10.000)

  • 1) + TermQuery: 235 sec
  • 1) + 2) (open): 1162 sec, so open() takes 927 sec
  • 1) + 2) (reopen): 321 sec, so reopen() takes 86 sec ==> Speedup: 10.8X

Yonik Seeley added a comment - 06/Aug/07 10:07 PM
> I think closing the old reader is fine. What do others think? Is keeping the old
> reader after a reopen() a useful usecase?

In a multi-threaded environment, one wants to open a new reader, but needs to wait until all requests finish before closing the old reader. Seems like reference counting is the only way to handle that case.


Hoss Man added a comment - 13/Aug/07 06:15 AM

(note: i haven't looked at the latest patch in detail, just commenting on the comments)

One key problem i see with automatically closing things in reopen is MultiReader: it's perfectly legal to do something like this psuedocode...

IndexReader a, b, c = ...
MultiReader ab = new MultiReader({a, b})
MultiReader bc = new MultiReader({b, c})
...b changes on disk...
ab.reopen(); // this shouldn't affect bc;

one possibility would be for the semantics of reopen to close old readers only if it completely owns them; ie: MultiReader should never close anything in reopen, MultiSegmentReader should close all of the subreaders since it opened them in the first place ... things get into a grey area with SegementReader though.

In general i think the safest thing to do is for reopen to never close. Yonik's comment showcases one of the most compelling reasons why it can be important for clients to be able to keep using an old IndexReader instance, and it's easy enough for clients that want the old one closed to do something like...

IndexReader r = ...
...
IndexReader tmp = r.reopen();
if (tmp != r) r.close();
r = tmp;
...

(one question that did jump out at me while greping the patch for the where old readers were being closed: why is the meat of reopen still in "IndexReader" with a "if (reader instanceof SegmentReader)" style logic in it? can't the different reopen mechanisms be refactored down into SegmentReader and MultiSegmentReader respectively? shouldn't the default impl of IndexReader throw an UnsuppportedOperationException?)


Michael Busch added a comment - 17/Aug/07 07:10 AM
> IndexReader a, b, c = ...
> MultiReader ab = new MultiReader({a, b})
> MultiReader bc = new MultiReader({b, c})
> ...b changes on disk...
> ab.reopen(); // this shouldn't affect bc;
>
> one possibility would be for the semantics of reopen to close old readers only
> if it completely owns them; ie: MultiReader should never close anything in
> reopen, MultiSegmentReader should close all of the subreaders since it opened
> them in the first place

So if 'b' in your example is a MultiSegmentReader, then the reopen() call
triggered from MultiReader.reopen() would close old readers, because it owns them,
thus 'bc' wouldn't work anymore. So it depends on the caller of
MultiSegmentReader.reopen() whether or not to close the subreaders. I think this
is kind of messy. Well instead of reopen() we could add
reopen(boolean closeOldReaders), but then...

> IndexReader r = ...
> ...
> IndexReader tmp = r.reopen();
> if (tmp != r) r.close();
> r = tmp;
> ...

... is actually easy enough as you pointed out, so that the extra complexity is not
really worth it IMO.

> In general i think the safest thing to do is for reopen to never close.

So yes, I agree.

> why is the meat of reopen still in "IndexReader" with a "if (reader instanceof
> SegmentReader)" style logic in it? can't the different reopen mechanisms be
> refactored down into SegmentReader and MultiSegmentReader respectively?

I'm not sure if the code would become cleaner if we did that. Sometimes a
SegmentReader would then have to return a MultiSegmentReader instance and vice
versa. So we'd probably have to duplicate some of the code in these two classes.


Hoss Man added a comment - 20/Aug/07 08:40 PM

> I'm not sure if the code would become cleaner if we did that. Sometimes a SegmentReader would then have to
> return a MultiSegmentReader instance and vice versa. So we'd probably have to duplicate some of the code in
> these two classes.

i don't hink there would be anything wrong with SegmentReader.reopen returning a MultiSegmentReader in some cases (or vice versa) but it definitely seems wrong to me for a parent class to be explicitly casing "this" to one of two know subclasses ... making reopen abstract in the base class (or throw UnsupportOp if for API back compatibility) seems like the only safe way to ensure any future IndexReader subclasses work properly.


Michael Busch added a comment - 22/Aug/07 11:21 PM
We should first refactor segmentInfos into IndexReader's subclasses.

Nat added a comment - 04/Sep/07 11:42 AM
Please also consider making an option where the reopen can be automated (i.e. when the index is updated) instead of having to call it explicitly. Thread safety should be taken into account as well.

Michael Busch added a comment - 02/Oct/07 06:24 AM
I'm attaching a new version of the patch that has a lot of changes compared to the last patch:
  • I factored most of the reopen logic into the subclasses of IndexReader. Now that we're having the DirectoryIndexReader layer this was possible in a more elegant way.
  • IndexReader.reopen() now does not close the old readers by default. This was somewhat tricky, because now the IndexReaders must be cloneable. I changed IndexReader to implement the Cloneable interface and implemented clone() for all Lucene built-in IndexReaders. However, there are probably custom IndexReader implementations out there that do not implement clone() and reopen() should throw an exception when an attempt is made to reopen such a reader. But I don't want to throw an exception in IndexReader.clone() itself, because then it would not be possible anymore for subclasses to recursively call the native Object.clone() via super.clone(). To solve this conflict I added the method
    /**
       * Returns true, if this IndexReader instance supports the clone() operation.
       */
      protected boolean isCloneSupported();

to IndexReader which returns false by default. IndexReader.clone() checks if the actual implementation supports clone() (i. e. the above method returns true). If not, it throws an UnsupportedOperationException, if yes, it returns super.clone().

I was not sure about whether to throw an (unchecked) UnsupportedOperationException or a CloneNotSupportedException in this case. I decided to throw UnsupportedOperationException even though it's not really following the clone() guidelines, because it avoids the necessity to catch the CloneNotSupportedException every time clone() is called (in the reopen() methods of all IndexReader implementations).

As an example for how the clone() method is used let me describe how MultiReader.reopen() works: it tries to reopen every of its subreaders. If at least one subreader could be reopened successfully, then a new MultiReader instance is created and the reopened subreaders are added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no index changes) is now cloned() and added to the new MultiReader.

  • I also added the new method
    /**
       * In addition to {@link #reopen()} this methods offers the ability to close
       * the old IndexReader instance. This speeds up the reopening process for
       * certain IndexReader implementations and reduces memory consumption, because
       * resources of the old instance can be reused for the reopened IndexReader
       * as it avoids the need of copying the resources.
       * <p>
       * The reopen performance especially benefits if IndexReader instances returned 
       * by one of the <code>open()</code> methods are reopened with 
       * <code>closeOldReader==true</code>.
       * <p>
       * Certain IndexReader implementations ({@link MultiReader}, {@link ParallelReader})
       * require that the subreaders support the clone() operation (see {@link #isCloneSupported()}
       * in order to perform reopen with <code>closeOldReader==false</code>.  
       */
      public synchronized IndexReader reopen(boolean closeOldReader);

As the javadoc says it has two benefits: 1) it speeds up reopening and reduces ressources, and 2) it allows to reopen readers, that use non-cloneable subreaders.

Please let me know what you think about these changes, especially about the clone() implementation.


Michael Busch added a comment - 05/Oct/07 07:41 AM
I ran new performance tests with the latest patch similar to the tests I explained in an earlier comment on this issue.

I'm using again a 4.5M index. In each round I delete one document and (re)open the IndexReader thereafter. Here are the numbers for 5000 rounds:

  Time Speedup
open 703s  
reopen(closeOldReader==false) 62s 11x
reopen(closeOldReader==true) 16s 44x

Now in each round I delete on document and also set the norm for one random document. Numbers for 1000 rounds:

  Time Speedup
open 166s  
reopen(closeOldReader==false) 33s 5.0x
reopen(closeOldReader==true) 29s 5.7x

I think these numbers look pretty good. We get a quite decent speedup even if the old readers are not closed.

I would like to commit this in a couple of days to get ready for Lucene 2.3. It would be nice if someone could review the latest patch! Hoss? Others?


Yonik Seeley added a comment - 06/Oct/07 12:58 AM
I think this looks pretty good Michael!
Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers.

Michael Busch added a comment - 06/Oct/07 03:43 AM
> Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers.

Yeah, the cloning part was kind of tedious. Read-only readers would indeed make our life much easier here. I'm wondering how many people are using the IndexReader to alter the norms anyway?

Well, thanks for reviewing the patch, Yonik!


robert engels added a comment - 06/Oct/07 04:24 AM
Nice to see all the good work on this. We are still on a 1.9 derivative.

Hopefully we'll be able to move to stock 2.X release in the future.


Hoss Man added a comment - 06/Oct/07 05:37 AM
I haven't looked at the patch yet (i really really plan to this weekend, baring catastrophe) but i'm confused as to why you went the cloning route (along with the complexity of the api changes to indicate when it is/isn't supported) ... based on the comments, it seems to boil down to...

> As an example for how the clone() method is used let me describe how MultiReader.reopen()
> works: it tries to reopen every of its subreaders. If at least one subreader could be reopened
> successfully, then a new MultiReader instance is created and the reopened subreaders are
> added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no
> index changes) is now cloned() and added to the new MultiReader.

that seems like circular logic: the clone method is used so that the sub readers can be cloned

why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it?

And if (for reasons that aren't clear to me) it is important for MultiReader to use a clone of it's subreaders when their reopen method returns "this" then shouldn't clients do the same thing? ... why not make reopen always return this.clone() if the index hasn't changed (which now that i think about it, would also help by punting on the isCloneSupported issue – each class would already know if it was clonable.

maybe this will make more sense once i read the patch ... i just wanted to through it out there in case someone had a chance to reply before i get a chance.


Michael Busch added a comment - 06/Oct/07 06:56 AM
> why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it?

Let's say you have a MultiReader with two subreaders:

IndexReader ir1 = IndexReader.open(index1);
IndexReader ir2 = IndexReader.open(index2);
IndexReader mr = new MultiReader(new IndexReader[] {ir1, ir2});

Now index1 changes and you reopen the MultiReader and keep the old one open:

IndexReader mr2 = mr.reopen();

ir1 would now be reopened and let's assume we wouldn't clone ir2. If you use mr2 now to e.g. delete a doc and that doc happens to be in index2, then mr1 would also see the changes because both MultiReaders share the same subreader ir2 and are thus not independent from each other.

> why not make reopen always return this.clone()

clone() might be an expensive operation. We only need to clone if at least one of the subreaders has changed.


Michael McCandless added a comment - 06/Oct/07 01:26 PM
> > Too bad so much needs to be cloned in the case that
> > closeOldReader==false... maybe someday in the future we can have
> > read-only readers.
>
> Yeah, the cloning part was kind of tedious. Read-only readers would
> indeed make our life much easier here. I'm wondering how many people
> are using the IndexReader to alter the norms anyway?

I think the closeOldReader=false case is actually quite important.

Because in production, I think you'd have to use that, so that your
old reader stays alive and is used to service incoming queries, up
until the point where the re-opened reader is "fully warmed".

Since fully warming could take a long time (minutes?) you need that
old reader to stay open.

Can we take a copy-on-write approach? EG, this is how OS's handle the
virtual memory pages when forking a process. This would mean whenever
a clone has been made of a SegmentReader, they cross-reference one
another. Then whenever either needs to alter deleted docs, one of them
makes a copy then. Likewise for the norms.

This would mean that "read-only" uses of the cloned reader never
pay the cost of copying the deleted docs bit vector nor norms.


Michael McCandless added a comment - 06/Oct/07 02:16 PM

Actually if we went back to the sharing (not cloning) approach, could
we insert a check for any writing operation against the re-opened
reader that throws an exception if the original reader is not yet
closed?

In Michael's example above, on calling mr2.deleteDoc, you would hit an
exception because mr2 would check and see that it's "original" reader
mr is not yet closed. But once you've closed mr, then the call
succeeds.

I think this would let us have our cake and eat it too: re-opening
would be very low cost for unchanged readers (no clones created), and,
you can still do deletes, etc, after you have closed your prior
reader. You control when your prior reader gets closed, to allow for
warming time and for queries to finish on the old reader.

Would this work?


Michael McCandless added a comment - 06/Oct/07 02:29 PM

A couple other questions/points:

  • Just to verify: if you have a DirectoryIndexReader that is holding
    the write lock (because it has made changes to deletes/norms) then
    calling reopen on this reader should always return 'this', right?
    Because it has the write lock, it must be (better be!) current.

This might be a good place to add an assert: if you are not
current, assert that you don't have the write lock, and if you
have the write lock, assert that you are current.

  • I think you should add "ensureOpen()" at the top of
    MultiReader.reopen(...)?
  • If an Exception is hit during reopen, what is the resulting state
    of your original reader? I think, ideally, it is unaffected by
    the attempt to re-open? EG if you're on the hairy edge of file
    descriptor limits, and the attempt to re-open failed because you
    hit the limit, ideally your original reader is unaffected (even if
    you specified closeOldReader=true).

Hoss Man added a comment - 07/Oct/07 06:22 PM
Okay, read the patch. I'm on board with the need for Clonable now ... it's all about isolation. if "r.reopen(false) == r" then the client is responsible for recognizing that it's got the same instance and can make the judgement call about reusing the instance or cloning depending on it's needs ... internally in things like MultiReader we have to assume we need a clone for isolation.

Questions and comments...

1. CompoundFileReader, FieldsReader, IndexReader, and BitVector all have clone methods where they silently catch and ignore CloneNotSupportedExceptions from super.clone()... if we don't expect the exception to ever happen, we should just let the exception propogate up the stack (there is no down side to declaring that clone() throws CloneNotSupportedException). If we think the exception might happen, but it's not a big deal if it does and we can ignore it, then we should put a comment in the catch block to that effect. i'm not clear which are the cases here.

2. i don't remember if the old patches had this issue as well, but having "return new FilterIndexReader(reopened);" in FilterIndexReader doesn't really help anyone using FilterIndexReader – they're going to want an instance of their own subclass. to meet the contract of FilterIndexReader, we should implement all "abstract" methods and delegate to the inner reader - so in theory do don't need a new instance of FIlterIndexReader, we could just return in.reopen(closeold) ... my gut says it would be better to leave the method out entirely so that the default impl which throws UnSupOpEx is used — that way subclasses who want to use reopen must define their own (instead of getting confused when their filtering logic stops working after they reopen for the first time)

3. instead of having an isClonable() method, why don't we just remove the "implements Clonable" declaration from IndexReader and put it on the concrete subclasses – then use "instanceof Cloneable" to determine if something is clonable? ... for that matter, the only place isCloneSupported is really used is in IndexReader.clone where an exception is thrown if Clone is not supported ... subclasses which know they are Clonable don't need this from the base class, subclasses which don't implement Clonable but are used in combination with core classes whose reopen method requires it could be skiped by the caller (since they don't implement Clonable) ..

4. javadocs frequently refer to "reopened successfully" and "refreshed successfully" when what it really means is "reopen() returned a newer/fresher reader" ... this may confuse people who are use to thinking of "successfully" a "without error"

5. random thought: are their any synchronization issues we're missing here? I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs. is there any chance for inconsistent state?


Hoss Man added a comment - 07/Oct/07 07:57 PM
a rough variation on Michael's latest patch that makes the changes along two of the lines i was suggesting before reagrding FilterIndexReader and ising "instanceof Cloneable" instead of "isCloneSupported()" two important things to note:

1) i didn't change any of the documentation, i was trying to change as little aspossibel so the two patches could be compared side-by-side

2) this patch is currently broken. TestIndexReaderReopen gets an exception i can't understand ... but i have to stop now, and i wanted to post what i had in case people had comments.

...now that i've done this exercise, i'm not convinced that it's a better way to go, but it does seem like isCloneSupported isn't neccessary, that's the whole point of the Cloneable interface.


Yonik Seeley added a comment - 07/Oct/07 08:09 PM
> I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs.

Hmmm there is cause for concern (and I should have had my mt-safe hat on
Reopen is synchronized on the reader, and so are norms access and docs, but from a quick look:

  • reopen() may be synchronized, but clone() called on sub-readers isn't in a synchronized context that the sub-reader cares about. For example, MultiReader.reopen has the lock on the multireader, but calles subreader.clone() which iterates over the norms in a non thread-safe way.
  • IndexInput objects that are in use should never be cloned... (like what is done in FieldsReader.clone())

Michael Busch added a comment - 08/Oct/07 05:56 PM
Thanks all for the reviews and comments!

There seem to be some issues/open questions concerning the cloning.
Before I comment on them I think it would make sense to decide whether
we want to stick with the cloning approach or not. Mike suggests this
approach:

> Actually if we went back to the sharing (not cloning) approach, could
> we insert a check for any writing operation against the re-opened
> reader that throws an exception if the original reader is not yet
> closed?

Interesting, yes that should work in case we have two readers (the
original one and the re-opened one). But what happens if the user calls
reopen twice to get two re-opened instances back? Then there would be
three instances, and without cloning the two re-opened ones would also
share the same resources. Is this a desirable use case or would it be
okay to restrict reopen() so that it can only create one new instance?


Michael McCandless added a comment - 08/Oct/07 06:45 PM
> > Actually if we went back to the sharing (not cloning) approach,
> > could we insert a check for any writing operation against the
> > re-opened reader that throws an exception if the original reader
> > is not yet closed?
>
> Interesting, yes that should work in case we have two readers (the
> original one and the re-opened one). But what happens if the user
> calls reopen twice to get two re-opened instances back? Then there
> would be three instances, and without cloning the two re-opened ones
> would also share the same resources. Is this a desirable use case or
> would it be okay to restrict reopen() so that it can only create one
> new instance?

Hmmm good point.

Actually, we could allow more then one re-open call if we take the
following approach: every time a cloned Reader "borrows" a reference
to a sub-reader, it increments a counter (call it the "referrer
count"). When the Reader is closed, it decrements the count (by 1)
for each of the sub-readers.

Then, any reader should refuse to do a writing operation if its
"referrer" count is greater than 1, because it's being shared across
more than one referrer.

This way if you have a reader X and you did reopen to get Y and did
reopen again to get Z then the shared sub-readers between X, Y and Z
would not allow any write operations until 2 of the three had been
closed. I think that would work?

BTW this would also allow for very efficient "snapshots" during
searching: keeping multiple readers "alive", each searching a
different point-in-time commit of the index, because they would all
share the underlying segment readers that they have in common. Vs
cloning which would have to make many copies of each segment reader.


Michael Busch added a comment - 08/Oct/07 09:42 PM
> This way if you have a reader X and you did reopen to get Y and did
> reopen again to get Z then the shared sub-readers between X, Y and Z
> would not allow any write operations until 2 of the three had been
> closed. I think that would work?

Yes I think it would work. However, this approach has two downside IMO:

  • reopen() becomes more complicated and restricted for the user. With
    the cloning approach the user doesn't have to care about when index
    modifications are not allowed. IndexReader instances returned by open()
    or reopen() can be used exactly the same without any restrictions.
  • We have to be very careful about cross-referencing multiple readers.
    If for some reason any references between two or more readers are not
    cleared after one was closed, then that reader might not become GC'd.
    I'm not saying it's not possible or even very hard, we just have to
    make sure those things can't ever happen.

Of course the cloning approach has disadvantages too. For custom
readers clone() has to be implemented in order to make reopen() work
correctly. Also reopen() is more expensive in case of
closeOldReader=false. Well we could certainly consider the lazy clone
approach that you suggested, Mike, but we have to be careful about the
cross-referencing issue again.

So I'm really not sure which way the better one is. I think I'm slightly
in favor for the cloning approach, so that reopen() returns instances
that are completely independant from each other, which seems cleaner IMO.
What do others think?


Michael McCandless added a comment - 08/Oct/07 10:15 PM

> We have to be very careful about cross-referencing multiple readers.
> If for some reason any references between two or more readers are
> not cleared after one was closed, then that reader might not become
> GC'd. I'm not saying it's not possible or even very hard, we just
> have to make sure those things can't ever happen.

One correction: there should be no cross-referencing, right? The only
thing that's happening is incrementing & decrementing the "referrer
count" for a reader? (But, you're right: the "copy on write" approach
to cloning would have cross-referencing).

I think the downside of this proposed "shared readers" (non-cloning)
approach is that you can't delete/setNorm on the clone until you close
the original. But I think that's fairly minor? Also if you really
need a full deep copy of your reader you could just open a new one
(ie, not use "reopen")?

I think the big upside of "shared readers" is reopening becomes quite
a bit more resource efficient: the cost of re-opening a reader would
be in proportion to what has actually changed in the index. EG, if
your index has added a few tiny segments since you last opened then
the reopen would be extremely fas but with cloning you are forced
to make a full deep copy of all other [unchanged] segments.


Hoss Man added a comment - 11/Oct/07 06:43 PM

in deciding "to clone or not clone" for the purposes of implementing reopen, it may make sense to step back and ask a two broader questions...

1) what was the motivation for approaching reopen using clones in the first place
2) what does it inherently mean to clone an indexreader.

the answer to the first question, as i understand it, relates to the issue of indexreaders not being "read only" object ... operations can be taken which modify the readers state, and those operations can be flushed to disk when the reader is closed. so readerB = readerA.reopen(closeOld=false) and then readerA.delete(...) is called, there is ambiguity as to what should happen in readerB. so the clone route seems pretty straight forward if and only if we have an unambiguous concept of cloning a reader (because then the clone approach to reopen becomes unambiguous as well). alternately, since reopen is a new method, we can resolve the ambiguity anyway we want, as long as it's documented ... in theory we should pick a solution that seems to serve the most benefit ... perhaps that solution is to document reopen with "if reopen(closeOld=false) returns a new reader it will share state with the current reader, attempting to do the following operations on this new reader will result in undefined behavior"

the answer the the second is only important if we want to go the cloning route ... but it is pretty important in a larger sense then just reopening ... f we start to say that any of the IndexReader classes are Clonable we have to be very clear about what that means in all cases where someone might clone it in addition to reopen ... in particular, i worry about what it means to clone a reader which has already had "write" operations performed on it (something the clone based version of reopen will never do because a write operation indicates the reader must be current), but some client code might as soon as we add the Clonable interface to a class.


Michael Busch added a comment - 16/Oct/07 08:34 PM
> in deciding "to clone or not clone" for the purposes of implementing
> reopen, it may make sense to step back and ask a two broader questions...

I agree!

> 1) what was the motivation for approaching reopen using clones in the
> first place

Good summarization! You are right, I started the clone approach because
IndexReaders are not "read only" objects.

> "if reopen(closeOld=false) returns a new reader it will share state with
> the current reader, attempting to do the following operations on this
> new reader will result in undefined behavior"

This would mean, that we simply warn the user that performing write
operations with the re-opened indexreader will result in undefined behavior,
whereas with Mike's approach we would prevent such an undefined behavior by
using reference counting.

Hmm, so what are our long-term plans for indexreaders? If our goal is to
make them read-only (we can delete via IndexWriter already), then I think
adding those warning comments to reopen(), as you suggest Hoss, would be
sufficient.

If everybody likes this approach I'll go ahead and submit a new patch.


Doug Cutting added a comment - 16/Oct/07 08:51 PM

Hmm, so what are our long-term plans for indexreaders? If our goal is to
make them read-only (we can delete via IndexWriter already), then I think
adding those warning comments to reopen(), as you suggest Hoss, would be
sufficient.

I think that's a fine direction. Note however that IndexWriter implements delete by calling IndexReader.delete(). That method could be made package-private, so that users cannot call it, but then this makes it impossible for someone to subclass IndexReader from a different package. So perhaps delete() needs to move to a subclass of IndexReader? That gets messy...


Michael Busch added a comment - 16/Oct/07 09:24 PM

I think that's a fine direction. Note however that IndexWriter implements delete by calling IndexReader.delete(). That method could be made package-private, so that users cannot call it, but then this makes it impossible for someone to subclass IndexReader from a different package. So perhaps delete() needs to move to a subclass of IndexReader? That gets messy...

Actually all of the lock/commit logic moved from IndexReader to DirectoryIndexReader already, and the delete logic is in SegmentReader, which subclasses DirectoryIndexReader. So we could remove the deleteDocument() API from IndexReader but leave it in DirectoryIndexReader. Then it would still be possible to extend IndexReader from a different package just as today, and IndexWriter could use DirectoryIndexReader for performing deletes. These changes should be trivial.


Doug Cutting added a comment - 16/Oct/07 09:33 PM
Got it. IndexWriter only works with SegmentReaders anyway, not with an arbitrary IndexReader implementation: IndexReader is extensible, but IndexWriter is not. I'd (momentarily) forgotten that. Nevermind.

Yonik Seeley added a comment - 16/Oct/07 09:40 PM
Having a read-only IndexReader would (should?) mean being able to remove "synchronized" from some things like isDeleted()... a nice performance win for multi-processor systems for things that didn't have access to the deleted-docs bitvec.

> If our goal is to make them read-only (we can delete via IndexWriter already)

But you can only delete-by-term.
It's more powerful to be able to delete by docid, however I manage to come up with it.
So I think deleteDocument(int id) should either be moved to a subclass. same with setNorms?


Michael Busch added a comment - 16/Oct/07 10:29 PM

So I think deleteDocument(int id) should either be moved to a subclass. same with setNorms?

Or we could take the approach you suggested in http://www.gossamer-threads.com/lists/lucene/java-dev/52017.

That would mean to add a callback after flush to get a current IndexReader to get the docids and to use the IndexWriter then to perform deleteDocument(docId) or setNorm(). These methods could also take an IndexReader as argument, e. g. deleteDocument(IndexReader reader, int docId), which would throw an IOException if the passed in reader is stale (i. e. docids have changed since the reader was opened). Just as IndexReader does it today. Does this make sense or am I missing something?


Michael Busch added a comment - 17/Oct/07 06:35 PM
I just opened LUCENE-1030 and would like to move discussions related to "read-only" IndexReaders to that issue.

As for reopen() I'd like to go with Hoss' suggestion for now and add warning comments to reopen() saying that using an re-opened IndexReader with closeOldReader==false for write operations will result in an undefined behavior. Any objections?


Yonik Seeley added a comment - 17/Oct/07 09:02 PM

As for reopen() I'd like to go with Hoss' suggestion for now and add warning comments to reopen() saying that using an re-opened IndexReader with closeOldReader==false for write operations will result in an undefined behavior.

How about just defining the behavior such that any pending changes are flushed. That would make it more useful because you could then reopen readers you used for deletes. An alternative would be a method to explicitly flush changes on a reader, giving one the ability to then reopen it, but I like the former better since it avoids adding another API call.


Michael Busch added a comment - 17/Oct/07 09:23 PM

How about just defining the behavior such that any pending changes are flushed. That would make it more useful because you could then reopen readers you used for deletes.

Hmm, I'm not sure I understand. A reader which is being used for deletes or setting norms is always current (it owns the write lock), so there should never be the need to re-open such a reader.

However, if you re-open an existing reader which was not used for deletes before and use the new instance (b) to perform deletes, it will result in a undefined behavior for the old reader (a):

IndexReader a = .....
....
IndexReader b = a.reopen();
b.deleteDocument(...);

Yonik Seeley added a comment - 18/Oct/07 04:55 PM

A reader which is being used for deletes or setting norms is always current (it owns the write lock), so there should never be the need to re-open such a reader.

I was thinking about the "add some docs" then "delete some docs" scenario:
One currently needs to close() the deleting reader to open an IndexWriter. If IndexReader.commit() was public, then one could simply flush changes, and then call reopen() after the IndexWriter was done adding new documents. But perhaps longer term, all deletions should be done via the IndexWriter anyway.

if you re-open an existing reader which was not used for deletes before and use the new instance (b) to perform deletes

Ah, thanks for that clarification. I guess that should remain undefined.


Michael Busch added a comment - 18/Oct/07 07:49 PM
Hmm one other thing: how should IndexReader.close() work? If we re-open a reader (a is the old reader, b is the new one), and then someone calls a.close(), then a should not close any resources that it shares with b.

One way to make this work would be reference counting. Since we want to avoid that, we could simply restrict reopen() from being called twice for the same reader. Then there would never be more than 2 readers sharing the same ressources. The old reader a would remember that reopen() was called already and would not close the shared ressources on close(). However, the new reader b would close all ressources, meaning that reader a would not work anymore after b.close() was called.
Thoughts?


Michael McCandless added a comment - 18/Oct/07 08:03 PM
I think reference counting would solve this issue quite nicely. How come we want to avoid reference counting? It seems like the right solution here.

The implementation seems simple. When a reader is opened, it starts with RC 1. When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC). When a MultiReader needs to use the reader, it calls incref. And when that MultiReader is done with it, it calls decref. Whenever the RC hits 0 it's safe to free all resources.

Wouldn't that work?


Yonik Seeley added a comment - 18/Oct/07 08:35 PM
> When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC)

But if a reader is shared, how do you tell two real closes from an erronous double-close?
Perhaps have a top level close() that is only invoked once via isClosed, and a projected doClose() that a multi-reader would use that actually does the decref?


Michael Busch added a comment - 18/Oct/07 08:43 PM

The implementation seems simple. When a reader is opened, it starts with RC 1. When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC). When a MultiReader needs to use the reader, it calls incref. And when that MultiReader is done with it, it calls decref. Whenever the RC hits 0 it's safe to free all resources.

Wouldn't that work?

You're right, for a MultiReader and ParallelReader this would work and wouldn't be hard to implement.

It is quite different when it comes to SegmentReaders. SegmentReader.reopen() checks SegmentInfos if there is a segment with the same name but newer normGen or delGen. If there is one then a new SegmentReader instance is created that reuses resources like e. g. TermInfosReader and loads the new generation of the del or norm file.

Now you can end up having a bunch of SegmentReaders that share the same resources but don't know about each other. The reference counting would have to happen somewhere else, e. g. in the TermInfosReader. Of course this is doable, but it's a special case and more complicated compared to the "restrict reopen() to only be called once"-approach.


Michael McCandless added a comment - 18/Oct/07 09:53 PM
> But if a reader is shared, how do you tell two real closes from an erronous double-close?
> Perhaps have a top level close() that is only invoked once via isClosed, and a projected doClose() that a multi-reader would use that actually does the decref?

Yes I think that's the right approach.

> It is quite different when it comes to SegmentReaders. SegmentReader.reopen() checks SegmentInfos if there is a segment with the same name but newer normGen or delGen. If there is one then a new SegmentReader instance is created that reuses resources like e. g. TermInfosReader and loads the new generation of the del or norm file.
>
> Now you can end up having a bunch of SegmentReaders that share the same resources but don't know about each other. The reference counting would have to happen somewhere else, e. g. in the TermInfosReader. Of course this is doable, but it's a special case and more complicated compared to the "restrict reopen() to only be called once"-approach.

For SegmentReader, I think on reopen(), everything would be shared
except norms and/or deletedDocs right? In which case couldn't all
cascaded reopens hold onto the original SegmentReader & call doClose()
on it when they are closed? (Ie it is the "owner" of all the shared
parts of a SegmentReader). Then, deletedDocs needs no protection (it
has no close()), and for Norms we could push the reference counting
down into it as well?

We wouldn't need to push reference counting into all the readers that
a SegmentReader holds because those are always shared when a
SegmentReader is reopened?


Michael Busch added a comment - 19/Oct/07 09:49 PM - edited
Hi Mike,

I'm not sure if I fully understand your comment. Consider the following (quite constructed) example:

IndexReader reader1 = IndexReader.open(index1);  // optimized index, reader1 is a SegmentReader
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
... // modify index2
IndexReader multiReader2 = multiReader1.reopen();  
// only index2 changed, so multiReader2 uses reader1 and has to increment the refcount of reader1
... // modify index1
IndexReader reader2 = reader1.reopen();
// reader2 is a new instance of SegmentReader that shares resources with reader1
... // modify index1
IndexReader reader3 = reader2.reopen();
// reader3 is a new instance of SegmentReader that shares resources with reader1 and reader2

Now the user closes the readers in this order:

  1. multiReader1.close();
  2. multiReader2.close();
  3. reader2.close();
  4. reader3.close();

reader1 should be marked as closed after 2., right? Because multiReader1.close() and multiReader2.close() have to decrement reader1's refcount. But the underlying files have to remain open until after 4., because reader2 and reader3 use reader1's resources.

So don't we need 2 refcount values in reader1? One that tells us when the reader itself can be marked as closed, and one that tells when the resources can be closed? Then multiReader1 and multiReader2 would decrement the first refCount, whereas reader2 and reader3 both have to "know" reader1, so that they can decrement the second refcount.

I hope I'm just completely confused now and someone tells me that the whole thing is much simpler


Michael McCandless added a comment - 20/Oct/07 09:09 AM
It's not nearly this complex (we don't need two ref counts). If we
follow the simple rule that "every time reader X wants to use reader
Y, it increfs it" and "whenver reader X is done using reader Y, it
decrefs it", all should work correctly.

Also we should think of "close()" as the way that the external user
does the decref of their reader. We just special-case this call, by
setting isOpen=false, to make sure we don't double decref on a double
close call.

Let's walk through your example...

I'm assuming in your example you meant for reader2 and reader3 to also
be SegmentReaders? Ie, the changes that are happening to the
single-segment index1 are just changes to norms and/or deletes. If
not, the example is less interesting because reader1 will be closed
sooner

Also in your example let's insert missing "reader1.close()" as the
very first close? (Else it will never be closed because it's RC never
hits 0).

When reader1 is created it has RC 1.

When multiReader1 is created, reader1 now has RC 2.

When multiReader2 is created, reader1 now has RC 3.

When reader2 is created (by reader1.reopen()), it incref's reader1
because it's sharing the sub-readers in reader1. So reader1 now has
RC 4.

When reader3 was created (by reader2.reopen()), it incref's reader2
because it's sharing the sub-readers reader2 contains. So reader1 is
still at RC 4 and reader2 is now at RC 2.

Now, we close.

After reader1.close() is called, reader1 sets isOpen=false (to prevent
double close by the user) and RC drops to 3.

With multiReader1.close(), multiReader1 is not at RC 0, and so it
decrefs all readers it was using, and so reader1 RC is now 2.

With multiReader2.close(), likewise it is now at RC 0 and so it
decrefs all readers it was using, and so reader1 RC is now 1.

With reader2.close(), it decrefs its own RC, however that brings its
RC to 1 (reader3 is still referring to it) and so it does not decref
the reader1 that it's referring to.

Finally, with reader3.close(), it is now at RC 0 and so it decrefs the
reader2 it refers to. This brings reader2's RC to 0, and so reader2
decrefs the reader1 that it's referring to. Which brings reader1's RC
to 0, and so reader1 finally closes all its internal sub-readers.


Michael Busch added a comment - 20/Oct/07 09:46 AM

I'm assuming in your example you meant for reader2 and reader3 to also
be SegmentReaders?

Yes that's what I meant. Sorry, I didn't make that clear.

Also in your example let's insert missing "reader1.close()" as the
very first close? (Else it will never be closed because it's RC never
hits 0).

Doesn't what you describe change the semantics of MultiReader.close()?

If you do:

IndexReader reader1 = IndexReader.open(index1);  
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
multiReader1.close();

then today multiReader1.close() also closes reader1. That's why I consciously omitted reader1.close().

Consequently, if you do

IndexReader reader1 = IndexReader.open(index1);  
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
IndexReader multiReader2 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index3)});
multiReader1.close();

then multiReader2 is not usable anymore, because multiReader1.close() closes reader1. But that can be explicitly avoided by the user because it's known that multiReader1 and multiReader2 share the same reader.

Now, with the reopen() code:

IndexReader reader1 = IndexReader.open(index1);  // optimized index, reader1 is a SegmentReader
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
... // modify index2
IndexReader multiReader2 = multiReader1.reopen();  
// only index2 changed, so multiReader2 uses reader1 and has to increment the refcount of reader1

The user gets a new reader instance from multiReader.reopen(), but can't tell which of the subreaders has been reopened and which are shared. That's why multiReader1.close() should not close reader1 in this case and we need refcounting in order to make this work.

So do you suggest that a MultiReader should increment the refcounts when it is opened as well (in the constructor)? I believe we can implement it like this, but as I said it changes the semantics of MultiReader.close() (and ParallelReader.close() is, I believe, the same). A user would then have to close subreaders manually.


Michael McCandless added a comment - 20/Oct/07 10:45 AM

If you do:

IndexReader reader1 = IndexReader.open(index1);  
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
multiReader1.close();

then today multiReader1.close() also closes reader1. That's why I consciously omitted reader1.close().

Ahh, I missed that MultiReader is allowed to close all readers that
were passed into it, when it is closed. OK, let's leave
reader1.close() out of the example.

It's somewhat "aggressive" of MultiReader/ParallelReader to do that?
If you go and use those same sub-readers in other MultiReaders then
they closing of the first MultiReader will then break the other ones?

I think we are forced to keep this semantics, for backwards
compatibility. But I don't really think MultiReader/ParallelReader
should actually be this aggressive. Maybe in the future we can add
ctors for MultiReader/ParallelReader that accept a "doClose" boolean
to turn this off.

Anyway, it's simple to preserve this semantics with reference
counting. It just means that IndexReader / MultiReader do not incref
the readers they receive, and, when they are done with those readers,
they must call their close(), not decref. Ie they "borrow the
reference" that was passed in, rather than incref'ing their own
reference, to the child readers.

With that change, plus the change below, your example works fine.

Consequently, if you do

IndexReader reader1 = IndexReader.open(index1);  
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
IndexReader multiReader2 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index3)});
multiReader1.close();

then multiReader2 is not usable anymore, because multiReader1.close() closes reader1. But that can be explicitly avoided by the user because it's known that multiReader1 and multiReader2 share the same reader.

This is why I don't like the semantics we have today – I don't think
it's right that the multiReader1.close() breaks multiReader2.

Now, with the reopen() code:

IndexReader reader1 = IndexReader.open(index1);  // optimized index, reader1 is a SegmentReader
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
... // modify index2
IndexReader multiReader2 = multiReader1.reopen();  
// only index2 changed, so multiReader2 uses reader1 and has to increment the refcount of reader1

The user gets a new reader instance from multiReader.reopen(), but can't tell which of the subreaders has been reopened and which are shared. That's why multiReader1.close() should not close reader1 in this case and we need refcounting in order to make this work.

Both of these cases are easy to fix with reference counting: we just
have to change ensureOpen() to assert that RC > 0 instead of
closed==false. Ie, a reader may still be used as long as its RC is
still non-zero.


Michael McCandless added a comment - 20/Oct/07 11:05 AM

I think we are forced to keep this semantics, for backwards
compatibility. But I don't really think MultiReader/ParallelReader
should actually be this aggressive. Maybe in the future we can add
ctors for MultiReader/ParallelReader that accept a "doClose" boolean
to turn this off.

Actually I retract this: it's no longer necessary as long as we change
ensureOpen to assert that RC > 0 instead of closed==false.

I think this is actually a nice unexpected side-effect of using
reference counting: it resolves this overly aggressive behavior of
MultiReader/ParallelReader.


Michael Busch added a comment - 22/Oct/07 07:36 PM

With that change, plus the change below, your example works fine.

Two things:

  • MultiReader/ParallelReader must not incref the subreaders on open()
    as you said. But on reopen() it must incref the subreaders that
    haven't changed and thus are shared with the old MultiReader/
    ParallelReader. This further means, that the re-opened MultiReader/
    ParallelReader must remember which of the subreaders to decref on
    close(), right?
  • If we change ensureOpen() like you suggest, then the user might
    still be able to use reader1 (in my example), even after
    reader1.close() was explicitly called. Probably not a big deal?

Michael McCandless added a comment - 22/Oct/07 08:36 PM
  • MultiReader/ParallelReader must not incref the subreaders on open()
    as you said. But on reopen() it must incref the subreaders that
    haven't changed and thus are shared with the old MultiReader/
    ParallelReader. This further means, that the re-opened MultiReader/
    ParallelReader must remember which of the subreaders to decref on
    close(), right?

Hmm, right. MultiReader/ParallelReader must keep track of whether it
should call decref() or close() on each of its child readers, when it
itself is closed.

  • If we change ensureOpen() like you suggest, then the user might
    still be able to use reader1 (in my example), even after
    reader1.close() was explicitly called. Probably not a big deal?

I think this is OK?


Michael Busch added a comment - 22/Oct/07 09:35 PM

I think this is OK?

This was essentially the reason why I suggested to use two refcount values:
one to control when to close a reader, and one to control when to close
it's (shared) resources in case of SegmentReader. That approach would not
alter the behaviour of IndexReader.close().
But I agree that your approach is simpler and I also think it is okay to
change ensureOpen() and accept the slight API change.

So I'll go ahead and implement the refcount approach unless anybody objects.

Oh and Mike, thanks for bearing with me


Yonik Seeley added a comment - 22/Oct/07 09:54 PM
What about a new constructor for MultiReader/ParallelReader that implements more sensible semantics (increment refcount on readers passed to it, and decrement on close())?

Michael Busch added a comment - 22/Oct/07 11:06 PM

What about a new constructor for MultiReader/ParallelReader that implements more sensible semantics (increment refcount on readers passed to it, and decrement on close())?

Yeah, when reference counting is implemented then such a constructor should be easy to add.


Michael McCandless added a comment - 22/Oct/07 11:31 PM

Oh and Mike, thanks for bearing with me

Thank you for bearing with me!

What about a new constructor for MultiReader/ParallelReader that implements more sensible semantics (increment refcount on readers passed to it, and decrement on close())?

+1


Michael Busch added a comment - 29/Oct/07 08:53 PM
Ok here is the next one ...

This patch implements the refCounting as discussed with Mike and Yonik
above.

Other changes/improvements/comments:

  • ensureOpen() is now also called in MultiReader.reopen() and
    ParallelReader.reopen(). (thanks, Mike)
  • in case an exception occurs during reopen() it is taken care of
    closing or decreasing the refCount of already created readers.
    Also old readers should not be affected in case an exception occurs.
  • I improved how norms are re-opened in a MultiSegmentReader (MSR).
    It now checks which parts of the normsCache haven't changed and
    copies those to the new normsCache. Because I'm imagining Yonik with
    his thread-safety hat on now , another comment about this: In case
    a MSR is refreshed, then the synchronized MSR.reopen() method has the
    lock on the old MSR. This method creates the new MSR and the values
    from the old cache are copied to the new cache in the constructor, so
    while the lock on the old MSR is still being held.
  • added new constructors to MultiReader and ParallelReader that
    increase the refCount on the subReaders and thus prevent closing the
    subReaders on close(). (thanks, Yonik)

I also made the changes suggested by Hoss (thanks!):

  • changed the "successfully reopened" comments int the javadocs
  • added comments to the javadocs saying that write operations on the
    re-opened reader will result in undefined behavior unless the old
    reader is closed
  • FilterIndexReader.reopen() not implemented, i. e. will throw an
    UnsupportedOperationException.

All unit tests pass.


Michael McCandless added a comment - 30/Oct/07 08:55 AM
Patch looks great! I'm still working through it but found a few small
issues...

It might be good to put a "assert refCount > 0" at various places like
decRef(), incRef(), ensureOpen()? That would require changing the
constructors to init refCount=1 rather than incRef() it to 1.

I'm seeing a failure in contrib/memory testcase:

[junit] *********** FILE=./NOTICE.txt
    [junit] Fatal error at query=Apache, file=./NOTICE.txt, anal=org.apache.lucene.analysis.SimpleAnalyzer@341960
    [junit] ------------- ---------------- ---------------
    [junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest):	Caused an ERROR
    [junit] this IndexReader is closed
    [junit] org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    [junit] 	at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:158)
    [junit] 	at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:632)
    [junit] 	at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64)
    [junit] 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
    [junit] 	at org.apache.lucene.search.Searcher.search(Searcher.java:118)
    [junit] 	at org.apache.lucene.search.Searcher.search(Searcher.java:97)
    [junit] 	at org.apache.lucene.index.memory.MemoryIndexTest.query(MemoryIndexTest.java:412)
    [junit] 	at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java:313)
    [junit] 	at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest.java:234)

I think it's because MemoryIndexReader (private class in
MemoryIndex.java) calls super(null) =
IndexReader.IndexReader(Directory) in its constructor, which does not
initialize the refCount to 1? If I insert incRef() into
IndexReader.IndexReader(Directory) constructor, the test passes, but
who else is using that constructor (ie will this double-incref in
those cases?).


Michael McCandless added a comment - 30/Oct/07 11:41 AM
OK I think this patch is very close! I finished reviewing it –
here's some more feedback:
  • In multiple places you catch an IOException and undo the attempted
    re-open, but shouldn't this be a try/finally instead so you also
    clean up on hitting any unchecked exceptions?
  • I think you need an explicit refCount for the Norm class in
    SegmentReader.
    .
    Say I've done a chain of 10 re-opens for SegmentReader and each
    time only the segment's norms has changed. I've closed all but
    the last SegmentReader. At this point all 10 SegmentReaders are
    still alive (RC > 0) and holding open all file handles for their
    copies of the norms. So this will leak file handles/RAM with each
    reopen?
    .
    To fix this, I think you just need to add refCount into Norm class
    & set refCount to 1 in the constructor. Then, each each
    SegmentReader calls Norm.decRef(), not Norm.close(), when it's
    done. When refCount hits 0 then the Norm closes itself. Finally,
    during re-open you should share a Norm instance (rather than open
    a new one) if it had not changed from the previous SegmentReader.
    .
    For singleNormStream, I think each reopened SegmentReader should
    always re-open this descriptor and then we can forcefully close
    this stream when the SegmentReader is closed (what you are doing
    now). Ie the SegmentReader fully owns singleNormStream.
  • If you have a long series of reopens, then, all SegmentReaders in
    the chain will remain alive. So this is a [small] memory leak
    with time. I think if you changed referencedSegmentReader to
    always be the starting SegmentReader then this chain is broken
    and after 10 reopens only the original SegmentReader and the most
    recent one will remain alive (assuming I closed all SegmentReaders
    but the most recent one).

Michael Busch added a comment - 31/Oct/07 07:02 AM

Patch looks great! I'm still working through it but found a few small
issues...

Thanks Mike! Very good review and feedback!

It might be good to put a "assert refCount > 0" at various places like...

Agreed. I added those asserts to incRef() and decRef(). I didn't add it
to ensureOpen(), because it throws an AlreadyClosedException anyway, and
some testcases check if this exception is thrown.

I'm seeing a failure in contrib/memory testcase:

Oups, I fixed this already. I changed the (deprecated) ctr
IndexReader.IndexReader(Directory) to call this() which sets the refCount
to 1. The test passes then. I made this fix yesterday, I think I just
forgot to update the patch file before I submitted it, sorry about this.

  • In multiple places you catch an IOException and undo the attempted
    re-open, but shouldn't this be a try/finally instead so you also
    clean up on hitting any unchecked exceptions?

Yes of course! Changed it.

  • I think you need an explicit refCount for the Norm class in
    SegmentReader.

OK I see. I made this change as well. I also made the change that
there is no chain, but one starting SegmentReader which all re-opened
ones reference (see below). Now this starting SegmentReader won't close
its norms until all other readers that reference it are closed (RC=0),
because only then doClose() is called, which calls closeNorms().
Do you see an easy way how to improve this?
Hmm, probably I have to definalize IndexReader.incRef() and decRef()
and overwrite them in SegmentReader. Then SegmentReader.incRef() would
also incRef the norms, SegmentReader.decref() would decref the norms,
and somehow a clone that shares references the reader but not the norms
(because they changed) would only incref the reader itself, but not
the norms... Or do you see an easier way?

  • If you have a long series of reopens, then, all SegmentReaders in
    the chain will remain alive. So this is a [small] memory leak
    with time. I think if you changed referencedSegmentReader to
    always be the starting SegmentReader then this chain is broken

Good point. Ok I changed this and also the test cases that check the refCount
values.


Michael McCandless added a comment - 31/Oct/07 06:04 PM
Looks great! All tests pass for me.

OK I see. I made this change as well. I also made the change that
there is no chain, but one starting SegmentReader which all re-opened
ones reference (see below). Now this starting SegmentReader won't close
its norms until all other readers that reference it are closed (RC=0),
because only then doClose() is called, which calls closeNorms().
Do you see an easy way how to improve this?

How about if SegmentReader.close() always calls Norm.decRef(),
immediately, for each Norm is has open? EG you could implement
doCloseUnsharedResources in SegmentReader and do it there). This way,
if the SegmentReader has been closed but it shares resources (and not
the Norms) with reopened SegmentReaders then its Norms would all
decRef to 0 & be closed.

Also make sure that if a SegmentReader is decRef'd to 0 and close was
never called, that also in this case you remember to call Norm.decRef
for all open norms.

One more comment: I think you can partially share Norm instances? Eg
if I have 2 fields that have norms, but only one of them changed since
I opened this SegmentReader, then the reopened SegmentReader could
share the Norm instance of the field that didn't change with the old
SegmentReader? But right now you're re-loading all the Norms.

Otherwise no more comments!


Michael Busch added a comment - 31/Oct/07 10:56 PM

How about if SegmentReader.close() always calls Norm.decRef(),
immediately, for each Norm is has open? EG you could implement
doCloseUnsharedResources in SegmentReader and do it there). This way,

Hmm I was thinking about this before (that's actually why I put that
method in there). But I don't think this is gonna work. For example,
let's say we use a MultiReader that has two SegmentReader SR1 and SR2.
Now only SR2 changed, we reopen the MR which increases the refCount on
SR1, because it shares that SR. Now we close the old MultiReader, which
calls close() on SR1. If now SegmentReader.close() calls Norm.decRef(),
then it will close the norms even though they are still used by the new
MultiReader.


Michael Busch added a comment - 31/Oct/07 10:56 PM

One more comment: I think you can partially share Norm instances? Eg

Good idea! Will make the change.


Michael McCandless added a comment - 01/Nov/07 01:12 PM

Hmm I was thinking about this before (that's actually why I put that
method in there). But I don't think this is gonna work. For example,
let's say we use a MultiReader that has two SegmentReader SR1 and SR2.
Now only SR2 changed, we reopen the MR which increases the refCount on
SR1, because it shares that SR. Now we close the old MultiReader, which
calls close() on SR1. If now SegmentReader.close() calls Norm.decRef(),
then it will close the norms even though they are still used by the new
MultiReader.

Ugh, you're right. The challenge is sometimes a reference to SR means
"I will use norms" (this is when MultiReader incRefs) but other times
it means "I will not use norms" (this is when SR incRefs due to
reopen).

OK, I like your original proposal: SR overrides incRef() and incrs its
RC as well as the RC for each Norm it's using. Then, in SR's
reopenSegment, you carefully incRef the "original" SR without
incRef'ing its Norms (except for those Norms you will keep).

Likewise, SR overrides decRef() to decr its RC and RC for each Norm.
But, when a reopened SR1.doClose() is called, you must carefully
decRef the RD of the original SR but not decRef each of its Norms
(except for those you had actually shared).

This way when MR calls SR.incRef/decRef then all Norms and the SR's RC
are incr'd/decr'd. But when SR1 shares resources with an original SR
it only incr's/decr's the refCount of the SR.


Michael Busch added a comment - 02/Nov/07 09:53 AM
OK, I think it's finally working now!

SegmentReader now overwrites incRef() and increments the readers RC,
as well as the RCs of all norms. I further added the private method
incRefReaderNotNorms() to SegmentReader, which is called in
reopenSegment(), because it takes care of incrementing the RCs of
all shared norms.

I also added the method doCloseUnsharedResources() to IndexReader,
which is a NOOP by default. It is called when a reader is closed,
even if its RC > 0. SegmentReader overwrites this method and
closes (=decRef) the norms in it. The SegmentReader then remembers
that it closed the norms already and won't close them again in
doClose(), which is called when its RC finally drops to 0.

I also made the change you suggested, Mike, to only reload the
field norms that actually changed. SegmentReader.openNorms() now
checks if there is already a norm for a field in the HashTable,
and only loads it if it's not there. reopenSegment() puts all
norms in the new SegmentReader that haven't changed.

I added some new tests to verify the norms ref counting. All unit
tests pass now. So I think this is ready to commit, but I'd feel
more comfortable if you could review it again before I commit it.


Yonik Seeley added a comment - 02/Nov/07 02:09 PM
I just did a quick partial review of SegmentReader for thread safety only and ran across some potential issues
  • It looks like fieldsReader is shared between clones(), and that isn't thread-safe (synchronization is done at the SegmentReader level, and now there is more than one)
  • maybe the same issue with deletedDocs? mutual exclusion is no longer enforced.
  • it looks like the norms Hashtable could be shared... looping over the individual norms and calling incRef doesn't seem safe for a number of reasons (for example, you might miss some just being added)
  • reading new norms isn't safe...
    synchronized norms(String field, byte[] bytes, int offset) uses the "norm' IndexInput that is shared. synchronization on a single reader no longer guarantees mutual exclusion.

There's probably other stuff, but I stopped looking. Since we are sharing things now, every method that was synchronized is now potentially unsafe. Synchronizing on the object being shared is probably a much better strategy now.

This is complex enough that in addition to review, I think we need a good multi-threaded test - 100 or 1000 threads over a ram directory, all changing, querying, retrieving docs, reopening, closing, etc.


Yonik Seeley added a comment - 02/Nov/07 02:37 PM
It also looks like Norm.incRef is used in an unsafe manner (unsynchronized, or synchronized on the reader), and also Norm.decRef() is called inside a synchronized(norms) block, but an individual Norm may be shared across multiple Hashtables, right?

I don't think that norms even needs to be a synchronized Hashtable... it could be changed to a HashMap since it's contents never change, right?


Michael McCandless added a comment - 02/Nov/07 02:43 PM

OK, reviewed the latest patch:

  • In this code:
    // singleNormFile means multiple norms share this file
    if (fileName.endsWith("." + IndexFileNames.NORMS_EXTENSION)) {
      clone.singleNormStream = d.openInput(fileName, readBufferSize);            
    }

    I think the comment should be removed (it doens't apply) and also
    won't this incorrectly open the singleNormStream more than once if
    more than one field does not have separate norms? I think you should
    init that to null and then only reopen it, once, if it's still null?

  • In MultiSegmentReader, the logic that copies over unchanged norms
    from the old norms cache can be simplified. I think you can just
    look up the old Norm instance & the new Norm instance and if they
    are == then you can copy bytes over? This would also let you remove
    "sharedNorms" entirely which is good because it's not a just a
    boolean thing anymore since some Norm instances are shared and some
    aren't.
  • I think you also need to override decRef (and add
    decRefReaderNotNorms) to SegmentReader? Because now there is a
    mismatch: incRef incr's the Norm RC's, but, decRef does not. So I
    think norms are not getting closed? I think we should modify the
    "assertReaderClosed()" in the unit test to verify (when appropriate)
    that also the RC of all Norm instances is also 0 (ie
    assertTrue(SR.normsClosed())). Then, make sure SR calls
    referencedSegmentReader.decRefReaderNotNorms instead of decRef. I
    think you then don't need to track "closedNorms" boolean, at all.
    You simply always decRef the norms whenever SR.decRef is called.
    The doCloseUnsharedResources is still needed to close the
    singleNormStream.

Michael McCandless added a comment - 02/Nov/07 02:47 PM

This is complex enough that in addition to review, I think we need a
good multi-threaded test - 100 or 1000 threads over a ram directory,
all changing, querying, retrieving docs, reopening, closing, etc.

+1

We should fix all the synchronization issues you've found, create this
unit test, and then iterate from there.


Yonik Seeley added a comment - 02/Nov/07 02:57 PM

We should fix all the synchronization issues you've found, create this
unit test, and then iterate from there.

Or reverse it... write the test first so we have confidence that it can at least uncover some of these issues.
The test should do as little synchronization as possible of it's own so it doesn't mask a lack of synchronization in the core.
It should be possible to uncover the unsynchronized concurrent use of IndexInput at least, and hopefully some of the refcounting issues too.


Michael McCandless added a comment - 02/Nov/07 03:08 PM

Or reverse it... write the test first so we have confidence that it can at least uncover some of these issues.
The test should do as little synchronization as possible of it's own so it doesn't mask a lack of synchronization in the core.
It should be possible to uncover the unsynchronized concurrent use of IndexInput at least, and hopefully some of the refcounting issues too.

Excellent, I agree!


Michael Busch added a comment - 02/Nov/07 04:32 PM

I just did a quick partial review of SegmentReader for thread safety only and ran across some potential issues

OK, let's scratch my "ready to commit" comment

A question about thread-safety here. I agree that we must
fix all possible problems concerning two or more
IndexReaders in read-mode, like the FieldsReader issue.

On the other hand: We're saying that performing write
operations on a re-opened reader results in undefined
behavior. Some of the issues you mentioned, Yonik, should
only apply in case one of the shared readers is used to
perform index modifications, right? Then the question is:
how much sense does it make to make reopen() thread-safe
in the write case then?

So I think the multi-threaded testcase should not
perform index modifications using readers involved in a
reopen()?


Yonik Seeley added a comment - 02/Nov/07 04:58 PM
Sorry, I hadn't kept up with this issue wrt what was going to be legal (and we should definitely only test what will be legal in the MT test). So that removes the deletedDocs issue I guess.

Thomas Peuss added a comment - 05/Nov/07 07:34 AM
To find concurrency issues with an unit test is hard to do, because your potential problems lie in the time domain and not in the code domain.

From my experience following things can have impact on the results of such a test:

  • Running on SP or SMP machines. SMP machines (the more cores the better) reveal concurrency issues much earlier.
  • The Java implementation you are using. IBM's and Sun's thread implementations behave slightly different for example.
  • The OS you are running. This may seem odd in the first run but remember that modern Java implementations rely heavily on the threading implementations of the OS.
  • The processor platform you are running. NUMA vs. UMA (which is AMD vs. intel). The timing of threads can differ because of this.

And be prepared that one time your tests runs through without a problem and on the next run it breaks...

Just my € 0.02


Michael Busch added a comment - 10/Nov/07 06:47 AM
Changes in this patch:
  • Fixed ParallelReader and MultiReader so that they don't incRef the subreaders anymore in case reopen() is a NOOP (i. e. reopen() doesn't return a new instance)
  • In the new ctr in MultiSegmentReader it was possible to hit a NullPointerException during filling the norms cache, because I didn't check for null after retrieving the old reader from the HashMap. I fixed this.
  • SegmentReader now also overwrites decRef() and implements decRefReaderNotNorms().
  • As Mike suggested I removed "boolean sharedNorms" from SegmentReader. Now in MultiSegmentReader I compare the norm instances from the old and the new subReaders and copy the bytes to the new cache in case they are ==.
  • In SegmentReader I changed norms to be a HashMap instead of HashTable.
  • Norm.decRef() and Norm.incRef() are synchronized now.
  • SegmentReader#norms(String field, byte[] bytes, int offset) now synchronizes on the norm object that is to be read.
  • SegmentReader#reopen() now opens a new FieldsReader because it is not thread-safe.
  • SegmentReader.Norm has a new boolean variable "useSingleNormStream". SegmentReader#norms(String field, byte[] bytes, int offset) checks if it is true. If yes, then the readers' singleNormStream is used, otherwise norm.in. This is necessary so that a reopened SegmentReader always uses its own singleNormStream and to avoid synchronization on the singleNormStream.
  • I added a bunch of code to TestIndexReaderReopen to test the thread-safety of reopen(). It starts 150 threads: some modify the index (some delete docs, some add docs and some set norms), some reopen readers and check if the re-opened ones deliver the same results as fresh ones.
  • assertReaderClosed now checks if the norms are closed and also checks recursively if all subReaders are closed.

Still outstanding:

  • On the IBM JVM all tests pass. On Sun, the thread-safety test sometimes fails. When it fails, then in assertReaderClosed, because the refCounts of either the norms or some subReaders aren't 0 at the end of the test. At this point I'm not sure why and I'm still debugging. I just wanted to submit the patch to give others the chance to review the patch or possibly (hopefully) find the problem before me.

Michael Busch added a comment - 12/Nov/07 10:28 AM
Changes:
  • Updated patch to current trunk (I just realized that the
    latest didn't apply cleanly anymore)
  • MultiSegmentReader now decRefs the subReaders correctly
    in case an exception is thrown during reopen()
  • Small changes in TestIndexReaderReopen.java

The thread-safety test still sometimes fails. The weird
thing is that the test verifies that the re-opened
readers always return correct results. The only problem
is that the refCount value is not always 0 at the end
of the test. I'm starting to think that the testcase
itself has a problem? Maybe someone else can take a look

  • it's probably something really obvious but I'm already
    starting to feel dizzy while pondering about
    thread-safety.

Michael McCandless added a comment - 12/Nov/07 09:34 PM
I think the cause of the intermittant failure in the test is a missing
try/finally in doReopen to properly close/decRef everything on
exception.

Because of lockless commits, a commit could be in-process while you
are re-opening, in which case you could hit an IOexception and you
must therefore decRef those norms you had incRef'd (and, close eg the
newly opened FieldsReader).


Michael Busch added a comment - 13/Nov/07 12:14 AM
> I think the cause of the intermittant failure in the test is a missing
> try/finally in doReopen to properly close/decRef everything on
> exception.

Awesome! Thanks so much for pointing me there, Mike! I was getting a
little suicidal here already ...

I should have read the comment in SegmentReader#initialize more
carefully:

} finally {

      // With lock-less commits, it's entirely possible (and
      // fine) to hit a FileNotFound exception above.  In
      // this case, we want to explicitly close any subset
      // of things that were opened so that we don't have to
      // wait for a GC to do so.
      if (!success) {
        doClose();
      }
    }

While debugging, it's easy to miss such an exception, because
SegmentInfos.FindSegmentsFile#run() ignores it. But it's good that it
logs such an exception, I just have to remember to print out the
infoStream next time.

So it seems that this was indeed the cause for the failing test case.
I made the change and so far the tests didn't fail anymore (ran it
about 10 times so far). I'll run it another few times on a different
JVM and submit an updated patch in a short while if it doesn't fail
again.


Michael Busch added a comment - 13/Nov/07 01:26 AM
OK, all tests pass now, including the thread-safety test.
I ran it several times on different JVMs.

Changes:

  • As Mike suggested I added a try ... finally block to
    SegmentReader#reopenSegment() which cleans up after an
    exception was hit.
  • Added some additional comments.
  • Minor improvements to TestIndexReaderReopen

Michael McCandless added a comment - 13/Nov/07 07:18 PM

Awesome! Thanks so much for pointing me there, Mike! I was getting a
little suicidal here already ...

No problem, I lost some hairs tracking that one down too!!

OK, latest patch looks good! I love the new threaded unit test.

Only two smallish comments:

  • You should also close fieldsReader when referencedSegmentReader !=
    null, right? (in SegmentReader.doClose)
  • In the new try/finally in reopenSegment: if you first setup
    referencedSegmentReader, then can't that finally clause just be
    clone.decRef() instead of duplicating code for decRef'ing norms,
    closeNorms(), etc.?

Yonik Seeley added a comment - 13/Nov/07 07:54 PM
So how about a public IndexReader.flush() method so that one could also reopen readers that were used for changes?

Usecase:

reader.deleteDocument()
reader.flush()
writer = new IndexWriter()
writer.addDocument()
writer.close()
reader.reopen()
reader.deleteDocument()


Michael Busch added a comment - 14/Nov/07 06:02 AM
  • You should also close fieldsReader when referencedSegmentReader !=
    null, right? (in SegmentReader.doClose)

Yes, will do!

  • In the new try/finally in reopenSegment: if you first setup
    referencedSegmentReader, then can't that finally clause just be
    clone.decRef() instead of duplicating code for decRef'ing norms,
    closeNorms(), etc.?

Hmm, what if then in clone.close() an exception is thrown from
FieldsReader.close() or singleNormStream.close(). In that case it
would not decRef the referenced reader.

Hmm but actually we could change the order in close() so that
referencedSegmentReader.decRefReaderNotNorms() is done first even
if the following close() operations don't succeed?


Michael Busch added a comment - 14/Nov/07 06:05 AM

So how about a public IndexReader.flush() method

Since our goal is it to make IndexReader read-only in the future
(LUCENE-1030), do you really think we need to add this?


Michael McCandless added a comment - 14/Nov/07 10:10 AM

Hmm but actually we could change the order in close() so that
referencedSegmentReader.decRefReaderNotNorms() is done first even
if the following close() operations don't succeed?

+1


Michael McCandless added a comment - 14/Nov/07 10:59 AM

So how about a public IndexReader.flush() method

I think also if we do decide to do this we should open a new issue for it?


Yonik Seeley added a comment - 14/Nov/07 01:50 PM
> Since our goal is it to make IndexReader read-only in the future
> (LUCENE-1030), do you really think we need to add this?

flush() would make reopen() useful in more cases, and LUCENE-1030 is further off (not Lucene 2.3, right?)
Anyway, flush() would be considered a write operation like setNorm() & deleteDocument() and could be deprecated along with them in the future if that's how we decide to go.

> I think also if we do decide to do this we should open a new issue for it?

Yes, that's fine.


Michael Busch added a comment - 14/Nov/07 03:57 PM

I think also if we do decide to do this we should open a new issue for it?

+1

I'll open a new issue.


Michael Busch added a comment - 14/Nov/07 08:59 PM
Changes:
  • Close FieldsReader in SegmentReader#doClose() even if
    referencedReader!=null
  • Call clone.decRef() in the finally clause of
    SegmentReader#reopenSegment()
  • decRef referencedReader before closing other resources
    in SegmentReader#doClose()
  • Removed IndexReader#doCloseUnsharedResources().

Michael McCandless added a comment - 15/Nov/07 02:22 PM
Patch looks good. Only thing I found was this leftover
System.out.println, in SegmentReader.java:
System.out.println("refCount " + getRefCount());

Michael Busch added a comment - 15/Nov/07 03:02 PM
Thanks for the review, Mike! I'll remove the println.

Ok, I think this patch has been reviewed a bunch of times and
should be ready to commit now. I'll wait another day and commit
it then if nobody objects.


Michael Busch added a comment - 17/Nov/07 08:44 PM
Changes:
  • Updated to current trunk.
  • Removed println in SegmentReader.

I'm going to commit this soon!


Michael Busch added a comment - 17/Nov/07 09:38 PM
Committed! Phew!!!