Issue Details (XML | Word | Printable)

Key: LUCENE-986
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael Busch
Reporter: Michael Busch
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Refactor segmentInfos from IndexReader into its subclasses

Created: 22/Aug/07 08:20 AM   Updated: 25/Jan/08 03:24 AM
Return to search
Component/s: Index
Affects Version/s: None
Fix Version/s: 2.3

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works lucene-986.patch 2007-09-18 08:24 PM Michael Busch 48 kB
Text File Licensed for inclusion in ASF works lucene-986.patch 2007-09-18 08:18 AM Hoss Man 42 kB
Text File Licensed for inclusion in ASF works lucene-986.patch 2007-09-13 06:54 AM Michael Busch 40 kB
Issue Links:
Dependants
 

Lucene Fields: New
Resolution Date: 20/Sep/07 07:30 AM


 Description  « Hide
References to segmentInfos in IndexReader cause different kinds of problems
for subclasses of IndexReader, like e. g. MultiReader.

Only subclasses of IndexReader that own the index directory, namely
SegmentReader and MultiSegmentReader, should have a SegmentInfos object
and be able to access it.

Further information:
http://www.gossamer-threads.com/lists/lucene/java-dev/51808
http://www.gossamer-threads.com/lists/lucene/java-user/52460

A part of the refactoring work was already done in LUCENE-781



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael Busch made changes - 22/Aug/07 11:21 PM
Field Original Value New Value
Link This issue blocks LUCENE-743 [ LUCENE-743 ]
Hoss Man added a comment - 23/Aug/07 04:46 AM
one aspect of this that should be considered: It may not make sense for MultiReader to extend MultiSegmentReader ... as Michael says, only subclasses that own the index directory should have segmentInfos, and a MultiReader (as defined on the trunk now) can never own it's own directory.

I haven't worked through all the implications, but perhaps the most logical refactoring would be...

  • IndexReader
    ...as abstract as possible given that we can't actually make methods abstract
  • DirectoryIndexReader extends IndexReader
    ...new class, encapsulated all the segmentInfos and locking logic currently in
    IndexReader (can definitely be made abstract if helpful)
  • SegmentReader extends DirectoryIndexReader
  • MultiSegmentReader extends DirectoryIndexReader
  • ParallelIndexReader extends IndexReader
  • FilterIndexReader extends IndexReader
  • MultiReader extends IndexReader
    ...(side note that i really haven't thought through completley: should
    MultiReader extend FilterIndexReader?)

there would likely be some utlity functionality that could be reused between MultiReader and MultiSegmentReader ... possible as static methods in IndexReader (or a new util class)


Michael Busch added a comment - 23/Aug/07 06:19 AM
This is good stuff, Hoss. I like the DirectoryIndexReader idea.
Sounds like a very clean and logical separation.

I'll work through the code to understand all consequences.


Michael Busch added a comment - 30/Aug/07 05:35 AM
What do you think about this alternative approach:
  • SegmentReader, MultiSegmentReader and MultiReader all extend IndexReader
  • IndexReader.open() always returns a MultiSegmentReader instance, even
    if the index has only one segment. We can make some optimizations, so that
    e. g. MultiSegmentReader.termDocs() returns a SegmentTermDocs in case
    there's only one underlying SegmentReader.
  • All write lock and transaction logic goes into MultiSegmentReader, as
    it is the only reader that has to obtain a write lock.

The advantage here is that only one class (MultiSegmentReader) is
responsible for acquiring the write lock. It would also make
IndexReader.open() simpler, because it can then return a
MultiSegmentReader instance in all cases. The logic of opening the
different segments could move from IndexReader.open() to the constructor
of MultiSegmentReader. Also LUCENE-743 (IndexReader.reopen()) would
become simpler.
The disadvantage of this approach is apparently that always using
a MultiSegmentReader would add some overhead when reading optimized
indexes, but actually I don't think that this overhead would really be
significant.

If we go the DirectoryIndexReader way (where SegmentReader and
MultiSegmentReader both extend DirectoryIndexReader), then we still need
an ownDirectory variable that tells SegmentReader if it has to acquire a
write lock or not, depending on whether it is a subreader of
MultiSegmentReader or not. And of course we add another layer to the
IndexReader hierachy.

So I'm not sure which approach is the better one. I'm hoping to get some
opionions here! Hoss? Others?


Hoss Man added a comment - 30/Aug/07 06:13 AM
i rarely remember a week later what i was thinking when i wrote something, but i suspect that when i suggested the DirectoryIndexReader i was assuming it would have everything directory/lock related thta currently exists in the IndexReader base class (including the directoryOwner boolean) ... in cases where there is a single Segment in a directory, there will be SegmentReader with directoryOwner==true ... in the multi segment cases, the MultiSegmentReader will have directoryOwner==true, and it's sub SegmentReaders will all have directoryOwner==false. ...

...the key point of DirectoryIndexReader being that any subclass can own a directory (and automaticly inherits all the code for dealing with locks properly when it needs/wants to) but doesn't always have to own the directory. meanwhile MultiReader (and ParallelIndexReader and FilteredIndexReader) make no attempt at owning a directory, and inherit no code for doing so (or for dealing with the locking of such non existent directories)

I don't really know enough about the performance characteristics of SegmentReader vs a MultiSegmentReader of one segment to have a sense of how possible/practical it would be to eliminate the need for SegmentReader and replace it completely with MultiSegmentReader ...

one hitch might be that SegmentReader.get is public, and in order to keep supporting it, SegmentReader still needs to have/inherit the same segment info and directory owning/locking code that we want to move out of IndexReader (so just putting it MultiSegmentReader won't fly unless we kill that public method)


Michael Busch added a comment - 13/Sep/07 06:54 AM
Here is the patch with the DirectoryIndexReader approach.

It moves SegmentInfos and Directory from IndexReader into
DirectoryIndexReader, as well as the commit and rollback logic.

MultiSegmentReader and SegmentReader extend DirectoryIndexReader.
MultiReader extends IndexReader now and uses some methods from
MultiSegmentReader that I made static.

I added the method acquireWriteLock() to IndexReader that does
nothing by default. Subclasses must implement it if they require
a write lock (so does DirectoryIndexReader now).

IndexReader is very abstract now and almost all logic moved into
the subclasses. The methods isOptimized(), isCurrent(), and
getVersion() all throw an UnsupportedOperationException (UOE). I
further deprecated the IndexReader constructor that takes a
Directory as argument. As soon as we remove this ctr the method
IndexReader.getDirectory() should throw an UOE as well. I added
a TODO comment as a reminder.

All unit tests pass with this patch.


Michael Busch made changes - 13/Sep/07 06:54 AM
Attachment lucene-986.patch [ 12365691 ]
Michael Busch added a comment - 18/Sep/07 02:23 AM
> one hitch might be that SegmentReader.get is public, and in order to keep
> supporting it, SegmentReader still needs to have/inherit the same segment info
> and directory owning/locking code that we want to move out of IndexReader (so
> just putting it MultiSegmentReader won't fly unless we kill that public method).

OK, I implemented the DirectoryIndexReader approach. Also because I'm not sure
about the performance characteristics of a MultiSegmentReader acting as a
SegmentReader.

I'd like to commit this rather soon. A review of the patch would be highly
appreciated.


Hoss Man added a comment - 18/Sep/07 06:20 AM
Michael: I've been meaning to look at this, but haven't had the time ... your recent update has goaded me

just to clarify: the patch you added on September 12th is your latest patch right? ... it's not clear from you comment on the 17th if you intended to attach an update and something went wrong.

I ask because i'm haivng trouble applying the patch from the 12th ... i must be tired because i can't understand why, it doesn't look like the files have changed since you posted the patch, so i'm not sure what it's complaining about ... visually everything seems to match up...

hossman@coaster:~/lucene/lucene$ svn update
At revision 576718.
hossman@coaster:~/lucene/lucene$ svn status
? lucene-986.patch
hossman@coaster:~/lucene/lucene$ patch --dry-run -p0 -i lucene-986.patch patching file src/java/org/apache/lucene/index/DirectoryIndexReader.java
patching file src/java/org/apache/lucene/index/FilterIndexReader.java
patching file src/java/org/apache/lucene/index/IndexReader.java
patching file src/java/org/apache/lucene/index/MultiReader.java
Hunk #1 FAILED at 19.
1 out of 2 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiReader.java.rej
patching file src/java/org/apache/lucene/index/MultiSegmentReader.java
Hunk #1 FAILED at 30.
Hunk #2 FAILED at 39.
Hunk #3 FAILED at 156.
Hunk #4 FAILED at 171.
Hunk #5 FAILED at 256.
Hunk #6 FAILED at 278.
6 out of 6 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiSegmentReader.java.rej
patching file src/java/org/apache/lucene/index/ParallelReader.java
patching file src/java/org/apache/lucene/index/SegmentReader.java
Hunk #1 succeeded at 32 with fuzz 2.
patching file src/test/org/apache/lucene/index/TestMultiReader.java
hossman@coaster:~/lucene/lucene$


Hoss Man added a comment - 18/Sep/07 08:18 AM
I got the patch to apply cleanly (see mailing list for details) On the whole it looks really good, i'm attaching an updated version with some minor improvements (mainly javadocs), but first a few questions...
  • just to clarify: IndexReader(Directory) is only around for
    backwards compatibility in subclasses right? And the only reason
    for the "private Directory" is to keep supporting the directory()
    method for any subclasses relying on it?
  • IndexReader() says it can be removed once the other constructor is
    removed ... why? is that just pointing out that we can rely on the
    default constructor?
  • since one of the goals is that IndexReaders which don't own their
    directory shouldn't utilize segmentInfos, would it make sense to
    eliminate directoryOwner from DirectoryIndexReader and instead test
    whether (null == segmentInfos) ?
  • the way TestMultiReader is setup with the "mode" is a bit confusing
    ... perhaps instead we should make a subclass where only openReader
    is overwritten (it will inherit the individual test methods) ?

here's the list of tweaks I made...

  • added a note in the IndexReader class javadocs about
    methods that throw UnSupOp but aren't abstract.
  • added javadocs to the IndexReader(Directory) constructor based on my
    understanding
  • added back javadocs to IndexReader methods that now throw UnSupOp to
    make it clear what they do to callers looking at the API, but made
    it clear tthe default impls throw UnSupOp
  • made IndexReader.directory() throw UnSupOp if directory is null,
    enhanced it's javadocs.
  • /* NOOP */ in IndexReader.acquireWriteLock so it's clear to code analysis tools that it's not a mistake.
  • small additions to javadocs for DirectoryIndexReader class, but
    these should probably be elaborated on (depending on what you think
    of my segmentInfos==null idea above)
  • made DirectoryIndexReader(...) call init(...) to eliminate a small
    about of duplication.

Hoss Man made changes - 18/Sep/07 08:18 AM
Attachment lucene-986.patch [ 12366073 ]
Michael Busch added a comment - 18/Sep/07 07:34 PM
> I got the patch to apply cleanly (see mailing list for details)

Thanks, Hoss! I'm using TortoiseSVN, I have to check how to set those
default properties for new files correctly.

> * just to clarify: IndexReader(Directory) is only around for
> backwards compatibility in subclasses right? And the only reason
> for the "private Directory" is to keep supporting the directory()
> method for any subclasses relying on it?

Yes, correct.

> * IndexReader() says it can be removed once the other constructor is
> removed ... why? is that just pointing out that we can rely on the
> default constructor?

Yes, just a suggested simplification. Keeping the constructor wouldn't hurt
though.

> * since one of the goals is that IndexReaders which don't own their
> directory shouldn't utilize segmentInfos, would it make sense to
> eliminate directoryOwner from DirectoryIndexReader and instead test
> whether (null == segmentInfos) ?

Sounds good, will do...

> * the way TestMultiReader is setup with the "mode" is a bit confusing
> ... perhaps instead we should make a subclass where only openReader
> is overwritten (it will inherit the individual test methods) ?

Yes, that's cleaner, will make that change as well.

> here's the list of tweaks I made...

The improvements look good to me.
Thanks for the review, Hoss! I'll attach a new patch shortly.


Michael Busch added a comment - 18/Sep/07 08:24 PM
In addition to Hoss' changes this patch:
  • Removes the boolean directoryOwner variable from DirectoryIndexReader
    and checks for segmentInfos != null instead. I also added a comment
    to DirectoryIndexReader about this.
  • TestMultiReader now extends the new unit test TestMultiSegmentReader
    and overwrites the method openReader().

I'm planning to commit this in a day or so if nobody objects.


Michael Busch made changes - 18/Sep/07 08:24 PM
Attachment lucene-986.patch [ 12366121 ]
Repository Revision Date User Message
ASF #577596 Thu Sep 20 07:27:07 UTC 2007 buschmi LUCENE-986: Refactored SegmentInfos from IndexReader into the new subclass DirectoryIndexReader.
Files Changed
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/MultiReader.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/index/TestMultiReader.java
ADD /lucene/java/trunk/src/test/org/apache/lucene/index/TestMultiSegmentReader.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/SegmentReader.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/MultiSegmentReader.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/IndexReader.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/ParallelReader.java
MODIFY /lucene/java/trunk/CHANGES.txt
ADD /lucene/java/trunk/src/java/org/apache/lucene/index/DirectoryIndexReader.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/FilterIndexReader.java

Michael Busch added a comment - 20/Sep/07 07:30 AM
Committed Rev. 577596

Michael Busch made changes - 20/Sep/07 07:30 AM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Michael Busch made changes - 25/Jan/08 03:24 AM
Status Resolved [ 5 ] Closed [ 6 ]