Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-986

Refactor segmentInfos from IndexReader into its subclasses

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      References to segmentInfos in IndexReader cause different kinds of problems
      for subclasses of IndexReader, like e. g. MultiReader.

      Only subclasses of IndexReader that own the index directory, namely
      SegmentReader and MultiSegmentReader, should have a SegmentInfos object
      and be able to access it.

      Further information:
      http://www.gossamer-threads.com/lists/lucene/java-dev/51808
      http://www.gossamer-threads.com/lists/lucene/java-user/52460

      A part of the refactoring work was already done in LUCENE-781

      1. lucene-986.patch
        48 kB
        Michael Busch
      2. lucene-986.patch
        42 kB
        Hoss Man
      3. lucene-986.patch
        40 kB
        Michael Busch

        Issue Links

          Activity

          Hide
          hossman Hoss Man added a comment -

          one aspect of this that should be considered: It may not make sense for MultiReader to extend MultiSegmentReader ... as Michael says, only subclasses that own the index directory should have segmentInfos, and a MultiReader (as defined on the trunk now) can never own it's own directory.

          I haven't worked through all the implications, but perhaps the most logical refactoring would be...

          • IndexReader
            ...as abstract as possible given that we can't actually make methods abstract
          • DirectoryIndexReader extends IndexReader
            ...new class, encapsulated all the segmentInfos and locking logic currently in
            IndexReader (can definitely be made abstract if helpful)
          • SegmentReader extends DirectoryIndexReader
          • MultiSegmentReader extends DirectoryIndexReader
          • ParallelIndexReader extends IndexReader
          • FilterIndexReader extends IndexReader
          • MultiReader extends IndexReader
            ...(side note that i really haven't thought through completley: should
            MultiReader extend FilterIndexReader?)

          there would likely be some utlity functionality that could be reused between MultiReader and MultiSegmentReader ... possible as static methods in IndexReader (or a new util class)

          Show
          hossman Hoss Man added a comment - one aspect of this that should be considered: It may not make sense for MultiReader to extend MultiSegmentReader ... as Michael says, only subclasses that own the index directory should have segmentInfos, and a MultiReader (as defined on the trunk now) can never own it's own directory. I haven't worked through all the implications, but perhaps the most logical refactoring would be... IndexReader ...as abstract as possible given that we can't actually make methods abstract DirectoryIndexReader extends IndexReader ...new class, encapsulated all the segmentInfos and locking logic currently in IndexReader (can definitely be made abstract if helpful) SegmentReader extends DirectoryIndexReader MultiSegmentReader extends DirectoryIndexReader ParallelIndexReader extends IndexReader FilterIndexReader extends IndexReader MultiReader extends IndexReader ...(side note that i really haven't thought through completley: should MultiReader extend FilterIndexReader?) there would likely be some utlity functionality that could be reused between MultiReader and MultiSegmentReader ... possible as static methods in IndexReader (or a new util class)
          Hide
          michaelbusch Michael Busch added a comment -

          This is good stuff, Hoss. I like the DirectoryIndexReader idea.
          Sounds like a very clean and logical separation.

          I'll work through the code to understand all consequences.

          Show
          michaelbusch Michael Busch added a comment - This is good stuff, Hoss. I like the DirectoryIndexReader idea. Sounds like a very clean and logical separation. I'll work through the code to understand all consequences.
          Hide
          michaelbusch Michael Busch added a comment -

          What do you think about this alternative approach:

          • SegmentReader, MultiSegmentReader and MultiReader all extend IndexReader
          • IndexReader.open() always returns a MultiSegmentReader instance, even
            if the index has only one segment. We can make some optimizations, so that
            e. g. MultiSegmentReader.termDocs() returns a SegmentTermDocs in case
            there's only one underlying SegmentReader.
          • All write lock and transaction logic goes into MultiSegmentReader, as
            it is the only reader that has to obtain a write lock.

          The advantage here is that only one class (MultiSegmentReader) is
          responsible for acquiring the write lock. It would also make
          IndexReader.open() simpler, because it can then return a
          MultiSegmentReader instance in all cases. The logic of opening the
          different segments could move from IndexReader.open() to the constructor
          of MultiSegmentReader. Also LUCENE-743 (IndexReader.reopen()) would
          become simpler.
          The disadvantage of this approach is apparently that always using
          a MultiSegmentReader would add some overhead when reading optimized
          indexes, but actually I don't think that this overhead would really be
          significant.

          If we go the DirectoryIndexReader way (where SegmentReader and
          MultiSegmentReader both extend DirectoryIndexReader), then we still need
          an ownDirectory variable that tells SegmentReader if it has to acquire a
          write lock or not, depending on whether it is a subreader of
          MultiSegmentReader or not. And of course we add another layer to the
          IndexReader hierachy.

          So I'm not sure which approach is the better one. I'm hoping to get some
          opionions here! Hoss? Others?

          Show
          michaelbusch Michael Busch added a comment - What do you think about this alternative approach: SegmentReader, MultiSegmentReader and MultiReader all extend IndexReader IndexReader.open() always returns a MultiSegmentReader instance, even if the index has only one segment. We can make some optimizations, so that e. g. MultiSegmentReader.termDocs() returns a SegmentTermDocs in case there's only one underlying SegmentReader. All write lock and transaction logic goes into MultiSegmentReader, as it is the only reader that has to obtain a write lock. The advantage here is that only one class (MultiSegmentReader) is responsible for acquiring the write lock. It would also make IndexReader.open() simpler, because it can then return a MultiSegmentReader instance in all cases. The logic of opening the different segments could move from IndexReader.open() to the constructor of MultiSegmentReader. Also LUCENE-743 (IndexReader.reopen()) would become simpler. The disadvantage of this approach is apparently that always using a MultiSegmentReader would add some overhead when reading optimized indexes, but actually I don't think that this overhead would really be significant. If we go the DirectoryIndexReader way (where SegmentReader and MultiSegmentReader both extend DirectoryIndexReader), then we still need an ownDirectory variable that tells SegmentReader if it has to acquire a write lock or not, depending on whether it is a subreader of MultiSegmentReader or not. And of course we add another layer to the IndexReader hierachy. So I'm not sure which approach is the better one. I'm hoping to get some opionions here! Hoss? Others?
          Hide
          hossman Hoss Man added a comment -

          i rarely remember a week later what i was thinking when i wrote something, but i suspect that when i suggested the DirectoryIndexReader i was assuming it would have everything directory/lock related thta currently exists in the IndexReader base class (including the directoryOwner boolean) ... in cases where there is a single Segment in a directory, there will be SegmentReader with directoryOwner==true ... in the multi segment cases, the MultiSegmentReader will have directoryOwner==true, and it's sub SegmentReaders will all have directoryOwner==false. ...

          ...the key point of DirectoryIndexReader being that any subclass can own a directory (and automaticly inherits all the code for dealing with locks properly when it needs/wants to) but doesn't always have to own the directory. meanwhile MultiReader (and ParallelIndexReader and FilteredIndexReader) make no attempt at owning a directory, and inherit no code for doing so (or for dealing with the locking of such non existent directories)

          I don't really know enough about the performance characteristics of SegmentReader vs a MultiSegmentReader of one segment to have a sense of how possible/practical it would be to eliminate the need for SegmentReader and replace it completely with MultiSegmentReader ...

          one hitch might be that SegmentReader.get is public, and in order to keep supporting it, SegmentReader still needs to have/inherit the same segment info and directory owning/locking code that we want to move out of IndexReader (so just putting it MultiSegmentReader won't fly unless we kill that public method)

          Show
          hossman Hoss Man added a comment - i rarely remember a week later what i was thinking when i wrote something, but i suspect that when i suggested the DirectoryIndexReader i was assuming it would have everything directory/lock related thta currently exists in the IndexReader base class (including the directoryOwner boolean) ... in cases where there is a single Segment in a directory, there will be SegmentReader with directoryOwner==true ... in the multi segment cases, the MultiSegmentReader will have directoryOwner==true, and it's sub SegmentReaders will all have directoryOwner==false. ... ...the key point of DirectoryIndexReader being that any subclass can own a directory (and automaticly inherits all the code for dealing with locks properly when it needs/wants to) but doesn't always have to own the directory. meanwhile MultiReader (and ParallelIndexReader and FilteredIndexReader) make no attempt at owning a directory, and inherit no code for doing so (or for dealing with the locking of such non existent directories) I don't really know enough about the performance characteristics of SegmentReader vs a MultiSegmentReader of one segment to have a sense of how possible/practical it would be to eliminate the need for SegmentReader and replace it completely with MultiSegmentReader ... one hitch might be that SegmentReader.get is public, and in order to keep supporting it, SegmentReader still needs to have/inherit the same segment info and directory owning/locking code that we want to move out of IndexReader (so just putting it MultiSegmentReader won't fly unless we kill that public method)
          Hide
          michaelbusch Michael Busch added a comment -

          Here is the patch with the DirectoryIndexReader approach.

          It moves SegmentInfos and Directory from IndexReader into
          DirectoryIndexReader, as well as the commit and rollback logic.

          MultiSegmentReader and SegmentReader extend DirectoryIndexReader.
          MultiReader extends IndexReader now and uses some methods from
          MultiSegmentReader that I made static.

          I added the method acquireWriteLock() to IndexReader that does
          nothing by default. Subclasses must implement it if they require
          a write lock (so does DirectoryIndexReader now).

          IndexReader is very abstract now and almost all logic moved into
          the subclasses. The methods isOptimized(), isCurrent(), and
          getVersion() all throw an UnsupportedOperationException (UOE). I
          further deprecated the IndexReader constructor that takes a
          Directory as argument. As soon as we remove this ctr the method
          IndexReader.getDirectory() should throw an UOE as well. I added
          a TODO comment as a reminder.

          All unit tests pass with this patch.

          Show
          michaelbusch Michael Busch added a comment - Here is the patch with the DirectoryIndexReader approach. It moves SegmentInfos and Directory from IndexReader into DirectoryIndexReader, as well as the commit and rollback logic. MultiSegmentReader and SegmentReader extend DirectoryIndexReader. MultiReader extends IndexReader now and uses some methods from MultiSegmentReader that I made static. I added the method acquireWriteLock() to IndexReader that does nothing by default. Subclasses must implement it if they require a write lock (so does DirectoryIndexReader now). IndexReader is very abstract now and almost all logic moved into the subclasses. The methods isOptimized(), isCurrent(), and getVersion() all throw an UnsupportedOperationException (UOE). I further deprecated the IndexReader constructor that takes a Directory as argument. As soon as we remove this ctr the method IndexReader.getDirectory() should throw an UOE as well. I added a TODO comment as a reminder. All unit tests pass with this patch.
          Hide
          michaelbusch Michael Busch added a comment -

          > one hitch might be that SegmentReader.get is public, and in order to keep
          > supporting it, SegmentReader still needs to have/inherit the same segment info
          > and directory owning/locking code that we want to move out of IndexReader (so
          > just putting it MultiSegmentReader won't fly unless we kill that public method).

          OK, I implemented the DirectoryIndexReader approach. Also because I'm not sure
          about the performance characteristics of a MultiSegmentReader acting as a
          SegmentReader.

          I'd like to commit this rather soon. A review of the patch would be highly
          appreciated.

          Show
          michaelbusch Michael Busch added a comment - > one hitch might be that SegmentReader.get is public, and in order to keep > supporting it, SegmentReader still needs to have/inherit the same segment info > and directory owning/locking code that we want to move out of IndexReader (so > just putting it MultiSegmentReader won't fly unless we kill that public method). OK, I implemented the DirectoryIndexReader approach. Also because I'm not sure about the performance characteristics of a MultiSegmentReader acting as a SegmentReader. I'd like to commit this rather soon. A review of the patch would be highly appreciated.
          Hide
          hossman Hoss Man added a comment -

          Michael: I've been meaning to look at this, but haven't had the time ... your recent update has goaded me

          just to clarify: the patch you added on September 12th is your latest patch right? ... it's not clear from you comment on the 17th if you intended to attach an update and something went wrong.

          I ask because i'm haivng trouble applying the patch from the 12th ... i must be tired because i can't understand why, it doesn't look like the files have changed since you posted the patch, so i'm not sure what it's complaining about ... visually everything seems to match up...

          hossman@coaster:~/lucene/lucene$ svn update
          At revision 576718.
          hossman@coaster:~/lucene/lucene$ svn status
          ? lucene-986.patch
          hossman@coaster:~/lucene/lucene$ patch --dry-run -p0 -i lucene-986.patch patching file src/java/org/apache/lucene/index/DirectoryIndexReader.java
          patching file src/java/org/apache/lucene/index/FilterIndexReader.java
          patching file src/java/org/apache/lucene/index/IndexReader.java
          patching file src/java/org/apache/lucene/index/MultiReader.java
          Hunk #1 FAILED at 19.
          1 out of 2 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiReader.java.rej
          patching file src/java/org/apache/lucene/index/MultiSegmentReader.java
          Hunk #1 FAILED at 30.
          Hunk #2 FAILED at 39.
          Hunk #3 FAILED at 156.
          Hunk #4 FAILED at 171.
          Hunk #5 FAILED at 256.
          Hunk #6 FAILED at 278.
          6 out of 6 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiSegmentReader.java.rej
          patching file src/java/org/apache/lucene/index/ParallelReader.java
          patching file src/java/org/apache/lucene/index/SegmentReader.java
          Hunk #1 succeeded at 32 with fuzz 2.
          patching file src/test/org/apache/lucene/index/TestMultiReader.java
          hossman@coaster:~/lucene/lucene$

          Show
          hossman Hoss Man added a comment - Michael: I've been meaning to look at this, but haven't had the time ... your recent update has goaded me just to clarify: the patch you added on September 12th is your latest patch right? ... it's not clear from you comment on the 17th if you intended to attach an update and something went wrong. I ask because i'm haivng trouble applying the patch from the 12th ... i must be tired because i can't understand why, it doesn't look like the files have changed since you posted the patch, so i'm not sure what it's complaining about ... visually everything seems to match up... hossman@coaster:~/lucene/lucene$ svn update At revision 576718. hossman@coaster:~/lucene/lucene$ svn status ? lucene-986.patch hossman@coaster:~/lucene/lucene$ patch --dry-run -p0 -i lucene-986.patch patching file src/java/org/apache/lucene/index/DirectoryIndexReader.java patching file src/java/org/apache/lucene/index/FilterIndexReader.java patching file src/java/org/apache/lucene/index/IndexReader.java patching file src/java/org/apache/lucene/index/MultiReader.java Hunk #1 FAILED at 19. 1 out of 2 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiReader.java.rej patching file src/java/org/apache/lucene/index/MultiSegmentReader.java Hunk #1 FAILED at 30. Hunk #2 FAILED at 39. Hunk #3 FAILED at 156. Hunk #4 FAILED at 171. Hunk #5 FAILED at 256. Hunk #6 FAILED at 278. 6 out of 6 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiSegmentReader.java.rej patching file src/java/org/apache/lucene/index/ParallelReader.java patching file src/java/org/apache/lucene/index/SegmentReader.java Hunk #1 succeeded at 32 with fuzz 2. patching file src/test/org/apache/lucene/index/TestMultiReader.java hossman@coaster:~/lucene/lucene$
          Hide
          hossman Hoss Man added a comment -

          I got the patch to apply cleanly (see mailing list for details) On the whole it looks really good, i'm attaching an updated version with some minor improvements (mainly javadocs), but first a few questions...

          • just to clarify: IndexReader(Directory) is only around for
            backwards compatibility in subclasses right? And the only reason
            for the "private Directory" is to keep supporting the directory()
            method for any subclasses relying on it?
          • IndexReader() says it can be removed once the other constructor is
            removed ... why? is that just pointing out that we can rely on the
            default constructor?
          • since one of the goals is that IndexReaders which don't own their
            directory shouldn't utilize segmentInfos, would it make sense to
            eliminate directoryOwner from DirectoryIndexReader and instead test
            whether (null == segmentInfos) ?
          • the way TestMultiReader is setup with the "mode" is a bit confusing
            ... perhaps instead we should make a subclass where only openReader
            is overwritten (it will inherit the individual test methods) ?

          here's the list of tweaks I made...

          • added a note in the IndexReader class javadocs about
            methods that throw UnSupOp but aren't abstract.
          • added javadocs to the IndexReader(Directory) constructor based on my
            understanding
          • added back javadocs to IndexReader methods that now throw UnSupOp to
            make it clear what they do to callers looking at the API, but made
            it clear tthe default impls throw UnSupOp
          • made IndexReader.directory() throw UnSupOp if directory is null,
            enhanced it's javadocs.
          • /* NOOP */ in IndexReader.acquireWriteLock so it's clear to code analysis tools that it's not a mistake.
          • small additions to javadocs for DirectoryIndexReader class, but
            these should probably be elaborated on (depending on what you think
            of my segmentInfos==null idea above)
          • made DirectoryIndexReader(...) call init(...) to eliminate a small
            about of duplication.
          Show
          hossman Hoss Man added a comment - I got the patch to apply cleanly (see mailing list for details) On the whole it looks really good, i'm attaching an updated version with some minor improvements (mainly javadocs), but first a few questions... just to clarify: IndexReader(Directory) is only around for backwards compatibility in subclasses right? And the only reason for the "private Directory" is to keep supporting the directory() method for any subclasses relying on it? IndexReader() says it can be removed once the other constructor is removed ... why? is that just pointing out that we can rely on the default constructor? since one of the goals is that IndexReaders which don't own their directory shouldn't utilize segmentInfos, would it make sense to eliminate directoryOwner from DirectoryIndexReader and instead test whether (null == segmentInfos) ? the way TestMultiReader is setup with the "mode" is a bit confusing ... perhaps instead we should make a subclass where only openReader is overwritten (it will inherit the individual test methods) ? here's the list of tweaks I made... added a note in the IndexReader class javadocs about methods that throw UnSupOp but aren't abstract. added javadocs to the IndexReader(Directory) constructor based on my understanding added back javadocs to IndexReader methods that now throw UnSupOp to make it clear what they do to callers looking at the API, but made it clear tthe default impls throw UnSupOp made IndexReader.directory() throw UnSupOp if directory is null, enhanced it's javadocs. /* NOOP */ in IndexReader.acquireWriteLock so it's clear to code analysis tools that it's not a mistake. small additions to javadocs for DirectoryIndexReader class, but these should probably be elaborated on (depending on what you think of my segmentInfos==null idea above) made DirectoryIndexReader(...) call init(...) to eliminate a small about of duplication.
          Hide
          michaelbusch Michael Busch added a comment -

          > I got the patch to apply cleanly (see mailing list for details)

          Thanks, Hoss! I'm using TortoiseSVN, I have to check how to set those
          default properties for new files correctly.

          > * just to clarify: IndexReader(Directory) is only around for
          > backwards compatibility in subclasses right? And the only reason
          > for the "private Directory" is to keep supporting the directory()
          > method for any subclasses relying on it?

          Yes, correct.

          > * IndexReader() says it can be removed once the other constructor is
          > removed ... why? is that just pointing out that we can rely on the
          > default constructor?

          Yes, just a suggested simplification. Keeping the constructor wouldn't hurt
          though.

          > * since one of the goals is that IndexReaders which don't own their
          > directory shouldn't utilize segmentInfos, would it make sense to
          > eliminate directoryOwner from DirectoryIndexReader and instead test
          > whether (null == segmentInfos) ?

          Sounds good, will do...

          > * the way TestMultiReader is setup with the "mode" is a bit confusing
          > ... perhaps instead we should make a subclass where only openReader
          > is overwritten (it will inherit the individual test methods) ?

          Yes, that's cleaner, will make that change as well.

          > here's the list of tweaks I made...

          The improvements look good to me.
          Thanks for the review, Hoss! I'll attach a new patch shortly.

          Show
          michaelbusch Michael Busch added a comment - > I got the patch to apply cleanly (see mailing list for details) Thanks, Hoss! I'm using TortoiseSVN, I have to check how to set those default properties for new files correctly. > * just to clarify: IndexReader(Directory) is only around for > backwards compatibility in subclasses right? And the only reason > for the "private Directory" is to keep supporting the directory() > method for any subclasses relying on it? Yes, correct. > * IndexReader() says it can be removed once the other constructor is > removed ... why? is that just pointing out that we can rely on the > default constructor? Yes, just a suggested simplification. Keeping the constructor wouldn't hurt though. > * since one of the goals is that IndexReaders which don't own their > directory shouldn't utilize segmentInfos, would it make sense to > eliminate directoryOwner from DirectoryIndexReader and instead test > whether (null == segmentInfos) ? Sounds good, will do... > * the way TestMultiReader is setup with the "mode" is a bit confusing > ... perhaps instead we should make a subclass where only openReader > is overwritten (it will inherit the individual test methods) ? Yes, that's cleaner, will make that change as well. > here's the list of tweaks I made... The improvements look good to me. Thanks for the review, Hoss! I'll attach a new patch shortly.
          Hide
          michaelbusch Michael Busch added a comment -

          In addition to Hoss' changes this patch:

          • Removes the boolean directoryOwner variable from DirectoryIndexReader
            and checks for segmentInfos != null instead. I also added a comment
            to DirectoryIndexReader about this.
          • TestMultiReader now extends the new unit test TestMultiSegmentReader
            and overwrites the method openReader().

          I'm planning to commit this in a day or so if nobody objects.

          Show
          michaelbusch Michael Busch added a comment - In addition to Hoss' changes this patch: Removes the boolean directoryOwner variable from DirectoryIndexReader and checks for segmentInfos != null instead. I also added a comment to DirectoryIndexReader about this. TestMultiReader now extends the new unit test TestMultiSegmentReader and overwrites the method openReader(). I'm planning to commit this in a day or so if nobody objects.
          Hide
          michaelbusch Michael Busch added a comment -

          Committed Rev. 577596

          Show
          michaelbusch Michael Busch added a comment - Committed Rev. 577596

            People

            • Assignee:
              michaelbusch Michael Busch
              Reporter:
              michaelbusch Michael Busch
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development