Lucene - Core
  1. Lucene - Core
  2. LUCENE-3866

Make CompositeReader.getSequentialSubReaders() and the corresponding IndexReaderContext methods return unmodifiable List<R extends IndexReader>

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since Lucene 2.9 we have the method getSequentialSubReader() returning IndexReader[]. Based on hardly-to-debug errors in user's code, Robert and me realized that returning an array from a public API is an anti-pattern. If the array is intended to be modifiable (like byte[] in terms,...), it is fine to use arrays in public APIs, but not, if the array must be protected from modification. As IndexReaders are 100% unmodifiable in trunk code (no deletions,...), the only possibility to corrumpt the reader is by modifying the array returned by getSequentialSubReaders(). We should prevent this.

      The same theoretically applies to FieldCache, too - but the party that is afraid of performance problems is too big to fight against that

      For getSequentialSubReaders there is no performance problem at all. The binary search of reader-ids inside BaseCompositeReader can still be done with the internal protected array, but public APIs should expose only a unmodifiable List. The same applies to leaves() and children() in IndexReaderContext. This change to list would also allow to make CompositeReader and CompositeReaderContext Iterable<R extends IndexReader>, so some loops would look nice.

      1. LUCENE-3866-step1.patch
        83 kB
        Uwe Schindler
      2. LUCENE-3866-step2.patch
        141 kB
        Uwe Schindler
      3. LUCENE-3866-step2.patch
        140 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          In revision 1300016 I already committed some JavaDocs warnings about this. But the issue should be solved and discussed before Lucene 4.0 is released.

          Show
          Uwe Schindler added a comment - In revision 1300016 I already committed some JavaDocs warnings about this. But the issue should be solved and discussed before Lucene 4.0 is released.
          Hide
          Robert Muir added a comment -

          Thanks for opening!

          The binary search of reader-ids inside BaseCompositeReader can still be done with the internal protected array,

          That's unaffected here right? Because starts[] is a separate problem. Why isnt this private? I made it private,
          only MultiPassIndexSplitter.FakeDeleteIndexReader is affected. In my opinion we should fix this too, the fix would
          be to just have a protected method in BaseComposite: int getDocStart(int readerIndex). Between this and readerIndex(),
          subclasses then have all they need.

          Show
          Robert Muir added a comment - Thanks for opening! The binary search of reader-ids inside BaseCompositeReader can still be done with the internal protected array, That's unaffected here right? Because starts[] is a separate problem. Why isnt this private? I made it private, only MultiPassIndexSplitter.FakeDeleteIndexReader is affected. In my opinion we should fix this too, the fix would be to just have a protected method in BaseComposite: int getDocStart(int readerIndex). Between this and readerIndex(), subclasses then have all they need.
          Hide
          Uwe Schindler added a comment -

          Yeah, its both cases: The protected inner reader[] is of course a problem. But in general we should protect anybody outside (except subclasses) to do harm here. So it should be a simple List<R extends IndexReader>.

          Show
          Uwe Schindler added a comment - Yeah, its both cases: The protected inner reader[] is of course a problem. But in general we should protect anybody outside (except subclasses) to do harm here. So it should be a simple List<R extends IndexReader>.
          Hide
          Yonik Seeley added a comment -

          I think we should allow users to access in the most low-level efficient manner, just as lucene can.
          Remember, this is expert level stuff! Seems like at most we would need a javadoc comment saying "don't modify this".

          Show
          Yonik Seeley added a comment - I think we should allow users to access in the most low-level efficient manner, just as lucene can. Remember, this is expert level stuff! Seems like at most we would need a javadoc comment saying "don't modify this".
          Hide
          Uwe Schindler added a comment - - edited

          Here the first step of this refactoring (affects only few core classes):

          • getSequentialSubReaders() returns a List<? extends IndexReader> (unmodifiable)
          • Nuked ReaderUtil.Gather completely (this one is ineffective and slow because of useless recursion). We can use ReaderContext.leaves() to get all atomic sub readers including their docStarts
          • Improved MultiFields and MultiDocValues by usage of ReaderContext.leaves(). Code much easier to read!
          • Moved ReaderUtil.Slice to separate class (it's somehow unrelated), but has nothing to do with ReaderUtil anymore -> Should move to index package + hidden

          There is still one (slow) nocommit, because step 2 will refactor the ReaderContexts to use Collections, too.

          Yonik:

          I think we should allow users to access in the most low-level efficient manner, just as lucene can. Remember, this is expert level stuff! Seems like at most we would need a javadoc comment saying "don't modify this".

          Actually it's more efficient than before. And the collection views are still backed by arrays and are never used in inner loops (it's just when IndexSearcher initializes or executes searches on all segments). Without any real benchmark showing a slowness (please do this after step 2 -> nocommit) I don't think we should risk corrumpt readers.

          Show
          Uwe Schindler added a comment - - edited Here the first step of this refactoring (affects only few core classes): getSequentialSubReaders() returns a List<? extends IndexReader> (unmodifiable) Nuked ReaderUtil.Gather completely (this one is ineffective and slow because of useless recursion). We can use ReaderContext.leaves() to get all atomic sub readers including their docStarts Improved MultiFields and MultiDocValues by usage of ReaderContext.leaves(). Code much easier to read! Moved ReaderUtil.Slice to separate class (it's somehow unrelated), but has nothing to do with ReaderUtil anymore -> Should move to index package + hidden There is still one (slow) nocommit, because step 2 will refactor the ReaderContexts to use Collections, too. Yonik: I think we should allow users to access in the most low-level efficient manner, just as lucene can. Remember, this is expert level stuff! Seems like at most we would need a javadoc comment saying "don't modify this". Actually it's more efficient than before. And the collection views are still backed by arrays and are never used in inner loops (it's just when IndexSearcher initializes or executes searches on all segments). Without any real benchmark showing a slowness (please do this after step 2 -> nocommit) I don't think we should risk corrumpt readers.
          Hide
          Uwe Schindler added a comment -

          This is the final patch, now also moving IndexReaderContext to collections and make them unmodifiable.

          For safety, IRC.leaves() now throws UnsupportedOperationException, if the context is not top-level (returned null before), this helps to find bugs.

          All tests pass, JavaDocs are updated.

          Robert suggested to me yesterday, to maybe make getSequentialSubReaders() protected, as it is no longer used by user code (only tests). All code should use the topReaderContext returned by the IndexReader to get all atomic IRC.leaves(), or IRC.children() to get the sub-contexts/readers.

          For easy use it might be a good idea to add some "easy-use methods" in ReaderUtil to get a view on all AtomicReader leaves (without docBase, so returns List<AtomicReader>). Some code not needing the whole info would get simplier. This is stuff for a new issue.

          In my opinion we should also move and hide ReaderSlice and BitSlice to index package, those classes are solely privately used from there.

          I think this is ready to commit to trunk and for backport to 4.x. I will not wait too long (max 24hrs), as the patch may get outdated very quickly. Maybe Mike can do some perf benchmarks with beast to show that it does not affect performance (some parts like creating Multi* should be much more effective now).

          Show
          Uwe Schindler added a comment - This is the final patch, now also moving IndexReaderContext to collections and make them unmodifiable. For safety, IRC.leaves() now throws UnsupportedOperationException, if the context is not top-level (returned null before), this helps to find bugs. All tests pass, JavaDocs are updated. Robert suggested to me yesterday, to maybe make getSequentialSubReaders() protected, as it is no longer used by user code (only tests). All code should use the topReaderContext returned by the IndexReader to get all atomic IRC.leaves(), or IRC.children() to get the sub-contexts/readers. For easy use it might be a good idea to add some "easy-use methods" in ReaderUtil to get a view on all AtomicReader leaves (without docBase, so returns List<AtomicReader>). Some code not needing the whole info would get simplier. This is stuff for a new issue. In my opinion we should also move and hide ReaderSlice and BitSlice to index package, those classes are solely privately used from there. I think this is ready to commit to trunk and for backport to 4.x. I will not wait too long (max 24hrs), as the patch may get outdated very quickly. Maybe Mike can do some perf benchmarks with beast to show that it does not affect performance (some parts like creating Multi* should be much more effective now).
          Hide
          Uwe Schindler added a comment -

          Samll update with JavaDocs (remove the "don't modify returned array" @ getSequentialSubReaders)

          Show
          Uwe Schindler added a comment - Samll update with JavaDocs (remove the "don't modify returned array" @ getSequentialSubReaders)
          Hide
          Uwe Schindler added a comment -

          I will later also upgrade MIGRATE.txt markdown.

          Show
          Uwe Schindler added a comment - I will later also upgrade MIGRATE.txt markdown.
          Hide
          Michael McCandless added a comment -

          I ran quick perf test:

                          Task    QPS base StdDev base   QPS patchStdDev patch      Pct diff
                       Respell       92.15        1.40       90.30        2.70   -6% -    2%
                      PKLookup      130.68        4.16      128.24        2.28   -6% -    3%
                        Fuzzy2       41.79        0.40       41.14        1.11   -5% -    2%
                        Fuzzy1      108.97        2.40      107.33        2.87   -6% -    3%
                   AndHighHigh       16.19        0.48       15.97        0.31   -6% -    3%
                        Phrase       12.90        0.32       12.74        0.36   -6% -    4%
                    AndHighMed       64.18        1.81       63.46        1.83   -6% -    4%
                  SloppyPhrase        8.37        0.29        8.29        0.11   -5% -    3%
                      SpanNear        5.51        0.12        5.46        0.18   -6% -    4%
                   TermGroup1M       36.19        0.60       35.89        0.74   -4% -    2%
                  TermBGroup1M       70.64        0.49       70.74        0.57   -1% -    1%
                       Prefix3       61.07        3.58       61.25        1.35   -7% -    8%
                      Wildcard       40.84        2.20       41.00        0.98   -7% -    8%
                          Term      147.32        3.85      149.65        4.87   -4% -    7%
                TermBGroup1M1P       50.32        1.58       51.29        0.88   -2% -    7%
                        IntNRQ        9.96        1.40       10.18        0.56  -15% -   25%
                    OrHighHigh       10.31        0.74       10.56        0.57   -9% -   16%
                     OrHighMed       12.95        1.01       13.26        0.79  -10% -   17%
          

          Basically no real change ... good!

          Show
          Michael McCandless added a comment - I ran quick perf test: Task QPS base StdDev base QPS patchStdDev patch Pct diff Respell 92.15 1.40 90.30 2.70 -6% - 2% PKLookup 130.68 4.16 128.24 2.28 -6% - 3% Fuzzy2 41.79 0.40 41.14 1.11 -5% - 2% Fuzzy1 108.97 2.40 107.33 2.87 -6% - 3% AndHighHigh 16.19 0.48 15.97 0.31 -6% - 3% Phrase 12.90 0.32 12.74 0.36 -6% - 4% AndHighMed 64.18 1.81 63.46 1.83 -6% - 4% SloppyPhrase 8.37 0.29 8.29 0.11 -5% - 3% SpanNear 5.51 0.12 5.46 0.18 -6% - 4% TermGroup1M 36.19 0.60 35.89 0.74 -4% - 2% TermBGroup1M 70.64 0.49 70.74 0.57 -1% - 1% Prefix3 61.07 3.58 61.25 1.35 -7% - 8% Wildcard 40.84 2.20 41.00 0.98 -7% - 8% Term 147.32 3.85 149.65 4.87 -4% - 7% TermBGroup1M1P 50.32 1.58 51.29 0.88 -2% - 7% IntNRQ 9.96 1.40 10.18 0.56 -15% - 25% OrHighHigh 10.31 0.74 10.56 0.57 -9% - 16% OrHighMed 12.95 1.01 13.26 0.79 -10% - 17% Basically no real change ... good!
          Hide
          Uwe Schindler added a comment -

          Thanks, I was not expecting a change - otherwise all my believes would have been devastated...

          Show
          Uwe Schindler added a comment - Thanks, I was not expecting a change - otherwise all my believes would have been devastated...
          Hide
          Michael McCandless added a comment -

          +1, patch looks good! Thanks Uwe.

          Show
          Michael McCandless added a comment - +1, patch looks good! Thanks Uwe.
          Hide
          Uwe Schindler added a comment -

          I checked CHANGES.txt and MIGRATE.txt, there are no explicit mentions of the datatypes. No need to change anything. But MIGRATE.txt is missing the documentation about how to get atomic leaves from a reader. We should do this in the followup issue, when we simplify some names.

          I will commit this patch tomorrow morning to trunk and 4.x.

          Show
          Uwe Schindler added a comment - I checked CHANGES.txt and MIGRATE.txt, there are no explicit mentions of the datatypes. No need to change anything. But MIGRATE.txt is missing the documentation about how to get atomic leaves from a reader. We should do this in the followup issue, when we simplify some names. I will commit this patch tomorrow morning to trunk and 4.x.
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1351590
          Committed 4.x revision: 1351591

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1351590 Committed 4.x revision: 1351591

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development