Lucene - Core
  1. Lucene - Core
  2. LUCENE-3606

Make IndexReader really read-only in Lucene 4.0

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As we change API completely in Lucene 4.0 we are also free to remove read-write access and commits from IndexReader. This code is so hairy and buggy (as investigated by Robert and Mike today) when you work on SegmentReader level but forget to flush in the DirectoryReader, so its better to really make IndexReaders readonly.

      Currently with IndexReader you can do things like:

      • delete/undelete Documents -> Can be done by with IndexWriter, too (using deleteByQuery)
      • change norms -> this is a bad idea in general, but when we remove norms at all and replace by DocValues this is obsolete already. Changing DocValues should also be done using IndexWriter in trunk (once it is ready)
      1. LUCENE-3606-hideMethodAgain.patch
        3 kB
        Uwe Schindler
      2. LUCENE-3606-deprecations3x.patch
        26 kB
        Uwe Schindler
      3. LUCENE-3606-deprecations3x.patch
        28 kB
        Uwe Schindler
      4. LUCENE-3606-deprecations3x.patch
        31 kB
        Uwe Schindler
      5. LUCENE-3606.patch
        541 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment - - edited

          Merged changes to trunk revision: 1212545

          This issue is now finished, thanks to Robert for the great help during fixing tests and funny discussions
          edit ...and for porting norms to codec! edit

          Show
          Uwe Schindler added a comment - - edited Merged changes to trunk revision: 1212545 This issue is now finished, thanks to Robert for the great help during fixing tests and funny discussions edit ...and for porting norms to codec! edit
          Hide
          Uwe Schindler added a comment -

          Committed 3.x revision: 1212539

          Show
          Uwe Schindler added a comment - Committed 3.x revision: 1212539
          Hide
          Uwe Schindler added a comment -

          Final patch including changes.txt. Will commit this now.

          Show
          Uwe Schindler added a comment - Final patch including changes.txt. Will commit this now.
          Hide
          Uwe Schindler added a comment -

          Updated patch, some javadoc changes (deprecated method replacements) and one more missing method.

          Show
          Uwe Schindler added a comment - Updated patch, some javadoc changes (deprecated method replacements) and one more missing method.
          Hide
          Uwe Schindler added a comment -

          Here a patch that deprecates the removed methods in 3x:

          • delete*, undeleteAll
          • IR.open(..., readonly,...)
          • adds missing non-readOnly IR.open()
          • for now I also deprecated setNorm(). If I would not deprecate it, it would be inconsistent, as setNorm only works with readOnly=false - but you can never call that wthout a deprecation warning. I think we should leave setNorm deprecated.

          I did not fix the tests to not use deprecated readOnly=true IR.open()!

          I found a serious API glitch in 3.x with openIfChanged:

          • the base class defines doOpenIfChanged(boolean readOnly), but MultiReader and ParallelReader "override" this method with a signature doOpenIfChanged(doClone) and missing @Override. This makes consumers calling IR.openIfChanged(boolean readOnly) do the wrong thing. Instead they should get UOE like for the other unimplemented doOpenIfChanged methods in MR and PR.
          Show
          Uwe Schindler added a comment - Here a patch that deprecates the removed methods in 3x: delete*, undeleteAll IR.open(..., readonly,...) adds missing non-readOnly IR.open() for now I also deprecated setNorm(). If I would not deprecate it, it would be inconsistent, as setNorm only works with readOnly=false - but you can never call that wthout a deprecation warning. I think we should leave setNorm deprecated. I did not fix the tests to not use deprecated readOnly=true IR.open()! I found a serious API glitch in 3.x with openIfChanged: the base class defines doOpenIfChanged(boolean readOnly), but MultiReader and ParallelReader "override" this method with a signature doOpenIfChanged(doClone) and missing @Override. This makes consumers calling IR.openIfChanged(boolean readOnly) do the wrong thing. Instead they should get UOE like for the other unimplemented doOpenIfChanged methods in MR and PR.
          Hide
          Uwe Schindler added a comment -

          I will post a patch later adding the needed deprecations to 3.x! This issue is kept open until the deprecations are solved.

          Show
          Uwe Schindler added a comment - I will post a patch later adding the needed deprecations to 3.x! This issue is kept open until the deprecations are solved.
          Hide
          Uwe Schindler added a comment -

          Patch that hides an accidently visible method IndexReader.open with Directory and IndexCommit again.

          Committed in trunk revision: 1212294

          Show
          Uwe Schindler added a comment - Patch that hides an accidently visible method IndexReader.open with Directory and IndexCommit again. Committed in trunk revision: 1212294
          Hide
          Uwe Schindler added a comment -

          Committed to trunk revision: 1212292

          Show
          Uwe Schindler added a comment - Committed to trunk revision: 1212292
          Hide
          Simon Willnauer added a comment -

          +1 I love that Norms are under codec now!

          Show
          Simon Willnauer added a comment - +1 I love that Norms are under codec now!
          Hide
          Michael McCandless added a comment -

          +1! Nice work guys... I love all the minuses

          Show
          Michael McCandless added a comment - +1! Nice work guys... I love all the minuses
          Hide
          Uwe Schindler added a comment -

          Thanks to Robert for fixing the last tests.

          This is the patch of the branch merged back to trunk. I will try to commit this ASAP, as it might get outdated very fast.

          Please review the changes!

          After committing I will keep the issue open and add deprecations to Lucene 3.x on most methods removed in trunk.

          Show
          Uwe Schindler added a comment - Thanks to Robert for fixing the last tests. This is the patch of the branch merged back to trunk. I will try to commit this ASAP, as it might get outdated very fast. Please review the changes! After committing I will keep the issue open and add deprecations to Lucene 3.x on most methods removed in trunk.
          Hide
          Robert Muir added a comment -

          Tests are all fixed... i think this one is close.

          Show
          Robert Muir added a comment - Tests are all fixed... i think this one is close.
          Hide
          Uwe Schindler added a comment - - edited

          I tried to fix the remaining tests today, but this seems impossible without IndexReader.deleteDocument(int docId). Some of those tests with commented out by nocommits are so old that its impossible to understand what they are really testing (especially TestAddIndexes and TestIndexWriterMerging). I would simply delete them, because all this stuff is heavyily random tested otherwise (those "old tests" have no randomization at all).

          The remaining nocommits are:

          ./src/java/org/apache/lucene/index/codecs/lucene40/Lucene40NormsReader.java:      // nocommit: change to a real check? see LUCENE-3619
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: move the whole modification stuff to IW
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // end nocommit
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: is this needed anymore by IndexWriter?
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
          ./src/test/org/apache/lucene/index/TestAddIndexes.java:  /* nocommit: reactivate these tests
          ./src/test/org/apache/lucene/index/TestDeletionPolicy.java:  /* nocommit: fix this test, I don't understand it!
          ./src/test/org/apache/lucene/index/TestIndexWriterMerging.java:  /* nocommit: Fix tests to use an id and delete by term
          ./src/test/org/apache/lucene/index/TestParallelReaderEmptyIndex.java:  /* nocommit: Fix tests to use an id and delete by term
          ./src/test/org/apache/lucene/index/TestSizeBoundedForceMerge.java:  /* nocommit: Fix tests to use an id and delete by term
          ./src/test/org/apache/lucene/index/TestSizeBoundedForceMerge.java:  /* nocommit: Fix tests to use an id and delete by term
          

          The parts in SegmentReader should be made TODO and a new issue should be opened, which removed RW from SegmentReader (Mike?). The tests should be deleted as described above. Otherwise the branch seems finalized otherwise so I would like to merge back to trunk asap.

          Show
          Uwe Schindler added a comment - - edited I tried to fix the remaining tests today, but this seems impossible without IndexReader.deleteDocument(int docId). Some of those tests with commented out by nocommits are so old that its impossible to understand what they are really testing (especially TestAddIndexes and TestIndexWriterMerging). I would simply delete them, because all this stuff is heavyily random tested otherwise (those "old tests" have no randomization at all). The remaining nocommits are: ./src/java/org/apache/lucene/index/codecs/lucene40/Lucene40NormsReader.java: // nocommit: change to a real check? see LUCENE-3619 ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: move the whole modification stuff to IW ./src/java/org/apache/lucene/index/SegmentReader.java: // end nocommit ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: is this needed anymore by IndexWriter? ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/java/org/apache/lucene/index/SegmentReader.java: // nocommit: remove deletions from SR ./src/test/org/apache/lucene/index/TestAddIndexes.java: /* nocommit: reactivate these tests ./src/test/org/apache/lucene/index/TestDeletionPolicy.java: /* nocommit: fix this test, I don't understand it! ./src/test/org/apache/lucene/index/TestIndexWriterMerging.java: /* nocommit: Fix tests to use an id and delete by term ./src/test/org/apache/lucene/index/TestParallelReaderEmptyIndex.java: /* nocommit: Fix tests to use an id and delete by term ./src/test/org/apache/lucene/index/TestSizeBoundedForceMerge.java: /* nocommit: Fix tests to use an id and delete by term ./src/test/org/apache/lucene/index/TestSizeBoundedForceMerge.java: /* nocommit: Fix tests to use an id and delete by term The parts in SegmentReader should be made TODO and a new issue should be opened, which removed RW from SegmentReader (Mike?). The tests should be deleted as described above. Otherwise the branch seems finalized otherwise so I would like to merge back to trunk asap.
          Hide
          Uwe Schindler added a comment -

          Robert and I committed more fixes to the remaining tests to the branch.
          I also added the test of LUCENE-3620 to this branch and fixed FilterIndexReader.

          Show
          Uwe Schindler added a comment - Robert and I committed more fixes to the remaining tests to the branch. I also added the test of LUCENE-3620 to this branch and fixed FilterIndexReader.
          Hide
          Uwe Schindler added a comment -

          Clean up IR.open(...) methods to no longer accept readOnly and IndexDeletionPolicy: Core, Contrib, Solr (Modules need fixing, benchmark was broken before, too)

          New branch revision: 1210305

          Show
          Uwe Schindler added a comment - Clean up IR.open(...) methods to no longer accept readOnly and IndexDeletionPolicy: Core, Contrib, Solr (Modules need fixing, benchmark was broken before, too) New branch revision: 1210305
          Hide
          Uwe Schindler added a comment -

          Committed removal of IR.commit(...).

          Still todo:

          • remove readOnly from DirectoryReader / SegmentReader internals (SR.delete is only called syncrhonized from inside IW)
          • remove deleteionPolicy from DirectoryReader
          • DirectoryReader is now a very simple class, I will change it to extend MultiReader and only implement open() and doOpenIfChanged differrent (check SegemntInfos)
          Show
          Uwe Schindler added a comment - Committed removal of IR.commit(...). Still todo: remove readOnly from DirectoryReader / SegmentReader internals (SR.delete is only called syncrhonized from inside IW) remove deleteionPolicy from DirectoryReader DirectoryReader is now a very simple class, I will change it to extend MultiReader and only implement open() and doOpenIfChanged differrent (check SegemntInfos)
          Hide
          Simon Willnauer added a comment -

          Thinking about this: a clean way to do it would be for Similarity to get a new method:

          I have been talking to robert about moving norms to IDV earlier and the biggest issues we had in our discussion were when Sims have more than one value for a document since we'd need to add custom (nested) fields or similar things which'd be yet another mess further down the road. I think it's totally valid to restrict to a single DV and if you need more than one you simply use a byte variant and impl you custom decoding / encoding in your sim (that seems more performant too IMO).

          Once we moved over to this new API ie. no custom norm code anymore we can actually make use of IDV directly, norms would be just another CFS file and each fields norms would just be a "virtual" file in the IDV CFS. All loading and writing could/would be done by the codecs IDV.

          Yet, I think once we have this we should even go further and remove omitNorms from the Field entirely and let the similarity decide if a field has norms or not. This would remove a lot of hairiness in the code too. I'd really like to see that!

          Show
          Simon Willnauer added a comment - Thinking about this: a clean way to do it would be for Similarity to get a new method: I have been talking to robert about moving norms to IDV earlier and the biggest issues we had in our discussion were when Sims have more than one value for a document since we'd need to add custom (nested) fields or similar things which'd be yet another mess further down the road. I think it's totally valid to restrict to a single DV and if you need more than one you simply use a byte variant and impl you custom decoding / encoding in your sim (that seems more performant too IMO). Once we moved over to this new API ie. no custom norm code anymore we can actually make use of IDV directly, norms would be just another CFS file and each fields norms would just be a "virtual" file in the IDV CFS. All loading and writing could/would be done by the codecs IDV. Yet, I think once we have this we should even go further and remove omitNorms from the Field entirely and let the similarity decide if a field has norms or not. This would remove a lot of hairiness in the code too. I'd really like to see that!
          Hide
          Uwe Schindler added a comment -

          Committed removal of IndexReader.deleteDocument*() and undeleteAll() to branch in revision: 1210153:

          • Removal of those methods from abstract IR and all its implementations, except SegmentReader
          • SegmentReader only contains doDelete(in) implementation (as used by BufferedDeletes & Co from IW
          • Rewritten lots of tests to use IW.deleteDocuments(Term), some tests commented out and noncommit added - some of them I dont even understand!
          • Contrib/MultiPassIndexSplitter.FakeDeletesIndexReader has a package-private delete/undelete operating on the FixedBitSet
          Show
          Uwe Schindler added a comment - Committed removal of IndexReader.deleteDocument*() and undeleteAll() to branch in revision: 1210153: Removal of those methods from abstract IR and all its implementations, except SegmentReader SegmentReader only contains doDelete(in) implementation (as used by BufferedDeletes & Co from IW Rewritten lots of tests to use IW.deleteDocuments(Term), some tests commented out and noncommit added - some of them I dont even understand! Contrib/MultiPassIndexSplitter.FakeDeletesIndexReader has a package-private delete/undelete operating on the FixedBitSet
          Hide
          Uwe Schindler added a comment -

          TestIndexReaderReopen should maybe modified to modify the Index using IndexWriter instead of IndexReader to test reopen functionality better. I removed the whole modifyIndex method from this test, so it is now not going deep enough. We should maybe revert the commit on this file and change modifyIndex to use IndexWriter.

          Done.

          Show
          Uwe Schindler added a comment - TestIndexReaderReopen should maybe modified to modify the Index using IndexWriter instead of IndexReader to test reopen functionality better. I removed the whole modifyIndex method from this test, so it is now not going deep enough. We should maybe revert the commit on this file and change modifyIndex to use IndexWriter. Done.
          Hide
          Uwe Schindler added a comment - - edited

          As first step, I removed setNorm/doSetNorm from all IndexReaders, deleted the NormModifier from contrib/misc and modified tests.

          I am not sure, if TestBackwardsCompatibility really checks that .sXX files are read correctly, we may need to add tests, as we no longer check the writing of norms anymore in the standard tests. So when reading old indexes with modified norms, we must ensure that the modified norms are read.

          When changing tests I already removed tests that opened IndexReader in RW mode (like TestBackwards), even if not only norms were modified. But as deletions/commit will also be removed later, this is easier than fixing the test at all.

          TestIndexReaderReopen should maybe modified to modify the Index using IndexWriter instead of IndexReader to test reopen functionality better. I removed the whole modifyIndex method from this test, so it is now not going deep enough. We should maybe revert the commit on this file and change modifyIndex to use IndexWriter.

          Show
          Uwe Schindler added a comment - - edited As first step, I removed setNorm/doSetNorm from all IndexReaders, deleted the NormModifier from contrib/misc and modified tests. I am not sure, if TestBackwardsCompatibility really checks that .sXX files are read correctly, we may need to add tests, as we no longer check the writing of norms anymore in the standard tests. So when reading old indexes with modified norms, we must ensure that the modified norms are read. When changing tests I already removed tests that opened IndexReader in RW mode (like TestBackwards), even if not only norms were modified. But as deletions/commit will also be removed later, this is easier than fixing the test at all. TestIndexReaderReopen should maybe modified to modify the Index using IndexWriter instead of IndexReader to test reopen functionality better. I removed the whole modifyIndex method from this test, so it is now not going deep enough. We should maybe revert the commit on this file and change modifyIndex to use IndexWriter.
          Hide
          Uwe Schindler added a comment -

          I created a branch, because this job is horrible: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3606

          Show
          Uwe Schindler added a comment - I created a branch, because this job is horrible: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3606
          Hide
          Robert Muir added a comment -

          finally, "holy grail" where similarities can declare the normalization factor(s) they need, using byte/float/int whatever, and its all unified with the docvalues api. IndexReader.norms() maybe goes away here, and maybe NormsFormat too.

          Thinking about this: a clean way to do it would be for Similarity to get a new method:

          ValueType getValueType();
          

          and we would change:

          byte computeNorm(FieldInvertState state);
          

          to:

          void computeNorm(FieldInvertState state, PerDocFieldValues norm);
          

          Sims that want to encode multiple index-time scoring factors separately
          could just use BYTES_FIXED_STRAIGHT. This should be only for some rare
          sims anyway, because a Sim can pull named 'application' specific scoring
          factors from IR.perDocValues() today already.

          Its not too crazy either since sims are already doing their own encoding,
          so e.g. default sim would just use FIXED_INTS_8.

          People that don't want to mess with bytes or smallfloat could use things
          like FLOAT_32 if they want and need this.

          we would just change FieldInfo.omitNorms to instead be FieldInfo.normValueType,
          which is the value type of the norm (null if its omitted, just like docValueType).

          Preflex FieldInfosReader would just set FIXED_INTS_8 or null, based on
          whether the fieldinfos had omitNorms or not. it doesnt support
          any other types...

          Finally then, sims would be own their scoring factors, and we could
          even remove omitNorms from Field/FieldType etc (just use the correct
          scoring algorithm for the field, if you don't want norms, use a sim
          that doesn't need them for scoring)

          This would remove the awkward/messy situation where every similarity
          implementation we have has to 'downgrade' itself to handle things like
          if the user decided to omit parts of their formula!

          Show
          Robert Muir added a comment - finally, "holy grail" where similarities can declare the normalization factor(s) they need, using byte/float/int whatever, and its all unified with the docvalues api. IndexReader.norms() maybe goes away here, and maybe NormsFormat too. Thinking about this: a clean way to do it would be for Similarity to get a new method: ValueType getValueType(); and we would change: byte computeNorm(FieldInvertState state); to: void computeNorm(FieldInvertState state, PerDocFieldValues norm); Sims that want to encode multiple index-time scoring factors separately could just use BYTES_FIXED_STRAIGHT. This should be only for some rare sims anyway, because a Sim can pull named 'application' specific scoring factors from IR.perDocValues() today already. Its not too crazy either since sims are already doing their own encoding, so e.g. default sim would just use FIXED_INTS_8. People that don't want to mess with bytes or smallfloat could use things like FLOAT_32 if they want and need this. we would just change FieldInfo.omitNorms to instead be FieldInfo.normValueType, which is the value type of the norm (null if its omitted, just like docValueType). Preflex FieldInfosReader would just set FIXED_INTS_8 or null, based on whether the fieldinfos had omitNorms or not. it doesnt support any other types... Finally then, sims would be own their scoring factors, and we could even remove omitNorms from Field/FieldType etc (just use the correct scoring algorithm for the field, if you don't want norms, use a sim that doesn't need them for scoring) This would remove the awkward/messy situation where every similarity implementation we have has to 'downgrade' itself to handle things like if the user decided to omit parts of their formula!
          Hide
          Robert Muir added a comment -

          Hi, I will help too. I think norms itself is a pretty big project to clean up and its a good one to do first.

          We don't have to do it this way, but here is my idea of a way we could do it in commitable steps:

          1. remove the setNorm first from IR, and fix all tests.
          2. rename NormsWriter to NormsConsumer, rote refactor of norms i/o code into codec as NormsFormat (yes with just one default, and just reads whole byte[])
          3. remove IndexFileNames constant and default implementation handles files(), including .sNNN hairiness
          4. create SimpleText implementation

          Then even more cleanups:

          1. split Default implementation to Preflex (with all hairiness like .sNNN) and Lucene40 (clean implementation)
          2. clean up 'behind the scenes' api, e.g. NormsFormat presents docvalues API (hardcoded at fixed bytes), SegmentReader does getArray(). IndexReader still returns just byte[]
          3. finally, "holy grail" where similarities can declare the normalization factor(s) they need, using byte/float/int whatever, and its all unified with the docvalues api. IndexReader.norms() maybe goes away here, and maybe NormsFormat too.
          Show
          Robert Muir added a comment - Hi, I will help too. I think norms itself is a pretty big project to clean up and its a good one to do first. We don't have to do it this way, but here is my idea of a way we could do it in commitable steps: remove the setNorm first from IR, and fix all tests. rename NormsWriter to NormsConsumer, rote refactor of norms i/o code into codec as NormsFormat (yes with just one default, and just reads whole byte[]) remove IndexFileNames constant and default implementation handles files(), including .sNNN hairiness create SimpleText implementation Then even more cleanups: split Default implementation to Preflex (with all hairiness like .sNNN) and Lucene40 (clean implementation) clean up 'behind the scenes' api, e.g. NormsFormat presents docvalues API (hardcoded at fixed bytes), SegmentReader does getArray(). IndexReader still returns just byte[] finally, "holy grail" where similarities can declare the normalization factor(s) they need, using byte/float/int whatever, and its all unified with the docvalues api. IndexReader.norms() maybe goes away here, and maybe NormsFormat too.
          Hide
          Uwe Schindler added a comment -

          OK, I will work on this as soon as I can (next weekend). I will be gald to remove the copy-on-write setNorm stuff in Lucene40 codec and make Lucene3x codec completely read-only (only reading the newest norm file). I hope Robert will possibly help me

          Show
          Uwe Schindler added a comment - OK, I will work on this as soon as I can (next weekend). I will be gald to remove the copy-on-write setNorm stuff in Lucene40 codec and make Lucene3x codec completely read-only (only reading the newest norm file). I hope Robert will possibly help me
          Hide
          Michael McCandless added a comment -

          OK I agree! It looks like in 4.0 it's possible to achieve these same capabilities. So let's nuke away

          Show
          Michael McCandless added a comment - OK I agree! It looks like in 4.0 it's possible to achieve these same capabilities. So let's nuke away
          Hide
          Uwe Schindler added a comment - - edited

          Mike,
          I agree that this might affect some apps, but as I said in the introduction: I want to only do this for 4.0 aka Trunk, so we are free to break some strange internal never-documented behaviour. By changing from atomic to per-segment search in 2.9 we already broke lots of incorrectly implemented stuff in Lucene, and these IndexReader deletion behaviour "features" are far less often used than Filters tied to top-level IndexSearcher instead the IR passed to getDocIdSet().

          Delete by doc id is dangerous and should be prevented. But you can still do this: Create a custom Filter and do the deletion work in getDocIdSet(IndexReader) and pass it as Query to IndexWriter using ConstantScoreQuery. Using the IndexReader passed to getDocIdSet() you can do any funny things on it, just to produce valid docIds. IW will take care of deleteing them after getDocIdSet() returns.

          As Robert said, with FilterIndexReader you can also delete documents without delay (same like in contrib/misc: PKIndexSplitter, MultiPassIndexSplitter) by just overlaying another Bits. To make those hidden documents disappear on disk, too use the aproach decsribed before.

          And of course, let's nuke my favourite hate-candidate: setNorms()

          Show
          Uwe Schindler added a comment - - edited Mike, I agree that this might affect some apps, but as I said in the introduction: I want to only do this for 4.0 aka Trunk, so we are free to break some strange internal never-documented behaviour. By changing from atomic to per-segment search in 2.9 we already broke lots of incorrectly implemented stuff in Lucene, and these IndexReader deletion behaviour "features" are far less often used than Filters tied to top-level IndexSearcher instead the IR passed to getDocIdSet(). Delete by doc id is dangerous and should be prevented. But you can still do this: Create a custom Filter and do the deletion work in getDocIdSet(IndexReader) and pass it as Query to IndexWriter using ConstantScoreQuery. Using the IndexReader passed to getDocIdSet() you can do any funny things on it, just to produce valid docIds. IW will take care of deleteing them after getDocIdSet() returns. As Robert said, with FilterIndexReader you can also delete documents without delay (same like in contrib/misc: PKIndexSplitter, MultiPassIndexSplitter) by just overlaying another Bits. To make those hidden documents disappear on disk, too use the aproach decsribed before. And of course, let's nuke my favourite hate-candidate: setNorms()
          Hide
          Robert Muir added a comment -

          well, just like live changing of scoring factors with setNorm, the first two of these
          things could be done by the user, with Bits/liveDocs/filters right?

          Show
          Robert Muir added a comment - well, just like live changing of scoring factors with setNorm, the first two of these things could be done by the user, with Bits/liveDocs/filters right?
          Hide
          Michael McCandless added a comment -

          I appreciate the purity of a truly read-only IndexReader, but there
          are things deleting from IR can do that IW cannot:

          • Real-time deletes: you can delete from IR and have it take
            immediate effect on a searcher using that IR. With IW you have to
            do an NRT reopen instead.
          • Delete by docID... users seem to use/want this (eg ask from time
            to time if we can do this in IW, which we can't since merging can
            shift docIDs at any time).
          • IR tells you how many documents were affected by a delete-by-Term.
            IW cannot, since it buffers and only resolves all deletes in bulk
            later.

          Now... just because IR can do these things that IW cannot.... doesn't
          mean they are compelling / important. Ie maybe no app out there
          relies on these special things and so we can lose this functionality?
          I'm not sure...

          On removing IR.setNorm, I agree we should just remove it: in trunk now
          you can make a custom sim that boost norms "live", so we won't lose
          any functionality. The setNorm code (and the implications – all the
          per-field normGen, reference counting, scary SR.reopen, etc.) is truly
          hairy; would be great to nuke it all.

          Show
          Michael McCandless added a comment - I appreciate the purity of a truly read-only IndexReader, but there are things deleting from IR can do that IW cannot: Real-time deletes: you can delete from IR and have it take immediate effect on a searcher using that IR. With IW you have to do an NRT reopen instead. Delete by docID... users seem to use/want this (eg ask from time to time if we can do this in IW, which we can't since merging can shift docIDs at any time). IR tells you how many documents were affected by a delete-by-Term. IW cannot, since it buffers and only resolves all deletes in bulk later. Now... just because IR can do these things that IW cannot.... doesn't mean they are compelling / important. Ie maybe no app out there relies on these special things and so we can lose this functionality? I'm not sure... On removing IR.setNorm, I agree we should just remove it: in trunk now you can make a custom sim that boost norms "live", so we won't lose any functionality. The setNorm code (and the implications – all the per-field normGen, reference counting, scary SR.reopen, etc.) is truly hairy; would be great to nuke it all.
          Hide
          Robert Muir added a comment -

          +1

          Show
          Robert Muir added a comment - +1
          Hide
          Yonik Seeley added a comment -

          +1 for read-only readers!

          Show
          Yonik Seeley added a comment - +1 for read-only readers!
          Hide
          Simon Willnauer added a comment -

          +1 - this would be awesome

          Show
          Simon Willnauer added a comment - +1 - this would be awesome

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development