Lucene - Core
  1. Lucene - Core
  2. LUCENE-2133

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.9.1, 3.0
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Hi all,

      up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open

      The FieldCache is flawed because it is incorrect to assume that:
      1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data.
      2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information.
      3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl.

      Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level.

      There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances.

      I now propose the following:
      1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances ("IndexCache"). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache)
      2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations.
      3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary.
      4. Obsolete FieldCacheSanityChecker because no more "insanities" are expected (or at least, they do not impact the overall performance)
      5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField)

      I have provided an patch which takes care of all these issues. It passes all JUnit tests.

      The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search.

      In detail and besides the above mentioned improvements, the following is provided:
      1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache.
      2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances.
      3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable).
      4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader()
      5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the "switch" statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values.

      The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields:

      • FieldCacheRangeFilter (=> IndexFieldCacheRangeFilter)
      • FieldCacheTermsFilter (=> IndexFieldCacheTermsFilter)
      • FieldCache (=> IndexFieldCache)
      • FieldCacheImpl (=> IndexFieldCacheImpl)
      • all classes in FieldCacheImpl (=> several package-level classes)
      • all subclasses of FieldComparator (=> several package-level classes)

      Final notes:

      • The patch would be simpler if no backwards compatibility was necessary. The Lucene community has to decide which classes/methods can immediately be removed, which ones later, which not at all. Whenever new classes depend on the old ones, an appropriate notice exists in the javadocs.
      • The patch introduces a new, deprecated class IndexFieldCacheSanityChecker.java which is just there for testing purposes, to show that no sanity checks are necessary any longer. This class may be removed at any time.
      • I expect that the patch does not impact performance. On the contrary, as the patch removes a few unnecessary checks we might even see a slight speedup. No benchmarking has been done so far, though.
      • I have tried to preserve the existing functionality wherever possible and to focus on the class/method structure only. We certainly may improve the caches' behavior, but this out of scope for this patch.
      • The refactoring finally makes the high duplication of code visible: For all supported atomic types (byte, double, float, int, long, short) three classes each are required: *Cache, *Comparator and *Parser. I think that further simplification might be possible (maybe using Java generics?), but I guess the current patch is large enough for now.

      Cheers,
      Christian

      1. LUCENE-2133.patch
        223 kB
        Christian Kohlschütter
      2. LUCENE-2133-complete.patch
        310 kB
        Christian Kohlschütter
      3. LUCENE-2133.patch
        223 kB
        Christian Kohlschütter
      4. LUCENE-2133.patch
        223 kB
        Christian Kohlschütter

        Issue Links

          Activity

          Hide
          Christian Kohlschütter added a comment -

          The main patch (for src/java)

          Show
          Christian Kohlschütter added a comment - The main patch (for src/java)
          Hide
          Christian Kohlschütter added a comment -

          Patch for JUnit tests (src/test) to enable them to use the new API
          (may be applied after approval of main patch)

          Show
          Christian Kohlschütter added a comment - Patch for JUnit tests (src/test) to enable them to use the new API (may be applied after approval of main patch)
          Hide
          Christian Kohlschütter added a comment -

          Patch for demo code (src/demo), may be applied after approval of main patch

          Show
          Christian Kohlschütter added a comment - Patch for demo code (src/demo), may be applied after approval of main patch
          Hide
          Christian Kohlschütter added a comment -

          Patch for contrib code (src/contrib), may be applied after approval of main patch

          Show
          Christian Kohlschütter added a comment - Patch for contrib code (src/contrib), may be applied after approval of main patch
          Hide
          Uwe Schindler added a comment -

          Thanks for the patch and the explanation!

          Some notes, without a deep insight:

          • The problem with your current patch is the same as with the other propsals: backwards compatibility is not yet adressed, but is the most hairy problem with all other approaches.
          • The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on.
          • The solution for the above point is the same like with all other changes: IndexReader needs a method to get a cache instance not the other way round. By this the collector.setNextReader change is obsolete.
          • Also the patch file changes files, that are not related or removes import statements, that are clearly marked as "// javadocs only".
          • For version 3.1: The "affects version", setting should be 2.9/3.0, as 3.1 is not yet released. An fix version cannot be yet given, thats correct.

          There may be more problems, but I will review the patch more detailed!

          Uwe

          Show
          Uwe Schindler added a comment - Thanks for the patch and the explanation! Some notes, without a deep insight: The problem with your current patch is the same as with the other propsals: backwards compatibility is not yet adressed, but is the most hairy problem with all other approaches. The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on. The solution for the above point is the same like with all other changes: IndexReader needs a method to get a cache instance not the other way round. By this the collector.setNextReader change is obsolete. Also the patch file changes files, that are not related or removes import statements, that are clearly marked as "// javadocs only". For version 3.1: The "affects version", setting should be 2.9/3.0, as 3.1 is not yet released. An fix version cannot be yet given, thats correct. There may be more problems, but I will review the patch more detailed! Uwe
          Hide
          Christian Kohlschütter added a comment - - edited

          Hi Uwe,

          thanks for reviewing. Just a quick response to your comment:

          The problem with your current patch is the same as with the other propsals: backwards compatibility is not yet adressed, but is the most hairy problem with all other approaches.

          As I wrote, the patch does provide backwards compatibility. Please correct me if I am wrong
          I think actually over-stressed the backwards-compatibility issue because many classes were marked as "Expert; can be changed in incompatible ways in the next release" (e.g. SortField/FieldComparator, Collector)

          The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on.

          You can access the IndexReader from IndexCache and vice versa. The patch actually contains a few changes for contrib where this is actually utilized.

          Passing IndexCache instead of IndexReader to setNextReader gives a slight gain in performance for most cases (i.e. whenever only the IndexCache is accessed), since there is no need to traverse the IndexReader-to-IndexCache method chain (IndexReader#getIndexCache() and the synchronized getOrCreateIndexCache()) for each call to setNextReader.

          In all the other cases, newCache.getIndexReader() gives full access to the IndexReader. There even is a default method which calls the (now-to-be-deprecated) newIndexReader(IndexReader,int) method, so there are actually zero changes necessary for existing code.

          Also the patch file changes files, that are not related or removes import statements, that are clearly marked as "// javadocs only".

          The corresponding references in the javadocs comments have been removed (e.g. FieldCache has been replaced to IndexFieldCache), so it made sense to remove the imports as well.

          For version 3.1: The "affects version", setting should be 2.9/3.0, as 3.1 is not yet released. An fix version cannot be yet given, thats correct.

          Indeed.

          Best regards,
          Christian

          Show
          Christian Kohlschütter added a comment - - edited Hi Uwe, thanks for reviewing. Just a quick response to your comment: The problem with your current patch is the same as with the other propsals: backwards compatibility is not yet adressed, but is the most hairy problem with all other approaches. As I wrote, the patch does provide backwards compatibility. Please correct me if I am wrong I think actually over-stressed the backwards-compatibility issue because many classes were marked as "Expert; can be changed in incompatible ways in the next release" (e.g. SortField/FieldComparator, Collector) The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on. You can access the IndexReader from IndexCache and vice versa. The patch actually contains a few changes for contrib where this is actually utilized. Passing IndexCache instead of IndexReader to setNextReader gives a slight gain in performance for most cases (i.e. whenever only the IndexCache is accessed), since there is no need to traverse the IndexReader-to-IndexCache method chain (IndexReader#getIndexCache() and the synchronized getOrCreateIndexCache()) for each call to setNextReader. In all the other cases, newCache.getIndexReader() gives full access to the IndexReader. There even is a default method which calls the (now-to-be-deprecated) newIndexReader(IndexReader,int) method, so there are actually zero changes necessary for existing code. Also the patch file changes files, that are not related or removes import statements, that are clearly marked as "// javadocs only". The corresponding references in the javadocs comments have been removed (e.g. FieldCache has been replaced to IndexFieldCache), so it made sense to remove the imports as well. For version 3.1: The "affects version", setting should be 2.9/3.0, as 3.1 is not yet released. An fix version cannot be yet given, thats correct. Indeed. Best regards, Christian
          Hide
          Uwe Schindler added a comment - - edited

          The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on.

          This is one of the backwards breaks. As noted, the Collector abstract methods cannot be changed and should not, as the IndexReader is the important part used for collecting the results. The cache is only used by sorting but not in general. Use of IndexCache here would be a design flaw, because it combines unrelated things.

          And setNextReader is only called seldom (only once for each segment). It would have an impact if you would need to call synchronized methods inside collect().

          Show
          Uwe Schindler added a comment - - edited The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on. This is one of the backwards breaks. As noted, the Collector abstract methods cannot be changed and should not, as the IndexReader is the important part used for collecting the results. The cache is only used by sorting but not in general. Use of IndexCache here would be a design flaw, because it combines unrelated things. And setNextReader is only called seldom (only once for each segment). It would have an impact if you would need to call synchronized methods inside collect().
          Hide
          Christian Kohlschütter added a comment -

          Hi Uwe,

          your arguments seem reasonable to me. I have added a new patch (src/java only), leaving all test and contrib classes intact (but still passes all tests).
          The new patch now keeps setNextReader as it is.

          Cheers,
          Christian

          Show
          Christian Kohlschütter added a comment - Hi Uwe, your arguments seem reasonable to me. I have added a new patch (src/java only), leaving all test and contrib classes intact (but still passes all tests). The new patch now keeps setNextReader as it is. Cheers, Christian
          Hide
          Christian Kohlschütter added a comment -

          Improved patch (src/java only), now without modification to newIndexReader.

          Show
          Christian Kohlschütter added a comment - Improved patch (src/java only), now without modification to newIndexReader.
          Hide
          Christian Kohlschütter added a comment -

          Added code to automatically evict entries from IndexFieldCache upon close of IndexCache/IndexReader
          (see also LUCENE-2135)

          Show
          Christian Kohlschütter added a comment - Added code to automatically evict entries from IndexFieldCache upon close of IndexCache/IndexReader (see also LUCENE-2135 )
          Hide
          Michael McCandless added a comment -

          Christian, could you roll up all patches into a single attachment? (Actually it looks like the separate demo, contrib, test, core patches were removed).

          Show
          Michael McCandless added a comment - Christian, could you roll up all patches into a single attachment? (Actually it looks like the separate demo, contrib, test, core patches were removed).
          Hide
          Christian Kohlschütter added a comment -

          Michael, LUCENE-2133.patch now contains all the patches for src/java. I have removed the test, contrib and demo files because they actually do not belong to the patch (since LUCENE-2133.patch really provides complete backwards compatibility now). It would not make much sense to claim backwards compatibility and at the same time modify a bunch of test classes, would it?

          Nevertheless, I am now preparing an updated LUCENE-2133-complete.patch with all these additional patches included. In the meantime you may simply apply LUCENE-2133.patch to your local trunk copy and see if it works for you.

          Show
          Christian Kohlschütter added a comment - Michael, LUCENE-2133 .patch now contains all the patches for src/java. I have removed the test, contrib and demo files because they actually do not belong to the patch (since LUCENE-2133 .patch really provides complete backwards compatibility now). It would not make much sense to claim backwards compatibility and at the same time modify a bunch of test classes, would it? Nevertheless, I am now preparing an updated LUCENE-2133 -complete.patch with all these additional patches included. In the meantime you may simply apply LUCENE-2133 .patch to your local trunk copy and see if it works for you.
          Hide
          Michael McCandless added a comment -

          I have removed the test, contrib and demo files because they actually do not belong to the patch (since LUCENE-2133.patch really provides complete backwards compatibility now).

          Ahh OK

          Show
          Michael McCandless added a comment - I have removed the test, contrib and demo files because they actually do not belong to the patch (since LUCENE-2133 .patch really provides complete backwards compatibility now). Ahh OK
          Hide
          Christian Kohlschütter added a comment -

          The same patch as LUCENE-2133.patch plus the changes in src/test and contrib to fully utilize the new IndexFieldCache and SortField code.

          Show
          Christian Kohlschütter added a comment - The same patch as LUCENE-2133 .patch plus the changes in src/test and contrib to fully utilize the new IndexFieldCache and SortField code.
          Hide
          Mark Miller added a comment -

          There are a bunch or unrelated changes (imports/names/exception thrown) that should be pulled from this patch.

          Show
          Mark Miller added a comment - There are a bunch or unrelated changes (imports/names/exception thrown) that should be pulled from this patch.
          Hide
          Mark Miller added a comment -

          Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin.

          Didn't see that

          import org.apache.lucene.search.SortField; // for javadocs

          wasn't being used anymore anyway.

          import org.apache.lucene.search.fields.IndexFieldCache in NumericQuery should get a //javadoc so someone doesn't accidently remove it.

          And I guess the t to threadLocal change doesn't hurt with the amount your changing that anyway. Its a better name.

          This looks pretty nice overall.

          Show
          Mark Miller added a comment - Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin. Didn't see that import org.apache.lucene.search.SortField; // for javadocs wasn't being used anymore anyway. import org.apache.lucene.search.fields.IndexFieldCache in NumericQuery should get a //javadoc so someone doesn't accidently remove it. And I guess the t to threadLocal change doesn't hurt with the amount your changing that anyway. Its a better name. This looks pretty nice overall.
          Hide
          Christian Kohlschütter added a comment -

          Mark, could you please name the changes you would like to be excluded?
          I thought I had only included necessary changes.

          Some dependent changes are not so obvious, but necessary.

          For example, Analyzer.close now needs to throw an IOException because of CloseableThreadLocal now closing an IOException because of doClose() throwing an IOException because it may close references from an IndexCache that might throw an IOException. Luckily this is covered by java.io.Closeable (which declared #close() to throw an IOException), and Analyzer implements that interface already.

          Show
          Christian Kohlschütter added a comment - Mark, could you please name the changes you would like to be excluded? I thought I had only included necessary changes. Some dependent changes are not so obvious, but necessary. For example, Analyzer.close now needs to throw an IOException because of CloseableThreadLocal now closing an IOException because of doClose() throwing an IOException because it may close references from an IndexCache that might throw an IOException. Luckily this is covered by java.io.Closeable (which declared #close() to throw an IOException), and Analyzer implements that interface already.
          Hide
          Christian Kohlschütter added a comment -

          Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin.

          OK, no problem.

          This looks pretty nice overall.

          Thanks

          Show
          Christian Kohlschütter added a comment - Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin. OK, no problem. This looks pretty nice overall. Thanks
          Hide
          Mark Miller added a comment -

          A couple more quick notes:

          I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there

          I'm also not sure if fields is the right package name? And do the Filters belong in that package?

          Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though

          Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed.

          Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?)

          I havn't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere

          Show
          Mark Miller added a comment - A couple more quick notes: I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there I'm also not sure if fields is the right package name? And do the Filters belong in that package? Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed. Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?) I havn't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere
          Hide
          Mark Miller added a comment - - edited

          It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter).

          Show
          Mark Miller added a comment - - edited It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter).
          Hide
          Mark Miller added a comment -

          I dont quite understand why we would have no more insanity? What happens when you get the cache from a multireader and then from each segment reader within it? You double up right? Insanity?

          Show
          Mark Miller added a comment - I dont quite understand why we would have no more insanity? What happens when you get the cache from a multireader and then from each segment reader within it? You double up right? Insanity?
          Hide
          Christian Kohlschütter added a comment -

          I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there

          I think it makes sense to move the stuff into another package because o.a.l.search is already filled with a lot of classes. Since there are no package-level-protection dependencies to o.a.l.search, I think it does not hurt either.

          I'm also not sure if fields is the right package name? And do the Filters belong in that package?

          I couldn't really come up with a better name. It's all about the fields somehow (caches, comparators and value parsers).

          Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though

          Yes. I thought it's better to make it as clear as possibe.

          Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though

          Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed.

          Yes, good idea. This way we can test the old and the new behaviour at once. Maybe it is enough to add new test methods to the same class?

          Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?)

          Sorry, Eclipse defaults. I thought I had removed all of them.

          I haven't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere

          Yes, it also just came over me the last weekend

          I will make a new patch tomorrow (CET time) with the new test cases incorporated.

          Show
          Christian Kohlschütter added a comment - I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there I think it makes sense to move the stuff into another package because o.a.l.search is already filled with a lot of classes. Since there are no package-level-protection dependencies to o.a.l.search, I think it does not hurt either. I'm also not sure if fields is the right package name? And do the Filters belong in that package? I couldn't really come up with a better name. It's all about the fields somehow (caches, comparators and value parsers). Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though Yes. I thought it's better to make it as clear as possibe. Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed. Yes, good idea. This way we can test the old and the new behaviour at once. Maybe it is enough to add new test methods to the same class? Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?) Sorry, Eclipse defaults. I thought I had removed all of them. I haven't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere Yes, it also just came over me the last weekend I will make a new patch tomorrow (CET time) with the new test cases incorporated.
          Hide
          Mark Miller added a comment -

          I think it does not hurt either.

          I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ...

          By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl):

          
            /**
             * @deprecated Use {@link #clear()} instead.
             */
            public void purgeAllCaches() {
              init();
            }
          
          Show
          Mark Miller added a comment - I think it does not hurt either. I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ... By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl): /** * @deprecated Use {@link #clear()} instead. */ public void purgeAllCaches() { init(); }
          Hide
          Christian Kohlschütter added a comment -

          It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter).

          Good catch. Since o.a.l.search.field.StringIndex extends o.a.l.search.StringIndex it's just a matter of declaration. The imports can indeed be removed (they were only required for javadocs, which should now also refer to the new classes instead). I have made the changes locally and will put them into the next update for this patch.

          I dont quite understand why we would have no more insanity? What happens when you get the cache from a multireader and then from each segment reader within it? You double up right? Insanity?

          The MultiReader has a different cache than the internal SegmentReaders, there is no interconnection between the two. While change the testcases, my patch initially even triggered a JUnit failure because of this – o.a.l.util.TestFieldCacheSanityChecker expected a cache error which now will not happen anymore.

          Show
          Christian Kohlschütter added a comment - It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter). Good catch. Since o.a.l.search.field.StringIndex extends o.a.l.search.StringIndex it's just a matter of declaration. The imports can indeed be removed (they were only required for javadocs, which should now also refer to the new classes instead). I have made the changes locally and will put them into the next update for this patch. I dont quite understand why we would have no more insanity? What happens when you get the cache from a multireader and then from each segment reader within it? You double up right? Insanity? The MultiReader has a different cache than the internal SegmentReaders, there is no interconnection between the two. While change the testcases, my patch initially even triggered a JUnit failure because of this – o.a.l.util.TestFieldCacheSanityChecker expected a cache error which now will not happen anymore.
          Hide
          Mark Miller added a comment - - edited

          And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they really intend to.

          edit

          This type of change actually even exaggerates that problem (though if we want to improve things here, its something we will have to deal with).

          Now you might have a mixture of old api/new api caches as well if you don't properly upgrade everything at once.

          Show
          Mark Miller added a comment - - edited And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they really intend to. edit This type of change actually even exaggerates that problem (though if we want to improve things here, its something we will have to deal with). Now you might have a mixture of old api/new api caches as well if you don't properly upgrade everything at once.
          Hide
          Christian Kohlschütter added a comment -

          I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ...

          Writing the new code was one day, then making it backwards compatible with all these deprecations another one :-b
          I actually wouldn't care so much about backwards compatibility in the most cases, but I think it is better now to allow a smooth transition to the new code.

          By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl):

          Indeed. This was a leftover from the early changes to src/test code. It's removed now (locally).

          Show
          Christian Kohlschütter added a comment - I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ... Writing the new code was one day, then making it backwards compatible with all these deprecations another one :-b I actually wouldn't care so much about backwards compatibility in the most cases, but I think it is better now to allow a smooth transition to the new code. By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl): Indeed. This was a leftover from the early changes to src/test code. It's removed now (locally).
          Hide
          Uwe Schindler added a comment -

          After removing the very problematic change to the Collector class (which is very central in Lucene and should not be changed again after 2.9), thatI told you to do in the morning, the other changes are no longer too intrusive. I am quite happy with your new patch and I now also like it very much. It is as a good candiate for replacing the very, very ugly FieldCache impl we currently have.

          I am not really sure, if the package name is good and I would like to also add Earwins changes and not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component (like all the other flex parts). For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching, but not realy basic functionality (lots of users have other sorting algos). Also not sure if all classes in search that contain "FieldCache" should be renamed. The FieldCacheTermsFilter and RangeFilter only use the cache internally, they should simply be changed to use the new API and maybe only get additional ctors for the other parser instance classes. So some stuff still needs to be changed.

          If it fits good together with the new flexible indexing branch, I see no problems with appling it soon. So its all good work. It is a pity, tht we heard not much from you in the past, the patch suddenly appeared in JIRA and almost nobody know you. I only found one introduction from you long time ago on java-dev. Are you still working at L3S? If yes, send nice greetings also to Jan Brase!

          This patch and the flex branch together and the many deprecations would make a 3.9 soon and 4.0 without all ugly stuff soon would be nice. I again would do the cleaning heavy commiting police man!

          So, good work. Thanks!

          Show
          Uwe Schindler added a comment - After removing the very problematic change to the Collector class (which is very central in Lucene and should not be changed again after 2.9), thatI told you to do in the morning, the other changes are no longer too intrusive. I am quite happy with your new patch and I now also like it very much. It is as a good candiate for replacing the very, very ugly FieldCache impl we currently have. I am not really sure, if the package name is good and I would like to also add Earwins changes and not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component (like all the other flex parts). For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching, but not realy basic functionality (lots of users have other sorting algos). Also not sure if all classes in search that contain "FieldCache" should be renamed. The FieldCacheTermsFilter and RangeFilter only use the cache internally, they should simply be changed to use the new API and maybe only get additional ctors for the other parser instance classes. So some stuff still needs to be changed. If it fits good together with the new flexible indexing branch, I see no problems with appling it soon. So its all good work. It is a pity, tht we heard not much from you in the past, the patch suddenly appeared in JIRA and almost nobody know you. I only found one introduction from you long time ago on java-dev. Are you still working at L3S? If yes, send nice greetings also to Jan Brase! This patch and the flex branch together and the many deprecations would make a 3.9 soon and 4.0 without all ugly stuff soon would be nice. I again would do the cleaning heavy commiting police man! So, good work. Thanks!
          Hide
          Christian Kohlschütter added a comment -

          And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they really intend to.

          The new IndexFieldCacheSanityChecker can be removed out-of-the-box (as documented in the class itself). It is just kind of a "showcase" class for this issue.
          The section I commented out is non-functional with the new change since there is no more ReaderKey in IndexFieldCache.CacheEntry.

          Show
          Christian Kohlschütter added a comment - And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they really intend to. The new IndexFieldCacheSanityChecker can be removed out-of-the-box (as documented in the class itself). It is just kind of a "showcase" class for this issue. The section I commented out is non-functional with the new change since there is no more ReaderKey in IndexFieldCache.CacheEntry.
          Hide
          Mark Miller added a comment -

          But we still want its functionality - we still want to check for "doubling" up insanity.

          Show
          Mark Miller added a comment - But we still want its functionality - we still want to check for "doubling" up insanity.
          Hide
          Mark Miller added a comment -

          not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component

          At a minimum, you should be able to set the cache for the reader.

          For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching

          The way he has it, this is not just for the fieldache, but also the fieldsreader and vectorreader - if we go down that road, we should consider norms as well.

          I see no problems with appling it soon

          I still think it might be a little early. This has a lot of consequences.

          Show
          Mark Miller added a comment - not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component At a minimum, you should be able to set the cache for the reader. For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching The way he has it, this is not just for the fieldache, but also the fieldsreader and vectorreader - if we go down that road, we should consider norms as well. I see no problems with appling it soon I still think it might be a little early. This has a lot of consequences.
          Hide
          Christian Kohlschütter added a comment -

          Uwe, thanks a lot for your assessment!

          I will definitely look at the flex branch and see what needs to be changed.

          Yes, I still work at L3S, working hard for my PhD (planned to finish mid 2010). This is probably the main reason for not being so active in the Lucene community in the past. However, as I use Lucene for research and commercial systems over the last years, I always try to contribute back patches whenever possible.

          Jan Brase doesn't work at L3S anymore, but I will happily forward the greetings.

          Cheers,
          Christian

          Show
          Christian Kohlschütter added a comment - Uwe, thanks a lot for your assessment! I will definitely look at the flex branch and see what needs to be changed. Yes, I still work at L3S, working hard for my PhD (planned to finish mid 2010). This is probably the main reason for not being so active in the Lucene community in the past. However, as I use Lucene for research and commercial systems over the last years, I always try to contribute back patches whenever possible. Jan Brase doesn't work at L3S anymore, but I will happily forward the greetings. Cheers, Christian
          Hide
          Uwe Schindler added a comment -

          I still think it might be a little early. This has a lot of consequences.

          I mean we should not wait for years like with LUCENE-831 No heavy committing please g.

          Show
          Uwe Schindler added a comment - I still think it might be a little early. This has a lot of consequences. I mean we should not wait for years like with LUCENE-831 No heavy committing please g .
          Hide
          Christian Kohlschütter added a comment -

          New changes from the discussions incorporated into the src/java patch.
          The complete patch (including src/test and contrib) follows tomorrow.

          Show
          Christian Kohlschütter added a comment - New changes from the discussions incorporated into the src/java patch. The complete patch (including src/test and contrib) follows tomorrow.
          Hide
          Christian Kohlschütter added a comment - - edited

          But we still want its functionality - we still want to check for "doubling" up insanity.

          Mark,
          the latest patch re-inserts the commented fragment.

          The "readerkey" in the original FieldCacheSanityChecker refers to the FieldCacheKey in IndexReader, which is now obsoleted by IndexReader#getIndexCache.getIndexReader(). I now use this as the reader key, even though I am still not quite sure if it is really necessary. Checking for insanity sounds so paranoid to me. If we know that there is a conceptional bug in the code, we should fix it, not circumvent and hotfix it using an insanity checker. But this is another story.

          Show
          Christian Kohlschütter added a comment - - edited But we still want its functionality - we still want to check for "doubling" up insanity. Mark, the latest patch re-inserts the commented fragment. The "readerkey" in the original FieldCacheSanityChecker refers to the FieldCacheKey in IndexReader, which is now obsoleted by IndexReader#getIndexCache.getIndexReader(). I now use this as the reader key, even though I am still not quite sure if it is really necessary. Checking for insanity sounds so paranoid to me. If we know that there is a conceptional bug in the code, we should fix it, not circumvent and hotfix it using an insanity checker. But this is another story.
          Hide
          Michael McCandless added a comment -

          This patch is a good step forward – it associates the cache directly
          with IndexReader, where it belongs; it cleanly decouples cache from
          reader (vs the hack we have today with IndexReader.getFieldCacheKey),
          so that eg cloned readers can share the same cache; it also preserves
          back compat, which is quite a stunning accomplishment

          But... there are many more things I don't like about FieldCache, that
          I'm not sure the patch addresses:

          • Uninversion to derive eg an int[] is horribly slow, compared to
            say loading the pre-encoded binary ints from disk, created during
            indexing. Ie, I think, if we are going to overhaul FieldCache
            API, we should somehow make LUCENE-1231 feasible.
          • There's no pluggability to customize where the int[] comes from
            for a given field – most you can do is provide your own IntParser
            that the uninverter uses. EG the fact that the patch had to
            "move" FieldCacheRange/TermsFilter down, is strange – these
            filters (and in general any future "cache consumers") should live
            in oal.search, but simply pull from a different int[] source,
            somehow.
          • Error checking of single-value-per-field (for StringIndex) is
            dangerous, today – it's intermittent, and, it's an unchecked
            exception. We should probably just remove the exception, or,
            maybe make it checked. Actually I'll go open a new issue for
            this. We should simply fix this.
          • Single-value-per-field limitation (though, that's a nice to have,
            future improvement)
          • Even accepting the single-value-per-field limitaiton, we should
            allow multiple values per field during uninversion, w/
            customizable logic about which value is kept as the single one
            (there is an issue open for this I think). This really should be
            some sort of added extensibility to whatever class drives
            uninversion...
          • The terror of accidentally asking for the array at the top-level
            of Multi/DirReader. I think this shouldn't even be allowed, at
            least not easily, ie Dir/MultiReader.getIndexCache should throw
            UOE. If we really wanted to, we could provide sugar methods in
            maybe ReaderUtil to "glom N int[]'s into a new int[]". But it
            should be named something scary Then we wouldn't need any
            insanity checking.
          • No control over caching policy (cannot evict things)
          • If we make field cache flexible enough, we could maybe fold norms
            & deleted docs into it (would be a separate future issue to
            actually do so...).

          Some other questions about the patch:

          • Consumers of the cache API (the sort comparator,
            FieldCacheTerms/RangeFilter, and any other future users of "the
            field cache") shouldn't have to move down into fields sub-package?
          • It's a little strange that the term vectors & fields reader also
            got pulled into the cache?
          Show
          Michael McCandless added a comment - This patch is a good step forward – it associates the cache directly with IndexReader, where it belongs; it cleanly decouples cache from reader (vs the hack we have today with IndexReader.getFieldCacheKey), so that eg cloned readers can share the same cache; it also preserves back compat, which is quite a stunning accomplishment But... there are many more things I don't like about FieldCache, that I'm not sure the patch addresses: Uninversion to derive eg an int[] is horribly slow, compared to say loading the pre-encoded binary ints from disk, created during indexing. Ie, I think, if we are going to overhaul FieldCache API, we should somehow make LUCENE-1231 feasible. There's no pluggability to customize where the int[] comes from for a given field – most you can do is provide your own IntParser that the uninverter uses. EG the fact that the patch had to "move" FieldCacheRange/TermsFilter down, is strange – these filters (and in general any future "cache consumers") should live in oal.search, but simply pull from a different int[] source, somehow. Error checking of single-value-per-field (for StringIndex) is dangerous, today – it's intermittent, and, it's an unchecked exception. We should probably just remove the exception, or, maybe make it checked. Actually I'll go open a new issue for this. We should simply fix this. Single-value-per-field limitation (though, that's a nice to have, future improvement) Even accepting the single-value-per-field limitaiton, we should allow multiple values per field during uninversion, w/ customizable logic about which value is kept as the single one (there is an issue open for this I think). This really should be some sort of added extensibility to whatever class drives uninversion... The terror of accidentally asking for the array at the top-level of Multi/DirReader. I think this shouldn't even be allowed, at least not easily, ie Dir/MultiReader.getIndexCache should throw UOE. If we really wanted to, we could provide sugar methods in maybe ReaderUtil to "glom N int[]'s into a new int[]". But it should be named something scary Then we wouldn't need any insanity checking. No control over caching policy (cannot evict things) If we make field cache flexible enough, we could maybe fold norms & deleted docs into it (would be a separate future issue to actually do so...). Some other questions about the patch: Consumers of the cache API (the sort comparator, FieldCacheTerms/RangeFilter, and any other future users of "the field cache") shouldn't have to move down into fields sub-package? It's a little strange that the term vectors & fields reader also got pulled into the cache?
          Hide
          Mark Miller added a comment -

          My impression is that this is not much different than LUCENE-831, and we are stuck on the same issues (831 started down the path of working with these goals, but its severely out of date now).

          It's a little strange that the term vectors & fields reader also got pulled into the cache?

          Couldn't figure this out either ... if anything, you'd think norms might goto the cache, but the vector and fields reader ... ?

          Show
          Mark Miller added a comment - My impression is that this is not much different than LUCENE-831 , and we are stuck on the same issues (831 started down the path of working with these goals, but its severely out of date now). It's a little strange that the term vectors & fields reader also got pulled into the cache? Couldn't figure this out either ... if anything, you'd think norms might goto the cache, but the vector and fields reader ... ?
          Hide
          Christian Kohlschütter added a comment -

          This patch is a good step forward - it associates the cache directly

          with IndexReader, where it belongs; it cleanly decouples cache from
          reader (vs the hack we have today with IndexReader.getFieldCacheKey),
          so that eg cloned readers can share the same cache; it also preserves
          back compat, which is quite a stunning accomplishment

          Yes, backwards compatibility was actually the driving factor for this patch.
          I actually have not addressed major changes in functionality. This would definitely require rework that breaks backwards compatibility.
          I just wanted to get rid of the static FieldCache, which caused me a lot of OOME headaches.

          As I wrote above, I see this patch as a good starting point for further development. Imagine I had submitted a real, complete rework of the FieldCache like in LUCENE-831 – it would be stuck as an open issue forever just like its predecessor.

          There is nothing wrong with the current patch (that's why I suggest comitting it to trunk soon-ish) – it does not break anything (it could even go into 3.1 I guess).
          Starting from a common codebase we can then later on extend it and address all the points you mentioned in Michael's most recent post:

          But... there are many more things I don't like about FieldCache, that I'm not sure the patch addresses:

          (snip)

          None of these (very important) issues have been addressed by the patch. Intentionally (as described).

          Some other questions about the patch:

          * Consumers of the cache API (the sort comparator,

          FieldCacheTerms/RangeFilter, and any other future users of "the

          field cache") shouldn't have to move down into fields sub-package?

          As you wish. I don't have problems with keep it in o.a.l.search. I was just a little scared about the plethora of classes in that package. Since we do not need to utilize package-level protection, it was actually possible to "encapsulate" that field-related functionality in another namespace. I just personally prefer to use many nested packages, I do not have problems of moving classes back to its parent package.

          * It's a little strange that the term vectors & fields reader also

          got pulled into the cache?

          They are not in the FieldCache. Caching fields is only one functionality provided by IndexCache. I took the opportunity to demonstrate caching something other than fields. You now save a few bytes per SegmentReader instance because of that change. Maybe it would make sense to move SegmentReader.CoreReaders to SegmentReaderIndexCache completely. However, if I had included that as well into this issue, it would again be too large to be discussed in time for the next release.

          If you have strong objections against moving these SegmentReader-specific things to the SegmentReaderIndexCache now, I can remove that part from the patch. However, I should probably then file another issue. Unfortunately, this will then depend on the outcome of this LUCENE-2133.

          To summarize, I suggest:

          1. Complete the additions to src/test (i.e. do not remove/modify the existing test cases but rather add new ones, as discussed previously)
          2. Commit the patch to svn, target release for Lucene 3.1.
          3. File another issue and discuss functional improvements for the IndexFieldCache part. Use Michael's wishlist as a starting point. Agree on breaking backwards compatibility, target for Lucene 4.0, 3.5 or whatever fancy version number. Incorporate things from LUCENE-831, LUCENE-1231 and also LUCENE-2135.
          4. File a new issue for the improvements in SegmentReader (move things that are shared between the original reader and its clones to a common SegmentReaderIndexCache, such as CoreReader and ThreadLocals), keep backwards compatibility and target for Lucene 3.1.

          What do you think?

          Show
          Christian Kohlschütter added a comment - This patch is a good step forward - it associates the cache directly with IndexReader, where it belongs; it cleanly decouples cache from reader (vs the hack we have today with IndexReader.getFieldCacheKey), so that eg cloned readers can share the same cache; it also preserves back compat, which is quite a stunning accomplishment Yes, backwards compatibility was actually the driving factor for this patch. I actually have not addressed major changes in functionality. This would definitely require rework that breaks backwards compatibility. I just wanted to get rid of the static FieldCache, which caused me a lot of OOME headaches. As I wrote above, I see this patch as a good starting point for further development. Imagine I had submitted a real, complete rework of the FieldCache like in LUCENE-831 – it would be stuck as an open issue forever just like its predecessor. There is nothing wrong with the current patch (that's why I suggest comitting it to trunk soon-ish) – it does not break anything (it could even go into 3.1 I guess). Starting from a common codebase we can then later on extend it and address all the points you mentioned in Michael's most recent post: But... there are many more things I don't like about FieldCache, that I'm not sure the patch addresses: (snip) None of these (very important) issues have been addressed by the patch. Intentionally (as described). Some other questions about the patch: * Consumers of the cache API (the sort comparator, FieldCacheTerms/RangeFilter, and any other future users of "the field cache") shouldn't have to move down into fields sub-package? As you wish. I don't have problems with keep it in o.a.l.search. I was just a little scared about the plethora of classes in that package. Since we do not need to utilize package-level protection, it was actually possible to "encapsulate" that field-related functionality in another namespace. I just personally prefer to use many nested packages, I do not have problems of moving classes back to its parent package. * It's a little strange that the term vectors & fields reader also got pulled into the cache? They are not in the FieldCache. Caching fields is only one functionality provided by IndexCache. I took the opportunity to demonstrate caching something other than fields. You now save a few bytes per SegmentReader instance because of that change. Maybe it would make sense to move SegmentReader.CoreReaders to SegmentReaderIndexCache completely. However, if I had included that as well into this issue, it would again be too large to be discussed in time for the next release. If you have strong objections against moving these SegmentReader-specific things to the SegmentReaderIndexCache now , I can remove that part from the patch. However, I should probably then file another issue. Unfortunately, this will then depend on the outcome of this LUCENE-2133 . To summarize, I suggest: 1. Complete the additions to src/test (i.e. do not remove/modify the existing test cases but rather add new ones, as discussed previously) 2. Commit the patch to svn, target release for Lucene 3.1. 3. File another issue and discuss functional improvements for the IndexFieldCache part. Use Michael's wishlist as a starting point. Agree on breaking backwards compatibility, target for Lucene 4.0, 3.5 or whatever fancy version number. Incorporate things from LUCENE-831 , LUCENE-1231 and also LUCENE-2135 . 4. File a new issue for the improvements in SegmentReader (move things that are shared between the original reader and its clones to a common SegmentReaderIndexCache, such as CoreReader and ThreadLocals), keep backwards compatibility and target for Lucene 3.1. What do you think?
          Hide
          Mark Miller added a comment -

          I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism?

          Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway.

          Which means a new impl should provide enough benefits to make that large pain worth enduring. 831 was not committed for the same reason - it didn't bring enough to table to be worth it after we got to a per segment cache in another way. Since I don't see that this provides anything over 831, I don't see how its not in the same boat.

          I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready.

          Show
          Mark Miller added a comment - I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism? Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway. Which means a new impl should provide enough benefits to make that large pain worth enduring. 831 was not committed for the same reason - it didn't bring enough to table to be worth it after we got to a per segment cache in another way. Since I don't see that this provides anything over 831, I don't see how its not in the same boat. I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready.
          Hide
          Christian Kohlschütter added a comment -

          I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism?

          I don't think that this is an option. FieldCache is fundamentally flawed and already creates a lot of trouble that has only been somehow fixed recently (FieldCacheKey, Insanity checks).

          FieldCache, the static "registry" of caches sort of violates fundamental OOP concepts – I mean, almost all methods require IndexReader as the first parameter. Since we allow IndexReader composition and decoration this apparently was not the right way to go because it exactly caused the problems that have later been addressed by the aformentioned FieldCacheKey and insanity checks.

          Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway.

          That is always an option, but as you see Uwe's initial reaction I think it is not so easy to convince everybody. Which is fine, because we now have a solution that is somehow intermediate between the two extremes (keeping as it is vs. complete rework) and still is completely bw-compatible.
          With a "real" complete overhaul of FieldCache, I imagine a lot of additional work to be then done for other projects such as SOLR, Nutch etc., which I would rather see not to happen abruptly.

          The IndexCache abstraction allows us to separate the two issues 1: "make caches non-static" and 2: "make the fieldcache snappier" very easily. We start with issue 1 here in LUCENE-2133, and then develop a new FancyFieldCache in LUCENE-xyz (which can be used in parallel to the then-old IndexFieldCache).

          In parallel we can integrate Michael's, Earwin's and Yonik's suggestions from LUCENE-2135 without starting the discussion from the beginning.

          I am not suggesting to take the current solution as a "temporary workaround", but as a base for future work. Anything else would make no sense indeed.

          Since I don't see that this provides anything over 831, I don't see how its not in the same boat.

          LUCENE-831 still requires a static FieldCache, the root of all evil It addresses orthogonal problems (ValueSource, Tries etc.).
          Finally, to cite yourself from LUCENE-831: "It probably makes sense to start from one of Hoss's original patches or even from scratch."

          I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready.

          The more complex the patches are, the longer it will take to integrate them into a new version.
          The more such patches you have, the longer it will take to get to a new release.

          Let's make it simple, submit what we have and build upon that.

          Show
          Christian Kohlschütter added a comment - I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism? I don't think that this is an option. FieldCache is fundamentally flawed and already creates a lot of trouble that has only been somehow fixed recently (FieldCacheKey, Insanity checks). FieldCache, the static "registry" of caches sort of violates fundamental OOP concepts – I mean, almost all methods require IndexReader as the first parameter. Since we allow IndexReader composition and decoration this apparently was not the right way to go because it exactly caused the problems that have later been addressed by the aformentioned FieldCacheKey and insanity checks. Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway. That is always an option, but as you see Uwe's initial reaction I think it is not so easy to convince everybody. Which is fine, because we now have a solution that is somehow intermediate between the two extremes (keeping as it is vs. complete rework) and still is completely bw-compatible. With a "real" complete overhaul of FieldCache, I imagine a lot of additional work to be then done for other projects such as SOLR, Nutch etc., which I would rather see not to happen abruptly. The IndexCache abstraction allows us to separate the two issues 1: "make caches non-static" and 2: "make the fieldcache snappier" very easily. We start with issue 1 here in LUCENE-2133 , and then develop a new FancyFieldCache in LUCENE-xyz (which can be used in parallel to the then-old IndexFieldCache). In parallel we can integrate Michael's, Earwin's and Yonik's suggestions from LUCENE-2135 without starting the discussion from the beginning. I am not suggesting to take the current solution as a "temporary workaround", but as a base for future work. Anything else would make no sense indeed. Since I don't see that this provides anything over 831, I don't see how its not in the same boat. LUCENE-831 still requires a static FieldCache, the root of all evil It addresses orthogonal problems (ValueSource, Tries etc.). Finally, to cite yourself from LUCENE-831 : "It probably makes sense to start from one of Hoss's original patches or even from scratch." I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready. The more complex the patches are, the longer it will take to integrate them into a new version. The more such patches you have, the longer it will take to get to a new release. Let's make it simple, submit what we have and build upon that.
          Hide
          Mark Miller added a comment -

          LUCENE-831 still requires a static FieldCache, the root of all evil

          It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource.

          The CacheByReaderValueSource is just there to handle a back compat issue - its something that we would want to get around and use the reader valuesource for instead - but that patch still had a long way to go.

          Overall, from what I can see, the approach was about the same.

          It probably makes sense to start from one of Hoss's original patches or even from scratch

          That was said before a lot more work was done. The API was actually starting to shape up nicely.

          The more complex the patches are, the longer it will take to integrate them into a new version.

          Of course - and this is a complex issue with a lot of upgrade pain. Like with 831, it not really worth the pain to users without more benefits.

          The more such patches you have, the longer it will take to get to a new release.

          Thats not really true. 3.1 does't need this patch - there would be no reason to hold it for it. Patches go in when they are ready.

          Let's make it simple, submit what we have and build upon that.

          I dont think thats simple The patch can be iterated on outside of trunk as easy as in.

          Show
          Mark Miller added a comment - LUCENE-831 still requires a static FieldCache, the root of all evil It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource. The CacheByReaderValueSource is just there to handle a back compat issue - its something that we would want to get around and use the reader valuesource for instead - but that patch still had a long way to go. Overall, from what I can see, the approach was about the same. It probably makes sense to start from one of Hoss's original patches or even from scratch That was said before a lot more work was done. The API was actually starting to shape up nicely. The more complex the patches are, the longer it will take to integrate them into a new version. Of course - and this is a complex issue with a lot of upgrade pain. Like with 831, it not really worth the pain to users without more benefits. The more such patches you have, the longer it will take to get to a new release. Thats not really true. 3.1 does't need this patch - there would be no reason to hold it for it. Patches go in when they are ready. Let's make it simple, submit what we have and build upon that. I dont think thats simple The patch can be iterated on outside of trunk as easy as in.
          Hide
          Christian Kohlschütter added a comment -

          bq. LUCENE-831 still requires a static FieldCache, the root of all evil

          It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource.

          With "requires" I mean it's still there, not marked as deprecated and still used. Plus a lot of "ifs" like

          {{{
          if(parserUninverter != null)

          { currentReaderValues = uninversionValueSource.getLongs(reader, field, parserUninverter); }

          else if(valueSource != null)

          { currentReaderValues = valueSource.getLongs(reader, field); }

          else

          { currentReaderValues = reader.getValueSource().getLongs(reader, field); }

          }}}

          That is, it adds a lot of duplicated code / different possible implementations for the same thing.

          I am not saying LUCENE-831 was a bad idea. And apparently, apart from the different wording, we see a few similarities with LUCENE-2133. We just need to move on at some point.

          What is still different in my proposal is the additional abstraction layer of "IndexCache". Was this already somehow planned with "ValueSourceFactory"? That class exists in LUCENE-831 but was never used.

          As we see from LUCENE-2135 Index-specific caches are much more than FieldCache/ValueSource implementations. They should store arbitrary data, allow cache inspection, eviction of entries and so on.

          bq. Let's make it simple, submit what we have and build upon that.

          I dont think thats simple The patch can be iterated on outside of trunk as easy as in.

          It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option.

          Show
          Christian Kohlschütter added a comment - bq. LUCENE-831 still requires a static FieldCache, the root of all evil It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource. With "requires" I mean it's still there, not marked as deprecated and still used. Plus a lot of "ifs" like {{{ if(parserUninverter != null) { currentReaderValues = uninversionValueSource.getLongs(reader, field, parserUninverter); } else if(valueSource != null) { currentReaderValues = valueSource.getLongs(reader, field); } else { currentReaderValues = reader.getValueSource().getLongs(reader, field); } }}} That is, it adds a lot of duplicated code / different possible implementations for the same thing. I am not saying LUCENE-831 was a bad idea. And apparently, apart from the different wording, we see a few similarities with LUCENE-2133 . We just need to move on at some point. What is still different in my proposal is the additional abstraction layer of "IndexCache". Was this already somehow planned with "ValueSourceFactory"? That class exists in LUCENE-831 but was never used. As we see from LUCENE-2135 Index-specific caches are much more than FieldCache/ValueSource implementations. They should store arbitrary data, allow cache inspection, eviction of entries and so on. bq. Let's make it simple, submit what we have and build upon that. I dont think thats simple The patch can be iterated on outside of trunk as easy as in. It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option.
          Hide
          Mark Miller added a comment - - edited

          That is, it adds a lot of duplicated code / different possible implementations for the same thing.

          Things that were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product.

          They should store arbitrary data, allow cache inspection, eviction of entries and so on.

          Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache.

          It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option.

          A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging.

          At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API.

          I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk.

          Show
          Mark Miller added a comment - - edited That is, it adds a lot of duplicated code / different possible implementations for the same thing. Things that were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product. They should store arbitrary data, allow cache inspection, eviction of entries and so on. Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache. It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option. A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging. At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API. I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk.
          Hide
          Christian Kohlschütter added a comment -

          Mark,

          that's great to hear overall.
          What I want to avoid are different competing implementations. Since there are indeed several people working on things that are closely related, esp. LUCENE-2135) a branch would make sense (if all agree, of course).

          Can we summarize all the open points for a "worthy" cache overhaul, close LUCENE-831 in favor of LUCENE-2133 and continue working here?
          I would like to hear the opionions of the other people involved: Earwin, Hoss, Mike, Uwe, Yonik to name a few.

          Best,
          Christian

          Show
          Christian Kohlschütter added a comment - Mark, that's great to hear overall. What I want to avoid are different competing implementations. Since there are indeed several people working on things that are closely related, esp. LUCENE-2135 ) a branch would make sense (if all agree, of course). Can we summarize all the open points for a "worthy" cache overhaul, close LUCENE-831 in favor of LUCENE-2133 and continue working here? I would like to hear the opionions of the other people involved: Earwin, Hoss, Mike, Uwe, Yonik to name a few. Best, Christian
          Hide
          Michael McCandless added a comment -

          I was just a little scared about the plethora of classes in that package.

          Let's separately worry about that? I think consumers of the cache API
          shouldn't have to move.

          Caching fields is only one functionality provided by IndexCache. I took the opportunity to demonstrate caching something other than fields. You now save a few bytes per SegmentReader instance because of that change.

          OK but let's also do this separately (Earwin I think is working on a
          patch for componentizing SR, which'd be great). FieldCache, "just"
          one of IndexReaders components, is hard enough to fix; let's just
          focus on it

          Also, as hackish as it is, the getFieldCacheKey() works well. The
          need for insanity checking is annoying, so let's just disallow getting
          cached @ Dir/MultiReader level (and offer sugar APIs if really
          needed).

          And as annoying as the static FieldCache is, LUCENE-2135 plugs yet
          another hold in the dike, ie, now entries are purged on close, and you
          can also explicitly purge a given IndexReader.

          I don't think we should wholesale move the cache APIs just for the
          sake of moving them. That's alot of API noise; we need some real
          improvements to justify making users switch.

          Since there are indeed several people working on things that are closely related, esp. LUCENE-2135) a branch would make sense

          Can we summarize all the open points for a "worthy" cache overhaul, close LUCENE-831 in favor of LUCENE-2133 and continue working here?

          I don't think we need a branch just yet. (LUCENE-2135 is a very
          standalone change from this issue). And, before closing other issues,
          committing this one as a start, etc, we really first need to reach
          some consensus on what even to do, here.

          I really want to see us first fix FieldCache's deep problems, then
          concern ourselves with where the resulting API will go / what it looks
          like. Ie, we're putting the horse before the cart, here, so far...

          EG, what [minimal] changes could we make to allow someone to plugin
          their own values source? If we make this:

          class FieldValues {
            public FIELD_TYPE getType();
            public int[] getInts() {};
           // and all the other types...
          }
          
          class FieldValuesSource implements Closeable {
            public FieldValues getValues(FieldInfo field);
            public void close();
          }
          

          We could then make an UninverterFieldValuesSource, subclassing
          PerFieldValueSource, that takes IndexReader to its ctor, lets your
          register parsers by field. And it'd allow customization beyond simply
          changing the parser, eg to control behavior of multi-valued fields,
          stopping early (NRQ does this), etc.

          We'd have CachedFieldValueSource that'd wrap any FieldValuesSource and
          cache, allowing you to subclass it if you want to do eviction, etc.
          It'd by default implement the same simplistic policy we have today –
          retain everything, until close at which point you discard everything.

          All consumers of field cache (sorting by field,
          FieldCacheRange/TermsFilter, function queries, etc.) should all allow
          me to pass in my own FieldValuesSource. IndexReader would let me
          customize, but would default to the cached uninversion source.

          I could then in theory [external to Lucene] make a FieldValueSource
          that slurped stuff from disk, and I could create LUCENE-1231 (= CSF)
          outside of Lucene. [And, when we build CSF inside of Lucene it'd also
          just be another source].

          Something along these lines maybe....?

          Show
          Michael McCandless added a comment - I was just a little scared about the plethora of classes in that package. Let's separately worry about that? I think consumers of the cache API shouldn't have to move. Caching fields is only one functionality provided by IndexCache. I took the opportunity to demonstrate caching something other than fields. You now save a few bytes per SegmentReader instance because of that change. OK but let's also do this separately (Earwin I think is working on a patch for componentizing SR, which'd be great). FieldCache, "just" one of IndexReaders components, is hard enough to fix; let's just focus on it Also, as hackish as it is, the getFieldCacheKey() works well. The need for insanity checking is annoying, so let's just disallow getting cached @ Dir/MultiReader level (and offer sugar APIs if really needed). And as annoying as the static FieldCache is, LUCENE-2135 plugs yet another hold in the dike, ie, now entries are purged on close, and you can also explicitly purge a given IndexReader. I don't think we should wholesale move the cache APIs just for the sake of moving them. That's alot of API noise; we need some real improvements to justify making users switch. Since there are indeed several people working on things that are closely related, esp. LUCENE-2135 ) a branch would make sense Can we summarize all the open points for a "worthy" cache overhaul, close LUCENE-831 in favor of LUCENE-2133 and continue working here? I don't think we need a branch just yet. ( LUCENE-2135 is a very standalone change from this issue). And, before closing other issues, committing this one as a start, etc, we really first need to reach some consensus on what even to do, here. I really want to see us first fix FieldCache's deep problems, then concern ourselves with where the resulting API will go / what it looks like. Ie, we're putting the horse before the cart, here, so far... EG, what [minimal] changes could we make to allow someone to plugin their own values source? If we make this: class FieldValues { public FIELD_TYPE getType(); public int [] getInts() {}; // and all the other types... } class FieldValuesSource implements Closeable { public FieldValues getValues(FieldInfo field); public void close(); } We could then make an UninverterFieldValuesSource, subclassing PerFieldValueSource, that takes IndexReader to its ctor, lets your register parsers by field. And it'd allow customization beyond simply changing the parser, eg to control behavior of multi-valued fields, stopping early (NRQ does this), etc. We'd have CachedFieldValueSource that'd wrap any FieldValuesSource and cache, allowing you to subclass it if you want to do eviction, etc. It'd by default implement the same simplistic policy we have today – retain everything, until close at which point you discard everything. All consumers of field cache (sorting by field, FieldCacheRange/TermsFilter, function queries, etc.) should all allow me to pass in my own FieldValuesSource. IndexReader would let me customize, but would default to the cached uninversion source. I could then in theory [external to Lucene] make a FieldValueSource that slurped stuff from disk, and I could create LUCENE-1231 (= CSF) outside of Lucene. [And, when we build CSF inside of Lucene it'd also just be another source]. Something along these lines maybe....?
          Hide
          Earwin Burrfoot added a comment -

          I would like to hear the opionions of the other people involved

          .... Earwin I think is working on a patch for componentizing SR ..... FieldCache, "just" one of IndexReaders components ......

          Mike answered for me
          My wish is to keep IR's as basic as possible, while plugging all extras (that includes sorting/whatever caches) on as-needed basis.
          Right now I have a basic impl that works for me for half a year, but in practice it ended up a bit ugly and it doesn't handle IW-produced readers (never needed/liked this feature). So I hope to fix these two deficiencies on a weekend.

          Show
          Earwin Burrfoot added a comment - I would like to hear the opionions of the other people involved .... Earwin I think is working on a patch for componentizing SR ..... FieldCache, "just" one of IndexReaders components ...... Mike answered for me My wish is to keep IR's as basic as possible, while plugging all extras (that includes sorting/whatever caches) on as-needed basis. Right now I have a basic impl that works for me for half a year, but in practice it ended up a bit ugly and it doesn't handle IW-produced readers (never needed/liked this feature). So I hope to fix these two deficiencies on a weekend.
          Hide
          Mark Miller added a comment -

          Something along these lines maybe....?

          And we are back to 831

          Show
          Mark Miller added a comment - Something along these lines maybe....? And we are back to 831
          Hide
          Michael McCandless added a comment -

          And we are back to 831

          Yes But maybe the fresh perspective will get us through it!

          Show
          Michael McCandless added a comment - And we are back to 831 Yes But maybe the fresh perspective will get us through it!

            People

            • Assignee:
              Unassigned
              Reporter:
              Christian Kohlschütter
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development