Lucene - Core
  1. Lucene - Core
  2. LUCENE-2665

Rework FieldCache to be more flexible/general

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None

      Description

      The existing FieldCache implementation is very rigid and does not allow much flexibility. In trying to implement simple features, it points to much larger structural problems.

      This patch aims to take a fresh approach to how we work with the FieldCache.

      1. LUCENE-2665-FieldCacheOverhaul.patch
        68 kB
        Ryan McKinley
      2. LUCENE-2665-FieldCacheOverhaul.patch
        67 kB
        Ryan McKinley
      3. LUCENE-2665-FieldCacheOverhaul.patch
        30 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          This is a quick sketch of a more general FieldCache – it only implements the ByteValues case, and is implemented in a different package.

          The core API looks like this:

            public <T> T get(IndexReader reader, String field, EntryCreator<T> creator) throws IOException
          

          and the EntryCreator looks like this:

          public abstract class EntryCreator<T> implements Serializable
          {
            public abstract T create( IndexReader reader, String field ) throws IOException;
            public abstract T validate( T entry, IndexReader reader, String field ) throws IOException;
          
            /**
             * Indicate if a cached cached value should be checked before usage.  
             * This is useful if an application wants to support subsequent calls
             * to the same cached object that may alter the cached object.  If
             * an application wants to avoid this (synchronized) check, it should 
             * return 'false'
             * 
             * @return 'true' if the Cache should call 'validate' before returning a cached object
             */
            public boolean shouldValidate() {
              return true;
            }
          
            /**
             * @return A key to identify valid cache entries for subsequent requests
             */
            public Integer getCacheKey( IndexReader reader, String field )
            {
              return new Integer(
                  EntryCreator.class.hashCode() ^ field.hashCode()
              );
            }
          }
          

          For a real cleanup, I think it makes sense to move the Parser stuff to somewhere that deals with numerics – I don't get why that is tied to the FieldCache

          Just as sketch to get us thinking....

          Show
          Ryan McKinley added a comment - This is a quick sketch of a more general FieldCache – it only implements the ByteValues case, and is implemented in a different package. The core API looks like this: public <T> T get(IndexReader reader, String field, EntryCreator<T> creator) throws IOException and the EntryCreator looks like this: public abstract class EntryCreator<T> implements Serializable { public abstract T create( IndexReader reader, String field ) throws IOException; public abstract T validate( T entry, IndexReader reader, String field ) throws IOException; /** * Indicate if a cached cached value should be checked before usage. * This is useful if an application wants to support subsequent calls * to the same cached object that may alter the cached object. If * an application wants to avoid this ( synchronized ) check, it should * return ' false ' * * @ return ' true ' if the Cache should call 'validate' before returning a cached object */ public boolean shouldValidate() { return true ; } /** * @ return A key to identify valid cache entries for subsequent requests */ public Integer getCacheKey( IndexReader reader, String field ) { return new Integer ( EntryCreator.class.hashCode() ^ field.hashCode() ); } } For a real cleanup, I think it makes sense to move the Parser stuff to somewhere that deals with numerics – I don't get why that is tied to the FieldCache Just as sketch to get us thinking....
          Hide
          Ryan McKinley added a comment -

          While we are thinking about it... perhaps the best option is to add a Cache to the IndexReader itself.

          This would be nice since it would would drop the WeakHashMap<IndexReader> that is used all over. I just like hard references better!

          The one (ok maybe there are more) hitch I can think of is that some Cache instances don't care about deleted docs (FieldCache) and others do. Perhaps the EntryCreator knows and the Cache coud behave differently... or am i getting ahead of myself!

          Show
          Ryan McKinley added a comment - While we are thinking about it... perhaps the best option is to add a Cache to the IndexReader itself. This would be nice since it would would drop the WeakHashMap<IndexReader> that is used all over. I just like hard references better! The one (ok maybe there are more) hitch I can think of is that some Cache instances don't care about deleted docs (FieldCache) and others do. Perhaps the EntryCreator knows and the Cache coud behave differently... or am i getting ahead of myself!
          Hide
          Michael McCandless added a comment -

          perhaps the best option is to add a Cache to the IndexReader itself.

          +1

          But reopened readers need to be able to share the same Cache... maybe we have a Cache doReopen() method, on the Cache? This way a given Cache impl could do something custom like evict certain entries that are volatile wrt reopen? (Though, I think we should try hard to keep del docs dependence out of cache entries... ie it's still the caller's job to fold in deleted docs).

          Show
          Michael McCandless added a comment - perhaps the best option is to add a Cache to the IndexReader itself. +1 But reopened readers need to be able to share the same Cache... maybe we have a Cache doReopen() method, on the Cache? This way a given Cache impl could do something custom like evict certain entries that are volatile wrt reopen? (Though, I think we should try hard to keep del docs dependence out of cache entries... ie it's still the caller's job to fold in deleted docs).
          Hide
          Michael McCandless added a comment -

          Would the validate method do things like compute the validDocs bit set if the current caller needs it but the entry didn't compute it the first time?

          Show
          Michael McCandless added a comment - Would the validate method do things like compute the validDocs bit set if the current caller needs it but the entry didn't compute it the first time?
          Hide
          Michael McCandless added a comment -

          Here are some of the problems w/ FC that I'd love to see fixed here:

          • The source of values should be fully flexible/pluggable – FC does
            uninversion, CSF pulls the array image from the index, an app can
            plugin its own source. All three of these "sources" should be
            consumed via the same API (eg ByteValues w/ getValue(int docID)).
          • Value lookup needs to be a method call, w/ optional
            getBackingArray(), for (manual today, automatic tomorrow) code
            spec.
          • Uninversion is dangerous – eg if you accidentally have multiple
            values per field, they "silently" overwrite one another.
          • When you do legitimately have multiple values (eg numeric fields),
            the Parser interface is also too inflexible – eg the exception to
            stop visiting terms, the inabilty to specify which (users have
            requested "first only" and "last only") of multiple values should
            be kept, etc.
          • Cache should be stored/accessible via the reader, not in separate
            external WeakHashMap. The eviction policy should be fully
            visible/controllable by the app (or maybe app optionally hands us
            a cache impl/factory). There should be no static FC.DEFAULT that
            we have today.
          • Insanity shouldn't be allowed/possible – it's just too dangerous
            today that we allow this. We should at least make it really hard
            to do, by accident (eg, like you must use SlowMultiReader to prove
            your insanity). EG caching values @ the MultiReader level. Or,
            LUCENE-2527 (fasterButMoreRAM true/false causing a double entry).
          • The entries are too strongly tied to field names. I may want
            virtual entries, not backed by a "real" field. EG, say I want to
            do a "blended" sort, say mixing in recency with elevance... I
            should be able name this "RelevanceAndRecency" (say), which is not
            a real field. I back this w/ my own FloatValues impl, which
            under-the-hood somehow combines the two "sources" and presents a
            FloatValues interface. Then I should be able to pass a SortField
            somehow referencing my dynamic/virtual field.
          • Cannot support multiple values per doc (this is a future
            nice-to-have-but-don't-preclude sort of thing)

          With these fixes, flex scoring (LUCENE-2392), the per-doc stats
          (unique term count, total term count, boost, etc.) should all become
          pluggable value sources.

          Show
          Michael McCandless added a comment - Here are some of the problems w/ FC that I'd love to see fixed here: The source of values should be fully flexible/pluggable – FC does uninversion, CSF pulls the array image from the index, an app can plugin its own source. All three of these "sources" should be consumed via the same API (eg ByteValues w/ getValue(int docID)). Value lookup needs to be a method call, w/ optional getBackingArray(), for (manual today, automatic tomorrow) code spec. Uninversion is dangerous – eg if you accidentally have multiple values per field, they "silently" overwrite one another. When you do legitimately have multiple values (eg numeric fields), the Parser interface is also too inflexible – eg the exception to stop visiting terms, the inabilty to specify which (users have requested "first only" and "last only") of multiple values should be kept, etc. Cache should be stored/accessible via the reader, not in separate external WeakHashMap. The eviction policy should be fully visible/controllable by the app (or maybe app optionally hands us a cache impl/factory). There should be no static FC.DEFAULT that we have today. Insanity shouldn't be allowed/possible – it's just too dangerous today that we allow this. We should at least make it really hard to do, by accident (eg, like you must use SlowMultiReader to prove your insanity). EG caching values @ the MultiReader level. Or, LUCENE-2527 (fasterButMoreRAM true/false causing a double entry). The entries are too strongly tied to field names. I may want virtual entries, not backed by a "real" field. EG, say I want to do a "blended" sort, say mixing in recency with elevance... I should be able name this "RelevanceAndRecency" (say), which is not a real field. I back this w/ my own FloatValues impl, which under-the-hood somehow combines the two "sources" and presents a FloatValues interface. Then I should be able to pass a SortField somehow referencing my dynamic/virtual field. Cannot support multiple values per doc (this is a future nice-to-have-but-don't-preclude sort of thing) With these fixes, flex scoring ( LUCENE-2392 ), the per-doc stats (unique term count, total term count, boost, etc.) should all become pluggable value sources.
          Hide
          Ryan McKinley added a comment -

          Would the validate method do things like compute the validDocs bit set if the current caller needs it but the entry didn't compute it the first time?

          Yes, since the options you ask for in an EntryCreator are not necessarily part of the key (and therefor the cached entry), I added the validate method to make sure the cached value has everything it asked for (like validDocs/values) This is optional since the application may not want to take the time to do this check (I think it needs to be synchronized)

          maybe we have a Cache doReopen() method, on the Cache?

          Ya, something like that...

          ---------

          I'll take a crack at this... hopefully we can break this into a few issues, so that it stays tractable and has a chance of getting committed soon!

          Show
          Ryan McKinley added a comment - Would the validate method do things like compute the validDocs bit set if the current caller needs it but the entry didn't compute it the first time? Yes, since the options you ask for in an EntryCreator are not necessarily part of the key (and therefor the cached entry), I added the validate method to make sure the cached value has everything it asked for (like validDocs/values) This is optional since the application may not want to take the time to do this check (I think it needs to be synchronized) maybe we have a Cache doReopen() method, on the Cache? Ya, something like that... --------- I'll take a crack at this... hopefully we can break this into a few issues, so that it stays tractable and has a chance of getting committed soon!
          Hide
          Yonik Seeley added a comment -

          Here are some of the problems w/ FC that I'd love to see fixed here:

          here?
          This issue could be a foundation for many of those things... but man... I really just want the Bits before another year goes by!
          Eviction policies, etc, are out of scope here IMO.

          Show
          Yonik Seeley added a comment - Here are some of the problems w/ FC that I'd love to see fixed here: here ? This issue could be a foundation for many of those things... but man... I really just want the Bits before another year goes by! Eviction policies, etc, are out of scope here IMO.
          Hide
          Ryan McKinley added a comment -

          really just want the Bits before another year goes by!

          +1

          Eviction policies, etc, are out of scope here IMO.

          oh, I sure hope so!

          I'm reading Mikes list as stuff we should keep in mind as we start a new direction. The basic design should not preclude these enhancements, but i sure hope they are not required before laying the basic groundwork (especially if this only goes to /trunk for now)

          Show
          Ryan McKinley added a comment - really just want the Bits before another year goes by! +1 Eviction policies, etc, are out of scope here IMO. oh, I sure hope so! I'm reading Mikes list as stuff we should keep in mind as we start a new direction. The basic design should not preclude these enhancements, but i sure hope they are not required before laying the basic groundwork (especially if this only goes to /trunk for now)
          Hide
          Michael McCandless added a comment -

          I'm reading Mikes list as stuff we should keep in mind as we start a new direction.

          Exactly – we certainly don't have to get this done on the first baby
          step, but if we're gonna make big changes to FC now, we should have
          this future target/issues-to-fix in mind.

          In fact I'm fine w/ the original baby step of sticking the Bits config
          onto Parser, as long as it doesn't use static config nor risk added
          insanity...

          Show
          Michael McCandless added a comment - I'm reading Mikes list as stuff we should keep in mind as we start a new direction. Exactly – we certainly don't have to get this done on the first baby step, but if we're gonna make big changes to FC now, we should have this future target/issues-to-fix in mind. In fact I'm fine w/ the original baby step of sticking the Bits config onto Parser, as long as it doesn't use static config nor risk added insanity...
          Hide
          Ryan McKinley added a comment -

          This patch takes the work from LUCENE-2649 and applies it to a more general ReaderCache.

          The BIG changes with this is that it does not have a global notion of 'insane' cache behavior.

          To allow EntryCreators to check for insanity, I added a callback where you could check other things:

           public boolean checkCacheAnomaliesAfterPut(ReaderCache cache, EntryKey key, T newValue )
          

          I think the one general thing we should check for is the same Key used in any ReaderCache within the same Directory...

          The other thing that was removed is the CacheEntry stuff.

          Show
          Ryan McKinley added a comment - This patch takes the work from LUCENE-2649 and applies it to a more general ReaderCache. The BIG changes with this is that it does not have a global notion of 'insane' cache behavior. To allow EntryCreators to check for insanity, I added a callback where you could check other things: public boolean checkCacheAnomaliesAfterPut(ReaderCache cache, EntryKey key, T newValue ) I think the one general thing we should check for is the same Key used in any ReaderCache within the same Directory... The other thing that was removed is the CacheEntry stuff.
          Hide
          Michael McCandless added a comment -

          Apologies for 'CTR' rather then 'RTC' – we can always revert if I jumped the gun!

          Better to ask forgiveness than permission

          In fact I'm +1 on switching Lucene's trunk to CTR model instead, now
          that we have 3.x as the stable branch. We have enough "policemen"
          around here that I think this'd work well.

          The changes look great Ryan – nice work! Some smallish feedback:

          • I see some windows eol's snuck in... can you change the
            svn:eol-style of all the new sources to "native"?
          • Some classes are missing copyright header (at least EntryKey,
            SimpleEntryKey)
          • Shouldn't we only incr .numDocs if the bit wasn't already set?
            (To be robust if docs have more than one value). Ie we can use
            OpenBits.getAndSet. Maybe then add and assert that numDoc <=
            maxDoc in the end...
          • Then, we can pass null for the delDocs to the enums, and then we
            don't need a 2nd pass to detect matchAllDocs (just test if
            .numDocs == maxDoc())?

          I think we should hold off on backport to 3.x until we stabilize
          LUCENE-2665?

          It looks like you've also fixed LUCENE-2527 with this? Ie the
          fasterButMoreRAM=true|false now cache to the same key? It's just that
          perhaps we should "upgrade" the entry, if it was first created w/
          false and then the current call passes true?

          Show
          Michael McCandless added a comment - Apologies for 'CTR' rather then 'RTC' – we can always revert if I jumped the gun! Better to ask forgiveness than permission In fact I'm +1 on switching Lucene's trunk to CTR model instead, now that we have 3.x as the stable branch. We have enough "policemen" around here that I think this'd work well. The changes look great Ryan – nice work! Some smallish feedback: I see some windows eol's snuck in... can you change the svn:eol-style of all the new sources to "native"? Some classes are missing copyright header (at least EntryKey, SimpleEntryKey) Shouldn't we only incr .numDocs if the bit wasn't already set? (To be robust if docs have more than one value). Ie we can use OpenBits.getAndSet. Maybe then add and assert that numDoc <= maxDoc in the end... Then, we can pass null for the delDocs to the enums, and then we don't need a 2nd pass to detect matchAllDocs (just test if .numDocs == maxDoc())? I think we should hold off on backport to 3.x until we stabilize LUCENE-2665 ? It looks like you've also fixed LUCENE-2527 with this? Ie the fasterButMoreRAM=true|false now cache to the same key? It's just that perhaps we should "upgrade" the entry, if it was first created w/ false and then the current call passes true?
          Hide
          Michael McCandless added a comment -

          Duh sorry above comment is supposed to be on LUCENE-2649 – I'll copy copy/paste there.

          Show
          Michael McCandless added a comment - Duh sorry above comment is supposed to be on LUCENE-2649 – I'll copy copy/paste there.
          Hide
          Yonik Seeley added a comment -

          In fact I'm +1 on switching Lucene's trunk to CTR model instead

          We have always been CTR - but how fast we commit w/o feedback from others is a function of our confidence level (i.e. how much we think people would agree/disagree with the change if they did review it, taking in other things like the complexity / invasiveness of the change).

          I'm not sure we should change the way we currently do things - trunk is developing plenty fast!

          Show
          Yonik Seeley added a comment - In fact I'm +1 on switching Lucene's trunk to CTR model instead We have always been CTR - but how fast we commit w/o feedback from others is a function of our confidence level (i.e. how much we think people would agree/disagree with the change if they did review it, taking in other things like the complexity / invasiveness of the change). I'm not sure we should change the way we currently do things - trunk is developing plenty fast!
          Hide
          Michael McCandless added a comment -

          In fact I'm +1 on switching Lucene's trunk to CTR model instead

          We have always been CTR - but how fast we commit w/o feedback from others is a function of our confidence level (i.e. how much we think people would agree/disagree with the change if they did review it, taking in other things like the complexity / invasiveness of the change).

          I'm not sure we should change the way we currently do things - trunk is developing plenty fast!

          Ahh you're right – RTC would require full vote on every commit: http://www.apache.org/foundation/glossary.html#ReviewThenCommit

          So we are in fact CTR!

          Show
          Michael McCandless added a comment - In fact I'm +1 on switching Lucene's trunk to CTR model instead We have always been CTR - but how fast we commit w/o feedback from others is a function of our confidence level (i.e. how much we think people would agree/disagree with the change if they did review it, taking in other things like the complexity / invasiveness of the change). I'm not sure we should change the way we currently do things - trunk is developing plenty fast! Ahh you're right – RTC would require full vote on every commit: http://www.apache.org/foundation/glossary.html#ReviewThenCommit So we are in fact CTR!
          Hide
          Michael McCandless added a comment -

          Awesome – this patch moves the cache into the reader! Finally

          Should we make the cache member of ReaderCache somehow accessible?
          This way apps can purge individual entries if they want?

          Maybe name this ReaderValuesCache? Like we cache other stuff (norms,
          deleted docs) in readers...

          I think composite readers (Multi/DirReader) should throw UOE if you
          try to get their cache; eg this is what flex APIs do as well.
          Instead, the app should wrap the composite reader with a
          SlowMultiReaderWrapper (or their own IndexReader impl that hides the
          sequential sub readers)?

          This way, creating insanity is still "possible", but, it won't be done
          by accident, ie you really know what you are getting into. If we do
          that then I think we don't need insanity checking anymore (because you
          made a clear choice to load vlaues @ the composite reader level).

          Show
          Michael McCandless added a comment - Awesome – this patch moves the cache into the reader! Finally Should we make the cache member of ReaderCache somehow accessible? This way apps can purge individual entries if they want? Maybe name this ReaderValuesCache? Like we cache other stuff (norms, deleted docs) in readers... I think composite readers (Multi/DirReader) should throw UOE if you try to get their cache; eg this is what flex APIs do as well. Instead, the app should wrap the composite reader with a SlowMultiReaderWrapper (or their own IndexReader impl that hides the sequential sub readers)? This way, creating insanity is still "possible", but, it won't be done by accident, ie you really know what you are getting into. If we do that then I think we don't need insanity checking anymore (because you made a clear choice to load vlaues @ the composite reader level).
          Hide
          Ryan McKinley added a comment -

          Here is an updated patch that moves the Cache to the Reader – this has no real changes, it just applies cleanly with trunk.

          I'm leaving on vacation, and likely won't be able to pick this up for some time, so if anyone gets the itch have at it!

          Show
          Ryan McKinley added a comment - Here is an updated patch that moves the Cache to the Reader – this has no real changes, it just applies cleanly with trunk. I'm leaving on vacation, and likely won't be able to pick this up for some time, so if anyone gets the itch have at it!
          Hide
          Ryan McKinley added a comment -

          this aimed to do the same thing as: LUCENE-2665

          Show
          Ryan McKinley added a comment - this aimed to do the same thing as: LUCENE-2665

            People

            • Assignee:
              Unassigned
              Reporter:
              Ryan McKinley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development