Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1052

Add an "termInfosIndexDivisor" to IndexReader

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The termIndexInterval, set during indexing time, let's you tradeoff
      how much RAM is used by a reader to load the indexed terms vs cost of
      seeking to the specific term you want to load.

      But the downside is you must set it at indexing time.

      This issue adds an indexDivisor to TermInfosReader so that on opening
      a reader you could further sub-sample the the termIndexInterval to use
      less RAM. EG a setting of 2 means every 2 * termIndexInterval is
      loaded into RAM.

      This is particularly useful if your index has a great many terms (eg
      you accidentally indexed binary terms).

      Spinoff from this thread:

      http://www.gossamer-threads.com/lists/lucene/java-dev/54371

      1. LUCENE-1052.patch
        21 kB
        Michael McCandless
      2. LUCENE-1052.patch
        14 kB
        Michael McCandless
      3. termInfosConfigurer.patch
        16 kB
        Chuck Williams

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch derived from Doug's & Chuck's patches from the thread. All tests pass.

        Show
        mikemccand Michael McCandless added a comment - Attached patch derived from Doug's & Chuck's patches from the thread. All tests pass.
        Hide
        mikemccand Michael McCandless added a comment -

        I just committed this. Thanks Chuck & Doug!

        Show
        mikemccand Michael McCandless added a comment - I just committed this. Thanks Chuck & Doug!
        Hide
        manawiz Chuck Williams added a comment -

        Michael, thanks for creating an excellent production version of this idea and committing it!

        I'd like to take it one step further to eliminate the need to call IndexReader.setTermInfosIndexDivisor up front. The idea is to instead specify a maximum number of index terms to cache in memory. This could then allow TermInfosReader to set indexDivisor automatically to the smallest value that yields a cache size less than the maximum.

        This seems a simple and extremely useful extension. Unfortunately, I'm still on an older Lucene, but will post my update. If you like this idea, you may want to just add the feature directly to your implementation in the trunk.

        Show
        manawiz Chuck Williams added a comment - Michael, thanks for creating an excellent production version of this idea and committing it! I'd like to take it one step further to eliminate the need to call IndexReader.setTermInfosIndexDivisor up front. The idea is to instead specify a maximum number of index terms to cache in memory. This could then allow TermInfosReader to set indexDivisor automatically to the smallest value that yields a cache size less than the maximum. This seems a simple and extremely useful extension. Unfortunately, I'm still on an older Lucene, but will post my update. If you like this idea, you may want to just add the feature directly to your implementation in the trunk.
        Hide
        mikemccand Michael McCandless added a comment -

        I'd like to take it one step further to eliminate the need to call IndexReader.setTermInfosIndexDivisor up front. The idea is to instead specify a maximum number of index terms to cache in memory. This could then allow TermInfosReader to set indexDivisor automatically to the smallest value that yields a cache size less than the maximum.

        This seems a simple and extremely useful extension. Unfortunately, I'm still on an older Lucene, but will post my update. If you like this idea, you may want to just add the feature directly to your implementation in the trunk.

        Good idea! This allows you to simply outright cap the memory usage,
        rather than having memory usage be a fraction of the number of terms
        and thus grow as your term count grows.

        So you would propose replacing IndexReader.setTermInfosIndexDivisor
        with IndexReader.setTermInfosIndexMaxCount or some such? Ie you would
        still need to call this on creating your reader...

        Show
        mikemccand Michael McCandless added a comment - I'd like to take it one step further to eliminate the need to call IndexReader.setTermInfosIndexDivisor up front. The idea is to instead specify a maximum number of index terms to cache in memory. This could then allow TermInfosReader to set indexDivisor automatically to the smallest value that yields a cache size less than the maximum. This seems a simple and extremely useful extension. Unfortunately, I'm still on an older Lucene, but will post my update. If you like this idea, you may want to just add the feature directly to your implementation in the trunk. Good idea! This allows you to simply outright cap the memory usage, rather than having memory usage be a fraction of the number of terms and thus grow as your term count grows. So you would propose replacing IndexReader.setTermInfosIndexDivisor with IndexReader.setTermInfosIndexMaxCount or some such? Ie you would still need to call this on creating your reader...
        Hide
        manawiz Chuck Williams added a comment -

        I believe this needs to be a formula as a reasonable bound on the number of terms is in general a function of the number of documents in the segment and the nature of the index (e.g., types of fields). A common thing to do would be to enforce that RAM usage for cached terms grows no faster than logarithmically in the number of documents. The specific formula that is appropriate will depend on the index, i.e. on the application. It might be of the form: c*ln(numdocs+k), wnere c and k are constants dependent on the index.

        One consequence of this approach, or any approach along these lines, is that the indexDivisor will vary across the segments, both in a single index and across indexes. It seems to me from the code that this should work fine.

        This leaves the issue of how to best specify an arbitrary formula. This requires a method to compute the max cached terms allowed for a segment based on the number of docs in the segment, the number of terms in the segment's index, and possibly other factors. The most direct way to do this is to introduce an interface, e.g. TermInfosConfigurer, to define the method signature, and to add setTermInfosConfigurer as an alternative to setTermInfosIndexDivisor. It would need to be in all the same places.

        A more general approach would be to introduce an IndexConfigurer class which over time could hold additional methods like this. It could even replace the current setters on IndexReader (as well as IndexWriter, etc.) with a more general mechanism that would allow dynamic parameters used to configure any classes in the index structure. Each constructor would be passed the IndexConfigurer and call getters or other methods on it to obtain its config. The methods could provide constant values or dynamic formulas.

        I'm going to implement the straightforward solution at the moment in our older version of Lucene, then will sync up to whatever you guys decide is best for the trunk later.

        Show
        manawiz Chuck Williams added a comment - I believe this needs to be a formula as a reasonable bound on the number of terms is in general a function of the number of documents in the segment and the nature of the index (e.g., types of fields). A common thing to do would be to enforce that RAM usage for cached terms grows no faster than logarithmically in the number of documents. The specific formula that is appropriate will depend on the index, i.e. on the application. It might be of the form: c*ln(numdocs+k), wnere c and k are constants dependent on the index. One consequence of this approach, or any approach along these lines, is that the indexDivisor will vary across the segments, both in a single index and across indexes. It seems to me from the code that this should work fine. This leaves the issue of how to best specify an arbitrary formula. This requires a method to compute the max cached terms allowed for a segment based on the number of docs in the segment, the number of terms in the segment's index, and possibly other factors. The most direct way to do this is to introduce an interface, e.g. TermInfosConfigurer, to define the method signature, and to add setTermInfosConfigurer as an alternative to setTermInfosIndexDivisor. It would need to be in all the same places. A more general approach would be to introduce an IndexConfigurer class which over time could hold additional methods like this. It could even replace the current setters on IndexReader (as well as IndexWriter, etc.) with a more general mechanism that would allow dynamic parameters used to configure any classes in the index structure. Each constructor would be passed the IndexConfigurer and call getters or other methods on it to obtain its config. The methods could provide constant values or dynamic formulas. I'm going to implement the straightforward solution at the moment in our older version of Lucene, then will sync up to whatever you guys decide is best for the trunk later.
        Hide
        manawiz Chuck Williams added a comment -

        termInfosConfigurer.patch extends the termInfoIndexDivisor mechanism to allow dynamic management of this parameter. A new new interface, TermInfosConfigurer, allows specification of a method, getMaxTermsCached(), that bounds the size of the in-memory term infos as a function of the segment name, segment numDocs, and total segment terms. This bound is then used to automatically set termInfosIndexDivisor whenever a TermInfosReader reads the term index. This mechanism provides a simple way to ensure that the total amount of memory consumed by the term cache is bounded by, say, O(log(numDocs)).

        All Lucene core tests pass. I'm using another version of this same patch in Lucene 2.1+ in an application that has indexes with binary term pollution, using the TermInfosConfigurer to dynamically bound the term cache in the polluted segments.

        Tried to test contrib, but it appears gdata-server needs external libraries I don't have to compile.

        Michael, this patch applies cleanly to today's Lucene trunk. I'd appreciate if you could verify one thing. Lucene 2.3 has the incremental reopen mechanism (can't wait to get that!), new since Lucene 2.1. It appears that reopen of a segment reuses the same TermInfosReader and thus does not need to configure a new one. I've implemented that part of the patch with this assumption.

        Show
        manawiz Chuck Williams added a comment - termInfosConfigurer.patch extends the termInfoIndexDivisor mechanism to allow dynamic management of this parameter. A new new interface, TermInfosConfigurer, allows specification of a method, getMaxTermsCached(), that bounds the size of the in-memory term infos as a function of the segment name, segment numDocs, and total segment terms. This bound is then used to automatically set termInfosIndexDivisor whenever a TermInfosReader reads the term index. This mechanism provides a simple way to ensure that the total amount of memory consumed by the term cache is bounded by, say, O(log(numDocs)). All Lucene core tests pass. I'm using another version of this same patch in Lucene 2.1+ in an application that has indexes with binary term pollution, using the TermInfosConfigurer to dynamically bound the term cache in the polluted segments. Tried to test contrib, but it appears gdata-server needs external libraries I don't have to compile. Michael, this patch applies cleanly to today's Lucene trunk. I'd appreciate if you could verify one thing. Lucene 2.3 has the incremental reopen mechanism (can't wait to get that!), new since Lucene 2.1. It appears that reopen of a segment reuses the same TermInfosReader and thus does not need to configure a new one. I've implemented that part of the patch with this assumption.
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Chuck for such a wonderfully thorough patch & unit tests, and
        for adding the methods to ParallelReader, too (I had missed it the
        first time around)! The patch looks good.

        Should we use an abstract base class instead of interface for
        TermInfosConfigurer so we can add additional methods in the future
        without breaking back compatibility?

        Also I think we should mark this API as advanced, somewhat
        experimental and subject to change?

        Show
        mikemccand Michael McCandless added a comment - Thanks Chuck for such a wonderfully thorough patch & unit tests, and for adding the methods to ParallelReader, too (I had missed it the first time around)! The patch looks good. Should we use an abstract base class instead of interface for TermInfosConfigurer so we can add additional methods in the future without breaking back compatibility? Also I think we should mark this API as advanced, somewhat experimental and subject to change?
        Hide
        cutting Doug Cutting added a comment -

        I think we should be cautious about adding a new public interface or abstract class to support just this feature. If we want to add a generic configuration API for Lucene, then I'd prefer something fully general, like what I proposed on the mailing list, not something specific to configuring TermInfosReader. Otherwise we'll keep adding new configuration interfaces and adding more parameters to IndexReader constructors each time we wish to make some obscure feature configurable.

        http://www.gossamer-threads.com/lists/lucene/java-dev/54421#54421

        In the model proposed there, adding a new configuration parameter involves just adding a new static method to the public class that implements a new configurable feature.

        Show
        cutting Doug Cutting added a comment - I think we should be cautious about adding a new public interface or abstract class to support just this feature. If we want to add a generic configuration API for Lucene, then I'd prefer something fully general, like what I proposed on the mailing list, not something specific to configuring TermInfosReader. Otherwise we'll keep adding new configuration interfaces and adding more parameters to IndexReader constructors each time we wish to make some obscure feature configurable. http://www.gossamer-threads.com/lists/lucene/java-dev/54421#54421 In the model proposed there, adding a new configuration parameter involves just adding a new static method to the public class that implements a new configurable feature.
        Hide
        manawiz Chuck Williams added a comment -

        I agree a general configuration system would be much better. Doug. we use a similar method to what you described in our application.

        TermInfosConfigurer is slightly different though since the desired config is a method that implements a formula, rather than just a value. This could still be done more generally by allowing methods as well as properties or setters on a higher level configuration object.

        I didn't want to take on the broader issue just for this feature.

        Michael I agree with both of your points.

        I'd be happy to clean up this patch if you guys provide some guidance for what would make it acceptable to commit.

        Show
        manawiz Chuck Williams added a comment - I agree a general configuration system would be much better. Doug. we use a similar method to what you described in our application. TermInfosConfigurer is slightly different though since the desired config is a method that implements a formula, rather than just a value. This could still be done more generally by allowing methods as well as properties or setters on a higher level configuration object. I didn't want to take on the broader issue just for this feature. Michael I agree with both of your points. I'd be happy to clean up this patch if you guys provide some guidance for what would make it acceptable to commit.
        Hide
        mikemccand Michael McCandless added a comment -

        Maybe, instead, we should simply make it "easy" to subclass
        TermInfosReader whenever a SegmentReader wants to instantiate it?

        Ie, the formula is such an advanced use case that it seems appropriate
        to subclass instead of trying to break it out into a special
        interface/abstract class?

        Of course, we need to know this class at SegmentReader construction
        time, so I think to specify it we should in fact take Doug's suggested
        approach using generic properties.

        The challenge with Lucene (and Hadoop) is how can you reach deep down
        into a complex IndexReader.open static method call to change various
        details of the embedded *Readers while they are being constructed,
        and, after they are constructed... I agree it is messy now that we
        must propogate the setTermInfosIndexInterval method up the *Reader
        hierarchy when not all Readers would even use a TermInfosReader.

        So ... maybe we 1) implement generic Lucene properties w/ static
        classes/methods to set/get these properties, then 2) remove
        set/getTermInfosIndexInterval from *Reader and make a generic property
        for it instead, and 3) add another property that allows you to specify
        the Class (or String name) of that is your TermInfosReader subclass
        (and make it non-final)?

        Show
        mikemccand Michael McCandless added a comment - Maybe, instead, we should simply make it "easy" to subclass TermInfosReader whenever a SegmentReader wants to instantiate it? Ie, the formula is such an advanced use case that it seems appropriate to subclass instead of trying to break it out into a special interface/abstract class? Of course, we need to know this class at SegmentReader construction time, so I think to specify it we should in fact take Doug's suggested approach using generic properties. The challenge with Lucene (and Hadoop) is how can you reach deep down into a complex IndexReader.open static method call to change various details of the embedded *Readers while they are being constructed, and, after they are constructed... I agree it is messy now that we must propogate the setTermInfosIndexInterval method up the *Reader hierarchy when not all Readers would even use a TermInfosReader. So ... maybe we 1) implement generic Lucene properties w/ static classes/methods to set/get these properties, then 2) remove set/getTermInfosIndexInterval from *Reader and make a generic property for it instead, and 3) add another property that allows you to specify the Class (or String name) of that is your TermInfosReader subclass (and make it non-final)?
        Hide
        cutting Doug Cutting added a comment -

        What class would we put TermInfosReader-specific setters & getters on, since that class is not public? Do we make TermInfosReader public or leave it package-private? My intuition is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking applications, and because making it public would drag a lot of other stuff into the public. We could consider making SegmentReader public, so that there's a public class that corresponds to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader).

        I'm also not yet convinced that it is critical to support arbitrary formulae for this feature. Sure, it would be nice, but it has costs, like increasing public APIs that must be supported. Folks have done fine without this feature for many years. Adding a simple integer divisor is a sufficient initial step here.

        So, even if we add a configuration system, I think the setter methods could still end up on IndexReader. The difference is primarily whether the methods are:

        public void setTermIndexInterval(int interval);
        public void setTermIndexDivisor(int divisor);

        or

        public static void setTermIndexInterval(LuceneProps props, int interval);
        public static void setTermIndexDivisor(LuceneProps props, int divisor);

        With the latter just a façade that uses package-private stuff. I think the latter style will be handy as we start adding parameters to, e.g., Query classes. In those cases we'll probably want façade's too, since a Query setter will probably really tweak something for a private Scorer class. In the case of indexes, however, we don't have a public, concrete class.

        Another option is to make a public class whose purpose is just to only such parameters, something like SegmentIndexParameters. That'd be my first choice and was the direction I pointed in my initial proposal, but with considerably less explanation.

        Show
        cutting Doug Cutting added a comment - What class would we put TermInfosReader-specific setters & getters on, since that class is not public? Do we make TermInfosReader public or leave it package-private? My intuition is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking applications, and because making it public would drag a lot of other stuff into the public. We could consider making SegmentReader public, so that there's a public class that corresponds to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader). I'm also not yet convinced that it is critical to support arbitrary formulae for this feature. Sure, it would be nice, but it has costs, like increasing public APIs that must be supported. Folks have done fine without this feature for many years. Adding a simple integer divisor is a sufficient initial step here. So, even if we add a configuration system, I think the setter methods could still end up on IndexReader. The difference is primarily whether the methods are: public void setTermIndexInterval(int interval); public void setTermIndexDivisor(int divisor); or public static void setTermIndexInterval(LuceneProps props, int interval); public static void setTermIndexDivisor(LuceneProps props, int divisor); With the latter just a façade that uses package-private stuff. I think the latter style will be handy as we start adding parameters to, e.g., Query classes. In those cases we'll probably want façade's too, since a Query setter will probably really tweak something for a private Scorer class. In the case of indexes, however, we don't have a public, concrete class. Another option is to make a public class whose purpose is just to only such parameters, something like SegmentIndexParameters. That'd be my first choice and was the direction I pointed in my initial proposal, but with considerably less explanation.
        Hide
        manawiz Chuck Williams added a comment -

        I can report that in our application having a formula is critical. We have no control over the content our users index, nor in fact do they. These are arbitrary documents. We find a surprising number of them contain embedded encoded binary data. When those are indexed, lucene's memory consumption skyrockets, either bringing the whole app down with an OOM or slowing performance to a crawl due to excessive GC's reclaiming a tiny remaining working memory space.

        Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically.

        I agree that making TermInfosReader, SegmentReader, etc. public classes is not a great solution The current patch does not do that. It simply adds a configurable class that can be used to provide formula parameters as opposed to just value parameters. At least for us, this special case is sufficiently important to outweigh any considerations of the complexity of an additional class.

        A single configuration class could be used at the IndexReader level that provides for both static and dynamically-varying properties through getters, some of which take parameters.

        Here is another possible solution. My current thought is that the bound should always be a multiple of sqrt(numDocs). E.g., see Heap's Law here: http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html

        I'm currently using this formula in my TermInfosConfigurer:

        int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

        This has Heap's law as foundation. I provide TERM_BOUNDING_MULTIPLIER as the config parameter, with 0 meaning don't do this. I also provide a TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually specified constant amount.

        If that approach would be acceptable to lucene in general, then we just need two static parameters. However, I don't have enough experience with how well this formula works in our user base yet to know whether or not we'll tune it further.

        Show
        manawiz Chuck Williams added a comment - I can report that in our application having a formula is critical. We have no control over the content our users index, nor in fact do they. These are arbitrary documents. We find a surprising number of them contain embedded encoded binary data. When those are indexed, lucene's memory consumption skyrockets, either bringing the whole app down with an OOM or slowing performance to a crawl due to excessive GC's reclaiming a tiny remaining working memory space. Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically. I agree that making TermInfosReader, SegmentReader, etc. public classes is not a great solution The current patch does not do that. It simply adds a configurable class that can be used to provide formula parameters as opposed to just value parameters. At least for us, this special case is sufficiently important to outweigh any considerations of the complexity of an additional class. A single configuration class could be used at the IndexReader level that provides for both static and dynamically-varying properties through getters, some of which take parameters. Here is another possible solution. My current thought is that the bound should always be a multiple of sqrt(numDocs). E.g., see Heap's Law here: http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html I'm currently using this formula in my TermInfosConfigurer: int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL); This has Heap's law as foundation. I provide TERM_BOUNDING_MULTIPLIER as the config parameter, with 0 meaning don't do this. I also provide a TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually specified constant amount. If that approach would be acceptable to lucene in general, then we just need two static parameters. However, I don't have enough experience with how well this formula works in our user base yet to know whether or not we'll tune it further.
        Hide
        cutting Doug Cutting added a comment -

        > We find a surprising number of them contain embedded encoded binary data.

        It sounds like a detector for this would be very useful. It would, e.g., substantially speed updates of such indexes, and not slow searches of them like a divisor does. At Excite we evolved effective heuristics for wordness to keep our dictionaries from exploding. Perhaps you should look into that? Also, it sounds like you might increase your default term index interval, since it sounds like you have big indexes with noisy data.

        > Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically.

        You could look at the size of the .tii files before you open an index, and, if they're too large, set the divisor automatically as you see fit.

        > int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

        This sounds like a fine approach.

        Show
        cutting Doug Cutting added a comment - > We find a surprising number of them contain embedded encoded binary data. It sounds like a detector for this would be very useful. It would, e.g., substantially speed updates of such indexes, and not slow searches of them like a divisor does. At Excite we evolved effective heuristics for wordness to keep our dictionaries from exploding. Perhaps you should look into that? Also, it sounds like you might increase your default term index interval, since it sounds like you have big indexes with noisy data. > Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically. You could look at the size of the .tii files before you open an index, and, if they're too large, set the divisor automatically as you see fit. > int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL); This sounds like a fine approach.
        Hide
        mikemccand Michael McCandless added a comment -

        What class would we put TermInfosReader-specific setters & getters on, since that class is not public? Do we make TermInfosReader public or leave it package-private? My intuition is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking applications, and because making it public would drag a lot of other stuff into the public. We could consider making SegmentReader public, so that there's a public class that corresponds to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader).

        Agreed: package private. People who do advanced things should be fine
        with that.

        Another option is to make a public class whose purpose is just to only such parameters, something like SegmentIndexParameters. That'd be my first choice and was the direction I pointed in my initial proposal, but with considerably less explanation.

        So I took a closer look at making generic properties by coding up
        Doug's approach (attached patch).

        I replaced *#setTermInfosIndexDivisor with a separate
        SegmentIndexProperties class that has static methods to set/get
        termIndexDivisor, and added/threaded down ctors that allow you to pass
        a LuceneProperties when opening an IndexReader.

        I came up with a number of questions along the way:

        • Who should know/store the default value for a given property?
          TermIndexDivisor defaults to 1.
          .
          Is this stored in that static facade class (a)? Or, passed in as
          defaultValue arg by TermInfosReader when it looks up the property
          (b)? Or, do we make a base DefaultLuceneProperties that has the
          default set for all properties (c)?
          .
          (b) is nice because I feel like the default should live in the
          class that uses it, but then that's bad because the outside world
          can't see the default value.
        • Every property must clearly define when it will be looked at. So
          for termIndexDivisor in the javadoc we would say "it's used only
          when the termInfos index is loaded (once)". This means changing
          that property after termInfos index is loaded has no effect.
        • We should presumably create a default LuceneProperties to save
          checking for props != null everywhere when user didn't make their
          own props. This favors option (c) in the first bullet above.
        • Presumably once you've created a class, passing in your props
          instance, you cannot later install a new props instance. The
          LuceneProperties class is "write once".
        • We would need guidelines for when something should be an arg to
          the ctor, setter/getter on the class. I think there are shades of
          gray here.

        After this, I suddenly realized if we indeed make termIndexDivisor a
        generic property, it's actually hard for Chuck to then do his formula
        by looking at the size of the .tii file: when the index has multiple
        segments, you would presumably need to set different indexDivisors for
        each segment, but the properties only lets you set one global value.

        You could carefully set the property, then somehow get ahold of just
        that one SegmentReader and have it load the term index, then move onto
        the next one, etc, but that's quite messy.

        Note that this limitation is also the case with the top-level
        setTermInfosIndexDivisor as it now stands in trunk – it's not easy to
        set different index divisors per segment.

        It almost feels like we should have "hooks" that are invoked at
        certain times, like when we are about to load the term infos index,
        that give the application a chance to change something...

        Show
        mikemccand Michael McCandless added a comment - What class would we put TermInfosReader-specific setters & getters on, since that class is not public? Do we make TermInfosReader public or leave it package-private? My intuition is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking applications, and because making it public would drag a lot of other stuff into the public. We could consider making SegmentReader public, so that there's a public class that corresponds to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader). Agreed: package private. People who do advanced things should be fine with that. Another option is to make a public class whose purpose is just to only such parameters, something like SegmentIndexParameters. That'd be my first choice and was the direction I pointed in my initial proposal, but with considerably less explanation. So I took a closer look at making generic properties by coding up Doug's approach (attached patch). I replaced *#setTermInfosIndexDivisor with a separate SegmentIndexProperties class that has static methods to set/get termIndexDivisor, and added/threaded down ctors that allow you to pass a LuceneProperties when opening an IndexReader. I came up with a number of questions along the way: Who should know/store the default value for a given property? TermIndexDivisor defaults to 1. . Is this stored in that static facade class (a)? Or, passed in as defaultValue arg by TermInfosReader when it looks up the property (b)? Or, do we make a base DefaultLuceneProperties that has the default set for all properties (c)? . (b) is nice because I feel like the default should live in the class that uses it, but then that's bad because the outside world can't see the default value. Every property must clearly define when it will be looked at. So for termIndexDivisor in the javadoc we would say "it's used only when the termInfos index is loaded (once)". This means changing that property after termInfos index is loaded has no effect. We should presumably create a default LuceneProperties to save checking for props != null everywhere when user didn't make their own props. This favors option (c) in the first bullet above. Presumably once you've created a class, passing in your props instance, you cannot later install a new props instance. The LuceneProperties class is "write once". We would need guidelines for when something should be an arg to the ctor, setter/getter on the class. I think there are shades of gray here. After this, I suddenly realized if we indeed make termIndexDivisor a generic property, it's actually hard for Chuck to then do his formula by looking at the size of the .tii file: when the index has multiple segments, you would presumably need to set different indexDivisors for each segment, but the properties only lets you set one global value. You could carefully set the property, then somehow get ahold of just that one SegmentReader and have it load the term index, then move onto the next one, etc, but that's quite messy. Note that this limitation is also the case with the top-level setTermInfosIndexDivisor as it now stands in trunk – it's not easy to set different index divisors per segment. It almost feels like we should have "hooks" that are invoked at certain times, like when we are about to load the term infos index, that give the application a chance to change something...
        Hide
        manawiz Chuck Williams added a comment -

        > It almost feels like we should have "hooks" that are invoked at
        > certain times, like when we are about to load the term infos index,
        > that give the application a chance to change something...

        I agree with the need for some kind of hook. This is what TermInfosConfigurer is. It calls a method whenever a SegmentReader reads an index to obtains parameters (termIndexDivisor) that should be used to configure the TermInfosReader.

        Why not make the setters/getters on SegmentIndexProperties regular non-static methods, and allow hook methods as well? E.g., setTermIndexDivsior(), getTermIndexDivisor(), getMaxTermsCached(String segmentName, int segmentNumDocs, long segmentNumTerms). Non-static methods make the defaulting straightforward and allow for subclassing to override hook methods.

        > It sounds like a detector for this would be very useful. It would, e.g., substantially
        > speed updates of such indexes, and not slow searches of them like a divisor does.
        > At Excite we evolved effective heuristics for wordness to keep our dictionaries from exploding.

        Yes, we are pursuing that approach as well, but we have some stringent requirements in our market. E.g., we cannot filter any valid content, because searches must be guaranteed to find all matching results. As of result of this, we cannot impose any maximum length for documents.

        Any type of binary content recognizer would either need to be 100% accurate, which is impossible, or require human intervention to validate filtering. For a human intervention approach to be viable the false positive rate must be tiny. To be effective the false negative rate must be tiny. Although invalid content is pretty easy for people tor recognize, I've found so far that high-accuracy recognition rules are surprising subtle.

        Do you by chance no of any quality work in this area?

        > > int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

        > This sounds like a fine approach.

        It seems to be working ok, but there is one issue. Heap's Law is based on the total number of tokens in the content, not the total number of documents. I.e., longer documents will generate more distinct terms than shorter documents. For large segments the use of numDocs works ok due to statistical averaging, but for smaller segments there are errors. I may loosen the bound somewhat on smaller segments in order to allow for their larger standard deviation.

        If Lucene indexes tracked totalTokens (with duplicates, i.e. not numDistinctTokens) that would be perfect, but they don't. I don't know whether or not there would be other good uses for totalTokens but mention its relevance here in case there are.

        Show
        manawiz Chuck Williams added a comment - > It almost feels like we should have "hooks" that are invoked at > certain times, like when we are about to load the term infos index, > that give the application a chance to change something... I agree with the need for some kind of hook. This is what TermInfosConfigurer is. It calls a method whenever a SegmentReader reads an index to obtains parameters (termIndexDivisor) that should be used to configure the TermInfosReader. Why not make the setters/getters on SegmentIndexProperties regular non-static methods, and allow hook methods as well? E.g., setTermIndexDivsior(), getTermIndexDivisor(), getMaxTermsCached(String segmentName, int segmentNumDocs, long segmentNumTerms). Non-static methods make the defaulting straightforward and allow for subclassing to override hook methods. > It sounds like a detector for this would be very useful. It would, e.g., substantially > speed updates of such indexes, and not slow searches of them like a divisor does. > At Excite we evolved effective heuristics for wordness to keep our dictionaries from exploding. Yes, we are pursuing that approach as well, but we have some stringent requirements in our market. E.g., we cannot filter any valid content, because searches must be guaranteed to find all matching results. As of result of this, we cannot impose any maximum length for documents. Any type of binary content recognizer would either need to be 100% accurate, which is impossible, or require human intervention to validate filtering. For a human intervention approach to be viable the false positive rate must be tiny. To be effective the false negative rate must be tiny. Although invalid content is pretty easy for people tor recognize, I've found so far that high-accuracy recognition rules are surprising subtle. Do you by chance no of any quality work in this area? > > int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL); > This sounds like a fine approach. It seems to be working ok, but there is one issue. Heap's Law is based on the total number of tokens in the content, not the total number of documents. I.e., longer documents will generate more distinct terms than shorter documents. For large segments the use of numDocs works ok due to statistical averaging, but for smaller segments there are errors. I may loosen the bound somewhat on smaller segments in order to allow for their larger standard deviation. If Lucene indexes tracked totalTokens (with duplicates, i.e. not numDistinctTokens) that would be perfect, but they don't. I don't know whether or not there would be other good uses for totalTokens but mention its relevance here in case there are.
        Hide
        mikemccand Michael McCandless added a comment -

        I think for 2.3 we should go with the approach as currently committed, and take this ongoing debate into 2.4? I'll mark this as 2.4 target.

        Show
        mikemccand Michael McCandless added a comment - I think for 2.3 we should go with the approach as currently committed, and take this ongoing debate into 2.4? I'll mark this as 2.4 target.
        Hide
        mikemccand Michael McCandless added a comment -

        Changing fix version from 2.4 to Unknown.

        Show
        mikemccand Michael McCandless added a comment - Changing fix version from 2.4 to Unknown.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Should any further improvements be in a new issue? It confuses tracking in changes.txt to rework issues after placing them in changes doesnt it? The issues could then be linked.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Should any further improvements be in a new issue? It confuses tracking in changes.txt to rework issues after placing them in changes doesnt it? The issues could then be linked.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        This issue was resolved - lets open a new one if we want to do more.

        Show
        markrmiller@gmail.com Mark Miller added a comment - This issue was resolved - lets open a new one if we want to do more.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development