Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This is a companion to FieldCacheRangeFilter except it operates on a set of terms rather than a range. It works best when the set is comparatively large or the terms are comparatively common.

      1. FieldCacheTermsFilter.java
        2 kB
        Tim Sturge
      2. FieldCacheTermsFilter.java
        2 kB
        Tim Sturge
      3. LUCENE-1487.patch
        7 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Tim Sturge added a comment -

        FieldCacheTermsFilter using OpenBitSet.fastGet()

        Show
        Tim Sturge added a comment - FieldCacheTermsFilter using OpenBitSet.fastGet()
        Hide
        Otis Gospodnetic added a comment -

        Would it be possible to reformat to use Lucene code style and add a bit of javadoc/unit test? Eclipse and IDEA styles are at the bottom of http://wiki.apache.org/lucene-java/HowToContribute

        Show
        Otis Gospodnetic added a comment - Would it be possible to reformat to use Lucene code style and add a bit of javadoc/unit test? Eclipse and IDEA styles are at the bottom of http://wiki.apache.org/lucene-java/HowToContribute
        Hide
        Tim Sturge added a comment -

        No problem at all. Should I assume this means that the idea is generally considered sound; the only question is getting something with sufficient tests/docs/level of finish?

        I was expecting to get comments about the implementation first; last time what ended up going in was very different (in good ways) from my initial submission.

        Show
        Tim Sturge added a comment - No problem at all. Should I assume this means that the idea is generally considered sound; the only question is getting something with sufficient tests/docs/level of finish? I was expecting to get comments about the implementation first; last time what ended up going in was very different (in good ways) from my initial submission.
        Hide
        Mark Miller added a comment -

        Hold out Tim, your likely to get further comments before it goes in. I think Otis was just suggesting we start with those changes. Once your code is in the right format, your more likely to get a committer to spend some time with it. Sometimes we just reformat and add the tests ourselves depending on a host of factors, but in general, your more likely to get good comments faster if that work has already been done.

        Its a fair question to ask if the idea is sound, but just posting the work doesn't necessarily imply that you are looking for that advice before putting more work into what you have done. And many times questions do go unanswered, they are missed, people don't have the time at the moment - so its best to supply all of this stuff, unless you are prepared for a wait if their is no current interest in going over the patch.

        Show
        Mark Miller added a comment - Hold out Tim, your likely to get further comments before it goes in. I think Otis was just suggesting we start with those changes. Once your code is in the right format, your more likely to get a committer to spend some time with it. Sometimes we just reformat and add the tests ourselves depending on a host of factors, but in general, your more likely to get good comments faster if that work has already been done. Its a fair question to ask if the idea is sound, but just posting the work doesn't necessarily imply that you are looking for that advice before putting more work into what you have done. And many times questions do go unanswered, they are missed, people don't have the time at the moment - so its best to supply all of this stuff, unless you are prepared for a wait if their is no current interest in going over the patch.
        Hide
        Tim Sturge added a comment -

        I'm running a bit behind this week and I'm out most of next week so it may be a while before I get to this.

        One thing I hope will be helpful in the interim is to repost here the java-dev exchange that lead to me posting this here; I suspect that many people who watch JIRA don't necessarily read java-dev as well and I hope the postings are informative.

        Here's the exchange:

        On 12/10/08 1:13 PM, "Tim Sturge" <tsturge@hi5.com> wrote:

        > Yes (mostly). It turns those terms into an OpenBitSet on the term array.
        > Then it does a fastGet() in the next() and skipTo() loops to see if the term
        > for that document is in the set.
        >
        > The issue is that fastGet() is not as fast as the two inequalities in FCRF.
        > I didn't directly benchmark FCTF against FCRF because I had a different
        > application in mind for FCTF (location boxes). However it wasn't as
        > efficient in that case as directly realizing the bit sets. This was mostly
        > because in the application I had in mind there were a lot (>100K) of terms
        > with relatively low frequency and queries that needed only a few hundred
        > terms in the set.
        >
        > I tried a sorted list of terms and Arrays.binarySearch() but that is way
        > slower as is Set<Integer> (no surprise there). I was thinking about a custom
        > hash table implementation but I'm not hopeful; it increases cycle cost and
        > means
        >
        > So it is efficient but for a more limited set of cases than FCRF. My gut
        > feeling is that FCRF is a better solution for "most" range filters, whereas
        > FCTF is a better solution for "some" term set filters (versus creating
        > TermsFilter objects on the fly each time) It all depends on how common the
        > terms are and how large the sets of terms are. Lots of terms (or a few very
        > common terms) it wins. A few less common terms it loses.
        >
        > I'll open a JIRA issue for it.
        >
        > Tim
        >
        > On 12/10/08 12:45 PM, "Michael McCandless" <lucene@mikemccandless.com>
        > wrote:
        >
        >>
        >> It'd be great to get this into Lucene.
        >>
        >> Does FieldCacheTermsFilter let you specify a set of arbitrary terms to
        >> filter for, like TermsFilter in contrib/queries? And it's space/time
        >> efficient once FieldCache is populated?
        >>
        >> Mike
        >>
        >> Tim Sturge wrote:
        >>
        >>> Mike, Mike,
        >>>
        >>> I have an implementation of FieldCacheTermsFilter (which uses field
        >>> cache to
        >>> filter for a predefined set of terms) around if either of you are
        >>> interested. It is faster than materializing the filter roughly when
        >>> the
        >>> filter matches more than 1% of the documents.
        >>>
        >>> So it's not better for a large set of small filters (which you can
        >>> materialize on the spot) but it is better for a small set (but more
        >>> than 32)
        >>> large filters.
        >>>
        >>> Let me know if you're interested and I'll send it in.
        >>>
        >>> Tim
        >>>
        >>> On 12/10/08 3:34 AM, "Michael McCandless"
        >>> <lucene@mikemccandless.com> wrote:
        >>>
        >>>>
        >>>> In your approach, roughly how many filters do you have cached? It
        >>>> seems like it could be quite a few (one for each color, one for each
        >>>> type, etc)?
        >>>>
        >>>> You might be able to modify the new (on Lucene trunk)
        >>>> FieldCacheRangeFilter to achieve this same filtering without actually
        >>>> having to materialize the full bitset for each.
        >>>>
        >>>> Mike
        >>>>
        >>>> Michael Stoppelman wrote:
        >>>>
        >>>>> Yeah looks similar to what we've implemented for ourselves
        >>>>> (although I
        >>>>> haven't looked at the implementation). We've got quite a custom
        >>>>> version of
        >>>>> lucene at this point. Using Solr at this point really isn't a viable
        >>>>> option,
        >>>>> but thanks for pointing this out.
        >>>>>
        >>>>> M
        >>>>>
        >>>>> On Tue, Dec 9, 2008 at 1:47 AM, Michael McCandless <
        >>>>> lucene@mikemccandless.com> wrote:
        >>>>>
        >>>>>>
        >>>>>> This use case sounds alot like faceted navigation, which Solr
        >>>>>> provides.
        >>>>>>
        >>>>>> Mike
        >>>>>>
        >>>>>>
        >>>>>> Michael Stoppelman wrote:
        >>>>>>
        >>>>>> Hi all,
        >>>>>>>
        >>>>>>> I'm working on upgrading to Lucene 2.4.0 from 2.3.2 and was trying
        >>>>>>> to
        >>>>>>> integrate the new DodIdSet changes since
        >>>>>>> o.a.l.search.Filter#bits() method
        >>>>>>> is now depreciated. For our app we actually heavily rely on bits
        >>>>>>> from the
        >>>>>>> Filter to do post-query filtering (I explain why below).
        >>>>>>>
        >>>>>>> For example, if someone searches for product: "ipod" and then
        >>>>>>> filters a
        >>>>>>> type: "nano" (e.g. mini/nano/regular) AND color: "red" (e.g.
        >>>>>>> red/yellow/blue). In our current model the results are gathered in
        >>>>>>> the
        >>>>>>> following way:
        >>>>>>>
        >>>>>>> 1) "ipod" w/o attributes is run and the results are stored in a
        >>>>>>> hitcollector
        >>>>>>> 2) "ipod" results are now filtered for color="red" AND type="mini"
        >>>>>>> using
        >>>>>>> the
        >>>>>>> lucene Filters
        >>>>>>> 3) The filtered results are returned to the user.
        >>>>>>>
        >>>>>>> The reason that the attributes are filtered post-query is so that
        >>>>>>> we can
        >>>>>>> return the other types and colors the user can filter by in the
        >>>>>>> future.
        >>>>>>> Meaning the UI would be able to show "blue", "green", "pink",
        >>>>>>> etc... if we
        >>>>>>> pre-filtered results by color and type before hand we wouldn't
        >>>>>>> know what
        >>>>>>> the
        >>>>>>> other filter options would be there for a broader result set.
        >>>>>>>
        >>>>>>> Does anyone else have this use case? I'd imagine other folks are
        >>>>>>> probably
        >>>>>>> doing similar things to accomplish this.
        >>>>>>>
        >>>>>>> M
        >>>>>>>
        >>>>>>
        >>>>>>
        >>>>>> ---------------------------------------------------------------------
        >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
        >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
        >>>>>>
        >>>>>>
        >>>>
        >>>>
        >>>> ---------------------------------------------------------------------
        >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
        >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
        >>>>
        >>>
        >>>
        >>> ---------------------------------------------------------------------
        >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
        >>> For additional commands, e-mail: java-user-help@lucene.apache.org
        >>>
        >>
        >>
        >> ---------------------------------------------------------------------
        >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
        >> For additional commands, e-mail: java-user-help@lucene.apache.org
        >>
        >
        >
        > ---------------------------------------------------------------------
        > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
        > For additional commands, e-mail: java-user-help@lucene.apache.org
        >

        Show
        Tim Sturge added a comment - I'm running a bit behind this week and I'm out most of next week so it may be a while before I get to this. One thing I hope will be helpful in the interim is to repost here the java-dev exchange that lead to me posting this here; I suspect that many people who watch JIRA don't necessarily read java-dev as well and I hope the postings are informative. Here's the exchange: On 12/10/08 1:13 PM, "Tim Sturge" <tsturge@hi5.com> wrote: > Yes (mostly). It turns those terms into an OpenBitSet on the term array. > Then it does a fastGet() in the next() and skipTo() loops to see if the term > for that document is in the set. > > The issue is that fastGet() is not as fast as the two inequalities in FCRF. > I didn't directly benchmark FCTF against FCRF because I had a different > application in mind for FCTF (location boxes). However it wasn't as > efficient in that case as directly realizing the bit sets. This was mostly > because in the application I had in mind there were a lot (>100K) of terms > with relatively low frequency and queries that needed only a few hundred > terms in the set. > > I tried a sorted list of terms and Arrays.binarySearch() but that is way > slower as is Set<Integer> (no surprise there). I was thinking about a custom > hash table implementation but I'm not hopeful; it increases cycle cost and > means > > So it is efficient but for a more limited set of cases than FCRF. My gut > feeling is that FCRF is a better solution for "most" range filters, whereas > FCTF is a better solution for "some" term set filters (versus creating > TermsFilter objects on the fly each time) It all depends on how common the > terms are and how large the sets of terms are. Lots of terms (or a few very > common terms) it wins. A few less common terms it loses. > > I'll open a JIRA issue for it. > > Tim > > On 12/10/08 12:45 PM, "Michael McCandless" <lucene@mikemccandless.com> > wrote: > >> >> It'd be great to get this into Lucene. >> >> Does FieldCacheTermsFilter let you specify a set of arbitrary terms to >> filter for, like TermsFilter in contrib/queries? And it's space/time >> efficient once FieldCache is populated? >> >> Mike >> >> Tim Sturge wrote: >> >>> Mike, Mike, >>> >>> I have an implementation of FieldCacheTermsFilter (which uses field >>> cache to >>> filter for a predefined set of terms) around if either of you are >>> interested. It is faster than materializing the filter roughly when >>> the >>> filter matches more than 1% of the documents. >>> >>> So it's not better for a large set of small filters (which you can >>> materialize on the spot) but it is better for a small set (but more >>> than 32) >>> large filters. >>> >>> Let me know if you're interested and I'll send it in. >>> >>> Tim >>> >>> On 12/10/08 3:34 AM, "Michael McCandless" >>> <lucene@mikemccandless.com> wrote: >>> >>>> >>>> In your approach, roughly how many filters do you have cached? It >>>> seems like it could be quite a few (one for each color, one for each >>>> type, etc)? >>>> >>>> You might be able to modify the new (on Lucene trunk) >>>> FieldCacheRangeFilter to achieve this same filtering without actually >>>> having to materialize the full bitset for each. >>>> >>>> Mike >>>> >>>> Michael Stoppelman wrote: >>>> >>>>> Yeah looks similar to what we've implemented for ourselves >>>>> (although I >>>>> haven't looked at the implementation). We've got quite a custom >>>>> version of >>>>> lucene at this point. Using Solr at this point really isn't a viable >>>>> option, >>>>> but thanks for pointing this out. >>>>> >>>>> M >>>>> >>>>> On Tue, Dec 9, 2008 at 1:47 AM, Michael McCandless < >>>>> lucene@mikemccandless.com> wrote: >>>>> >>>>>> >>>>>> This use case sounds alot like faceted navigation, which Solr >>>>>> provides. >>>>>> >>>>>> Mike >>>>>> >>>>>> >>>>>> Michael Stoppelman wrote: >>>>>> >>>>>> Hi all, >>>>>>> >>>>>>> I'm working on upgrading to Lucene 2.4.0 from 2.3.2 and was trying >>>>>>> to >>>>>>> integrate the new DodIdSet changes since >>>>>>> o.a.l.search.Filter#bits() method >>>>>>> is now depreciated. For our app we actually heavily rely on bits >>>>>>> from the >>>>>>> Filter to do post-query filtering (I explain why below). >>>>>>> >>>>>>> For example, if someone searches for product: "ipod" and then >>>>>>> filters a >>>>>>> type: "nano" (e.g. mini/nano/regular) AND color: "red" (e.g. >>>>>>> red/yellow/blue). In our current model the results are gathered in >>>>>>> the >>>>>>> following way: >>>>>>> >>>>>>> 1) "ipod" w/o attributes is run and the results are stored in a >>>>>>> hitcollector >>>>>>> 2) "ipod" results are now filtered for color="red" AND type="mini" >>>>>>> using >>>>>>> the >>>>>>> lucene Filters >>>>>>> 3) The filtered results are returned to the user. >>>>>>> >>>>>>> The reason that the attributes are filtered post-query is so that >>>>>>> we can >>>>>>> return the other types and colors the user can filter by in the >>>>>>> future. >>>>>>> Meaning the UI would be able to show "blue", "green", "pink", >>>>>>> etc... if we >>>>>>> pre-filtered results by color and type before hand we wouldn't >>>>>>> know what >>>>>>> the >>>>>>> other filter options would be there for a broader result set. >>>>>>> >>>>>>> Does anyone else have this use case? I'd imagine other folks are >>>>>>> probably >>>>>>> doing similar things to accomplish this. >>>>>>> >>>>>>> M >>>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>> >>>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >
        Hide
        Tim Sturge added a comment -

        Mark, Otis, looking back over the bug history I totally see where you are coming from; I do look like I've just dumped this here without explanation which wasn't my intention.

        Honestly I don't really know how useful this is; I think there's a set of cases where it works very well but how comparatively large that set is I am unsure. You can think of it as adding a level of indirection (from documents to terms) to filtering.

        The alternative (at least as far as I can see) is to do a union by term of sorted docid lists (which is fundamentally what a DisjunctionQuery does I think). There may well be other options.

        Show
        Tim Sturge added a comment - Mark, Otis, looking back over the bug history I totally see where you are coming from; I do look like I've just dumped this here without explanation which wasn't my intention. Honestly I don't really know how useful this is; I think there's a set of cases where it works very well but how comparatively large that set is I am unsure. You can think of it as adding a level of indirection (from documents to terms) to filtering. The alternative (at least as far as I can see) is to do a union by term of sorted docid lists (which is fundamentally what a DisjunctionQuery does I think). There may well be other options.
        Hide
        Michael McCandless added a comment -

        I think this is a useful filter impl, and a nice companion to FCRF.
        I'd like to see it committed; formatting & test case are good next
        steps.

        TermsFilter (in contrib/queries) does the same thing, but creates a
        bitset by docID up front by walking the TermDocs for each term. An OR
        query, wrapped in QueryWrapperFilter, is another way.

        This impl uses FieldCache to create a bitset by term number and then
        does a scan by docID, so it has different performance tradeoffs: for
        "enum" fields (far more docs than unique terms – like country, state,
        etc.) it's fast to create this filter, and then applying the filter is
        O(maxDocs) with a small constant factor.

        I think for many apps it means you do not have to cache the filter
        because creating & using it "on the fly" is plenty fast.

        Show
        Michael McCandless added a comment - I think this is a useful filter impl, and a nice companion to FCRF. I'd like to see it committed; formatting & test case are good next steps. TermsFilter (in contrib/queries) does the same thing, but creates a bitset by docID up front by walking the TermDocs for each term. An OR query, wrapped in QueryWrapperFilter, is another way. This impl uses FieldCache to create a bitset by term number and then does a scan by docID, so it has different performance tradeoffs: for "enum" fields (far more docs than unique terms – like country, state, etc.) it's fast to create this filter, and then applying the filter is O(maxDocs) with a small constant factor. I think for many apps it means you do not have to cache the filter because creating & using it "on the fly" is plenty fast.
        Hide
        Yonik Seeley added a comment -

        I think the name should be different since it only works with single-valued fields, unlike other TermFilters and TermQueries.

        Show
        Yonik Seeley added a comment - I think the name should be different since it only works with single-valued fields, unlike other TermFilters and TermQueries.
        Hide
        Tim Sturge added a comment -

        Reformatted version. I'm happy to change the name if that's the consensus but I can't think of any better alternatives right now.

        Show
        Tim Sturge added a comment - Reformatted version. I'm happy to change the name if that's the consensus but I can't think of any better alternatives right now.
        Hide
        Michael McCandless added a comment -

        Yonik do you have any suggestions for a new name (I agree a new name would be better but can't think of one offhand).

        Show
        Michael McCandless added a comment - Yonik do you have any suggestions for a new name (I agree a new name would be better but can't think of one offhand).
        Hide
        Yonik Seeley added a comment -

        FieldCacheStringFilter?
        FieldCacheValueFilter?
        FieldCacheMatchFilter?

        Not sure if any of those are better though. Perhaps it's enough that "FieldCache" is in the name to indicate that it only works on single-valued indexed fields that are able to be cached by the FieldCache.

        Show
        Yonik Seeley added a comment - FieldCacheStringFilter? FieldCacheValueFilter? FieldCacheMatchFilter? Not sure if any of those are better though. Perhaps it's enough that "FieldCache" is in the name to indicate that it only works on single-valued indexed fields that are able to be cached by the FieldCache.
        Hide
        Michael McCandless added a comment -

        > Perhaps it's enough that "FieldCache" is in the name to indicate that it only works on single-valued indexed fields that are able to be cached by the FieldCache.

        This'd be my vote (keep the name FieldCacheTermsFilter).

        Tim, the new patch looks great! Could you add some javadocs describing the tradeoffs with this filter, and maybe a unit test? Thanks.

        Show
        Michael McCandless added a comment - > Perhaps it's enough that "FieldCache" is in the name to indicate that it only works on single-valued indexed fields that are able to be cached by the FieldCache. This'd be my vote (keep the name FieldCacheTermsFilter). Tim, the new patch looks great! Could you add some javadocs describing the tradeoffs with this filter, and maybe a unit test? Thanks.
        Hide
        Michael McCandless added a comment -

        Tim, are you still looking into this? Or if you don't have the itch/time, does anyone else want to add javadocs & unit test for FieldCacheTermsFilter to move this forwards?

        Show
        Michael McCandless added a comment - Tim, are you still looking into this? Or if you don't have the itch/time, does anyone else want to add javadocs & unit test for FieldCacheTermsFilter to move this forwards?
        Hide
        Shalin Shekhar Mangar added a comment -

        Attached a patch on trunk

        1. Adds Javadocs per the comments here and my understanding
        2. TestFieldCacheTermsFilter is a simple unit test
        Show
        Shalin Shekhar Mangar added a comment - Attached a patch on trunk Adds Javadocs per the comments here and my understanding TestFieldCacheTermsFilter is a simple unit test
        Hide
        Michael McCandless added a comment -

        Fabulous, thanks Shalin! I changed UN_TOKENIZED --> NOT_ANALYZED in the javadoc,
        and switched to MockRAMDirectory in the test. I'll commit shortly.

        Show
        Michael McCandless added a comment - Fabulous, thanks Shalin! I changed UN_TOKENIZED --> NOT_ANALYZED in the javadoc, and switched to MockRAMDirectory in the test. I'll commit shortly.
        Hide
        Michael McCandless added a comment -

        Committed revision 738622. Thanks Tim & Shalin!

        Show
        Michael McCandless added a comment - Committed revision 738622. Thanks Tim & Shalin!
        Hide
        Mark Miller added a comment -

        So the advantage appears to be that you can cache the field values and so calculate the filter faster for arbitrary terms, rather than having to calculate and cache a bitset for each set of terms if you used TermsFilter - Right? I think it should be easier to extract that info from the javadoc. And more clear on exactly what the tradeoffs are, and when I should choose which.

        • The FieldCacheTermsFilter is faster than building a TermsFilter each time.

        While I did figure it out eventually (if I figured it out right), I'm thinking it could be clearer. It could just be me though. I'm often a bit hazzy.

        Show
        Mark Miller added a comment - So the advantage appears to be that you can cache the field values and so calculate the filter faster for arbitrary terms, rather than having to calculate and cache a bitset for each set of terms if you used TermsFilter - Right? I think it should be easier to extract that info from the javadoc. And more clear on exactly what the tradeoffs are, and when I should choose which. The FieldCacheTermsFilter is faster than building a TermsFilter each time. While I did figure it out eventually (if I figured it out right), I'm thinking it could be clearer. It could just be me though. I'm often a bit hazzy.
        Hide
        Michael McCandless added a comment -

        I agree: the wording can be improved. I'll take a stab at it.

        Show
        Michael McCandless added a comment - I agree: the wording can be improved. I'll take a stab at it.
        Hide
        Michael McCandless added a comment -

        How about this:

        /**
         * A {@link Filter} that only accepts documents whose single
         * term value in the specified field is contained in the
         * provided set of allowed terms.
         * 
         * <p/>
         * 
         * This is the same functionality as TermsFilter (from
         * contrib/queries), except this filter requires that the
         * field contains only a single term for all documents.
         * Because of drastically different implementations, they
         * also have different performance characteristics, as
         * described below.
         * 
         * <p/>
         * 
         * The first invocation of this filter on a given field will
         * be slower, since a {@link FieldCache.StringIndex} must be
         * created.  Subsequent invocations using the same field
         * will re-use this cache.  However, as with all
         * functionality based on {@link FieldCache}, persistent RAM
         * is consumed to hold the cache, and is not freed until the
         * {@link IndexReader} is closed.  In contrast, TermsFilter
         * has no persistent RAM consumption.
         * 
         * 
         * <p/>
         * 
         * With each search, this filter translates the specified
         * set of Terms into a private {@link OpenBitSet} keyed by
         * term number per unique {@link IndexReader} (normally one
         * reader per segment).  Then, during matching, the term
         * number for each docID is retrieved from the cache and
         * then checked for inclusion using the {@link OpenBitSet}.
         * Since all testing is done using RAM resident data
         * structures, performance should be very fast, most likely
         * fast enough to not require further caching of the
         * DocIdSet for each possible combination of terms.
         * However, because docIDs are simply scanned linearly, an
         * index with a great many small documents may find this
         * linear scan too costly.
         * 
         * <p/>
         * 
         * In contrast, TermsFilter builds up an {@link OpenBitSet},
         * keyed by docID, every time it's created, by enumerating
         * through all matching docs using {@link TermDocs} to seek
         * and scan through each term's docID list.  While there is
         * no linear scan of all docIDs, besides the allocation of
         * the underlying array in the {@link OpenBitSet}, this
         * approach requires a number of "disk seeks" in proportion
         * to the number of terms, which can be exceptionally costly
         * when there are cache misses in the OS's IO cache.
         * 
         * <p/>
         * 
         * Generally, this filter will be slower on the first
         * invocation for a given field, but subsequent invocations,
         * even if you change the allowed set of Terms, should be
         * faster than TermsFilter, especially as the number of
         * Terms being matched increases.  If you are matching only
         * a very small number of terms, and those terms in turn
         * match a very small number of documents, TermsFilter may
         * perform faster.
         *
         * <p/>
         *
         * Which filter is best is very application dependent.
         */
        
        Show
        Michael McCandless added a comment - How about this: /** * A {@link Filter} that only accepts documents whose single * term value in the specified field is contained in the * provided set of allowed terms. * * <p/> * * This is the same functionality as TermsFilter (from * contrib/queries), except this filter requires that the * field contains only a single term for all documents. * Because of drastically different implementations, they * also have different performance characteristics, as * described below. * * <p/> * * The first invocation of this filter on a given field will * be slower, since a {@link FieldCache.StringIndex} must be * created. Subsequent invocations using the same field * will re-use this cache. However, as with all * functionality based on {@link FieldCache}, persistent RAM * is consumed to hold the cache, and is not freed until the * {@link IndexReader} is closed. In contrast, TermsFilter * has no persistent RAM consumption. * * * <p/> * * With each search, this filter translates the specified * set of Terms into a private {@link OpenBitSet} keyed by * term number per unique {@link IndexReader} (normally one * reader per segment). Then, during matching, the term * number for each docID is retrieved from the cache and * then checked for inclusion using the {@link OpenBitSet}. * Since all testing is done using RAM resident data * structures, performance should be very fast, most likely * fast enough to not require further caching of the * DocIdSet for each possible combination of terms. * However, because docIDs are simply scanned linearly, an * index with a great many small documents may find this * linear scan too costly. * * <p/> * * In contrast, TermsFilter builds up an {@link OpenBitSet}, * keyed by docID, every time it's created, by enumerating * through all matching docs using {@link TermDocs} to seek * and scan through each term's docID list. While there is * no linear scan of all docIDs, besides the allocation of * the underlying array in the {@link OpenBitSet}, this * approach requires a number of "disk seeks" in proportion * to the number of terms, which can be exceptionally costly * when there are cache misses in the OS's IO cache. * * <p/> * * Generally, this filter will be slower on the first * invocation for a given field, but subsequent invocations, * even if you change the allowed set of Terms, should be * faster than TermsFilter, especially as the number of * Terms being matched increases. If you are matching only * a very small number of terms, and those terms in turn * match a very small number of documents, TermsFilter may * perform faster. * * <p/> * * Which filter is best is very application dependent. */
        Hide
        Mark Miller added a comment -

        +1

        Show
        Mark Miller added a comment - +1
        Hide
        Shalin Shekhar Mangar added a comment -

        +1

        This is much more clear. Thanks Michael.

        Show
        Shalin Shekhar Mangar added a comment - +1 This is much more clear. Thanks Michael.
        Hide
        Uwe Schindler added a comment -

        Sorry, I reopened the wrong issue, the correct class is FieldCacheRangeFilter.

        Closing again.

        Show
        Uwe Schindler added a comment - Sorry, I reopened the wrong issue, the correct class is FieldCacheRangeFilter. Closing again.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tim Sturge
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development