Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      These classes implement inexpensive range filtering over a field containing a single term. They do this by building an integer array of term numbers (storing the term->number mapping in a TreeMap) and then implementing a fast integer comparison based DocSetIdIterator.

      This code is currently being used to do age range filtering, but could also be used to do other date filtering or in any application where there need to be multiple filters based on the same single term field. I have an untested implementation of single term filtering and have considered but not yet implemented term set filtering (useful for location based searches) as well.

      The code here is fairly rough; it works but lacks javadocs and toString() and hashCode() methods etc. I'm posting it here to discover if there is other interest in this feature; I don't mind fixing it up but would hate to go to the effort if it's not going to make it into Lucene.

      1. TestFieldCacheRangeFilter.patch
        17 kB
        Tim Sturge
      2. TermMultiFilter.java
        2 kB
        Tim Sturge
      3. RangeMultiFilter.java
        2 kB
        Tim Sturge
      4. RangeMultiFilter.java
        3 kB
        Tim Sturge
      5. PerfTest.java
        4 kB
        Uwe Schindler
      6. LUCENE-1461c.patch
        18 kB
        Tim Sturge
      7. LUCENE-1461b.patch
        25 kB
        Tim Sturge
      8. LUCENE-1461a.patch
        4 kB
        Paul Elschot
      9. LUCENE-1461.patch
        26 kB
        Michael McCandless
      10. LUCENE-1461.patch
        21 kB
        Uwe Schindler
      11. LUCENE-1461.patch
        37 kB
        Uwe Schindler
      12. LUCENE-1461.patch
        38 kB
        Uwe Schindler
      13. LUCENE-1461.patch
        52 kB
        Uwe Schindler
      14. LUCENE-1461.patch
        56 kB
        Uwe Schindler
      15. LUCENE-1461.patch
        57 kB
        Uwe Schindler
      16. LUCENE-1461.patch
        57 kB
        Uwe Schindler
      17. FieldCacheRangeFilter.patch
        1 kB
        Tim Sturge
      18. DisjointMultiFilter.java
        3 kB
        Tim Sturge

        Issue Links

          Activity

          Hide
          Tim Sturge added a comment -

          Base code which builds the integer array.

          Show
          Tim Sturge added a comment - Base code which builds the integer array.
          Hide
          Tim Sturge added a comment -

          Constructs a virtual RangeFilter on top of an already existing DisjointMultiFilter. Note that the RangeFilter costs almost nothing once the DisjointMultiFilter already exists.

          Show
          Tim Sturge added a comment - Constructs a virtual RangeFilter on top of an already existing DisjointMultiFilter. Note that the RangeFilter costs almost nothing once the DisjointMultiFilter already exists.
          Hide
          Tim Sturge added a comment -

          Here's some benchmark data to demonstrate the utility. Results on a 45M document index:

          Firstly without an age constraint as a baseline:

          Query "+name:tim"
          startup: 0
          Hits: 15089
          first query: 1004
          100 queries: 132 (1.32 msec per query)

          Now with a cached filter. This is ideal from a speed standpoint but as with most range based queries there are too many possible start/end combinations to cache all the filters.

          Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on cached RangeFilter)
          startup: 3
          Hits: 11156
          first query: 1830
          100 queries: 287 (2.87 msec per query)

          Now with an uncached filter. This is awful.

          Query "+name:tim age:[18 TO 35]" (uncached ConstantScoreRangeQuery)
          startup: 3
          Hits: 11156
          first query: 1665
          100 queries: 51862 (yes, 518 msec per query, 200x slower)

          A RangeQuery is slightly better but still bad (and has a different result set)

          Query "+name:tim age:[18 TO 35]" (uncached RangeQuery)
          startup: 0
          Hits: 10147
          first query: 1517
          100 queries: 27157 (271 msec is 100x slower than the filter)

          Now with the prebuilt column stride filter:

          Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on prebuilt column stride filter)
          startup: 2811
          Hits: 11156
          first query: 1395
          100 queries: 441 (back down to 4.41msec per query)

          This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query.

          Show
          Tim Sturge added a comment - Here's some benchmark data to demonstrate the utility. Results on a 45M document index: Firstly without an age constraint as a baseline: Query "+name:tim" startup: 0 Hits: 15089 first query: 1004 100 queries: 132 (1.32 msec per query) Now with a cached filter. This is ideal from a speed standpoint but as with most range based queries there are too many possible start/end combinations to cache all the filters. Query "+name:tim age: [18 TO 35] " (ConstantScoreQuery on cached RangeFilter) startup: 3 Hits: 11156 first query: 1830 100 queries: 287 (2.87 msec per query) Now with an uncached filter. This is awful. Query "+name:tim age: [18 TO 35] " (uncached ConstantScoreRangeQuery) startup: 3 Hits: 11156 first query: 1665 100 queries: 51862 (yes, 518 msec per query, 200x slower) A RangeQuery is slightly better but still bad (and has a different result set) Query "+name:tim age: [18 TO 35] " (uncached RangeQuery) startup: 0 Hits: 10147 first query: 1517 100 queries: 27157 (271 msec is 100x slower than the filter) Now with the prebuilt column stride filter: Query "+name:tim age: [18 TO 35] " (ConstantScoreQuery on prebuilt column stride filter) startup: 2811 Hits: 11156 first query: 1395 100 queries: 441 (back down to 4.41msec per query) This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query.
          Hide
          Paul Elschot added a comment - - edited

          This is a nice tradeoff: reduced space (the equivalent of 32 bit set range filters) for increased time (less than 2x slower) against any number of range filters on a single term field.

          Show
          Paul Elschot added a comment - - edited This is a nice tradeoff: reduced space (the equivalent of 32 bit set range filters) for increased time (less than 2x slower) against any number of range filters on a single term field.
          Hide
          Tim Sturge added a comment -

          Thanks Paul.

          This solved a nasty performance itch with the system we are building at hi5.

          I'm looking into whether using a byte[] or short[] makes sense when there are less total terms (it will certainly save space, don't know about performance).

          The other thing I wonder is whether you can use this for a set based query (for example a set of location grid-blocks). What I need there is a very fast java integer set (hopefully much faster than java.util.HashSet<Integer>)

          Show
          Tim Sturge added a comment - Thanks Paul. This solved a nasty performance itch with the system we are building at hi5. I'm looking into whether using a byte[] or short[] makes sense when there are less total terms (it will certainly save space, don't know about performance). The other thing I wonder is whether you can use this for a set based query (for example a set of location grid-blocks). What I need there is a very fast java integer set (hopefully much faster than java.util.HashSet<Integer>)
          Hide
          Paul Elschot added a comment -

          For fields that have no more distinct values than fit into a short (2^16 at best, 65536), using a short[] would make sense I think. As the number of distinct field values can simply be counted in this context, it would make sense to simply replace the int[] by a short[] in that case. But it would only help to reduce space, and only a factor two.

          For a set based query, the problem boils down to doing integer set membership in the iterator. For small sets, binary search should be fine. For larger ones an OpenBitSet would be preferable, but in this context that would only be feasible when the number of different terms is a lot smaller than the number of documents in the index.

          For location grid-blocks one needs to deal with more than one dimension. In such cases my first thought is to use indexed hierarchical prefixes in each dimension, because this allows skipTo() to be used on the documents for the intersection between the dimensions. (But there may be better ways, it's a long time ago that I had a look at the literature for this.)
          Do you need to index separate lower bounds and upper bounds on the data? That would complicate things.
          Without indexed bounds (i.e. point data only) for each dimension it could make sense to use this multi range filter.

          Show
          Paul Elschot added a comment - For fields that have no more distinct values than fit into a short (2^16 at best, 65536), using a short[] would make sense I think. As the number of distinct field values can simply be counted in this context, it would make sense to simply replace the int[] by a short[] in that case. But it would only help to reduce space, and only a factor two. For a set based query, the problem boils down to doing integer set membership in the iterator. For small sets, binary search should be fine. For larger ones an OpenBitSet would be preferable, but in this context that would only be feasible when the number of different terms is a lot smaller than the number of documents in the index. For location grid-blocks one needs to deal with more than one dimension. In such cases my first thought is to use indexed hierarchical prefixes in each dimension, because this allows skipTo() to be used on the documents for the intersection between the dimensions. (But there may be better ways, it's a long time ago that I had a look at the literature for this.) Do you need to index separate lower bounds and upper bounds on the data? That would complicate things. Without indexed bounds (i.e. point data only) for each dimension it could make sense to use this multi range filter.
          Hide
          Tim Sturge added a comment -

          For small subsets of a large set (in my case around 1000 out of 1million) I suspect a simple open hash may perform better than a binary search.

          For location blocks (point data) my plan is just to number the grid with N^2 numbers and create a set based on a circle around the desired place. Ideally this solution doesn't degrade with circle size so it's not necessary to do hierarchical prefixes, but I don't have benchmarks to support or refute that assumption. Agreed bounded locations make this much trickier.

          Show
          Tim Sturge added a comment - For small subsets of a large set (in my case around 1000 out of 1million) I suspect a simple open hash may perform better than a binary search. For location blocks (point data) my plan is just to number the grid with N^2 numbers and create a set based on a circle around the desired place. Ideally this solution doesn't degrade with circle size so it's not necessary to do hierarchical prefixes, but I don't have benchmarks to support or refute that assumption. Agreed bounded locations make this much trickier.
          Hide
          Tim Sturge added a comment -

          I tried a short[] array and it is about 20% faster than the int[] array (I'm assuming this is a memory bandwidth issue.)

          I also tried replacing catching the ArrayIndexOutOfBoundsException with a check in the loop and discovered that the exception handling is about 3% faster.

          Finally, I implemented TermMultiFilter as well which has about the same performance characteristics.

          Show
          Tim Sturge added a comment - I tried a short[] array and it is about 20% faster than the int[] array (I'm assuming this is a memory bandwidth issue.) I also tried replacing catching the ArrayIndexOutOfBoundsException with a check in the loop and discovered that the exception handling is about 3% faster. Finally, I implemented TermMultiFilter as well which has about the same performance characteristics.
          Hide
          Tim Sturge added a comment -

          Added TermMultiFilter.java

          Show
          Tim Sturge added a comment - Added TermMultiFilter.java
          Hide
          Paul Elschot added a comment -

          I tried a short[] array and it is about 20% faster than the int[] array (I'm assuming this is a memory bandwidth issue.)

          20% is more than I expected. Have a look at LUCENE-1410 for optimal bit packing in a frame of reference. There are also some performance numbers there for different numbers of frame bits. (A short[] is equivalent to 16 frame bits.)
          This 20% means that it could well be wortwhile to always use such a frame for the docContents here.

          I would not expect that TermMultiFilter has an advantage over a TermFilter, since it does a linear search even for skipTo(). The only advantage it has it that it does the linear search from memory where TermFilter does its skipping using the skip info in the index.

          Would anyone else have an idea where this could be added, in core or contrib, and what (new) package name could be used?

          Show
          Paul Elschot added a comment - I tried a short[] array and it is about 20% faster than the int[] array (I'm assuming this is a memory bandwidth issue.) 20% is more than I expected. Have a look at LUCENE-1410 for optimal bit packing in a frame of reference. There are also some performance numbers there for different numbers of frame bits. (A short[] is equivalent to 16 frame bits.) This 20% means that it could well be wortwhile to always use such a frame for the docContents here. I would not expect that TermMultiFilter has an advantage over a TermFilter, since it does a linear search even for skipTo(). The only advantage it has it that it does the linear search from memory where TermFilter does its skipping using the skip info in the index. Would anyone else have an idea where this could be added, in core or contrib, and what (new) package name could be used?
          Hide
          Michael McCandless added a comment -

          It seems like the core class here (DisjointMultiFilter) is doing the same thing as FieldCache's StringIndex? Ie, it builds a data structure that maps String <-> ord and docID -> ord. So maybe we can merge DisjointMultiFilter into the FieldCache API.

          And then RangeMultiFilter is a great addition for quickly "spawning" numerous new RangeFilters, having pulled & stored the StringIndex from the FieldCache? So I think it should live in core org.apache.lucene.search.*? I'd prefer a different name (RangeMultiFilter implies it can filter over multiple ranges) but can't think of one. Or maybe we absorb it into RangeFilter, as a different "rewrite" method like "useFieldCache=true|false"?

          Show
          Michael McCandless added a comment - It seems like the core class here (DisjointMultiFilter) is doing the same thing as FieldCache's StringIndex? Ie, it builds a data structure that maps String <-> ord and docID -> ord. So maybe we can merge DisjointMultiFilter into the FieldCache API. And then RangeMultiFilter is a great addition for quickly "spawning" numerous new RangeFilters, having pulled & stored the StringIndex from the FieldCache? So I think it should live in core org.apache.lucene.search.*? I'd prefer a different name (RangeMultiFilter implies it can filter over multiple ranges) but can't think of one. Or maybe we absorb it into RangeFilter, as a different "rewrite" method like "useFieldCache=true|false"?
          Hide
          Tim Sturge added a comment -

          Paul,

          Wow, I didn't realize people spent so much time on integer packing. I think there's lots of opportunities here, particularly if this ends up in the index (so the potential I/O cost becomes a factor as well as mem bandwidth).

          I agree that TermMultiFilter is not that useful; I mostly have it because I'm looking at TermsMultiFilter for location matching and wanted some benchmarks versus regular filters so I could compare set implementations.

          Mike,

          I hadn't looked at fieldcache before, but StringIndex does seem to be the same thing as DisjointMultiFilter (modulo using a String[] instead of a TreeMap). I'll port RangeMultiFilter to run on top of FieldCache and check it is identical and performs similarly (which seems like a fairly sure bet once I figure out the FieldCache API.)

          FieldCacheRangeFilter? (yeah, I know )

          Show
          Tim Sturge added a comment - Paul, Wow, I didn't realize people spent so much time on integer packing. I think there's lots of opportunities here, particularly if this ends up in the index (so the potential I/O cost becomes a factor as well as mem bandwidth). I agree that TermMultiFilter is not that useful; I mostly have it because I'm looking at TermsMultiFilter for location matching and wanted some benchmarks versus regular filters so I could compare set implementations. Mike, I hadn't looked at fieldcache before, but StringIndex does seem to be the same thing as DisjointMultiFilter (modulo using a String[] instead of a TreeMap). I'll port RangeMultiFilter to run on top of FieldCache and check it is identical and performs similarly (which seems like a fairly sure bet once I figure out the FieldCache API.) FieldCacheRangeFilter? (yeah, I know )
          Hide
          Paul Elschot added a comment -

          I didn't realize people spent so much time on integer packing.

          Well, it appears that the memory-CPU bus really is getting to be a bottleneck, and you're not the first one to discover that, see the papers on which LUCENE-1410 is based.
          Nevertheless I was surprised by a 20% performance increase when moving from int[] to short[].

          I'll port RangeMultiFilter to run on top of FieldCache.

          That means that bit packing could be confined to the FieldCache lateron, which is good.
          At the moment I'm factoring out the exceptions in the 1410 code. The FieldCache may need to wait for that because it will probably not be using exceptions.
          Just think of the extreme case of a field that has only two indexed values, it would be effectively cached as a bit set.

          Show
          Paul Elschot added a comment - I didn't realize people spent so much time on integer packing. Well, it appears that the memory-CPU bus really is getting to be a bottleneck, and you're not the first one to discover that, see the papers on which LUCENE-1410 is based. Nevertheless I was surprised by a 20% performance increase when moving from int[] to short[]. I'll port RangeMultiFilter to run on top of FieldCache. That means that bit packing could be confined to the FieldCache lateron, which is good. At the moment I'm factoring out the exceptions in the 1410 code. The FieldCache may need to wait for that because it will probably not be using exceptions. Just think of the extreme case of a field that has only two indexed values, it would be effectively cached as a bit set.
          Hide
          Tim Sturge added a comment -

          This is a version of RangeMultiFilter built on top of FieldCache. This is much cleaner; it automatically handles changing the IndexReader and no longer requires the user to manually build a separate DisjointMultiFilter.

          Performance is the same:

          Cached Range Filter:
          startup: 2
          Hits: 167390
          first query: 2009
          100 queries: 4733

          RangeMultiFilter + FieldCache
          startup: 0
          Hits: 167390
          first query: 5405
          100 queries: 8091

          ConstantScoreRangeQuery
          startup: 3
          Hits: 167390
          first query: 2012
          100 queries: 56620

          Boolean Query for Range
          startup: 0
          Hits: 121151
          first query: 3518
          100 queries: 118690

          Show
          Tim Sturge added a comment - This is a version of RangeMultiFilter built on top of FieldCache. This is much cleaner; it automatically handles changing the IndexReader and no longer requires the user to manually build a separate DisjointMultiFilter. Performance is the same: Cached Range Filter: startup: 2 Hits: 167390 first query: 2009 100 queries: 4733 RangeMultiFilter + FieldCache startup: 0 Hits: 167390 first query: 5405 100 queries: 8091 ConstantScoreRangeQuery startup: 3 Hits: 167390 first query: 2012 100 queries: 56620 Boolean Query for Range startup: 0 Hits: 121151 first query: 3518 100 queries: 118690
          Hide
          Tim Sturge added a comment -

          Paul, Mike,

          FieldCache.StringIndex doesn't behave in the way I expect. In particular, the first element of the lookup[] array is null (which causes the binarySearch to NPE when you select a range wider than the one that actually exists.)

          Is this a bug in FieldCache? I would expect it to only contain terms actually in the index and I'm sincerely hoping that null is not a valid term.

          Show
          Tim Sturge added a comment - Paul, Mike, FieldCache.StringIndex doesn't behave in the way I expect. In particular, the first element of the lookup[] array is null (which causes the binarySearch to NPE when you select a range wider than the one that actually exists.) Is this a bug in FieldCache? I would expect it to only contain terms actually in the index and I'm sincerely hoping that null is not a valid term.
          Hide
          Tim Sturge added a comment -

          Looking at FieldCache and FieldDocSortedHitQueue I am very tempted to change the null sentinel (which already causes lots of grief in the comparator) to an empty string.

          I'm not sure that lucene is prepared to distinguish a field that is completely missing with an empty field (or field that is analyzed away), and I think the null exception handling is a significant pain.

          Show
          Tim Sturge added a comment - Looking at FieldCache and FieldDocSortedHitQueue I am very tempted to change the null sentinel (which already causes lots of grief in the comparator) to an empty string. I'm not sure that lucene is prepared to distinguish a field that is completely missing with an empty field (or field that is analyzed away), and I think the null exception handling is a significant pain.
          Hide
          Paul Elschot added a comment -

          Here's a patch for the latest RangeMultiFilter. I've changed the package to o.a.l.search, changed the layout (where's the tool to automatically do that?), and I've added the Apache Licence, assuming that's ok from the earlier licence grant.

          Show
          Paul Elschot added a comment - Here's a patch for the latest RangeMultiFilter. I've changed the package to o.a.l.search, changed the layout (where's the tool to automatically do that?), and I've added the Apache Licence, assuming that's ok from the earlier licence grant.
          Hide
          Paul Elschot added a comment -

          Tim,

          If there is code that depends on some particular null/empty string behaviour of FieldCache
          there should be a test for that, so just try and patch FieldCache as you need it, and then see whether all tests still pass.
          This way is a bit pushing, but it gets things going, and it's still no more than a patch.

          Could you add some test code for RangeMultiFilter?

          Show
          Paul Elschot added a comment - Tim, If there is code that depends on some particular null/empty string behaviour of FieldCache there should be a test for that, so just try and patch FieldCache as you need it, and then see whether all tests still pass. This way is a bit pushing, but it gets things going, and it's still no more than a patch. Could you add some test code for RangeMultiFilter?
          Hide
          Michael McCandless added a comment -

          In fact, could you add a test that actually indexes an empty-string token (you'll have to make your own "degenerate" TokenStrem to do this I think), to ensure that switching to empty-string as sentinel doesn't break anything?

          Show
          Michael McCandless added a comment - In fact, could you add a test that actually indexes an empty-string token (you'll have to make your own "degenerate" TokenStrem to do this I think), to ensure that switching to empty-string as sentinel doesn't break anything?
          Hide
          Michael McCandless added a comment -

          Should we absorb RangeMultiFilter into RangeFilter, and add a "setMethod" to RangeFilter?

          Someday, I think RangeFilter and RangeQuery should be implemented using hierarchical ranges (there was a reference to a page in the wiki, specifically about date range searching, recently), which would be another method.

          Show
          Michael McCandless added a comment - Should we absorb RangeMultiFilter into RangeFilter, and add a "setMethod" to RangeFilter? Someday, I think RangeFilter and RangeQuery should be implemented using hierarchical ranges (there was a reference to a page in the wiki, specifically about date range searching, recently), which would be another method.
          Hide
          Paul Elschot added a comment -

          Someday, I think RangeFilter and RangeQuery should be implemented using hierarchical ranges (there was a reference to a page in the wiki, specifically about date range searching, recently), which would be another method.

          I think you're referring to the hierarchical prefixes here:
          http://wiki.apache.org/jakarta-lucene/DateRangeQueries
          These hierarchical ranges require an analyzer to output all tokens "to the root" and a disjunction filter on the terms corresponding to the levels in the range. At each level a RangeMultiFilter could be used, but that would require a lot of memory. TermFilters would be better in that case I think.

          Show
          Paul Elschot added a comment - Someday, I think RangeFilter and RangeQuery should be implemented using hierarchical ranges (there was a reference to a page in the wiki, specifically about date range searching, recently), which would be another method. I think you're referring to the hierarchical prefixes here: http://wiki.apache.org/jakarta-lucene/DateRangeQueries These hierarchical ranges require an analyzer to output all tokens "to the root" and a disjunction filter on the terms corresponding to the levels in the range. At each level a RangeMultiFilter could be used, but that would require a lot of memory. TermFilters would be better in that case I think.
          Hide
          Uwe Schindler added a comment -

          I understood this in the same way. This is why I reported to the java-dev-Mailing list my developments going in this directions, perhaps for contrib: http://www.gossamer-threads.com/lists/lucene/java-dev/67807

          Show
          Uwe Schindler added a comment - I understood this in the same way. This is why I reported to the java-dev-Mailing list my developments going in this directions, perhaps for contrib: http://www.gossamer-threads.com/lists/lucene/java-dev/67807
          Hide
          Paul Elschot added a comment -

          Uwe,

          As it is already under APL 2.0, TrieRangeQuery and its utilities would make a nice addition to Lucene as a contrib package.

          Show
          Paul Elschot added a comment - Uwe, As it is already under APL 2.0, TrieRangeQuery and its utilities would make a nice addition to Lucene as a contrib package.
          Hide
          Tim Sturge added a comment -

          Paul,

          Thanks for the updates. I'll see about toString() and hashCode() methods. Are we settled on RangeMultiFilter as the least confusing name?

          Mike, Paul,

          I'll play with the lucene test infrastructure. Right now all the tests have been in my application but I can make a clean build to try them out.

          Mike,

          I have a slight bias against adding RangeMultiFilter to RangeFilter due to the slight difference in semantics. RangeMultiFilter only works on single term fields (which should probably be mentioned in the java docs) whereas RangeFilter works on multiple term fields as well.

          While I expect more than 95% of the RangeFilter use cases are met by RangeMultiFilter (I suspect they are primarily dates, and otherwise prices and other numeric ranges.) I bet there are some people who really do a text range search between "aardvark" and "antelope". Those people will unexpectedly break if they set "useFieldCache=true" or setMethod(). I would rather we add a comment in the RangeFilter javadocs to the effect of:

          "If you have a single term field (for example a date, or a price) that is repeatedly used in a RangeFilter with many different ranges, you should consider using RangeMultiFilter as a faster alternative to building a RangeFilter on this field for each query. You need to ensure that this field is untokenized, or that it always tokenizes to a single term."

          Show
          Tim Sturge added a comment - Paul, Thanks for the updates. I'll see about toString() and hashCode() methods. Are we settled on RangeMultiFilter as the least confusing name? Mike, Paul, I'll play with the lucene test infrastructure. Right now all the tests have been in my application but I can make a clean build to try them out. Mike, I have a slight bias against adding RangeMultiFilter to RangeFilter due to the slight difference in semantics. RangeMultiFilter only works on single term fields (which should probably be mentioned in the java docs) whereas RangeFilter works on multiple term fields as well. While I expect more than 95% of the RangeFilter use cases are met by RangeMultiFilter (I suspect they are primarily dates, and otherwise prices and other numeric ranges.) I bet there are some people who really do a text range search between "aardvark" and "antelope". Those people will unexpectedly break if they set "useFieldCache=true" or setMethod(). I would rather we add a comment in the RangeFilter javadocs to the effect of: "If you have a single term field (for example a date, or a price) that is repeatedly used in a RangeFilter with many different ranges, you should consider using RangeMultiFilter as a faster alternative to building a RangeFilter on this field for each query. You need to ensure that this field is untokenized, or that it always tokenizes to a single term."
          Hide
          Michael McCandless added a comment -

          OK that makes sense – let's leave it as a separate class, and can you add that difference to the javadocs?

          Show
          Michael McCandless added a comment - OK that makes sense – let's leave it as a separate class, and can you add that difference to the javadocs?
          Hide
          Paul Elschot added a comment -

          Are we settled on RangeMultiFilter as the least confusing name?

          Names are important, although not as important as javadocs.
          I'm happy to leave the choice of name to someone with an English/... mother tongue.

          Show
          Paul Elschot added a comment - Are we settled on RangeMultiFilter as the least confusing name? Names are important, although not as important as javadocs. I'm happy to leave the choice of name to someone with an English/... mother tongue.
          Hide
          Tim Sturge added a comment -

          Progress report:

          Having written some javadocs, I think FieldCacheRangeFilter is a better name; without DisjointMultiFilter the "multi" in RangeMultiFilter is confusing (after all, it indexes a field containing a single term). So at the risk of repainting the bikeshed I will go with FieldCacheRangeFilter.

          The null versus "" distinction is completely confusing to me. I see this in ConstantScoreRangeQuery:

          // Map to RangeFilter semantics which are slightly different...
          RangeFilter rangeFilt = new RangeFilter
          (fieldName, lowerVal != null?lowerVal:"", upperVal,
          lowerVal==""?false:includeLower, upperVal==null?false:includeUpper,
          collator);

          which makes no sense to me at all.

          I'm also not sure it makes sense to allow the indexing of an empty field and distinguishing that case from there being nothing there. Please let me know if there is a usecase. The lowest impedance solution may be to write a version of binarySearch() that allows there to be a null in the first element and use that instead of Arrays.binarySearch().

          Show
          Tim Sturge added a comment - Progress report: Having written some javadocs, I think FieldCacheRangeFilter is a better name; without DisjointMultiFilter the "multi" in RangeMultiFilter is confusing (after all, it indexes a field containing a single term). So at the risk of repainting the bikeshed I will go with FieldCacheRangeFilter. The null versus "" distinction is completely confusing to me. I see this in ConstantScoreRangeQuery: // Map to RangeFilter semantics which are slightly different... RangeFilter rangeFilt = new RangeFilter (fieldName, lowerVal != null?lowerVal:"", upperVal, lowerVal==""?false:includeLower, upperVal==null?false:includeUpper, collator); which makes no sense to me at all. I'm also not sure it makes sense to allow the indexing of an empty field and distinguishing that case from there being nothing there. Please let me know if there is a usecase. The lowest impedance solution may be to write a version of binarySearch() that allows there to be a null in the first element and use that instead of Arrays.binarySearch().
          Hide
          Tim Sturge added a comment -

          Here's the first cleanup

          Changes:

          RangeMultiFilter now FieldCacheRangeFilter

          FieldCache.StringIndex gains a binarySearchLookup() method that handles null

          toString(), hashCode() and equals() methods.

          This hasn't been tested very well; but I wanted to post something before I left for Thanksgiving.

          Show
          Tim Sturge added a comment - Here's the first cleanup Changes: RangeMultiFilter now FieldCacheRangeFilter FieldCache.StringIndex gains a binarySearchLookup() method that handles null toString(), hashCode() and equals() methods. This hasn't been tested very well; but I wanted to post something before I left for Thanksgiving.
          Hide
          Earwin Burrfoot added a comment -

          Somewhat off topic, but nonetheless, my two techniques for superfast range queries/filters:
          1. cache [from, null]+[null, to] filters instead of [from, to] and intersect them
          -> can tremendously improve cache hits for certain setups

          2. when indexing a field that will be used for range filter, index lower-resolution versions of it additionally, than use a union of rangefilters over different resolution fields, ie:
          a. we have severalM documents with a date field spanning few years with say minute precision (we'd like to sort on it afterward)
          b. we index additional fields with dates rounded down to something like years, months, days, hours (best combination depends on width of the queries you're most likely to perform, let's say it's day+hour for queries rarely spanning more than a month)
          c. we have a query like [2008-05-05 18:00 .. 2008-06-01 10:53], it is converted to -> hour:[05-05 18 .. 05-06 00) or day:[05-06 .. 06-01) or hour:[06-01 00 .. 06-01 10) or minute:[06-01 10:00 .. 06-01 10:53]
          -> massive win for ranges over fields having lots of high-selectivity terms, with timestamps being a good example, also salaries, coordinates, whatever

          Show
          Earwin Burrfoot added a comment - Somewhat off topic, but nonetheless, my two techniques for superfast range queries/filters: 1. cache [from, null] + [null, to] filters instead of [from, to] and intersect them -> can tremendously improve cache hits for certain setups 2. when indexing a field that will be used for range filter, index lower-resolution versions of it additionally, than use a union of rangefilters over different resolution fields, ie: a. we have severalM documents with a date field spanning few years with say minute precision (we'd like to sort on it afterward) b. we index additional fields with dates rounded down to something like years, months, days, hours (best combination depends on width of the queries you're most likely to perform, let's say it's day+hour for queries rarely spanning more than a month) c. we have a query like [2008-05-05 18:00 .. 2008-06-01 10:53] , it is converted to -> hour:[05-05 18 .. 05-06 00) or day:[05-06 .. 06-01) or hour:[06-01 00 .. 06-01 10) or minute: [06-01 10:00 .. 06-01 10:53] -> massive win for ranges over fields having lots of high-selectivity terms, with timestamps being a good example, also salaries, coordinates, whatever
          Hide
          Michael McCandless added a comment -

          my two techniques for superfast range queries/filters

          I like those approaches – I think 2 is similar to above and similar to Uwe's approach (described on java-dev). One nice property of these "factor the range into a set of OR/AND clauses" is RangeQuery no longer relies on the sort order of the terms, which means tricks like padding numeric terms are no longer needed, I think?

          This sudden burst of innovation around RangeQuery is very exciting!

          Show
          Michael McCandless added a comment - my two techniques for superfast range queries/filters I like those approaches – I think 2 is similar to above and similar to Uwe's approach (described on java-dev). One nice property of these "factor the range into a set of OR/AND clauses" is RangeQuery no longer relies on the sort order of the terms, which means tricks like padding numeric terms are no longer needed, I think? This sudden burst of innovation around RangeQuery is very exciting!
          Hide
          Uwe Schindler added a comment -

          The RangeQuery still relies on the sort order of terms (this is how it works). For storing terms with lower precision you have two possiblities:

          a) use another field name for each precision
          b) prefix the terms with a precision marker. The prefix is important for the sort order, so that all terms of one precision are in one "bunch" and not distributed between higher precsion terms.

          The first version of my TrieRangeQuery was invented before the RangeFilter occurred first in Lucene. This version did exactly what was proposed here: combining more range queries with OR.

          For my last implementation, based on filters I did not use a BooleanQuery with OR'ed ranges because of resource usage: Each RangeFilter needs an OpenBitSet instance, and all of them must be OR'ed during query execution. Using only one OpenBitSet for all range parts is more effective, I think. I am currently working on including my extension to the contrib-query package. I refactored the code a little bit, so the TrieRangeFilter is now separated from the query (and because of that could be used with e.g. filter caching). I think, I will start n issue this afternoon.

          Show
          Uwe Schindler added a comment - The RangeQuery still relies on the sort order of terms (this is how it works). For storing terms with lower precision you have two possiblities: a) use another field name for each precision b) prefix the terms with a precision marker. The prefix is important for the sort order, so that all terms of one precision are in one "bunch" and not distributed between higher precsion terms. The first version of my TrieRangeQuery was invented before the RangeFilter occurred first in Lucene. This version did exactly what was proposed here: combining more range queries with OR. For my last implementation, based on filters I did not use a BooleanQuery with OR'ed ranges because of resource usage: Each RangeFilter needs an OpenBitSet instance, and all of them must be OR'ed during query execution. Using only one OpenBitSet for all range parts is more effective, I think. I am currently working on including my extension to the contrib-query package. I refactored the code a little bit, so the TrieRangeFilter is now separated from the query (and because of that could be used with e.g. filter caching). I think, I will start n issue this afternoon.
          Hide
          Michael McCandless added a comment -

          This patch looks good! I think it's ready to commit? I plan to commit in a day or two.

          I made some small changes – added CHANGES.txt entry, fixed whitespace, removed one unnecessary import.

          Thanks Tim!

          Show
          Michael McCandless added a comment - This patch looks good! I think it's ready to commit? I plan to commit in a day or two. I made some small changes – added CHANGES.txt entry, fixed whitespace, removed one unnecessary import. Thanks Tim!
          Hide
          Michael McCandless added a comment -

          The RangeQuery still relies on the sort order of terms (this is how it works)

          Ahh OK. Allowing each field to provide its own Comparator may still be helpful then (Lucene doesn't allow this today since fields are always sorted in java char order) to avoid padding and other binary conversion tricks.

          Show
          Michael McCandless added a comment - The RangeQuery still relies on the sort order of terms (this is how it works) Ahh OK. Allowing each field to provide its own Comparator may still be helpful then (Lucene doesn't allow this today since fields are always sorted in java char order) to avoid padding and other binary conversion tricks.
          Hide
          Earwin Burrfoot added a comment - - edited

          RangeQuery no longer relies on the sort order of the terms, which means tricks like padding numeric terms are no longer needed, I think?

          I do rely on sort order for speed and simplicity, though I never used padding for numeric/date terms All dates/numbers/somethingelsespecial are converted to strings using base-2 15 (to keep high bit=0, as 0xFFFF is used somewhere within Lucene intestines as EOS marker, darn it!) encoding. Plus adjustment to preserve sort order for negative numbers in face of unsigned java char. This transformation is insanely fast, and produces well-compressed results (I have FAT read->mem/write->mem+disk indexes).

          b) prefix the terms with a precision marker. The prefix is important for the sort order, so that all terms of one precision are in one "bunch" and not distributed between higher precsion terms.

          And you can no longer use this field for sorting, as it has more than one term for each document.

          For my last implementation, based on filters I did not use a BooleanQuery with OR'ed ranges because of resource usage

          Using filters here too

          Allowing each field to provide its own Comparator may still be helpful then

          But you still store strings in the index. So essentially you'll convert your value from T to String, store it, retrieve it, convert back to T in such a custom comparator, and finally compare. Why should I need that second conversion and custom comparators, if I can have order-preserving bijective T<->String relation?

          Show
          Earwin Burrfoot added a comment - - edited RangeQuery no longer relies on the sort order of the terms, which means tricks like padding numeric terms are no longer needed, I think? I do rely on sort order for speed and simplicity, though I never used padding for numeric/date terms All dates/numbers/somethingelsespecial are converted to strings using base-2 15 (to keep high bit=0, as 0xFFFF is used somewhere within Lucene intestines as EOS marker, darn it!) encoding. Plus adjustment to preserve sort order for negative numbers in face of unsigned java char. This transformation is insanely fast, and produces well-compressed results (I have FAT read->mem/write->mem+disk indexes). b) prefix the terms with a precision marker. The prefix is important for the sort order, so that all terms of one precision are in one "bunch" and not distributed between higher precsion terms. And you can no longer use this field for sorting, as it has more than one term for each document. For my last implementation, based on filters I did not use a BooleanQuery with OR'ed ranges because of resource usage Using filters here too Allowing each field to provide its own Comparator may still be helpful then But you still store strings in the index. So essentially you'll convert your value from T to String, store it, retrieve it, convert back to T in such a custom comparator, and finally compare. Why should I need that second conversion and custom comparators, if I can have order-preserving bijective T<->String relation?
          Hide
          Michael McCandless added a comment -

          But you still store strings in the index. So essentially you'll convert your value from T to String, store it, retrieve it, convert back to T in such a custom comparator, and finally compare. Why should I need that second conversion and custom comparators, if I can have order-preserving bijective T<->String relation?

          True, since you'll need to xform anyway for non-textual fields. Or maybe eventually we can simply allow T to be the key in the terms dict (so long as T.compareTo(T) exists), which KS/Lucy apparently does.

          Show
          Michael McCandless added a comment - But you still store strings in the index. So essentially you'll convert your value from T to String, store it, retrieve it, convert back to T in such a custom comparator, and finally compare. Why should I need that second conversion and custom comparators, if I can have order-preserving bijective T<->String relation? True, since you'll need to xform anyway for non-textual fields. Or maybe eventually we can simply allow T to be the key in the terms dict (so long as T.compareTo(T) exists), which KS/Lucy apparently does.
          Hide
          Michael McCandless added a comment -

          See also LUCENE-1470, which is another way to achieve faster RangeFilter/Query, by pre-aggregating ranges during indexing and then factoring queries at search time to use the aggregates when possible.

          Show
          Michael McCandless added a comment - See also LUCENE-1470 , which is another way to achieve faster RangeFilter/Query, by pre-aggregating ranges during indexing and then factoring queries at search time to use the aggregates when possible.
          Hide
          Michael McCandless added a comment -

          Committed revision 721663.

          Thanks Tim!

          Show
          Michael McCandless added a comment - Committed revision 721663. Thanks Tim!
          Hide
          Tim Sturge added a comment -

          Mike,

          Thanks for committing. I have a slightly more tested version which uncovered a couple of bugs:

          • using null for the upper limit didn't work
          • bad combinations of ranges weren't rejected with IllegalArgumentException

          This also includes the correct version of the test suite (I accidentally included the pre edit version before)

          Show
          Tim Sturge added a comment - Mike, Thanks for committing. I have a slightly more tested version which uncovered a couple of bugs: using null for the upper limit didn't work bad combinations of ranges weren't rejected with IllegalArgumentException This also includes the correct version of the test suite (I accidentally included the pre edit version before)
          Hide
          Tim Sturge added a comment -

          New patch. Only changes are:

          • initialize() in FieldCacheRangeFilter
          • correct version of TestFieldCacheRangeFilter.
          Show
          Tim Sturge added a comment - New patch. Only changes are: initialize() in FieldCacheRangeFilter correct version of TestFieldCacheRangeFilter.
          Hide
          Michael McCandless added a comment -

          No problem! Could you redo the patch relative to what's now committed (on trunk)? This way I can more easily see the new changes.

          "svn diff" (after checking out the trunk) is the simplest way to generate a patch.

          Show
          Michael McCandless added a comment - No problem! Could you redo the patch relative to what's now committed (on trunk)? This way I can more easily see the new changes. "svn diff" (after checking out the trunk) is the simplest way to generate a patch.
          Hide
          Tim Sturge added a comment -

          Patch from trunk fixing upper bound in FieldCacheRangeFilter

          Show
          Tim Sturge added a comment - Patch from trunk fixing upper bound in FieldCacheRangeFilter
          Hide
          Tim Sturge added a comment -

          Patch from incorrect to correct test suite

          Show
          Tim Sturge added a comment - Patch from incorrect to correct test suite
          Hide
          Michael McCandless added a comment -

          Committed revision 722203.

          Thanks Tim!

          Show
          Michael McCandless added a comment - Committed revision 722203. Thanks Tim!
          Hide
          Otis Gospodnetic added a comment -

          Is this related to LUCENE-855? The same? Aha, I see Paul asked the reverse question in LUCENE-855 already... Tim?

          Show
          Otis Gospodnetic added a comment - Is this related to LUCENE-855 ? The same? Aha, I see Paul asked the reverse question in LUCENE-855 already... Tim?
          Hide
          Tim Sturge added a comment -

          That's amazing. LUCENE-855 (the FieldCacheRangeFilter part) is pretty much identical in purpose and design, down to the name. The major implementation differences are that it overloaded BitSet which was necessary prior to the addition of DocIdSetIterator. Thus my implementation looks significantly cleaner even though it is basically functionally identical.

          I think this shows that any decent idea will be repeatedly reinvented until it is widely enough known. I personally would have saved some time both in conceptualization and implementation had I been aware of this.

          I would very much like to credit Matt in CHANGES.txt for this as well; it seems like an accident of fate that I'm not using his implementation today.

          Show
          Tim Sturge added a comment - That's amazing. LUCENE-855 (the FieldCacheRangeFilter part) is pretty much identical in purpose and design, down to the name. The major implementation differences are that it overloaded BitSet which was necessary prior to the addition of DocIdSetIterator. Thus my implementation looks significantly cleaner even though it is basically functionally identical. I think this shows that any decent idea will be repeatedly reinvented until it is widely enough known. I personally would have saved some time both in conceptualization and implementation had I been aware of this. I would very much like to credit Matt in CHANGES.txt for this as well; it seems like an accident of fate that I'm not using his implementation today.
          Hide
          Uwe Schindler added a comment - - edited

          I reopened the wrong issue

          The class to handle is FieldCacheRangeFilter! Here, why reopen:

          This Filter is really cool on iterating on the FieldCache for StringIndex and can be even faster for ranges, that are int/float/double/... - so why not retrofit to our new naming-convention and extend:

          • FieldCacheRangeFilter.newTermRange()
          • FieldCacheRangeFilter.newByteRange()
          • FieldCacheRangeFilter.newShortRange()
          • FieldCacheRangeFilter.newIntRange()
          • ...

          It could because of that also be used on "old" int/long fields of dates, if a good parser is given (parser that does SimpleDateFormat -> long -> FieldCache -> direct comparison on this raw numbers). I would try to extend this to all types and it can be faster than TrieRange, if the range is already in FieldCache!

          Show
          Uwe Schindler added a comment - - edited I reopened the wrong issue The class to handle is FieldCacheRangeFilter! Here, why reopen: This Filter is really cool on iterating on the FieldCache for StringIndex and can be even faster for ranges, that are int/float/double/... - so why not retrofit to our new naming-convention and extend: FieldCacheRangeFilter.newTermRange() FieldCacheRangeFilter.newByteRange() FieldCacheRangeFilter.newShortRange() FieldCacheRangeFilter.newIntRange() ... It could because of that also be used on "old" int/long fields of dates, if a good parser is given (parser that does SimpleDateFormat -> long -> FieldCache -> direct comparison on this raw numbers). I would try to extend this to all types and it can be faster than TrieRange, if the range is already in FieldCache!
          Hide
          Uwe Schindler added a comment -

          Here is an first version of retrofitted FieldCacheRangeFilter. It supports StringIndex like before for normal string ranges and as an example Byte ranges on FieldCache.getBytes(). Optional a parser can be given, passed to getBytes(). The internal code structure was changed to make it easier to implement the iterator for each data type.

          The test currently only checks StringIndex (it is the original test), other datatypes beyond byte are not implemented until now (it's mostly copy'n'paste...)

          Show
          Uwe Schindler added a comment - Here is an first version of retrofitted FieldCacheRangeFilter. It supports StringIndex like before for normal string ranges and as an example Byte ranges on FieldCache.getBytes(). Optional a parser can be given, passed to getBytes(). The internal code structure was changed to make it easier to implement the iterator for each data type. The test currently only checks StringIndex (it is the original test), other datatypes beyond byte are not implemented until now (it's mostly copy'n'paste...)
          Hide
          Uwe Schindler added a comment -

          Patch that implements all FieldCache data types. The Double/Float includesXxxx code is a littly bit a hack, I will think about it again.

          What is currently not consistent for all types of range queries (RangeQuery, NumericRangeQuery, FieldCacheRangeFilter) is, if it is allowed to have one bound null and include it. In my opinion, this should always be allowed and should deliver same results for inclusive or not.

          There is a first test together with TrieRangeQuery (for ints) in TestNumericRangeQuery32.java - which passes.

          An important restricion for this type with numeric field caches is: There must be exactly one value per document. If value is missing, 0 is assumed. For Strings, no value is allowed, not for numbers (because arrays like int[] cannot contain null values).

          Show
          Uwe Schindler added a comment - Patch that implements all FieldCache data types. The Double/Float includesXxxx code is a littly bit a hack, I will think about it again. What is currently not consistent for all types of range queries (RangeQuery, NumericRangeQuery, FieldCacheRangeFilter) is, if it is allowed to have one bound null and include it. In my opinion, this should always be allowed and should deliver same results for inclusive or not. There is a first test together with TrieRangeQuery (for ints) in TestNumericRangeQuery32.java - which passes. An important restricion for this type with numeric field caches is: There must be exactly one value per document. If value is missing, 0 is assumed. For Strings, no value is allowed, not for numbers (because arrays like int[] cannot contain null values).
          Hide
          Uwe Schindler added a comment -

          Updated patch. It now changes the includeUpper/Lower behaviour to be consistent with other range query types (RangeQuery, NumericRangeQuery). It also has updates to hashCode and equals (missing parser).

          After I wrote some additional tests (possibly inside TestNumericRangeQuery32/64), I think it is ready to commit.

          I found no better solution to remove the large dupplicate code parts, but this is not possible because of different array data types.

          Show
          Uwe Schindler added a comment - Updated patch. It now changes the includeUpper/Lower behaviour to be consistent with other range query types (RangeQuery, NumericRangeQuery). It also has updates to hashCode and equals (missing parser). After I wrote some additional tests (possibly inside TestNumericRangeQuery32/64), I think it is ready to commit. I found no better solution to remove the large dupplicate code parts, but this is not possible because of different array data types.
          Hide
          Michael McCandless added a comment -

          Patch looks good, Uwe! The only issue I found was you're using includeUpper where you should be using includeLower.

          Show
          Michael McCandless added a comment - Patch looks good, Uwe! The only issue I found was you're using includeUpper where you should be using includeLower.
          Hide
          Uwe Schindler added a comment -

          where? these are typical copy'n'paste errors and missing tests...

          Show
          Uwe Schindler added a comment - where? these are typical copy'n'paste errors and missing tests...
          Hide
          Uwe Schindler added a comment -

          oh ja, found it - was copy'n'paste

          Show
          Uwe Schindler added a comment - oh ja, found it - was copy'n'paste
          Hide
          Uwe Schindler added a comment -

          Mike: There is one thing, I wanted to know, as I am not so familar with the whole internal query handling:

          The DocIdSet for the native numeric types may return document ids, that are deleted (because for native types there is no possibility to find out if there was no term saved, null-fields or deleted docs would simply have 0 in the field cache array). A range covering 0 always returns all doc ids of deleted docs or docs without a numeric field (this can be noted in the javadocs: "for numeric queries every document must have exact one numeric term per field"). Is this a problem, will the false-hits in the iterator be filted because doc is deleted?

          In the case of an real filter the query, that is filtered will not return deleted docs, but a ConstantScoreQuery on this filter would return deleted docs?

          Show
          Uwe Schindler added a comment - Mike: There is one thing, I wanted to know, as I am not so familar with the whole internal query handling: The DocIdSet for the native numeric types may return document ids, that are deleted (because for native types there is no possibility to find out if there was no term saved, null-fields or deleted docs would simply have 0 in the field cache array). A range covering 0 always returns all doc ids of deleted docs or docs without a numeric field (this can be noted in the javadocs: "for numeric queries every document must have exact one numeric term per field"). Is this a problem, will the false-hits in the iterator be filted because doc is deleted? In the case of an real filter the query, that is filtered will not return deleted docs, but a ConstantScoreQuery on this filter would return deleted docs?
          Hide
          Michael McCandless added a comment -

          but a ConstantScoreQuery on this filter would return deleted docs?

          You are right! I guess one workaround is to AND it with a MatchAllDocsQuery, if you are using ConstantScoreQuery w/o already ANDing it with a query that takes deletions into account.

          So there are two issues:

          • This filter returns deleted docs
          • This filter "pretends" deleted docs had a value of zero

          This is then only a problem if nothing else in the query applies deletions. Other FieldCache driven filters have challenges here, too; eg we just fixed LUCENE-1571 where local lucene tripped up on deleted docs (because it's using the Strings FieldCache, and hit nulls for deleted docs)

          I'm not sure how we should fix this... (and I think we should open a new issue to do so). I don't want to force this filter to always take deletions into account (since for many queries the filter is "and'd" on, deletions are already factored in). More generally, we need to think about what's the "right" top-down way to ask a scorer to take deletions and filters into account. Eg, LUCENE-1536 is looking at sizable performance improvements for the "relatively dense and supports random access" type of filters.

          Show
          Michael McCandless added a comment - but a ConstantScoreQuery on this filter would return deleted docs? You are right! I guess one workaround is to AND it with a MatchAllDocsQuery, if you are using ConstantScoreQuery w/o already ANDing it with a query that takes deletions into account. So there are two issues: This filter returns deleted docs This filter "pretends" deleted docs had a value of zero This is then only a problem if nothing else in the query applies deletions. Other FieldCache driven filters have challenges here, too; eg we just fixed LUCENE-1571 where local lucene tripped up on deleted docs (because it's using the Strings FieldCache, and hit nulls for deleted docs) I'm not sure how we should fix this... (and I think we should open a new issue to do so). I don't want to force this filter to always take deletions into account (since for many queries the filter is "and'd" on, deletions are already factored in). More generally, we need to think about what's the "right" top-down way to ask a scorer to take deletions and filters into account. Eg, LUCENE-1536 is looking at sizable performance improvements for the "relatively dense and supports random access" type of filters.
          Hide
          Uwe Schindler added a comment -

          Hey Mike, same time...

          I did some recherche and also found out, that a filter's DocIdSet should not list deleted documents.

          Because of that, I changed the non-StringIndex (which will never contain strings of deleted docs because it has a order[]->0 mapping) to use IndexReader.termDocs(null) to lists the docIds (which is no real problem, as it is just an iterator an a bitset, the additional cost is low, tested with 10 Mio index).

          I also created a superclass for all the iterators working on numbers, to get the termDocs handled easily. The type-specific iterators ony override a matchDoc() method. StringIndex iterator stays separate, because it is optimized and has no deleted docs problem as described before.

          This patch also contains tests for all (except byte) types.

          I will commit in a day or two.

          (an other solution for future would be to have an additional bitset for numeric values in addition to the native type array (in FieldCache), that holds the information, if the document had a term available. This would also cover the deleted docs)

          Show
          Uwe Schindler added a comment - Hey Mike, same time... I did some recherche and also found out, that a filter's DocIdSet should not list deleted documents. Because of that, I changed the non-StringIndex (which will never contain strings of deleted docs because it has a order[]->0 mapping) to use IndexReader.termDocs(null) to lists the docIds (which is no real problem, as it is just an iterator an a bitset, the additional cost is low, tested with 10 Mio index). I also created a superclass for all the iterators working on numbers, to get the termDocs handled easily. The type-specific iterators ony override a matchDoc() method. StringIndex iterator stays separate, because it is optimized and has no deleted docs problem as described before. This patch also contains tests for all (except byte) types. I will commit in a day or two. (an other solution for future would be to have an additional bitset for numeric values in addition to the native type array (in FieldCache), that holds the information, if the document had a term available. This would also cover the deleted docs)
          Hide
          Michael McCandless added a comment -

          OK, patch looks good Uwe!

          Having this filter just always take deletions into account is the safe solution; presumably the added performance cost is OK since this filter is so fast to begin with.

          Longer term I think we need a cleaner way to ask a Scorer to "carry out" deletions & filtering, and have it more optimally delegate that request to its sub-scorers as needed.

          One corner case issue: if I eg make a newShortRange w/ lowerVal == Short.MAX_VALUE and includeLower=false, which should match no docs, I think in this case you overflow short in computing inclusiveLowerPoint and thus match possibly many docs incorrectly? (Same for byte/int/long).

          Show
          Michael McCandless added a comment - OK, patch looks good Uwe! Having this filter just always take deletions into account is the safe solution; presumably the added performance cost is OK since this filter is so fast to begin with. Longer term I think we need a cleaner way to ask a Scorer to "carry out" deletions & filtering, and have it more optimally delegate that request to its sub-scorers as needed. One corner case issue: if I eg make a newShortRange w/ lowerVal == Short.MAX_VALUE and includeLower=false, which should match no docs, I think in this case you overflow short in computing inclusiveLowerPoint and thus match possibly many docs incorrectly? (Same for byte/int/long).
          Hide
          Uwe Schindler added a comment -

          I did some performance tests and compared this filter with TrieRange (precStep 8) on an 5 Mio index with homegenous distributed int values from Integer.MIN_VALUE to Integer.MAX_VALUE and 200 queries with random bounds in same range. Platform was Win32 with 1.5 GIG RAM on my Thinkpad T60 Core Duo (not 2 Duo!), Java 1.5:

          loading field cache
          time: 11826.602264 ms
          Warming searcher...
          avg number of terms: 414.365
          TRIE: best time=4.51482 ms; worst time=1560.544985 ms; avg=470.56886981499997 ms; sum=323328111
          FIELDCACHE: best time=314.611773 ms; worst time=878.438461 ms; avg=511.93189495499996 ms; sum=323328111

          This test shows, that with a good warmed searcher and the whole index in OS cache is the same in speed. A constant score convential range query is far out (about 10 to 1000 times slower dependent on how far the random range bounds are away).

          The same with the old patch (using no TermDocs) and a completely separate loop (not matchDoc() method call), the FieldCache filter only hits the trie filter here:

          loading field cache
          time: 12134.143027 ms
          Warming searcher...
          avg number of terms: 403.785
          TRIE: best time=3.890159 ms; worst time=1266.979462 ms; avg=453.553236545ms; sum=308154314
          FIELDCACHE: best time=84.019897 ms; worst time=434.558023 ms; avg=235.91554798500002 ms; sum=308154314

          Both test runs show, that the queries work correct (sum is identical, it shows that both returned exact the same hits).

          In all cases I would still prefer TrieRange (hihi), especially because of the long warming time for the field cache. And TrieRange gets even better with lower precSteps, but not really (in constant score mode the bits sets are the bigger problem)

          Show
          Uwe Schindler added a comment - I did some performance tests and compared this filter with TrieRange (precStep 8) on an 5 Mio index with homegenous distributed int values from Integer.MIN_VALUE to Integer.MAX_VALUE and 200 queries with random bounds in same range. Platform was Win32 with 1.5 GIG RAM on my Thinkpad T60 Core Duo (not 2 Duo!), Java 1.5: loading field cache time: 11826.602264 ms Warming searcher... avg number of terms: 414.365 TRIE: best time=4.51482 ms; worst time=1560.544985 ms; avg=470.56886981499997 ms; sum=323328111 FIELDCACHE: best time=314.611773 ms; worst time=878.438461 ms; avg=511.93189495499996 ms; sum=323328111 This test shows, that with a good warmed searcher and the whole index in OS cache is the same in speed. A constant score convential range query is far out (about 10 to 1000 times slower dependent on how far the random range bounds are away). The same with the old patch (using no TermDocs) and a completely separate loop (not matchDoc() method call), the FieldCache filter only hits the trie filter here: loading field cache time: 12134.143027 ms Warming searcher... avg number of terms: 403.785 TRIE: best time=3.890159 ms; worst time=1266.979462 ms; avg=453.553236545ms; sum=308154314 FIELDCACHE: best time=84.019897 ms; worst time=434.558023 ms; avg=235.91554798500002 ms; sum=308154314 Both test runs show, that the queries work correct (sum is identical, it shows that both returned exact the same hits). In all cases I would still prefer TrieRange (hihi), especially because of the long warming time for the field cache. And TrieRange gets even better with lower precSteps, but not really (in constant score mode the bits sets are the bigger problem)
          Hide
          Uwe Schindler added a comment -

          Attached my performance test program for reference.

          Show
          Uwe Schindler added a comment - Attached my performance test program for reference.
          Hide
          Uwe Schindler added a comment -

          One corner case issue: if I eg make a newShortRange w/ lowerVal == Short.MAX_VALUE and includeLower=false, which should match no docs, I think in this case you overflow short in computing inclusiveLowerPoint and thus match possibly many docs incorrectly? (Same for byte/int/long).

          This problem has also NumericRangeQuery (see the TermEnum impl there). I could change both queries to simply return the empty iterator (like when upper<lower)

          Show
          Uwe Schindler added a comment - One corner case issue: if I eg make a newShortRange w/ lowerVal == Short.MAX_VALUE and includeLower=false, which should match no docs, I think in this case you overflow short in computing inclusiveLowerPoint and thus match possibly many docs incorrectly? (Same for byte/int/long). This problem has also NumericRangeQuery (see the TermEnum impl there). I could change both queries to simply return the empty iterator (like when upper<lower)
          Hide
          Michael McCandless added a comment -

          This problem has also NumericRangeQuery (see the TermEnum impl there). I could change both queries to simply return the empty iterator (like when upper<lower)

          Right, and I see you've already fixed it!

          From your performance runs, looking at the average times, forcing this
          filter to take deletions into account made it ~2X slower. That's
          quite costly.

          (Though, you really should seed the Random() so the two tests run
          precisely the same set of queries against precisely the same index).

          I would imagine that for most usage of this filter, taking deletes
          into account is not necessary, because it's being used as a filter
          with a query whose scorer won't return deleted docs. Then we've taken
          this perf hit for nothing...

          Somehow, we really need better control, when creating scorers, on just
          when we need and don't need deletions / filters to be "AND'd" in.

          Also, this filter isn't good when not many docs pass the filter, since
          it's an O(N) scan through the index. Trie should do much better in
          those cases.

          I wonder, if we could make a hybrid approach that eg loads the trie
          fields into a fast in-memory postings format (simple int arrays), just
          how much faster it'd be. Ie, if you want to spend memory, spending it
          on trie's postings would presumably net the best performance. Once we
          have flexible indexing we could presumably "swap in" an in-RAM
          postings impl and then run trie against that.

          Show
          Michael McCandless added a comment - This problem has also NumericRangeQuery (see the TermEnum impl there). I could change both queries to simply return the empty iterator (like when upper<lower) Right, and I see you've already fixed it! From your performance runs, looking at the average times, forcing this filter to take deletions into account made it ~2X slower. That's quite costly. (Though, you really should seed the Random() so the two tests run precisely the same set of queries against precisely the same index). I would imagine that for most usage of this filter, taking deletes into account is not necessary, because it's being used as a filter with a query whose scorer won't return deleted docs. Then we've taken this perf hit for nothing... Somehow, we really need better control, when creating scorers, on just when we need and don't need deletions / filters to be "AND'd" in. Also, this filter isn't good when not many docs pass the filter, since it's an O(N) scan through the index. Trie should do much better in those cases. I wonder, if we could make a hybrid approach that eg loads the trie fields into a fast in-memory postings format (simple int arrays), just how much faster it'd be. Ie, if you want to spend memory, spending it on trie's postings would presumably net the best performance. Once we have flexible indexing we could presumably "swap in" an in-RAM postings impl and then run trie against that.
          Hide
          Uwe Schindler added a comment -

          Here an updated patch with the corner cases fixed (incl tests) for this filter (TrieRange is already fixed in trunk).

          Show
          Uwe Schindler added a comment - Here an updated patch with the corner cases fixed (incl tests) for this filter (TrieRange is already fixed in trunk).
          Hide
          Uwe Schindler added a comment -

          From your performance runs, looking at the average times, forcing this
          filter to take deletions into account made it ~2X slower. That's
          quite costly.

          This is not only because of the deleted docs. Its also because the while loops are no longer hard coded with bounds, so there is an additional method call to check if something is a hit. The linear scan makes this also very costly. But without it, the code is unmaintainable (with all problems like copy/paste errors) and so on. It is almost 6 times the same code

          I would imagine that for most usage of this filter, taking deletes
          into account is not necessary, because it's being used as a filter
          with a query whose scorer won't return deleted docs. Then we've taken
          this perf hit for nothing...

          I could put in a switch, that uses another iterator, if 0 is not inside the range (because then all deleted docs would never be hits) or the IR has no deletions. But I think this optimization is something for later.

          In my opinion, in most cases TrieRange is better (and with Payloads, too). So I keeps this filter how it is for the beginning.

          Show
          Uwe Schindler added a comment - From your performance runs, looking at the average times, forcing this filter to take deletions into account made it ~2X slower. That's quite costly. This is not only because of the deleted docs. Its also because the while loops are no longer hard coded with bounds, so there is an additional method call to check if something is a hit. The linear scan makes this also very costly. But without it, the code is unmaintainable (with all problems like copy/paste errors) and so on. It is almost 6 times the same code I would imagine that for most usage of this filter, taking deletes into account is not necessary, because it's being used as a filter with a query whose scorer won't return deleted docs. Then we've taken this perf hit for nothing... I could put in a switch, that uses another iterator, if 0 is not inside the range (because then all deleted docs would never be hits) or the IR has no deletions. But I think this optimization is something for later. In my opinion, in most cases TrieRange is better (and with Payloads, too). So I keeps this filter how it is for the beginning.
          Hide
          Michael McCandless added a comment -

          This is not only because of the deleted docs. Its also because the while loops are no longer hard coded with bounds, so there is an additional method call to check if something is a hit. The linear scan makes this also very costly. But without it, the code is unmaintainable (with all problems like copy/paste errors) and so on. It is almost 6 times the same code.

          The tradeoff of code ugliness vs performance is always an "interesting" one, but here we lost 2X performance Maybe we leave this to future source code specialization.

          In my opinion, in most cases TrieRange is better (and with Payloads, too).

          Since, with the 2X slowdown, and with the linear-scan impl, this filter is not faster than trie... should we even bother? It's going to confuse users having two ways to do numeric filtering, and, trie seems to be best across the board anyway? Are there any times that this filter is better than trie?

          So I keeps this filter how it is for the beginning.

          OK, you mean leave this filter as only doing text (term) ranges? OK, that sounds good.

          Show
          Michael McCandless added a comment - This is not only because of the deleted docs. Its also because the while loops are no longer hard coded with bounds, so there is an additional method call to check if something is a hit. The linear scan makes this also very costly. But without it, the code is unmaintainable (with all problems like copy/paste errors) and so on. It is almost 6 times the same code. The tradeoff of code ugliness vs performance is always an "interesting" one, but here we lost 2X performance Maybe we leave this to future source code specialization. In my opinion, in most cases TrieRange is better (and with Payloads, too). Since, with the 2X slowdown, and with the linear-scan impl, this filter is not faster than trie... should we even bother? It's going to confuse users having two ways to do numeric filtering, and, trie seems to be best across the board anyway? Are there any times that this filter is better than trie? So I keeps this filter how it is for the beginning. OK, you mean leave this filter as only doing text (term) ranges? OK, that sounds good.
          Hide
          Uwe Schindler added a comment -

          I wanted to inform, that the latest patch (and the ones before) are still buggy. During changing the upper/lower bound settings I missed the case that StringIndex.binarySearch can return negative values, if the exact key was not found. See javadocs of Arrays.binarySearch. The problem is that the testcase did not found that.

          I will fix that and do some further tests later, but I am away now.

          Show
          Uwe Schindler added a comment - I wanted to inform, that the latest patch (and the ones before) are still buggy. During changing the upper/lower bound settings I missed the case that StringIndex.binarySearch can return negative values, if the exact key was not found. See javadocs of Arrays.binarySearch. The problem is that the testcase did not found that. I will fix that and do some further tests later, but I am away now.
          Hide
          Uwe Schindler added a comment -

          Attached is a new patch, that has 2 DocIdSetIterator implementations, one with TermDocs, one without. The TermDocs one is for numeric types only choosen, if the reader contains deletions and 0 is inside the range. For all other cases (also StringIndex) the simple DocIdSetIterator using the counter is used.

          For more code-reuse, all range implementations now use the same abstract DocIdSet implementation and only override matchDoc(). My tests showed, that use of this method does not affect performance (method is inlined), the original stringindex impl is as fast as the new one with matchDoc().

          This patch also restores the original handling of the return value of binarySearch (which can be negative).

          Here again the comparison:

          Version with TermDocs:
          loading field cache
          time: 6767.23131 ms
          Warming searcher...
          avg number of terms: 378.75
          TRIE: best time=5.232229 ms; worst time=553.791334 ms; avg=250.4418579 ms; sum=31996909
          FIELDCACHE: best time=212.763912 ms; worst time=357.100414 ms; avg=279.75582110000005 ms; sum=31996909

          Version without (because index in testcase has no deletions):
          loading field cache
          time: 6463.311678 ms
          Warming searcher...
          avg number of terms: 378.75
          TRIE: best time=4.539963 ms; worst time=581.657446 ms; avg=246.58688465 ms; sum=31996909
          FIELDCACHE: best time=64.747614 ms; worst time=211.557335 ms; avg=139.16517340000001 ms; sum=31996909

          (my T60 was not on battery, because of this the measurement with TermDocs and FieldCache loading was faster that before). But both tests before and after optimization were done with same settings. The randseed was identical (0L)

          Show
          Uwe Schindler added a comment - Attached is a new patch, that has 2 DocIdSetIterator implementations, one with TermDocs, one without. The TermDocs one is for numeric types only choosen, if the reader contains deletions and 0 is inside the range. For all other cases (also StringIndex) the simple DocIdSetIterator using the counter is used. For more code-reuse, all range implementations now use the same abstract DocIdSet implementation and only override matchDoc(). My tests showed, that use of this method does not affect performance (method is inlined), the original stringindex impl is as fast as the new one with matchDoc(). This patch also restores the original handling of the return value of binarySearch (which can be negative). Here again the comparison: Version with TermDocs: loading field cache time: 6767.23131 ms Warming searcher... avg number of terms: 378.75 TRIE: best time=5.232229 ms; worst time=553.791334 ms; avg=250.4418579 ms; sum=31996909 FIELDCACHE: best time=212.763912 ms; worst time=357.100414 ms; avg=279.75582110000005 ms; sum=31996909 Version without (because index in testcase has no deletions): loading field cache time: 6463.311678 ms Warming searcher... avg number of terms: 378.75 TRIE: best time=4.539963 ms; worst time=581.657446 ms; avg=246.58688465 ms; sum=31996909 FIELDCACHE: best time=64.747614 ms; worst time=211.557335 ms; avg=139.16517340000001 ms; sum=31996909 (my T60 was not on battery, because of this the measurement with TermDocs and FieldCache loading was faster that before). But both tests before and after optimization were done with same settings. The randseed was identical (0L)
          Hide
          Uwe Schindler added a comment -

          It seems that the latest patch has no problems anymore. Without deletions or if 0 is not inside the range it seems to be faster than trie range, with the problem of long first-time searches (cache loading). But if you e.g. search on this field or use the cache for something other, it may not be a problem.

          The biggest advantage of this is, that you do not need to index the values in a special way, you can simply use your old Number.toString() formatted fields and do range queries on them. For term/string ranges it works better than RangeFilter, but the memory usage is much higher (if you have lots of distinct terms) with StringIndex.

          Maybe for the future, there would also be a possibility to implement a TrieRangeQuery for Strings (the precisionStep would there be the number of chars per precision, so e.g. precStep=2 would be to index for "lucene" the tokens "lu", "luce", "lucene"). The same here like with TrieRange: a TokenStream that does this would be good.

          In my opinion, this class is a good approach for range queries, if you have enough RAM and warm your searchers correctly, but do not want to change you index structure to use the new TrieRange. This class is not good for indexes where you will hit only few documents per range, as the cost of the linear scan for all data types then overweight.

          I think I will commit this later, if nobody objects. If you think, that only StringIndex and not numeric values should be handled by this class (throw away the new code), I tend to rename this class before release according to LUCENE-1713.

          Show
          Uwe Schindler added a comment - It seems that the latest patch has no problems anymore. Without deletions or if 0 is not inside the range it seems to be faster than trie range, with the problem of long first-time searches (cache loading). But if you e.g. search on this field or use the cache for something other, it may not be a problem. The biggest advantage of this is, that you do not need to index the values in a special way, you can simply use your old Number.toString() formatted fields and do range queries on them. For term/string ranges it works better than RangeFilter, but the memory usage is much higher (if you have lots of distinct terms) with StringIndex. Maybe for the future, there would also be a possibility to implement a TrieRangeQuery for Strings (the precisionStep would there be the number of chars per precision, so e.g. precStep=2 would be to index for "lucene" the tokens "lu", "luce", "lucene"). The same here like with TrieRange: a TokenStream that does this would be good. In my opinion, this class is a good approach for range queries, if you have enough RAM and warm your searchers correctly, but do not want to change you index structure to use the new TrieRange. This class is not good for indexes where you will hit only few documents per range, as the cost of the linear scan for all data types then overweight. I think I will commit this later, if nobody objects. If you think, that only StringIndex and not numeric values should be handled by this class (throw away the new code), I tend to rename this class before release according to LUCENE-1713 .
          Hide
          Uwe Schindler added a comment -

          I forgot: Here the latest optimizations and corrections in the corner cases for StringIndex.

          Show
          Uwe Schindler added a comment - I forgot: Here the latest optimizations and corrections in the corner cases for StringIndex.
          Hide
          Michael McCandless added a comment -

          OK let's keep the new code, and keep the current name; it's great that it can now be faster than trie in some cases. I'd love to eventually see an in-memory trie structure; seems like this'd be the fastest of all

          I only briefly skimmed the patch (I'm on "vacation" until Jul 6) and it looks good!

          Show
          Michael McCandless added a comment - OK let's keep the new code, and keep the current name; it's great that it can now be faster than trie in some cases. I'd love to eventually see an in-memory trie structure; seems like this'd be the fastest of all I only briefly skimmed the patch (I'm on "vacation" until Jul 6) and it looks good!
          Hide
          Uwe Schindler added a comment -

          I only briefly skimmed the patch (I'm on "vacation" until Jul 6) and it looks good!

          Happy holidays! (my holiday is later). I commit shortly.

          Show
          Uwe Schindler added a comment - I only briefly skimmed the patch (I'm on "vacation" until Jul 6) and it looks good! Happy holidays! (my holiday is later). I commit shortly.
          Hide
          Uwe Schindler added a comment -

          Added CHANGES.txt & committed revision 789682.

          Show
          Uwe Schindler added a comment - Added CHANGES.txt & committed revision 789682.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Tim Sturge
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development