Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 5.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The facet module today requires the app to compute the hierarchy
      at index time, eg a timestamp field might use a year/month/day
      hierarchy.

      While this gives great performance, since it minimizes the search-time
      computation, sometimes it's unfortunately useful/necessary to do things entirely at
      search time, like Solr does.

      E.g. I'm playing with a prototype Lucene search for Jira issues
      and I'd like to add a drill down+sideways for "Updated in past day,
      2 days, week, month" etc. But because time is constantly advancing,
      doing this at index time is a not easy ...

      1. LUCENE-4965.patch
        9 kB
        Michael McCandless
      2. LUCENE-4965.patch
        9 kB
        Michael McCandless
      3. LUCENE-4965.patch
        15 kB
        Michael McCandless
      4. LUCENE-4965.patch
        15 kB
        Michael McCandless
      5. LUCENE-4965.patch
        23 kB
        Michael McCandless
      6. LUCENE-4965.patch
        24 kB
        Michael McCandless
      7. LUCENE-4965.patch
        25 kB
        Michael McCandless
      8. LUCENE-4965.patch
        49 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Hide
        Michael McCandless added a comment -

        Patch, just a start but the test passes It uses a NumericDV field
        and just does the obvious approach (load value for each hit and find
        ranges that value falls within). Once LUCENE-4964 is in I'll test the
        integration with DrillDown/SidewaysQuery...

        Show
        Michael McCandless added a comment - Patch, just a start but the test passes It uses a NumericDV field and just does the obvious approach (load value for each hit and find ranges that value falls within). Once LUCENE-4964 is in I'll test the integration with DrillDown/SidewaysQuery...
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-4965.patch [ 12580848 ]
        Hide
        Shai Erera added a comment -

        Looks great! Few comments:

        • I think the Accumulator should not take FSP, just the ndv field, for various reasons:
          • It will eliminate the need for the checks about the validity of FSP
          • When I read the test, it seemed awkward to me that you need to create a FSP and CountingFR, since 'field' is not a facet at all...
        • If you change FacetsAccumulator to not initialize a FacetArrays, you can safely pass null for taxonomy and FacetArrays
        • I think the only reason you override getAggregator is because FacetsCollector.create() calls it to determine if doc scores are needed?
          • Maybe we can have FacetsAccumulator.requiresDocScore() – the default will call getAggregator().requireDocScore() and you will just return false? Then FC will use the accumulator to determine that.
        • Regarding the "nocommit int/float/double too", I think this class should be called DynamicLongRangeAccumulator since it can only handle longs. To handle int/float/double, you will need to override accumulate() entirely to pull a different DV and find the range.
          • If you make this class abstract, you can have a utility method which converts the ranges to FacetResultNodes.
        • if (r.count > 0) – I think that's wrong? We should return all requested ranges, some may be with count=0. Just like we return in normal faceted search a FacetResult per FacetRequest.
        • Instead of addLongRange, can the ctor take an IndexReader and LongRange...? The ranges need to be final (there's no sense adding a new range after accumulate), and also, I think that the start/endInclusive logic may not be that simple when you come to handle a FloatRange.
        Show
        Shai Erera added a comment - Looks great! Few comments: I think the Accumulator should not take FSP, just the ndv field, for various reasons: It will eliminate the need for the checks about the validity of FSP When I read the test, it seemed awkward to me that you need to create a FSP and CountingFR, since 'field' is not a facet at all... If you change FacetsAccumulator to not initialize a FacetArrays, you can safely pass null for taxonomy and FacetArrays I think the only reason you override getAggregator is because FacetsCollector.create() calls it to determine if doc scores are needed? Maybe we can have FacetsAccumulator.requiresDocScore() – the default will call getAggregator().requireDocScore() and you will just return false? Then FC will use the accumulator to determine that. Regarding the "nocommit int/float/double too", I think this class should be called DynamicLongRangeAccumulator since it can only handle longs. To handle int/float/double, you will need to override accumulate() entirely to pull a different DV and find the range. If you make this class abstract, you can have a utility method which converts the ranges to FacetResultNodes. if (r.count > 0) – I think that's wrong? We should return all requested ranges, some may be with count=0. Just like we return in normal faceted search a FacetResult per FacetRequest. Instead of addLongRange, can the ctor take an IndexReader and LongRange... ? The ranges need to be final (there's no sense adding a new range after accumulate), and also, I think that the start/endInclusive logic may not be that simple when you come to handle a FloatRange.
        Hide
        Michael McCandless added a comment -

        New patch incorporating Shai's feedback ...

        Show
        Michael McCandless added a comment - New patch incorporating Shai's feedback ...
        Michael McCandless made changes -
        Attachment LUCENE-4965.patch [ 12580885 ]
        Hide
        Michael McCandless added a comment -

        Woops, trying again ...

        Show
        Michael McCandless added a comment - Woops, trying again ...
        Michael McCandless made changes -
        Attachment LUCENE-4965.patch [ 12580886 ]
        Hide
        Shai Erera added a comment -

        Thanks Mike. Looks good! Few more comments:

        • The nocommit on default aggregator – I think you can override and throw Unsupported?
        • +1 to move to o.a.l.facet.dynamic, or oal.facet.range.
        • DynamicRange:
          • thanks for introducing accept(), but I don't think the way it's implemented can support other impls, such as float/double.
          • Perhapd instead DynamicRange should be abstract, with impls for LongRange, IntRange, FloatRange and DoubleRange, which can then implement accept properly.
            • It won't have min/max, but can keep label.
          • Also, that will allow you to add type-safety to DynamicRangeFacetRequest. I.e. now someone can pass int and float ranges in the same request, which makes no sense. But if you make it 'typed', one would need to specify the type at construction, and cannot make mistakes.
            • DRFR would then be defined with <T extends DynamicRange> and its ctor would take T....
        • I think it's ok if the request is called DynamicRange without numerics because someone can write a DynamicRange impl not on numeric fields. He'll then need to also write an accumulator though, but as for the request, it's still about DynamicRange implementations ... in short, I'm +0 if it stays like that or renamed .
        Show
        Shai Erera added a comment - Thanks Mike. Looks good! Few more comments: The nocommit on default aggregator – I think you can override and throw Unsupported? +1 to move to o.a.l.facet.dynamic, or oal.facet.range. DynamicRange: thanks for introducing accept(), but I don't think the way it's implemented can support other impls, such as float/double. Perhapd instead DynamicRange should be abstract, with impls for LongRange, IntRange, FloatRange and DoubleRange, which can then implement accept properly. It won't have min/max, but can keep label. Also, that will allow you to add type-safety to DynamicRangeFacetRequest. I.e. now someone can pass int and float ranges in the same request, which makes no sense. But if you make it 'typed', one would need to specify the type at construction, and cannot make mistakes. DRFR would then be defined with <T extends DynamicRange> and its ctor would take T... . I think it's ok if the request is called DynamicRange without numerics because someone can write a DynamicRange impl not on numeric fields. He'll then need to also write an accumulator though, but as for the request, it's still about DynamicRange implementations ... in short, I'm +0 if it stays like that or renamed .
        Hide
        Michael McCandless added a comment -

        Patch, just moving things under oal.facet.range.

        I'm ... not sure how to do the generics to support float/double/int/long, without just making separate Accumulator (eg Int/Float/Double/LongRangeAccumulator), since I need to use native types for each case.

        Also, I think ideally (maybe not on the first commit), we should use an interval tree to find overlapping ranges for each value ... so somehow we'd need to build this from all of the provided ranges up front.

        Show
        Michael McCandless added a comment - Patch, just moving things under oal.facet.range. I'm ... not sure how to do the generics to support float/double/int/long, without just making separate Accumulator (eg Int/Float/Double/LongRangeAccumulator), since I need to use native types for each case. Also, I think ideally (maybe not on the first commit), we should use an interval tree to find overlapping ranges for each value ... so somehow we'd need to build this from all of the provided ranges up front.
        Michael McCandless made changes -
        Attachment LUCENE-4965.patch [ 12580889 ]
        Hide
        Michael McCandless added a comment -

        New patch, breaking out separate .newLong/Float/DoubleRange.

        Show
        Michael McCandless added a comment - New patch, breaking out separate .newLong/Float/DoubleRange.
        Michael McCandless made changes -
        Attachment LUCENE-4965.patch [ 12580896 ]
        Hide
        Shai Erera added a comment -

        Comments:

        • Range is not really extendable (private ctor) – is that intentional? I guess there's not much point extending it ...
        • Any reason why not take primitive values and let the user define e.g Long.MAX_VALUE as an unlimited upper bound? I'm not a big fan of auto-boxing, and passing null is as good to me as passing MAX_VAL. User has to make a decision anyway, so might as well be explicit about it.
        • Accumulate() iterates at the outer loop on matchingDocs, and inner loop on ranges. I remember while writing FacetsAccumulator that luceneutil was happier with the other way (matchingDocs inner). Maybe test?
        • Shouldn't if (ranges.ranges.accept(v)) break if there's a match?
          • While at it, maybe RangeSet should sort the ranges by their minimum value? Not sure if asymptotically this will matter...
        • I don't think it matters much, but maybe allocate new ArrayList<FacetResultNode>() with ranges.ranges.length?

        I think this we're almost ready!

        Show
        Shai Erera added a comment - Comments: Range is not really extendable (private ctor) – is that intentional? I guess there's not much point extending it ... Any reason why not take primitive values and let the user define e.g Long.MAX_VALUE as an unlimited upper bound? I'm not a big fan of auto-boxing, and passing null is as good to me as passing MAX_VAL. User has to make a decision anyway, so might as well be explicit about it. Accumulate() iterates at the outer loop on matchingDocs, and inner loop on ranges. I remember while writing FacetsAccumulator that luceneutil was happier with the other way (matchingDocs inner). Maybe test? Shouldn't if (ranges.ranges.accept(v)) break if there's a match? While at it, maybe RangeSet should sort the ranges by their minimum value? Not sure if asymptotically this will matter... I don't think it matters much, but maybe allocate new ArrayList<FacetResultNode>() with ranges.ranges.length ? I think this we're almost ready!
        Hide
        Michael McCandless added a comment -

        Range is not really extendable (private ctor) – is that intentional? I guess there's not much point extending it ...

        I suppose if an app had some custom encoding (into the long value in NumericDVField) then it would need a custom impl? I'll make it protected...

        Any reason why not take primitive values and let the user define e.g Long.MAX_VALUE as an unlimited upper bound?

        I was trying to mimic NumericRangeQuery, but I agree: let's just take primitives.

        Accumulate() iterates at the outer loop on matchingDocs, and inner loop on ranges. I remember while writing FacetsAccumulator that luceneutil was happier with the other way (matchingDocs inner). Maybe test?

        Ahhh right I forgot about that I'll just switch it to inner loop on matchingDocs... I'm not quite set up to perf test this yet

        Shouldn't if (ranges.ranges.accept(v)) break if there's a match

        I think not? Ie overlapping ranges should be allowed? For Jira search I'd like to have "Past day", "Past 2 days", etc.

        While at it, maybe RangeSet should sort the ranges by their minimum value? Not sure if asymptotically this will matter...

        I think we need to use an interval tree here (I put a TODO) ... it's a little tricky to do that since each Range has its own private min/max ... but I think we can do this in a followon issue.

        Show
        Michael McCandless added a comment - Range is not really extendable (private ctor) – is that intentional? I guess there's not much point extending it ... I suppose if an app had some custom encoding (into the long value in NumericDVField) then it would need a custom impl? I'll make it protected... Any reason why not take primitive values and let the user define e.g Long.MAX_VALUE as an unlimited upper bound? I was trying to mimic NumericRangeQuery, but I agree: let's just take primitives. Accumulate() iterates at the outer loop on matchingDocs, and inner loop on ranges. I remember while writing FacetsAccumulator that luceneutil was happier with the other way (matchingDocs inner). Maybe test? Ahhh right I forgot about that I'll just switch it to inner loop on matchingDocs... I'm not quite set up to perf test this yet Shouldn't if (ranges.ranges.accept(v)) break if there's a match I think not? Ie overlapping ranges should be allowed? For Jira search I'd like to have "Past day", "Past 2 days", etc. While at it, maybe RangeSet should sort the ranges by their minimum value? Not sure if asymptotically this will matter... I think we need to use an interval tree here (I put a TODO) ... it's a little tricky to do that since each Range has its own private min/max ... but I think we can do this in a followon issue.
        Hide
        Michael McCandless added a comment -

        Patch w/ last round of fixes...

        Show
        Michael McCandless added a comment - Patch w/ last round of fixes...
        Michael McCandless made changes -
        Attachment LUCENE-4965.patch [ 12580913 ]
        Hide
        Shai Erera added a comment -

        Ie overlapping ranges should be allowed?

        Ahh, you're right.

        it's a little tricky to do that since each Range has its own private min/max

        Maybe Range can implement Comparable, and you build the tree using it, like we build PQ? In practice though, I wonder how much more efficient a tree would be vs simply sorting and iterating until a range is bigger than value? We're talking probably very few ranges no?

        BTW, I think Range would need to implement Comparable as well as compareTo(long) since a value may not fall into the first (sorted order) range (e.g. range.accept() returns false), but you'll need to continue looking until range.compareTo(long) > 0 (i.e. range.minValue > value).

        Also, maybe instead of FRN we should return a RangeFRN which contains the range? That way, someone can extract the min/max values of the range without parsing the label by casting to the range added? Hmm, but then you'll need to make the range impls public, but maybe that's not bad? Instead of Range.newLongRange, someone will just do new LongRange()?

        I see you decided not to go with generics? In that case, maybe document that you are expected to pass the same Range type? Although, why not make RangeFacetRequest generic and prevent this pitfall?

        Show
        Shai Erera added a comment - Ie overlapping ranges should be allowed? Ahh, you're right. it's a little tricky to do that since each Range has its own private min/max Maybe Range can implement Comparable, and you build the tree using it, like we build PQ? In practice though, I wonder how much more efficient a tree would be vs simply sorting and iterating until a range is bigger than value? We're talking probably very few ranges no? BTW, I think Range would need to implement Comparable as well as compareTo(long) since a value may not fall into the first (sorted order) range (e.g. range.accept() returns false), but you'll need to continue looking until range.compareTo(long) > 0 (i.e. range.minValue > value). Also, maybe instead of FRN we should return a RangeFRN which contains the range? That way, someone can extract the min/max values of the range without parsing the label by casting to the range added? Hmm, but then you'll need to make the range impls public, but maybe that's not bad? Instead of Range.newLongRange, someone will just do new LongRange() ? I see you decided not to go with generics? In that case, maybe document that you are expected to pass the same Range type? Although, why not make RangeFacetRequest generic and prevent this pitfall?
        Hide
        Michael McCandless added a comment -

        New patch, adds generics to make sure the Range instances are the same class, and switches to simple ctors instead of .newXXXRange.

        I also added RangeFRN ... I only use it for the "sub nodes" and not the root node.

        I don't think we should try to do the sorting here ... I think that's too simplistic ... we should do an interval tree (but on a new issue).

        Eg the simplistic sorting approach can easily fail to speed things up if, say, most of my values are > max(range.max) (this will be the case with the Jira search) and you had sorted "smallest to biggest".

        Show
        Michael McCandless added a comment - New patch, adds generics to make sure the Range instances are the same class, and switches to simple ctors instead of .newXXXRange. I also added RangeFRN ... I only use it for the "sub nodes" and not the root node. I don't think we should try to do the sorting here ... I think that's too simplistic ... we should do an interval tree (but on a new issue). Eg the simplistic sorting approach can easily fail to speed things up if, say, most of my values are > max(range.max) (this will be the case with the Jira search) and you had sorted "smallest to biggest".
        Michael McCandless made changes -
        Attachment LUCENE-4965.patch [ 12580958 ]
        Hide
        Shai Erera added a comment -

        I agree we should optimize separately (and preferably after you have luceneutil set up so we can measure how important it really is).

        Thanks for adding generics.

        +1 to commit!

        Show
        Shai Erera added a comment - I agree we should optimize separately (and preferably after you have luceneutil set up so we can measure how important it really is). Thanks for adding generics. +1 to commit!
        Hide
        Michael McCandless added a comment -

        New patch, breaking out Long/Float/DoubleRange as standalone sources, adding Test/RangeFacetsExample, and adding randomized tests in TestRangeAccumulator ...

        Show
        Michael McCandless added a comment - New patch, breaking out Long/Float/DoubleRange as standalone sources, adding Test/RangeFacetsExample, and adding randomized tests in TestRangeAccumulator ...
        Michael McCandless made changes -
        Attachment LUCENE-4965.patch [ 12581043 ]
        Hide
        Shai Erera added a comment -

        Awesome! And thanks for adding the examples!

        In TestRangeAccumulator:

        • You don't use LongRange.accept (for expected counts) b/c that's what you test right? maybe just drop a one liner?
        • Also, does the test really need to run atLeast(100) iters? I would think one iteration is enough, if we let Jenkins periodic builds do the iters for us (plus, someone can run with -Dtests.iters or -Dtests.dups)
        • These apply to all 4 random test methods.

        In RangeAccumulator:

        • I think you can remove "nocommit add to example"?
        • There are some leftover commented out lines (members and ctor)

        In the example:

        • new Date().getTime() can be just System.currentTimeMillis()?
        • Also, do we really need to index 1000 documents to demo? I ask because it's harder to debug 1000 docs

        I'm +1 to commit, no need to upload a new patch from my side.

        Show
        Shai Erera added a comment - Awesome! And thanks for adding the examples! In TestRangeAccumulator: You don't use LongRange.accept (for expected counts) b/c that's what you test right? maybe just drop a one liner? Also, does the test really need to run atLeast(100) iters? I would think one iteration is enough, if we let Jenkins periodic builds do the iters for us (plus, someone can run with -Dtests.iters or -Dtests.dups) These apply to all 4 random test methods. In RangeAccumulator: I think you can remove "nocommit add to example"? There are some leftover commented out lines (members and ctor) In the example: new Date().getTime() can be just System.currentTimeMillis()? Also, do we really need to index 1000 documents to demo? I ask because it's harder to debug 1000 docs I'm +1 to commit, no need to upload a new patch from my side.
        Hide
        Michael McCandless added a comment -

        You don't use LongRange.accept (for expected counts) b/c that's what you test right? maybe just drop a one liner?

        Right, I'll add a comment.

        Also, does the test really need to run atLeast(100) iters? I would think one iteration is enough, if we let Jenkins periodic builds do the iters for us (plus, someone can run with -Dtests.iters or -Dtests.dups)

        Well it's an efficiency question I guess (how many jenkins iters will
        it take to find a bug) ... I have beasted it and it's passing for me
        I'll drop it to 10 ... the test is plenty fast.

        I think you can remove "nocommit add to example"?

        There are some leftover commented out lines (members and ctor)

        Woops, fixed!

        In the example:
        new Date().getTime() can be just System.currentTimeMillis()?
        Also, do we really need to index 1000 documents to demo? I ask because it's harder to debug 1000 docs

        Good, I'll switch to currentTimeMillis, and drop it to 100 docs.

        Thanks Shai.

        Show
        Michael McCandless added a comment - You don't use LongRange.accept (for expected counts) b/c that's what you test right? maybe just drop a one liner? Right, I'll add a comment. Also, does the test really need to run atLeast(100) iters? I would think one iteration is enough, if we let Jenkins periodic builds do the iters for us (plus, someone can run with -Dtests.iters or -Dtests.dups) Well it's an efficiency question I guess (how many jenkins iters will it take to find a bug) ... I have beasted it and it's passing for me I'll drop it to 10 ... the test is plenty fast. I think you can remove "nocommit add to example"? There are some leftover commented out lines (members and ctor) Woops, fixed! In the example: new Date().getTime() can be just System.currentTimeMillis()? Also, do we really need to index 1000 documents to demo? I ask because it's harder to debug 1000 docs Good, I'll switch to currentTimeMillis, and drop it to 100 docs. Thanks Shai.
        Hide
        Shai Erera added a comment -

        Great thanks. +1!

        Show
        Shai Erera added a comment - Great thanks. +1!
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] mikemccand
        http://svn.apache.org/viewvc?view=revision&revision=1477560

        LUCENE-4965: add dynamic numeric range faceting

        Show
        Commit Tag Bot added a comment - [branch_4x commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1477560 LUCENE-4965 : add dynamic numeric range faceting
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] mikemccand
        http://svn.apache.org/viewvc?view=revision&revision=1477562

        LUCENE-4965: add dynamic numeric range faceting

        Show
        Commit Tag Bot added a comment - [trunk commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1477562 LUCENE-4965 : add dynamic numeric range faceting
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] mikemccand
        http://svn.apache.org/viewvc?view=revision&revision=1477563

        LUCENE-4965: add dynamic numeric range faceting

        Show
        Commit Tag Bot added a comment - [trunk commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1477563 LUCENE-4965 : add dynamic numeric range faceting
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] mikemccand
        http://svn.apache.org/viewvc?view=revision&revision=1477564

        LUCENE-4965: add dynamic numeric range faceting

        Show
        Commit Tag Bot added a comment - [branch_4x commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1477564 LUCENE-4965 : add dynamic numeric range faceting
        Hide
        Michael McCandless added a comment -

        Thanks Shai!

        Show
        Michael McCandless added a comment - Thanks Shai!
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues
        Steve Rowe made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        ASF subversion and git services added a comment -

        Commit 1533782 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533782 ]

        LUCENE-4965: don't expose this as a large hashmap with thousands of arrays

        Show
        ASF subversion and git services added a comment - Commit 1533782 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533782 ] LUCENE-4965 : don't expose this as a large hashmap with thousands of arrays
        Hide
        Pradeep added a comment -

        Is this code easily portable to Solr 3.6?

        Show
        Pradeep added a comment - Is this code easily portable to Solr 3.6?
        Hide
        Michael McCandless added a comment -

        This facet method isn't available via Solr, yet, unfortunately. Though, in theory porting it to Solr would not be that hard, since it does not rely on the taxonomy index.

        That said, Solr does have its own range faceting ( http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range ) but it effectively requires multiple passes through the result set (I think?), so for something like range distance faceting (< 1 mile, < 2 miles, < 5 miles, ...), where the cost to compute the value for each doc is highish, the multiple passes through the result set is costly.

        Show
        Michael McCandless added a comment - This facet method isn't available via Solr, yet, unfortunately. Though, in theory porting it to Solr would not be that hard, since it does not rely on the taxonomy index. That said, Solr does have its own range faceting ( http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range ) but it effectively requires multiple passes through the result set (I think?), so for something like range distance faceting (< 1 mile, < 2 miles, < 5 miles, ...), where the cost to compute the value for each doc is highish, the multiple passes through the result set is costly.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development