Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 5.3
    • Component/s: modules/spatial
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The newly added DateRangePrefixTree (DRPT) encodes terms in a fashion amenable to faceting by meaningful time buckets. The motivation for this feature is to efficiently populate a calendar bar chart or heat-map. It's not hard if you have date instances like many do but it's challenging for date ranges.

      Internally this is going to iterate over the terms using seek/next with TermsEnum as appropriate. It should be quite efficient; it won't need any special caches. I should be able to re-use SPT traversal code in AbstractVisitingPrefixTreeFilter. If this goes especially well; the underlying implementation will be re-usable for geospatial heat-map faceting.

      1. LUCENE-5735__PrefixTreeFacetCounter.patch
        27 kB
        David Smiley
      2. LUCENE-5735.patch
        26 kB
        David Smiley
      3. LUCENE-5735.patch
        51 kB
        David Smiley

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          Here's the patch, with randomized testing verification of results. The faceting code doesn't make any date assumptions, and so it should work once a non-date NumberRangePrefixTree subclass comes into existence.

          I added a method to NumberRangePrefixTreeStrategy with this signature:

          public Facets calcFacets(IndexReaderContext context, final Bits acceptDocs, Shape facetRange, int level)
          

          acceptDocs is a filter of documents to count (null to count all docs). facetRange is a range that will limit the values counted to that provided range (e.g. a start date and end date span). Get one of those easily via toRangeShape(start,end) method. 'level' is the bottom-most prefix tree level to be counted. For example, the level corresponding to a 'day'. There are multiple ways of determining what the level is, namely by lookup given a Calendar field or taking it from an existing Calendar instance.

          The response structure, 'Facets', looks like this:

          public static class Facets {
              //TODO consider a variable-level structure -- more general purpose.
          
              public Facets(int detailLevel) {
                this.detailLevel = detailLevel;
              }
          
              /** The bottom-most detail-level counted, as requested. */
              public final int detailLevel;
          
              /**
               * The count of documents with ranges that completely spanned the parents of the detail level. In more technical
               * terms, this is the count of leaf cells 2 up and higher from the bottom. Usually you only care about counts at
               * detailLevel, and so you will add this number to all other counts below, including to omitted/implied children
               * counts of 0. If there are no indexed ranges (just instances, i.e. fully specified dates) then this value will
               * always be 0.
               */
              public int topLeaves;
          
              /** Holds all the {@link FacetParentVal} instances in order of the key. This is sparse; there won't be an
               * instance if it's count and children are all 0. The keys are {@link LevelledValue} shapes, which can be
               * converted back to the original Object (i.e. a Calendar) via {@link #toObject(LevelledValue)}. */
              public final SortedMap<LevelledValue,FacetParentVal> parents = new TreeMap<>();
          
              /** Holds a block of detailLevel counts aggregated to their parent level. */
              public static class FacetParentVal {
          
                /** The count of ranges that span all of the childCount.  In more technical terms, this is the number of leaf
                 * cells found at this parent.  Treat this like {@link Facets#topLeaves}. */
                public int parentLeaves;// (parent leaf count) to be added to all descendants (children)
          
                /** The length of {@link #childCounts}. If childCounts is not null then this is childCounts.length, otherwise it
                 * says how long it would have been if it weren't null. */
                public int childCountsLen;
          
                /** The detail level counts. It will be null if there are none, and thus they are assumed 0. Most apps, when
                 * presenting the information, will add {@link #topLeaves} and {@link #parentLeaves} to each count. */
                public int[] childCounts;//null if there are none.
                //assert childCountsLen == childCounts.length
              }
            }
          

          I've got a toString() on it with concise output that I found nice to look at during debugging.

          The patch has some small changes to related classes involved that are mostly little refactorings and/or making things visible that the facets code needs that were previously private. I'd like to make more refactoring/renaming happen around there to make this number/date spatial API a little more friendly. I'm sure it'll be a bit awkward to newcomers.

          Show
          dsmiley David Smiley added a comment - Here's the patch, with randomized testing verification of results. The faceting code doesn't make any date assumptions, and so it should work once a non-date NumberRangePrefixTree subclass comes into existence. I added a method to NumberRangePrefixTreeStrategy with this signature: public Facets calcFacets(IndexReaderContext context, final Bits acceptDocs, Shape facetRange, int level) acceptDocs is a filter of documents to count (null to count all docs). facetRange is a range that will limit the values counted to that provided range (e.g. a start date and end date span). Get one of those easily via toRangeShape(start,end) method. 'level' is the bottom-most prefix tree level to be counted. For example, the level corresponding to a 'day'. There are multiple ways of determining what the level is, namely by lookup given a Calendar field or taking it from an existing Calendar instance. The response structure, 'Facets', looks like this: public static class Facets { //TODO consider a variable-level structure -- more general purpose. public Facets( int detailLevel) { this .detailLevel = detailLevel; } /** The bottom-most detail-level counted, as requested. */ public final int detailLevel; /** * The count of documents with ranges that completely spanned the parents of the detail level. In more technical * terms, this is the count of leaf cells 2 up and higher from the bottom. Usually you only care about counts at * detailLevel, and so you will add this number to all other counts below, including to omitted/implied children * counts of 0. If there are no indexed ranges (just instances, i.e. fully specified dates) then this value will * always be 0. */ public int topLeaves; /** Holds all the {@link FacetParentVal} instances in order of the key. This is sparse; there won't be an * instance if it's count and children are all 0. The keys are {@link LevelledValue} shapes, which can be * converted back to the original Object (i.e. a Calendar) via {@link #toObject(LevelledValue)}. */ public final SortedMap<LevelledValue,FacetParentVal> parents = new TreeMap<>(); /** Holds a block of detailLevel counts aggregated to their parent level. */ public static class FacetParentVal { /** The count of ranges that span all of the childCount. In more technical terms, this is the number of leaf * cells found at this parent. Treat this like {@link Facets#topLeaves}. */ public int parentLeaves; // (parent leaf count) to be added to all descendants (children) /** The length of {@link #childCounts}. If childCounts is not null then this is childCounts.length, otherwise it * says how long it would have been if it weren't null . */ public int childCountsLen; /** The detail level counts. It will be null if there are none, and thus they are assumed 0. Most apps, when * presenting the information, will add {@link #topLeaves} and {@link #parentLeaves} to each count. */ public int [] childCounts; // null if there are none. // assert childCountsLen == childCounts.length } } I've got a toString() on it with concise output that I found nice to look at during debugging. The patch has some small changes to related classes involved that are mostly little refactorings and/or making things visible that the facets code needs that were previously private. I'd like to make more refactoring/renaming happen around there to make this number/date spatial API a little more friendly. I'm sure it'll be a bit awkward to newcomers.
          Hide
          shaie Shai Erera added a comment -

          Would it make sense to use the facet/ module Facets and FacetResult? I haven't looked at the patch, but if we can, it will be good since those facets could be used together with other facets (e.g. range faceting on numeric fields).

          Show
          shaie Shai Erera added a comment - Would it make sense to use the facet/ module Facets and FacetResult? I haven't looked at the patch, but if we can, it will be good since those facets could be used together with other facets (e.g. range faceting on numeric fields).
          Hide
          dsmiley David Smiley added a comment -

          Hello Shai Erera. I did take a quick look even before your suggestion now and there isn't much congruency between both. For one thing, the facet module yields a LabelAndValue for each count yet for the faceting I did, I deliberately avoided computing the label String for each value and instead return a int[] for a higher level bucket. So if you facet by day, you'll get a month "parent" with a label and an int array of size 31 (or sometimes 30 or 28). Also, I don't know what to make of the "dim" and "path" stuff. So it doesn't seem like a good fit, and might erroneously suggest to users some sort of integration with the facets module that doesn't actually exist.

          Show
          dsmiley David Smiley added a comment - Hello Shai Erera . I did take a quick look even before your suggestion now and there isn't much congruency between both. For one thing, the facet module yields a LabelAndValue for each count yet for the faceting I did, I deliberately avoided computing the label String for each value and instead return a int[] for a higher level bucket. So if you facet by day, you'll get a month "parent" with a label and an int array of size 31 (or sometimes 30 or 28). Also, I don't know what to make of the "dim" and "path" stuff. So it doesn't seem like a good fit, and might erroneously suggest to users some sort of integration with the facets module that doesn't actually exist.
          Hide
          shaie Shai Erera added a comment -

          OK in that case it sounds like it doesn't make much sense then. No need to force anything.

          Show
          shaie Shai Erera added a comment - OK in that case it sounds like it doesn't make much sense then. No need to force anything.
          Hide
          dsmiley David Smiley added a comment -

          Updated patch to sync with refactors recently committed, and a minor test bug. I think I may commit this to trunk soonish to get Jenkins beating on it... though Ryan McKinley I'd appreciate a look.

          Show
          dsmiley David Smiley added a comment - Updated patch to sync with refactors recently committed, and a minor test bug. I think I may commit this to trunk soonish to get Jenkins beating on it... though Ryan McKinley I'd appreciate a look.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1647727 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1647727 ]

          LUCENE-5735: number/date range faceting on NumberRangePrefixTreeStrategy

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1647727 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1647727 ] LUCENE-5735 : number/date range faceting on NumberRangePrefixTreeStrategy
          Hide
          dsmiley David Smiley added a comment -

          This patch improves on the trunk committed code:

          • Refactors out a PrefixTreeStrategy agnostic "PrefixTreeFacetCounter" with a callback/visitor for each count. NumberRangePrefixFacets is no more, and the NumberRangePrefixTreeStrategy provides a callback to populate the same data structure it did before. It could be slower because of a separation that's here now that wasn't before but I doubt it'd be noticeable.
          • Instead of filtering docs to be faceted with a Bits, you now provide a Filter. I think this is much better API wise than before.

          This patch depends on two recent small spatial refactorings, LUCENE-6181 and LUCENE-6182. I'll commit these issues in a bit and let Jenkins beat on it.

          Show
          dsmiley David Smiley added a comment - This patch improves on the trunk committed code: Refactors out a PrefixTreeStrategy agnostic "PrefixTreeFacetCounter" with a callback/visitor for each count. NumberRangePrefixFacets is no more, and the NumberRangePrefixTreeStrategy provides a callback to populate the same data structure it did before. It could be slower because of a separation that's here now that wasn't before but I doubt it'd be noticeable. Instead of filtering docs to be faceted with a Bits, you now provide a Filter. I think this is much better API wise than before. This patch depends on two recent small spatial refactorings, LUCENE-6181 and LUCENE-6182 . I'll commit these issues in a bit and let Jenkins beat on it.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1652212 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1652212 ]

          LUCENE-5735: spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1652212 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1652212 ] LUCENE-5735 : spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1652254 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1652254 ]

          LUCENE-5735: move PrefixTreeFacetCounter up a package

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1652254 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1652254 ] LUCENE-5735 : move PrefixTreeFacetCounter up a package
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1652270 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1652270 ]

          LUCENE-5735: don't run test 10,000 times. if this is needed please change back but add Nightly

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1652270 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1652270 ] LUCENE-5735 : don't run test 10,000 times. if this is needed please change back but add Nightly
          Hide
          dsmiley David Smiley added a comment -

          The PrefixTreeFacetCounter utility is good; if it doesn't get committed to 5x as part of this issue first, it will for the heatmap one.

          There's a bug in NumberRangePrefixTreeStrategy.calcFacets in which all cells above the parent are counted as topLeaves, when really that can only be done if the leaf cell contains the facet range. I have a fix in-progress in which I detect this and if the cell doesn't contain the facet range then I walk the sub-cells and increment the counters on the parent facet cells. There's a rare-ish bug I need to debug still. But thus far there are a few changes pending in my local check-out:

          • Make TreeCellIterator public (lucene.internal, still) and allow the 'cell' to be a cell other than the top world cell. Probably add a reset() constructor-like method to re-use an instance.
          • NRCell has an optimization when getting subCells that seems to work fine in the normal code-paths thus far but the updated faceting code in-progress has shown the optimization to be faulty, so I just removed it as I don't think it was worth trying to make it work.
          • NRCell sometimes can't get subCells if it was initialized from a short length shape/bytes; it should instead always initialize it's array to maxLevels. Again; this apparently never happen in normal code paths but in some toy test code I triggered it.
          • Refactor the two main date range tests to share a random calendar utility (RandomCalHelper).
          Show
          dsmiley David Smiley added a comment - The PrefixTreeFacetCounter utility is good; if it doesn't get committed to 5x as part of this issue first, it will for the heatmap one. There's a bug in NumberRangePrefixTreeStrategy.calcFacets in which all cells above the parent are counted as topLeaves, when really that can only be done if the leaf cell contains the facet range. I have a fix in-progress in which I detect this and if the cell doesn't contain the facet range then I walk the sub-cells and increment the counters on the parent facet cells. There's a rare-ish bug I need to debug still. But thus far there are a few changes pending in my local check-out: Make TreeCellIterator public (lucene.internal, still) and allow the 'cell' to be a cell other than the top world cell. Probably add a reset() constructor-like method to re-use an instance. NRCell has an optimization when getting subCells that seems to work fine in the normal code-paths thus far but the updated faceting code in-progress has shown the optimization to be faulty, so I just removed it as I don't think it was worth trying to make it work. NRCell sometimes can't get subCells if it was initialized from a short length shape/bytes; it should instead always initialize it's array to maxLevels. Again; this apparently never happen in normal code paths but in some toy test code I triggered it. Refactor the two main date range tests to share a random calendar utility (RandomCalHelper).
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1658300 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1658300 ]

          LUCENE-5735: spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1658300 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1658300 ] LUCENE-5735 : spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)
          Hide
          dsmiley David Smiley added a comment -

          After the 6x branch is cut, I'm going to delete this from the 6x branch but leave it in master. Even though the current test passes; it's too in-progress. I'll likely get some more time on this to finish it yet but not ATM.

          Show
          dsmiley David Smiley added a comment - After the 6x branch is cut, I'm going to delete this from the 6x branch but leave it in master. Even though the current test passes; it's too in-progress. I'll likely get some more time on this to finish it yet but not ATM.
          Hide
          mikemccand Michael McCandless added a comment -

          David Smiley you can remove this now from both 6.0.x and 6.x branches?

          Show
          mikemccand Michael McCandless added a comment - David Smiley you can remove this now from both 6.0.x and 6.x branches?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 0b15fd86364dbafe08abc2939d54749e69bca23f in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0b15fd8 ]

          LUCENE-5735: remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now

          Show
          jira-bot ASF subversion and git services added a comment - Commit 0b15fd86364dbafe08abc2939d54749e69bca23f in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0b15fd8 ] LUCENE-5735 : remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 239eb2400754ac5d31ee11f44e38567623ec8557 in lucene-solr's branch refs/heads/branch_6_0 from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=239eb24 ]

          LUCENE-5735: remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now
          (cherry picked from commit 0b15fd8)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 239eb2400754ac5d31ee11f44e38567623ec8557 in lucene-solr's branch refs/heads/branch_6_0 from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=239eb24 ] LUCENE-5735 : remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now (cherry picked from commit 0b15fd8)

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dsmiley David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development