Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1
    • Component/s: modules/spatial
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene spatial's PrefixTree (grid) based strategies index data in a way highly amenable to faceting on grids cells to compute a so-called heatmap. The underlying code in this patch uses the PrefixTreeFacetCounter utility class which was recently refactored out of faceting for NumberRangePrefixTree LUCENE-5735. At a low level, the terms (== grid cells) are navigated per-segment, forward only with TermsEnum.seek, so it's pretty quick and furthermore requires no extra caches & no docvalues. Ideally you should use QuadPrefixTree (or Flex once it comes out) to maximize the number grid levels which in turn maximizes the fidelity of choices when you ask for a grid covering a region. Conveniently, the provided capability returns the data in a 2-D grid of counts, so the caller needn't know a thing about how the data is encoded in the prefix tree. Well almost... at this point they need to provide a grid level, but I'll soon provide a means of deriving the grid level based on a min/max cell count.

      I recommend QuadPrefixTree with geo=false so that you can provide a square world-bounds (360x360 degrees), which means square grid cells which are more desirable to display than rectangular cells.

      1. LUCENE-6191__Spatial_heatmap.patch
        29 kB
        David Smiley
      2. LUCENE-6191__Spatial_heatmap.patch
        28 kB
        David Smiley
      3. LUCENE-6191__Spatial_heatmap.patch
        25 kB
        David Smiley

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          This new feature works for not just points but any shape in the grid (e.g. polygons, lines, whatever). Remember that all non-point shapes are approximated to the grid at indexing time. So if you ask for a detailed heatmap on the edge of an indexed shape, the detail of that shape isn't going to magically be more detailed than whatever it was determined to be at indexing time.

          Dateline spanning heatmaps are supported.

          In the patch, the code is in HeatmapFacetCounter, a static utility class with a "Heatmap" inner class to hold the response data comprised of the column count, row count, an int[] of counts, and the rectangular region of where this heatmap is. And of course there's a test, one that uses randomized testing extensively.

          I'm tempted to add a method to PrefixTreeStrategy calcFacets2D or calcHeatmap or something like that which simply calls through to this class.

          Show
          David Smiley added a comment - This new feature works for not just points but any shape in the grid (e.g. polygons, lines, whatever). Remember that all non-point shapes are approximated to the grid at indexing time. So if you ask for a detailed heatmap on the edge of an indexed shape, the detail of that shape isn't going to magically be more detailed than whatever it was determined to be at indexing time. Dateline spanning heatmaps are supported. In the patch, the code is in HeatmapFacetCounter, a static utility class with a "Heatmap" inner class to hold the response data comprised of the column count, row count, an int[] of counts, and the rectangular region of where this heatmap is. And of course there's a test, one that uses randomized testing extensively. I'm tempted to add a method to PrefixTreeStrategy calcFacets2D or calcHeatmap or something like that which simply calls through to this class.
          Hide
          David Smiley added a comment -

          BTW I took a peek at ElasticSearch's geohash aggregations feature to see how that similar feature worked. It's quite different. AFAICT, it's only for point data and works off of DocValues, and at least presently it always exposes the counts as faceting on geohashes. (i.e. frequency ordered geohash terms with counts). The algorithmic complexity is based on the number of documents matching your search O(docs), whereas this patch is O(log(terms)) with a constant factor of how many grid cells you request.

          Show
          David Smiley added a comment - BTW I took a peek at ElasticSearch's geohash aggregations feature to see how that similar feature worked. It's quite different. AFAICT, it's only for point data and works off of DocValues, and at least presently it always exposes the counts as faceting on geohashes. (i.e. frequency ordered geohash terms with counts). The algorithmic complexity is based on the number of documents matching your search O(docs), whereas this patch is O(log(terms)) with a constant factor of how many grid cells you request.
          Hide
          Nicholas Knize added a comment -

          Lookin good David Smiley I like this approach and the first-cut implementation. For completeness it might be worthwhile to add some systematic unit tests with complex shapes (i.e., dateline and/or pole crossing polys w/ multiple holes) just to validate/verify expected behavior along coordinate system boundary conditions and to assess the possibility for false positives/negatives (which could be detrimental to some spatial analysis applications).

          Re: the former, I have patches for correcting polygonal ambiguity that will eventually make their way into either Spatial4J or Lucene-Spatial. The timing will just depend on our ES feature release schedule.

          Show
          Nicholas Knize added a comment - Lookin good David Smiley I like this approach and the first-cut implementation. For completeness it might be worthwhile to add some systematic unit tests with complex shapes (i.e., dateline and/or pole crossing polys w/ multiple holes) just to validate/verify expected behavior along coordinate system boundary conditions and to assess the possibility for false positives/negatives (which could be detrimental to some spatial analysis applications). Re: the former, I have patches for correcting polygonal ambiguity that will eventually make their way into either Spatial4J or Lucene-Spatial. The timing will just depend on our ES feature release schedule.
          Hide
          David Smiley added a comment -

          Thanks for the review Nick; I don't get much peer-review in the spatial module so I appreciate it when I get it greatly.

          I do tend to drink the randomized-testing cool-aid damn hard and it wouldn't hurt to add some basic static sanity tests into the mix too, like at some coordinate system boundaries. But regarding exotic shapes, IMO this isn't the right place to test them. That would go to some test or another in org.apache.lucene.spatial.prefix.tree perhaps, if at all. As long as the Shape implementation is tested to obey it's API contract, the SPT (SpatialPrefixTree, e.g. quad) & RPT (RecursivePrefixTree w/ collaborators) don't care. The 3 primary layers (Spatial4j, SPT, and RPT) are highly decoupled and tested independently. So an exotic shape is "just" another shape to SPT & the RPT so long as it implements relate() & bounding box properly. Arguably SPT isn't directly tested well right now, but RPT is tested heavily and beats on SPT hard.

          Show
          David Smiley added a comment - Thanks for the review Nick; I don't get much peer-review in the spatial module so I appreciate it when I get it greatly. I do tend to drink the randomized-testing cool-aid damn hard and it wouldn't hurt to add some basic static sanity tests into the mix too, like at some coordinate system boundaries. But regarding exotic shapes, IMO this isn't the right place to test them. That would go to some test or another in org.apache.lucene.spatial.prefix.tree perhaps, if at all . As long as the Shape implementation is tested to obey it's API contract, the SPT (SpatialPrefixTree, e.g. quad) & RPT (RecursivePrefixTree w/ collaborators) don't care. The 3 primary layers (Spatial4j, SPT, and RPT) are highly decoupled and tested independently. So an exotic shape is "just" another shape to SPT & the RPT so long as it implements relate() & bounding box properly. Arguably SPT isn't directly tested well right now, but RPT is tested heavily and beats on SPT hard.
          Hide
          Nicholas Knize added a comment - - edited

          That seems like an acceptable compromise. I'm a big fan of test coverage. In this case that we have test coverage for validating whether the grid cells returned are the expected cells contained within the shape (testing with rectangles here is easy for a quick proof of concept). Unexpected failures in the grid logic would result in incorrect facet counts, no? Might also be a worthwhile endeavor for assessing/documenting the accuracy of the spatial faceting results at the various grid levels (though if I understand your approach your level estimation will sort of help mitigate the resolution issue?)

          Show
          Nicholas Knize added a comment - - edited That seems like an acceptable compromise. I'm a big fan of test coverage. In this case that we have test coverage for validating whether the grid cells returned are the expected cells contained within the shape (testing with rectangles here is easy for a quick proof of concept). Unexpected failures in the grid logic would result in incorrect facet counts, no? Might also be a worthwhile endeavor for assessing/documenting the accuracy of the spatial faceting results at the various grid levels (though if I understand your approach your level estimation will sort of help mitigate the resolution issue?)
          Hide
          David Smiley added a comment -

          FYI Lucene coverage reports are online here: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/spatial/pkg-summary.html

          Unexpected failures in the grid logic would result in incorrect facet counts, no?

          Absolutely; the randomized tests unearthed all sorts of things that were fixed before I posted the patch. It's a philosophy of going about testing that I can't speak more highly of, particularly for spatial – it is so ripe for it.

          Might also be a worthwhile endeavor for assessing/documenting the accuracy of the spatial faceting results at the various grid levels

          Unless I'm misunderstanding you, what you say is one in the same with documenting the resolution of grid cells of the SPT you choose (irrespective of faceting or filter/search or whatever you want to do with the cells). I noticed ES's docs have a nice chart for geohashes. SPT instances have methods to programmatically lookup the resolution.

          Show
          David Smiley added a comment - FYI Lucene coverage reports are online here: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/spatial/pkg-summary.html Unexpected failures in the grid logic would result in incorrect facet counts, no? Absolutely; the randomized tests unearthed all sorts of things that were fixed before I posted the patch. It's a philosophy of going about testing that I can't speak more highly of, particularly for spatial – it is so ripe for it. Might also be a worthwhile endeavor for assessing/documenting the accuracy of the spatial faceting results at the various grid levels Unless I'm misunderstanding you, what you say is one in the same with documenting the resolution of grid cells of the SPT you choose (irrespective of faceting or filter/search or whatever you want to do with the cells). I noticed ES's docs have a nice chart for geohashes. SPT instances have methods to programmatically lookup the resolution.
          Hide
          Nicholas Knize added a comment - - edited

          Unless I'm misunderstanding you, what you say is one in the same with documenting the resolution of grid cells of the SPT you choose

          Spatial resolution is part of the equation. I was referring to the percent error of the facet results. With overlapping polys, for example. At level n the boundaries of polys A:F intersect a grid cell but are contained within polys G:J. At level n the facet count returns 4 (for G:J) but at level n+1 returns 10 (for A:J) - a 60% error at level n. It might be nice to include percent error as a part of the results. If the heatmap is converted to a thematic intensity map those grid cells will underestimate ground truth. By including percent error (or even the residuals) RMS could be computed and provided at the varying zoom levels.

          "98% of statistics is made up"

          Show
          Nicholas Knize added a comment - - edited Unless I'm misunderstanding you, what you say is one in the same with documenting the resolution of grid cells of the SPT you choose Spatial resolution is part of the equation. I was referring to the percent error of the facet results. With overlapping polys, for example. At level n the boundaries of polys A:F intersect a grid cell but are contained within polys G:J . At level n the facet count returns 4 (for G:J ) but at level n+1 returns 10 (for A:J ) - a 60% error at level n . It might be nice to include percent error as a part of the results. If the heatmap is converted to a thematic intensity map those grid cells will underestimate ground truth. By including percent error (or even the residuals) RMS could be computed and provided at the varying zoom levels. "98% of statistics is made up"
          Hide
          David Smiley added a comment -

          I believe the % error of the facet results is the same as the resolution (grid cell size) chosen + impacted by the resolution settings chosen when shapes were indexed. Unfortunately, there isn't one error figure since at index time each shape is independently turned into cells of a maximum resolution based on the size of the shape (which can vary). So USSR is going to be approximated more than a small country next to it would be. So if your heatmap is at a higher resolution than the indexed shapes in the region in question, then you may observe some blocky effects (e.g. all cells on one side of the heatmap could be +1 due to matching an adjacent country). A "fix" could be to always index to a maximum or fixed resolution... but that's not workable when the shape is large – it doesn't scale. I have an article on this subject: http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/

          Show
          David Smiley added a comment - I believe the % error of the facet results is the same as the resolution (grid cell size) chosen + impacted by the resolution settings chosen when shapes were indexed. Unfortunately, there isn't one error figure since at index time each shape is independently turned into cells of a maximum resolution based on the size of the shape (which can vary). So USSR is going to be approximated more than a small country next to it would be. So if your heatmap is at a higher resolution than the indexed shapes in the region in question, then you may observe some blocky effects (e.g. all cells on one side of the heatmap could be +1 due to matching an adjacent country). A "fix" could be to always index to a maximum or fixed resolution... but that's not workable when the shape is large – it doesn't scale. I have an article on this subject: http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/
          Hide
          David Smiley added a comment -

          I have some performance numbers taken while working on SOLR-7005. I took a geonames data set of 8,552,952 docs and I indexed the latitude & longitude into a quad prefixTree with maximum resolution of a meter and with geo=false and -180 to 180, -90 to 90 world bounds of standard geodetic degree boundaries. That's a screw-up on my part; I forgot to use 360x360 to get square grid boxes instead of rectangular ones. But that's not pertinent. The index size is 2.6GB which is kind of large. Increasing the maximum resolution to above a meter will decrease the index size a lot. This reminds me of how beneficial the forthcoming "flex" prefixTree will be, but I digress. This data is all points.

          Base stats:

          • Machine: my SSD based recent MacBook Pro, Java 8
          • Lucene/Solr: trunk as of last night
          • Docs: 8,552,952
          • Segments: 1
          • Disk index size: 2.6GB
          • QuadTree:
            • precision: 26 (better than a meter)

          512x512 heatmap, (note: this is a whopping 262,144 cells): 248ms (PNG to be attached to SOLR-7005 soon).
          Now filtered with an additional query down to 165 docs: 105ms (I figure this fast number is due to a particular optimization in the prefix tree facet counter for highly discriminating filters).

          64x64 heatmap (4,096 cells): 105ms
          Filtered to 165 docs: 21ms

          I took one measurement when the index was un-optimized at 38 segments, including 10K deleted docs (512x512 query all): 1800ms roughly. I should try this again after I re-index with the square grid cells I want.

          Show
          David Smiley added a comment - I have some performance numbers taken while working on SOLR-7005 . I took a geonames data set of 8,552,952 docs and I indexed the latitude & longitude into a quad prefixTree with maximum resolution of a meter and with geo=false and -180 to 180, -90 to 90 world bounds of standard geodetic degree boundaries. That's a screw-up on my part; I forgot to use 360x360 to get square grid boxes instead of rectangular ones. But that's not pertinent. The index size is 2.6GB which is kind of large. Increasing the maximum resolution to above a meter will decrease the index size a lot. This reminds me of how beneficial the forthcoming "flex" prefixTree will be, but I digress. This data is all points. Base stats: Machine: my SSD based recent MacBook Pro, Java 8 Lucene/Solr: trunk as of last night Docs: 8,552,952 Segments: 1 Disk index size: 2.6GB QuadTree: precision: 26 (better than a meter) 512x512 heatmap, ( note: this is a whopping 262,144 cells ): 248ms (PNG to be attached to SOLR-7005 soon). Now filtered with an additional query down to 165 docs: 105ms (I figure this fast number is due to a particular optimization in the prefix tree facet counter for highly discriminating filters). 64x64 heatmap (4,096 cells): 105ms Filtered to 165 docs: 21ms I took one measurement when the index was un-optimized at 38 segments, including 10K deleted docs (512x512 query all): 1800ms roughly. I should try this again after I re-index with the square grid cells I want.
          Hide
          David Smiley added a comment -

          delta stats:

          • Segments: 16 (no deleted)
          • docs: 8,526,175 (slightly less than before, not sure why).
          • QuadTree
            precision: 22 (better than 10m)
            bounds: -180 to 180, -180 to 180 (360x360 square)
          • Disk index size: 2.35GB
          • heatmap input range: -180 to 180, -89.999 to 89.999 (slightly inset so heatmap doesn't include a row just > 90 and just < 90)

          512x256 (131,072 cells) heatmap : 882ms
          64x32 (2048 cells) heatmap: 120ms

          Show
          David Smiley added a comment - delta stats: Segments: 16 (no deleted) docs: 8,526,175 (slightly less than before, not sure why). QuadTree precision: 22 (better than 10m) bounds: -180 to 180, -180 to 180 (360x360 square) Disk index size: 2.35GB heatmap input range: -180 to 180, -89.999 to 89.999 (slightly inset so heatmap doesn't include a row just > 90 and just < 90) 512x256 (131,072 cells) heatmap : 882ms 64x32 (2048 cells) heatmap: 120ms
          Hide
          Nicholas Knize added a comment -

          The error factor would certainly be a function of spatial resolution but not the same since you're dealing with expected vs. observed counts (an RMS over the result set vs. %spatialErr). It'd be worth exploring for a later enhancement (maybe an option to include descriptive stats as part of the faceting operation for spatial analysis use-cases) but not critical for the initial capability.

          I'm just not a fan of creating analysis results without communicating some kind of accuracy. It leads to data misrepresentation.

          I do like what you have going on here. I'll experiment with it when I get some time and see if I can't help get some low-overhead accuracy results.

          Show
          Nicholas Knize added a comment - The error factor would certainly be a function of spatial resolution but not the same since you're dealing with expected vs. observed counts (an RMS over the result set vs. %spatialErr). It'd be worth exploring for a later enhancement (maybe an option to include descriptive stats as part of the faceting operation for spatial analysis use-cases) but not critical for the initial capability. I'm just not a fan of creating analysis results without communicating some kind of accuracy. It leads to data misrepresentation. I do like what you have going on here. I'll experiment with it when I get some time and see if I can't help get some low-overhead accuracy results.
          Hide
          David Smiley added a comment -

          This is an updated patch.

          • Check for null input range; swap in world-bounds.
          • Add some static tests that don't use randomization (in addition to extensive randomized one). They aren't much... but it's something.

          And that's about it. I plan to commit this this weekend to trunk & 5x (which means copying PrefixTreeFacetCounter from trunk along with this patch).

          One thing i'm 50/50 on is the ordering of the heatmap counts (int[] counts). The layout is column 1, column 2, ... etc.. Alternatively, perhaps the layout should be row 1, row 2, row 3, .... It's arbitrary of course, and so I'm inclined to let it be since it really doesn't matter. The the by-row layout would feel more closely aligned with viewing one's screen and match the orientation of some image/screen APIs that draw top->down.

          Show
          David Smiley added a comment - This is an updated patch. Check for null input range; swap in world-bounds. Add some static tests that don't use randomization (in addition to extensive randomized one). They aren't much... but it's something. And that's about it. I plan to commit this this weekend to trunk & 5x (which means copying PrefixTreeFacetCounter from trunk along with this patch). One thing i'm 50/50 on is the ordering of the heatmap counts (int[] counts). The layout is column 1, column 2, ... etc.. Alternatively, perhaps the layout should be row 1, row 2, row 3, .... It's arbitrary of course, and so I'm inclined to let it be since it really doesn't matter. The the by-row layout would feel more closely aligned with viewing one's screen and match the orientation of some image/screen APIs that draw top->down.
          Hide
          David Smiley added a comment -

          This updated patch changes the inputRange argument from being a Rectangle to a Shape, and adds a test with a circle.
          docs:

          inputShape - the shape to gather grid squares for; typically a Rectangle. The actual heatmap area will usually be larger since the cells on the edge that overlap are returned. We always return a rectangle of integers even if the inputShape isn't a rectangle – the non-intersecting cells will all be 0. If null is given, the entire world is assumed.

          Show
          David Smiley added a comment - This updated patch changes the inputRange argument from being a Rectangle to a Shape, and adds a test with a circle. docs: inputShape - the shape to gather grid squares for; typically a Rectangle. The actual heatmap area will usually be larger since the cells on the edge that overlap are returned. We always return a rectangle of integers even if the inputShape isn't a rectangle – the non-intersecting cells will all be 0. If null is given, the entire world is assumed.
          Hide
          ASF subversion and git services added a comment -

          Commit 1658298 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1658298 ]

          LUCENE-6191: Spatial 2D heatmap for PrefixTreeStrategy

          Show
          ASF subversion and git services added a comment - Commit 1658298 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1658298 ] LUCENE-6191 : Spatial 2D heatmap for PrefixTreeStrategy
          Hide
          ASF subversion and git services added a comment -

          Commit 1658302 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1658302 ]

          LUCENE-6191: Spatial 2D heatmap for PrefixTreeStrategy

          Show
          ASF subversion and git services added a comment - Commit 1658302 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1658302 ] LUCENE-6191 : Spatial 2D heatmap for PrefixTreeStrategy
          Hide
          David Smiley added a comment -

          One little change in the committed version is a convenience method on PrefixTreeStrategy, which also gives the capability a bit more visibility.

          Show
          David Smiley added a comment - One little change in the committed version is a convenience method on PrefixTreeStrategy, which also gives the capability a bit more visibility.
          Hide
          ASF subversion and git services added a comment -

          Commit 1659041 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1659041 ]

          LUCENE-6191: fix test bug when given 0-area input

          Show
          ASF subversion and git services added a comment - Commit 1659041 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1659041 ] LUCENE-6191 : fix test bug when given 0-area input
          Hide
          ASF subversion and git services added a comment -

          Commit 1659042 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1659042 ]

          LUCENE-6191: fix test bug when given 0-area input

          Show
          ASF subversion and git services added a comment - Commit 1659042 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1659042 ] LUCENE-6191 : fix test bug when given 0-area input
          Hide
          David Smiley added a comment -

          FYI I'm adding an advisory to the javadocs to PrefixTreeFacetCounter that double-counting can occur in certain avoidable situations:

           * <em>NOTE:</em> If for a given document and a given field using 
           * {@link org.apache.lucene.spatial.prefix.RecursivePrefixTreeStrategy}
           * multiple values are indexed (i.e. multi-valued) and at least one of them is a non-point, then there is a possibility
           * of double-counting the document in the facet results.  Since each shape is independently turned into grid cells at
           * a resolution chosen by the shape's size, it's possible they will be indexed at different resolutions.  This means
           * the document could be present in BOTH the postings for a cell in both its prefix and leaf variants.  To avoid this,
           * use a single valued field with a {@link com.spatial4j.core.shape.ShapeCollection} (or WKT equivalent).  Or
           * calculate a suitable level/distErr to index both and call
           * {@link org.apache.lucene.spatial.prefix.PrefixTreeStrategy#createIndexableFields(com.spatial4j.core.shape.Shape, int)}
           * with the same value for all shapes for a given document/field.
          
          Show
          David Smiley added a comment - FYI I'm adding an advisory to the javadocs to PrefixTreeFacetCounter that double-counting can occur in certain avoidable situations: * <em>NOTE:</em> If for a given document and a given field using * {@link org.apache.lucene.spatial.prefix.RecursivePrefixTreeStrategy} * multiple values are indexed (i.e. multi-valued) and at least one of them is a non-point, then there is a possibility * of double -counting the document in the facet results. Since each shape is independently turned into grid cells at * a resolution chosen by the shape's size, it's possible they will be indexed at different resolutions. This means * the document could be present in BOTH the postings for a cell in both its prefix and leaf variants. To avoid this , * use a single valued field with a {@link com.spatial4j.core.shape.ShapeCollection} (or WKT equivalent). Or * calculate a suitable level/distErr to index both and call * {@link org.apache.lucene.spatial.prefix.PrefixTreeStrategy#createIndexableFields(com.spatial4j.core.shape.Shape, int )} * with the same value for all shapes for a given document/field.
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release
          Hide
          David Smiley added a comment -

          Note: this code was donated to the Apache Software Foundation by the Harvard Center for Geographic Analysis as part of the Harvard Hypermap (HHypermap) and WorldMap projects.

          Show
          David Smiley added a comment - Note: this code was donated to the Apache Software Foundation by the Harvard Center for Geographic Analysis as part of the Harvard Hypermap (HHypermap) and WorldMap projects.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development