Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7401

BKDWriter should ensure all dimensions are indexed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current heuristic is to use the dimension that has the largest span, so if dimensions have a different distribution of values, we could theoretically (but maybe in practice too?) end up with one dimension that is not indexed at all and queries that are mostly selective on this dimension would need to scan lots of blocks.

      1. LUCENE-7401.patch
        10 kB
        Adrien Grand

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Are you trying to address the adersarial case of indexing e.g. a narrow sliver of points?

        Another fun one is if all indexed points are equidistant from an origin. I've wondered whether cells should be "shrink wrapped" during indexing to handle this one...

        There are quite a few papers that explore different splitting techniques to have better behavior with challenging cases.

        Show
        mikemccand Michael McCandless added a comment - Are you trying to address the adersarial case of indexing e.g. a narrow sliver of points? Another fun one is if all indexed points are equidistant from an origin. I've wondered whether cells should be "shrink wrapped" during indexing to handle this one... There are quite a few papers that explore different splitting techniques to have better behavior with challenging cases.
        Hide
        jpountz Adrien Grand added a comment -

        Are you trying to address the adersarial case of indexing e.g. a narrow sliver of points?

        Yes indeed. It might feel like a corner case when dimensions represent similar data like lontitudes and latitudes, but what happens eg. if you want to index all towns in the world alongside their population as a 3rd dimension. Given that there are very large areas that only have small towns, it could happen that the population dimension does not get indexed at all in these areas?

        Another fun one is if all indexed points are equidistant from an origin. I've wondered whether cells should be "shrink wrapped" during indexing to handle this one...

        Hmm this got me curious, why is it an adversarial case if all points are equidistant from an origin?

        There are quite a few papers that explore different splitting techniques to have better behavior with challenging cases.

        Good point, I should take a look!

        Show
        jpountz Adrien Grand added a comment - Are you trying to address the adersarial case of indexing e.g. a narrow sliver of points? Yes indeed. It might feel like a corner case when dimensions represent similar data like lontitudes and latitudes, but what happens eg. if you want to index all towns in the world alongside their population as a 3rd dimension. Given that there are very large areas that only have small towns, it could happen that the population dimension does not get indexed at all in these areas? Another fun one is if all indexed points are equidistant from an origin. I've wondered whether cells should be "shrink wrapped" during indexing to handle this one... Hmm this got me curious, why is it an adversarial case if all points are equidistant from an origin? There are quite a few papers that explore different splitting techniques to have better behavior with challenging cases. Good point, I should take a look!
        Hide
        mikemccand Michael McCandless added a comment -

        what happens eg. if you want to index all towns in the world alongside their population as a 3rd dimension. Given that there are very large areas that only have small towns, it could happen that the population dimension does not get indexed at all in these areas?

        That's a good example! In that case, with our current splitting, running a range filter for "small population" will be costly. Though, without other filters (by lat/lon) it will likely be costly anyway since town population is probably Zipf's law like? I.e., most areas will still have many more small population towns than big ones.

        Hmm this got me curious, why is it an adversarial case if all points are equidistant from an origin?

        Oh it results in long slivery KD cells, which means queries have to visit too many points.

        Show
        mikemccand Michael McCandless added a comment - what happens eg. if you want to index all towns in the world alongside their population as a 3rd dimension. Given that there are very large areas that only have small towns, it could happen that the population dimension does not get indexed at all in these areas? That's a good example! In that case, with our current splitting, running a range filter for "small population" will be costly. Though, without other filters (by lat/lon) it will likely be costly anyway since town population is probably Zipf's law like? I.e., most areas will still have many more small population towns than big ones. Hmm this got me curious, why is it an adversarial case if all points are equidistant from an origin? Oh it results in long slivery KD cells, which means queries have to visit too many points.
        Hide
        jpountz Adrien Grand added a comment -

        That's a good example! In that case, with our current splitting, running a range filter for "small population" will be costly. Though, without other filters (by lat/lon) it will likely be costly anyway since town population is probably Zipf's law like? I.e., most areas will still have many more small population towns than big ones.

        Right, a filter for "small population" is costly, but I don't think we can do much about it since it is due to the fact it would match lots of documents. The problem that would like to address here is the opposite: filtering for "large population", for instance: "Find all cities in America that have more than 10K inhabitants". This would be a selective filter, so hopefully something that we can execute efficiently. However with the current way the splitting works, the population dimension might never be indexed (because BKD would always decide to index the lat or lon instead, which have larger ranges) and such a query, which is very selective on the population dimension, would likely have to visit all documents that match "America" to find matches, which is disappointing.

        Here is a patch that implements what I had in mind when opening this ticket. It ensures that every dimension gets split no less than 2x less than the dimension that had the most splits. So back to our 3 dimensions example with lat, lon and population, let's assume we have 1M docs, it would mean we need to split 10 times. In such a scenario, we would likely split 4 times on lat, 4 times on lon and 2 times on the population dimension.

        I think it is a better trade-off since it has a better worst case for selective queries. For instance the fact that today the geo dimensions would get 10 splits means that a selective geo query would have to visit 1/1024th of the index, but a selective query on population would have to perform a linear scan. With this patch a selective geo query would have to visit 1/256th of the index (8 splits), which is slower, however a selective query on the population dimension would only have to visit 1/4th of the index (2 splits).

        Show
        jpountz Adrien Grand added a comment - That's a good example! In that case, with our current splitting, running a range filter for "small population" will be costly. Though, without other filters (by lat/lon) it will likely be costly anyway since town population is probably Zipf's law like? I.e., most areas will still have many more small population towns than big ones. Right, a filter for "small population" is costly, but I don't think we can do much about it since it is due to the fact it would match lots of documents. The problem that would like to address here is the opposite: filtering for "large population", for instance: "Find all cities in America that have more than 10K inhabitants". This would be a selective filter, so hopefully something that we can execute efficiently. However with the current way the splitting works, the population dimension might never be indexed (because BKD would always decide to index the lat or lon instead, which have larger ranges) and such a query, which is very selective on the population dimension, would likely have to visit all documents that match "America" to find matches, which is disappointing. Here is a patch that implements what I had in mind when opening this ticket. It ensures that every dimension gets split no less than 2x less than the dimension that had the most splits. So back to our 3 dimensions example with lat, lon and population, let's assume we have 1M docs, it would mean we need to split 10 times. In such a scenario, we would likely split 4 times on lat, 4 times on lon and 2 times on the population dimension. I think it is a better trade-off since it has a better worst case for selective queries. For instance the fact that today the geo dimensions would get 10 splits means that a selective geo query would have to visit 1/1024th of the index, but a selective query on population would have to perform a linear scan. With this patch a selective geo query would have to visit 1/256th of the index (8 splits), which is slower, however a selective query on the population dimension would only have to visit 1/4th of the index (2 splits).
        Hide
        mikemccand Michael McCandless added a comment -

        +1, I like how simple the implementation is, just tracking how many times each dim was split "above" us.

        Show
        mikemccand Michael McCandless added a comment - +1, I like how simple the implementation is, just tracking how many times each dim was split "above" us.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0c1cab71920a540807555501f7198ca402e16740 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0c1cab7 ]

        LUCENE-7401: Make sure BKD trees index all dimensions.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0c1cab71920a540807555501f7198ca402e16740 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0c1cab7 ] LUCENE-7401 : Make sure BKD trees index all dimensions.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ba47f530d1165d4518569422472bc9e4f1c04b26 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ba47f53 ]

        LUCENE-7401: Make sure BKD trees index all dimensions.

        Show
        jira-bot ASF subversion and git services added a comment - Commit ba47f530d1165d4518569422472bc9e4f1c04b26 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ba47f53 ] LUCENE-7401 : Make sure BKD trees index all dimensions.
        Hide
        jpountz Adrien Grand added a comment -

        Thanks Mike for having a look.

        Show
        jpountz Adrien Grand added a comment - Thanks Mike for having a look.

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development