Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10548

hyper-log-log based numBuckets for faceting

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.6
    • Component/s: Facet Module
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      numBuckets currently uses an estimate (same as the unique function detailed at http://yonik.com/solr-count-distinct/ ). We should either change implementations or introduce a way to optionally select a hyper-log-log based approach for a better estimate with high field cardinalities.

        Issue Links

          Activity

          Hide
          otis Otis Gospodnetic added a comment -

          A new paper published in January introduced a new cardinality estimation algorithm called LogLog-Beta/β:

          https://arxiv.org/abs/1612.02284

          "The new algorithm uses only one formula and needs no additional bias
          corrections for the entire range of cardinalities, therefore, it is more
          efficient and simpler to implement. Our simulations show that the accuracy
          provided by the new algorithm is as good as or better than the accuracy
          provided by either of HyperLogLog or HyperLogLog++."
          Some comments about its accuracy (graphs included) can be found in this PR: https://github.com/antirez/redis/pull/3677

          Show
          otis Otis Gospodnetic added a comment - A new paper published in January introduced a new cardinality estimation algorithm called LogLog-Beta/β: https://arxiv.org/abs/1612.02284 "The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new algorithm is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++." Some comments about its accuracy (graphs included) can be found in this PR: https://github.com/antirez/redis/pull/3677
          Hide
          yseeley@gmail.com Yonik Seeley added a comment - - edited

          Thanks for the pointer, looks interesting! I've linked another issue for the implementation since we already have hyper-log-log implemented (hll) and I have a patch in progress to just use that for now.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - - edited Thanks for the pointer, looks interesting! I've linked another issue for the implementation since we already have hyper-log-log implemented (hll) and I have a patch in progress to just use that for now.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 71ce0d31a6a907bf1566fc51324d5f26e4205c21 in lucene-solr's branch refs/heads/master from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=71ce0d3 ]

          SOLR-10548: SOLR-10552: numBuckets should use hll and ignore mincount>1 filtering

          Show
          jira-bot ASF subversion and git services added a comment - Commit 71ce0d31a6a907bf1566fc51324d5f26e4205c21 in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=71ce0d3 ] SOLR-10548 : SOLR-10552 : numBuckets should use hll and ignore mincount>1 filtering
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1f67ddda7699e1889d600f3f155dd910d71e864f in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f67ddd ]

          SOLR-10548: SOLR-10552: numBuckets should use hll and ignore mincount>1 filtering

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1f67ddda7699e1889d600f3f155dd910d71e864f in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f67ddd ] SOLR-10548 : SOLR-10552 : numBuckets should use hll and ignore mincount>1 filtering

            People

            • Assignee:
              yseeley@gmail.com Yonik Seeley
              Reporter:
              yseeley@gmail.com Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development