Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      stats component currently supports "calcDistinct" but it's terribly inefficient – especially in distib mode.

      we should add support for using hyperloglog to compute an approximate count of distinct values (using localparams via SOLR-6349 to control the precision of the approximation)

      1. SOLR-6968_nosparse.patch
        3 kB
        Hoss Man
      2. SOLR-6968.patch
        97 kB
        Hoss Man
      3. SOLR-6968.patch
        85 kB
        Hoss Man
      4. SOLR-6968.patch
        57 kB
        Hoss Man
      5. SOLR-6968.patch
        41 kB
        Hoss Man
      6. SOLR-6968.patch
        38 kB
        Hoss Man
      7. SOLR-6968.patch
        23 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          really simple straw man implementation using java-hll...

          https://github.com/aggregateknowledge/java-hll

          The bulk of the current patch is in test refactoring because all the special case conditionals in StatsComponentTest.testIndividualStatLocalParams were driving me insane.

          Currently only cardinality of numeric fields is supported (and even then, only long fields really work "correctly"). Current syntax is...

          /select?q=*:*&stats=true&stats.field={!cardinality=true}fieldname_l
          

          ...but i'm thinking that should change ... there's at least two types of knobs we should support, i'm just not sure which is more important, or if either should be mandatory:

          • An indication of wether or not hte input is already hashed
            • reading up more on HLL i'm realizing how important it is that the values be hashed (into longs).
            • We should certainly support on the fly hashing, but for people who plan to compute cardinalities a lot, particularly over large sets or strings, we should also have both:
              • an easy way for them to compute those long hashes at index time (simple UpdateProcessor)
              • a stats localparam indicate that the field they are computing cardinality over is already hashed
          • precisions / size tunning
            • similar to how we have an optional "tdigestCompression" param we could have an "hllOptions" param for overriding the "log2m" and "regwidth" options
            • or we could require that the value of the "cardinality" param be a value indicating how much the user cares about accuracy vs ram (ie: a float between 0 and 1 indicating min ram vs max accurace) and compute log2m+regwidth from those ("false" or negative values could disable complete, while "true" could be shorthand for some default)
              • this would have the benefit of being something we could continue to support even if a better cardinality algorithm comes along in the future

          My next steps are to focus on more concrete tests & then refactoring to work with other field types, and think about the knobs/configuration as i go.

          Show
          Hoss Man added a comment - really simple straw man implementation using java-hll... https://github.com/aggregateknowledge/java-hll The bulk of the current patch is in test refactoring because all the special case conditionals in StatsComponentTest.testIndividualStatLocalParams were driving me insane. Currently only cardinality of numeric fields is supported (and even then, only long fields really work "correctly"). Current syntax is... /select?q=*:*&stats=true&stats.field={!cardinality=true}fieldname_l ...but i'm thinking that should change ... there's at least two types of knobs we should support, i'm just not sure which is more important, or if either should be mandatory: An indication of wether or not hte input is already hashed reading up more on HLL i'm realizing how important it is that the values be hashed (into longs). We should certainly support on the fly hashing, but for people who plan to compute cardinalities a lot, particularly over large sets or strings, we should also have both: an easy way for them to compute those long hashes at index time (simple UpdateProcessor) a stats localparam indicate that the field they are computing cardinality over is already hashed precisions / size tunning similar to how we have an optional "tdigestCompression" param we could have an "hllOptions" param for overriding the "log2m" and "regwidth" options or we could require that the value of the "cardinality" param be a value indicating how much the user cares about accuracy vs ram (ie: a float between 0 and 1 indicating min ram vs max accurace) and compute log2m+regwidth from those ("false" or negative values could disable complete, while "true" could be shorthand for some default) this would have the benefit of being something we could continue to support even if a better cardinality algorithm comes along in the future My next steps are to focus on more concrete tests & then refactoring to work with other field types, and think about the knobs/configuration as i go.
          Hide
          Hoss Man added a comment -

          Updated patch with basic support for all field types (that are already supported by stats component) and some more tests. Also SolrJ response object getter method.

          Show
          Hoss Man added a comment - Updated patch with basic support for all field types (that are already supported by stats component) and some more tests. Also SolrJ response object getter method.
          Hide
          Hoss Man added a comment -

          Updated patch with more tests.

          My current TODO list...

           - 6 < regwidth makes no sense? 
             - even at (min) log2m==4, isn't regwidth==6 big enough for all possible (hashed) long values?
          
           - prehashed support
             - need to sanity/error check that the field is a long
             - add an update processor to make this easy to do at index time
           - tunning knobs
             - memory vs accuracy (log2m)
               - idea: (least ram) 0 < accuracy < 1 (most accurate)
                 - scale 
             - max cardinality estimatable (regwidth)
               - perhaps hardcode regwidth==6 ? expert only option to adjust?
               - pick regwidth based on field type? (int/enum have fewer in general)
               - pick regwidth based on index stats? max out based on total terms in field?
                 - or for single valued fields: max out based on numDocs
                 - HLL must use same hash seed, but does it support union when log2m and regwidth are diff?
           - convinience equivilence with countDistinct in solrj response obj ?
          
          Show
          Hoss Man added a comment - Updated patch with more tests. My current TODO list... - 6 < regwidth makes no sense? - even at (min) log2m==4, isn't regwidth==6 big enough for all possible (hashed) long values? - prehashed support - need to sanity/error check that the field is a long - add an update processor to make this easy to do at index time - tunning knobs - memory vs accuracy (log2m) - idea: (least ram) 0 < accuracy < 1 (most accurate) - scale - max cardinality estimatable (regwidth) - perhaps hardcode regwidth==6 ? expert only option to adjust? - pick regwidth based on field type? (int/enum have fewer in general) - pick regwidth based on index stats? max out based on total terms in field? - or for single valued fields: max out based on numDocs - HLL must use same hash seed, but does it support union when log2m and regwidth are diff? - convinience equivilence with countDistinct in solrj response obj ?
          Hide
          Hoss Man added a comment -

          Updated patch now includes an "HllOptions" class w/tests for parsing various knobs for tunning...

          • cardinality=true and cardinality=false still supported for basic defaults
          • can also specify huerstic based cardinality=N where N is a number between 0.0 and 1.0 inclusive indicating how much accuracy you care about
            • 0 == minimum accuracy, conserve as much ram as possible
            • 1.0 == maximum accuracy, spend as much ram as possible
            • cardinality=true roughly the same as cardinality=0.33
          • additional advanced local params for overriding the hueristic based on a knowledge of HLL:
            • hllLog2m=N (raw int passed to HLL API)
            • hllRegwidth=N (raw int passed to HLL API)
            • "hll" param prefix choosen based on implementation details similar to how percentiles supports tdigestCompression
              • if/when we change the implementation details of how we compute cardinality, these can be ignored and new tunning options can be introduced.
          • hllPreHashed=BOOL
            • only works with Long based fields (by design)
          Show
          Hoss Man added a comment - Updated patch now includes an "HllOptions" class w/tests for parsing various knobs for tunning... cardinality=true and cardinality=false still supported for basic defaults can also specify huerstic based cardinality=N where N is a number between 0.0 and 1.0 inclusive indicating how much accuracy you care about 0 == minimum accuracy, conserve as much ram as possible 1.0 == maximum accuracy, spend as much ram as possible cardinality=true roughly the same as cardinality=0.33 additional advanced local params for overriding the hueristic based on a knowledge of HLL: hllLog2m=N (raw int passed to HLL API) hllRegwidth=N (raw int passed to HLL API) "hll" param prefix choosen based on implementation details similar to how percentiles supports tdigestCompression if/when we change the implementation details of how we compute cardinality, these can be ignored and new tunning options can be introduced. hllPreHashed=BOOL only works with Long based fields (by design)
          Hide
          Hoss Man added a comment -

          more updates...

          • Added hueristics for regwidth
            • slightly reduced default for stats over value sets with 32bit max cardinality (5 vs 6)
            • scaled default further based on cardinality=N float option
              • less aggresive scaling on the low end then we have with log2m: cardinality=0.0 gives "regwidth default - 1"
              • we don't want the huersitc to even generate single bit registers (regwidth==1), instead we lean on the log2m hueristic to be where most of the RAM vs accuracy tunning happens when a float option is used
          • license files and misc precommit cleanup

          ...i'm working on more involved randomized & distributed tests next.

          Show
          Hoss Man added a comment - more updates... Added hueristics for regwidth slightly reduced default for stats over value sets with 32bit max cardinality (5 vs 6) scaled default further based on cardinality=N float option less aggresive scaling on the low end then we have with log2m: cardinality=0.0 gives "regwidth default - 1" we don't want the huersitc to even generate single bit registers (regwidth==1), instead we lean on the log2m hueristic to be where most of the RAM vs accuracy tunning happens when a float option is used license files and misc precommit cleanup ...i'm working on more involved randomized & distributed tests next.
          Hide
          Hoss Man added a comment -

          Updated patch now includes a TestDistributedStatsComponentCardinality – which does randomized docs & tunning options with asserts about relative error and consistnecy with prehashed values.

          i think this is pretty much good to go – we can open a distinct issue for adding an UpdateProcessor to support index time murmur3 hashing.

          Show
          Hoss Man added a comment - Updated patch now includes a TestDistributedStatsComponentCardinality – which does randomized docs & tunning options with asserts about relative error and consistnecy with prehashed values. i think this is pretty much good to go – we can open a distinct issue for adding an UpdateProcessor to support index time murmur3 hashing.
          Hide
          ASF subversion and git services added a comment -

          Commit 1678245 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1678245 ]

          SOLR-6968: New 'cardinality' option for stats.field, uses HyperLogLog to efficiently estimate the cardinality of a field w/bounded RAM

          Show
          ASF subversion and git services added a comment - Commit 1678245 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1678245 ] SOLR-6968 : New 'cardinality' option for stats.field, uses HyperLogLog to efficiently estimate the cardinality of a field w/bounded RAM
          Hide
          ASF subversion and git services added a comment -

          Commit 1678254 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1678254 ]

          SOLR-6968: New 'cardinality' option for stats.field, uses HyperLogLog to efficiently estimate the cardinality of a field w/bounded RAM (merge r1678245)

          Show
          ASF subversion and git services added a comment - Commit 1678254 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1678254 ] SOLR-6968 : New 'cardinality' option for stats.field, uses HyperLogLog to efficiently estimate the cardinality of a field w/bounded RAM (merge r1678245)
          Hide
          Hoss Man added a comment -

          Doing some perf testing, i found that the SPARSE representation of HLL can cause some heinous response times for large sets – a minor part of the issue seems to be the slower insertion rate compared to the FULL representation (documented), but a much bigger factor is that merging multiple (large) SPARSE HLLs is almost 10x slower then merging FULL HLLs of the same size.

          it might be worth adding tuning options and/or hueristics to control if/when SPARSE representation should be used (in cases where folks have smaller sets and care more about memory then speed), but for now i'm just going disable it.

          Show
          Hoss Man added a comment - Doing some perf testing, i found that the SPARSE representation of HLL can cause some heinous response times for large sets – a minor part of the issue seems to be the slower insertion rate compared to the FULL representation (documented), but a much bigger factor is that merging multiple (large) SPARSE HLLs is almost 10x slower then merging FULL HLLs of the same size. it might be worth adding tuning options and/or hueristics to control if/when SPARSE representation should be used (in cases where folks have smaller sets and care more about memory then speed), but for now i'm just going disable it.
          Hide
          Hoss Man added a comment -

          patch that worked well in my perf tests to disable SPARSE (and optimize some ram usage when merging EMPTY) ... will commit once ant precommit test finishes.

          Show
          Hoss Man added a comment - patch that worked well in my perf tests to disable SPARSE (and optimize some ram usage when merging EMPTY) ... will commit once ant precommit test finishes.
          Hide
          ASF subversion and git services added a comment -

          Commit 1679241 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1679241 ]

          SOLR-6968: perf tweak: eliminate use of SPARSE storage option since it has some pathologically bad behavior for some set sizes (particularly when merging shard responses)

          Show
          ASF subversion and git services added a comment - Commit 1679241 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1679241 ] SOLR-6968 : perf tweak: eliminate use of SPARSE storage option since it has some pathologically bad behavior for some set sizes (particularly when merging shard responses)
          Hide
          ASF subversion and git services added a comment -

          Commit 1679250 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1679250 ]

          SOLR-6968: perf tweak: eliminate use of SPARSE storage option since it has some pathologically bad behavior for some set sizes (particularly when merging shard responses) (merge r1679241)

          Show
          ASF subversion and git services added a comment - Commit 1679250 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1679250 ] SOLR-6968 : perf tweak: eliminate use of SPARSE storage option since it has some pathologically bad behavior for some set sizes (particularly when merging shard responses) (merge r1679241)
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development