Solr
  1. Solr
  2. SOLR-7461

StatsComponent, calcdistinct, ability to disable distinctValues while keeping countDistinct

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:

      Description

      When using calcdistinct with large amounts of data the distinctValues field can be extremely large. In cases where the countDistinct is only required it would be helpful if the server did not return distinctValues in the response.

      I'm no expert, but here are some ideas for how the syntax could look.

      # Both countDistinct and distinctValues are returned, along with all other stats
      stats.calcdistinct=true&stats.field=myfield
      
      # Only countDistinct and distinctValues are returned
      stats.calcdistinct=true&stats.field={!countDistinct=true distinctValues=true}myfield
      
      # Only countDistinct is returned
      stats.calcdistinct=true&stats.field={!countDistinct=true}myfield
      
      # Only distinctValues is returned
      stats.calcdistinct=true&stats.field={!distinctValues=true}myfield
      

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          As noted in SOLR-6349...

          i think the best approach would be to leave "calcDistinct" alone as it is now but deprecate/discourage it andmove towards adding an entirely new stats option for computing an aproximated count using hyperloglog (i opened a new issue for this: SOLR-6968)

          ...the problem is that the "exact" count returned by calcDistinct today requires that all distinctValues be aggregated (from all shards in a distrib setup) and dumped into a giant Set in memory. returning the distinctValues may seem cumbersome to clients, but not returning them would just mask how painful this feature is on the server side, and the biggest problems with it (notably server OOMs) wouldn't go away, they'd just be harder to understand.

          so i'm generally opposed to adding more flags to hide what is, in my opinion, a broken "feature" and instead aim to move on and implement a better version of it (hopefully within the next week or so)

          Show
          Hoss Man added a comment - As noted in SOLR-6349 ... i think the best approach would be to leave "calcDistinct" alone as it is now but deprecate/discourage it andmove towards adding an entirely new stats option for computing an aproximated count using hyperloglog (i opened a new issue for this: SOLR-6968 ) ...the problem is that the "exact" count returned by calcDistinct today requires that all distinctValues be aggregated (from all shards in a distrib setup) and dumped into a giant Set in memory. returning the distinctValues may seem cumbersome to clients, but not returning them would just mask how painful this feature is on the server side, and the biggest problems with it (notably server OOMs) wouldn't go away, they'd just be harder to understand. so i'm generally opposed to adding more flags to hide what is, in my opinion, a broken "feature" and instead aim to move on and implement a better version of it (hopefully within the next week or so)
          Hide
          James Andres added a comment -

          Thanks for replying so quickly Hoss Man.

          What you say makes total sense. Calcdistinct is not very friendly to memory or CPU on our Solr cluster, and we are aware of why!

          Unfortunately, for me, being able to get exact (or very nearly exact) distinct counts is important for the application I'm building (it's an analytics / stats tool so correct numbers are a must have feature). The trouble is I rarely need to know what the unique values actually are.

          The specific itch I'm hoping to scratch is to avoid the 10MB to 30MB of distinctValues coming over the wire to the app servers which my Python app needs to parse then immediately discard. Without distinctValues my JSON responses drop to a few kilobytes in size.

          Regarding the hyperloglog approach:

          • This is awesome. I can definitely use this
          • Can you speculate on how the precision will be controlled? Will it be in the query or in solrconfig.xml or both?
          • Can I crank the precision way up such that when dealing with hundreds of millions of unique items it will be nearly exact counts (eg: off by > 100)?
          Show
          James Andres added a comment - Thanks for replying so quickly Hoss Man . What you say makes total sense. Calcdistinct is not very friendly to memory or CPU on our Solr cluster, and we are aware of why! Unfortunately, for me, being able to get exact (or very nearly exact) distinct counts is important for the application I'm building (it's an analytics / stats tool so correct numbers are a must have feature). The trouble is I rarely need to know what the unique values actually are. The specific itch I'm hoping to scratch is to avoid the 10MB to 30MB of distinctValues coming over the wire to the app servers which my Python app needs to parse then immediately discard. Without distinctValues my JSON responses drop to a few kilobytes in size. Regarding the hyperloglog approach: This is awesome. I can definitely use this Can you speculate on how the precision will be controlled? Will it be in the query or in solrconfig.xml or both? Can I crank the precision way up such that when dealing with hundreds of millions of unique items it will be nearly exact counts (eg: off by > 100)?
          Hide
          Hoss Man added a comment -

          James: sorry, i overlooked your comments until now...

          Regarding the hyperloglog approach:

          please raise HLL related discussions in the comments of SOLR-6968 - been making lots of progress there, would love your opinions & help testing, but i don't want to have the conversation fractured in two diff places for other people watching that issue.


          Unfortunately, for me, being able to get exact (or very nearly exact) distinct counts is important for the application I'm building (it's an analytics / stats tool so correct numbers are a must have feature)

          I understand ... the reason i popped back up in this issue today was to comment that:

          1) as I work more with HLL and really think about the internals, i realized that for people who care a lot about absolutely accurate precision, especially in low cardinality situations, i think you are correct – fixing this issue (countdistinct w/o distinctValues) would be a big win – even once HLL support is added. The simple reason being that HLL is fundamentally based on counting unique hashed values, so even if someone tries to tune it to be as accurate as possible, there's still the risk that two distinct values in a users's set will have the same hash value and prevent 100% accurate cardinality results.
          2) as i worked on refactoring the tests in SOLR-6968 it occurred to me that what you're requesting here would actually be fairly easy to implement and support in a backcompat way – in particular because of the changes already made in SOLR-6349 (distinguishing per-shard dependent stats) and the fact that the current param is "calcdistinct" but the stats currently returned are "countDistinct" and "distinctValues"

          In a nutshell, what i think we should do is...

          • replace calcdistinct(true) value in the Stat enum with distinctvals(true) and countdistinct(false,distinctvals) (so countdistinct indicates that in a distributed setup it requires distinctvals to be returned by each shard)
          • remove all the calcDistinct logic from StatsValuesFactory and instead treat countdistinct and distinctvals as independently requested stats that have a dependent distributed computation relationship (just like "mean" vs "sum / count")
          • update the existing special syntactic sugar parsing for the top level "calcdistinct" param to also include syntatic sugar parsing for the calcdistinct local param - in both cases they become sugar for countdistinct=true distinctvals=true

          the end result being that users who want the countDistinct response info, w/o the list of all distinctVals would just specify countdistinct=true as a localparam (no other new top level params needed)

          How does that sound?

          these changes should already be fairly trivial to make if you want to take a stab at a patch – but there may be some patch collision in the code with the work i'm doing in SOLR-6968. There will definitely be patch collisions in the existing test cases, but once SOLR-6968 is commited the test changes should actually be much easier ... and i'm happy to try and tackle this as soon as i'm done with that one (if nothing else, it will help me do a better apples to apples benchmark of HLL vs countDistinct w/o the added distincatVals in the final result)

          Show
          Hoss Man added a comment - James: sorry, i overlooked your comments until now... Regarding the hyperloglog approach: please raise HLL related discussions in the comments of SOLR-6968 - been making lots of progress there, would love your opinions & help testing, but i don't want to have the conversation fractured in two diff places for other people watching that issue. Unfortunately, for me, being able to get exact (or very nearly exact) distinct counts is important for the application I'm building (it's an analytics / stats tool so correct numbers are a must have feature) I understand ... the reason i popped back up in this issue today was to comment that: 1) as I work more with HLL and really think about the internals, i realized that for people who care a lot about absolutely accurate precision, especially in low cardinality situations, i think you are correct – fixing this issue (countdistinct w/o distinctValues) would be a big win – even once HLL support is added. The simple reason being that HLL is fundamentally based on counting unique hashed values, so even if someone tries to tune it to be as accurate as possible, there's still the risk that two distinct values in a users's set will have the same hash value and prevent 100% accurate cardinality results. 2) as i worked on refactoring the tests in SOLR-6968 it occurred to me that what you're requesting here would actually be fairly easy to implement and support in a backcompat way – in particular because of the changes already made in SOLR-6349 (distinguishing per-shard dependent stats) and the fact that the current param is "calcdistinct" but the stats currently returned are "countDistinct" and "distinctValues" In a nutshell, what i think we should do is... replace calcdistinct(true) value in the Stat enum with distinctvals(true) and countdistinct(false,distinctvals) (so countdistinct indicates that in a distributed setup it requires distinctvals to be returned by each shard) remove all the calcDistinct logic from StatsValuesFactory and instead treat countdistinct and distinctvals as independently requested stats that have a dependent distributed computation relationship (just like "mean" vs "sum / count") update the existing special syntactic sugar parsing for the top level "calcdistinct" param to also include syntatic sugar parsing for the calcdistinct local param - in both cases they become sugar for countdistinct=true distinctvals=true the end result being that users who want the countDistinct response info, w/o the list of all distinctVals would just specify countdistinct=true as a localparam (no other new top level params needed) How does that sound? these changes should already be fairly trivial to make if you want to take a stab at a patch – but there may be some patch collision in the code with the work i'm doing in SOLR-6968 . There will definitely be patch collisions in the existing test cases, but once SOLR-6968 is commited the test changes should actually be much easier ... and i'm happy to try and tackle this as soon as i'm done with that one (if nothing else, it will help me do a better apples to apples benchmark of HLL vs countDistinct w/o the added distincatVals in the final result)
          Hide
          Hoss Man added a comment -

          here's a patch including what i was describing before.

          James: would be great if you could take this for a spin and let me know what you think.

          Show
          Hoss Man added a comment - here's a patch including what i was describing before. James: would be great if you could take this for a spin and let me know what you think.
          Hide
          ASF subversion and git services added a comment -

          Commit 1678422 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1678422 ]

          SOLR-7461: stats.field now supports individual local params for 'countDistinct' and 'distinctValues', 'calcdistinct' is still supported as an alias for both options

          Show
          ASF subversion and git services added a comment - Commit 1678422 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1678422 ] SOLR-7461 : stats.field now supports individual local params for 'countDistinct' and 'distinctValues', 'calcdistinct' is still supported as an alias for both options
          Hide
          ASF subversion and git services added a comment -

          Commit 1678449 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1678449 ]

          SOLR-7461: stats.field now supports individual local params for 'countDistinct' and 'distinctValues', 'calcdistinct' is still supported as an alias for both options (merge r1678422)

          Show
          ASF subversion and git services added a comment - Commit 1678449 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1678449 ] SOLR-7461 : stats.field now supports individual local params for 'countDistinct' and 'distinctValues', 'calcdistinct' is still supported as an alias for both options (merge r1678422)
          Hide
          Hoss Man added a comment -

          NOTE: final commit differed slightly from the lsat patch i posted: I cleaned up some exception logging in the distrib test, and simplified how/where statsToCalculate was getting updated.

          (neither of which is directly realted to this issue, but i noticed them while reviewing the patch again)

          Show
          Hoss Man added a comment - NOTE: final commit differed slightly from the lsat patch i posted: I cleaned up some exception logging in the distrib test, and simplified how/where statsToCalculate was getting updated. (neither of which is directly realted to this issue, but i noticed them while reviewing the patch again)
          Hide
          James Andres added a comment -

          Amazing Hoss Man! Thank you! I'll take it for a spin this week.

          Show
          James Andres added a comment - Amazing Hoss Man ! Thank you! I'll take it for a spin this week.
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Hoss Man
              Reporter:
              James Andres
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development