Solr
  1. Solr
  2. SOLR-680

StatsComponent - get min, max, sum, qt, avg of number fields

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: search
    • Labels:
      None

      Description

      StatsComponent - it returns min,max,sum,qt,avg of specified number fields:

      request parameters:
      &stats=on&stats.field=price

      <stats>
       <stats_fields>
         <lst name="price">
             <double name="min">10</double>
             <double name="max">30</double>
             <double name="avg">20</double>
             <double name="sum">60</double>
             <double name="qt">3</double>
         </lst>
       </stats_fields>
      </stats>
      

      WRT "stats", the component can output sum and avg, but not sd and var.

      USE CASE:
      StatsComponent can be used to get "market price" of DocSet e.g. rental housing site, package tour site.

      1. SOLR-680.patch
        15 kB
        Harish Agarwal
      2. SOLR-680.patch
        34 kB
        Ryan McKinley
      3. SOLR-680.patch
        28 kB
        Ryan McKinley
      4. SOLR-680.patch
        26 kB
        Ryan McKinley
      5. SOLR-680.patch
        22 kB
        Ryan McKinley
      6. SOLR-680.patch
        16 kB
        Ryan McKinley
      7. SOLR-680.patch
        12 kB
        Koji Sekiguchi
      8. SOLR-680-remove-bad-median-calculation.patch
        6 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Koji Sekiguchi added a comment -

          First draft - need more test.

          Show
          Koji Sekiguchi added a comment - First draft - need more test.
          Hide
          Lars Kotthoff added a comment -

          Looks good. Some initial comments:

          • Do we need to specify stats=on explicitely when it's a separate request handler?
          • "qt" should probably renamed to something like "samples" or "quantity" as there's already a "qt" (query type) parameter.
          • What's supposed to happen when this is called on a non-numerical field? Error message in response XML or exception?
          Show
          Lars Kotthoff added a comment - Looks good. Some initial comments: Do we need to specify stats=on explicitely when it's a separate request handler? "qt" should probably renamed to something like "samples" or "quantity" as there's already a "qt" (query type) parameter. What's supposed to happen when this is called on a non-numerical field? Error message in response XML or exception?
          Hide
          Ryan McKinley added a comment -

          Koji – this looks great!

          I just updated the patch so it:

          • handles null values better (skips them, but counts them)
          • adds "missing"=number of null values for the field to the stats
          • throws an error if you try stats on a tokenized or multivalue field
          • registers the StatsComponet by default
          • optionally calculates median value and standard deviation (requires a second pass through the field cache)
            NOTE – this will break in distributed context... i'm not sure there is a fix for that... we could return a weighted average? perhaps a better result would be to return the raw values for each shard?

          I'll upload this now, and start working on solrj integration with tests...

          Show
          Ryan McKinley added a comment - Koji – this looks great! I just updated the patch so it: handles null values better (skips them, but counts them) adds "missing"=number of null values for the field to the stats throws an error if you try stats on a tokenized or multivalue field registers the StatsComponet by default optionally calculates median value and standard deviation (requires a second pass through the field cache) NOTE – this will break in distributed context... i'm not sure there is a fix for that... we could return a weighted average? perhaps a better result would be to return the raw values for each shard? I'll upload this now, and start working on solrj integration with tests...
          Hide
          Ryan McKinley added a comment - - edited

          updated patch:

          with the sample data:
          http://localhost:8983/solr/select?q=*:*&stats=true&stats.field=price&stats.field=popularity&stats.stddev=true&rows=0
          returns

          <lst name="stats">
            <lst name="stats_fields">
              <lst name="price">
                  <double name="min">0.0</double>
                  <double name="max">2199.0</double>
                  <double name="sum">5251.2699999999995</double>
                  <long name="count">15</long>
                  <long name="missing">11</long>
                  <double name="mean">350.08466666666664</double>
                  <double name="median">399.0</double>
                  <double name="stddev">547.7375579061129</double>
             </lst>
             <lst name="popularity">
                  <double name="min">0.0</double>
                  <double name="max">10.0</double>
                  <double name="sum">90.0</double>
                  <long name="count">26</long>
                  <long name="missing">0</long>
                  <double name="mean">3.4615384615384617</double>
                  <double name="median">7.0</double>
                  <double name="stddev">3.557873176275616</double>
             </lst>
          </lst>
          </lst>
          

          Changes:

          • changed "qt" to "count"
          • changed "ave" to "mean" (so it sits nicely with median)
          • added stats support to solrj
          • added test to example jetty/embedded runners

          now it just needs a little documentaion, then i think ready to go...

          Show
          Ryan McKinley added a comment - - edited updated patch: with the sample data: http://localhost:8983/solr/select?q=*:*&stats=true&stats.field=price&stats.field=popularity&stats.stddev=true&rows=0 returns <lst name= "stats" > <lst name= "stats_fields" > <lst name= "price" > <double name= "min" > 0.0 </double> <double name= "max" > 2199.0 </double> <double name= "sum" > 5251.2699999999995 </double> <long name= "count" > 15 </long> <long name= "missing" > 11 </long> <double name= "mean" > 350.08466666666664 </double> <double name= "median" > 399.0 </double> <double name= "stddev" > 547.7375579061129 </double> </lst> <lst name= "popularity" > <double name= "min" > 0.0 </double> <double name= "max" > 10.0 </double> <double name= "sum" > 90.0 </double> <long name= "count" > 26 </long> <long name= "missing" > 0 </long> <double name= "mean" > 3.4615384615384617 </double> <double name= "median" > 7.0 </double> <double name= "stddev" > 3.557873176275616 </double> </lst> </lst> </lst> Changes: changed "qt" to "count" changed "ave" to "mean" (so it sits nicely with median) added stats support to solrj added test to example jetty/embedded runners now it just needs a little documentaion, then i think ready to go...
          Hide
          Ryan McKinley added a comment -

          sorry, patch was missing a file

          Show
          Ryan McKinley added a comment - sorry, patch was missing a file
          Hide
          Sean Timm added a comment -

          Ryan--

          If you want to get the standard deviation with out a second pass, I think you can do it by additionally keeping the running sum of squares of the values. Then:

              /**
               * Returns the standard deviation of all previously counted
               * values.  
               */
              public double standardDeviation()
              {
                  if( _count <= 1.0D ) return 0.0D;
                  return Math.sqrt( ( ( count * sumOfSquares ) - ( sum * sum ) )
                                    / ( count * ( count - 1.0D ) ) );    
              
              }
          
          Show
          Sean Timm added a comment - Ryan-- If you want to get the standard deviation with out a second pass, I think you can do it by additionally keeping the running sum of squares of the values. Then: /** * Returns the standard deviation of all previously counted * values. */ public double standardDeviation() { if ( _count <= 1.0D ) return 0.0D; return Math .sqrt( ( ( count * sumOfSquares ) - ( sum * sum ) ) / ( count * ( count - 1.0D ) ) ); }
          Hide
          Ryan McKinley added a comment -

          good catch – that would even work in distributed mode!

          However, median still requires a second pass. (unless you assume there are no null values)
          If we do a second pass, we could also calculate Q1 and Q3 (1st and 3rd quarter deviation) – JFreeChart has a nice program to graph that

          Show
          Ryan McKinley added a comment - good catch – that would even work in distributed mode! However, median still requires a second pass. (unless you assume there are no null values) If we do a second pass, we could also calculate Q1 and Q3 (1st and 3rd quarter deviation) – JFreeChart has a nice program to graph that
          Hide
          Ryan McKinley added a comment -

          updated:

          • calculate stddev in first pass – and works distributed (thanks Sean!)
          • throws a full error when asking for a bad field – this seems better then catching it and adding it to the response.
          • changed param "stddev" to "twopass" – now it is a flag to calculate things that require a 2nd pass through the data. Currently only "median"

          I'd like to commit this soon...

          Show
          Ryan McKinley added a comment - updated: calculate stddev in first pass – and works distributed (thanks Sean!) throws a full error when asking for a bad field – this seems better then catching it and adding it to the response. changed param "stddev" to "twopass" – now it is a flag to calculate things that require a 2nd pass through the data. Currently only "median" I'd like to commit this soon...
          Hide
          Koji Sekiguchi added a comment - - edited

          Lars, Ryan and Sean – thank you for your comments and contribution on this!
          And thanks again Ryan for Wiki document http://wiki.apache.org/solr/StatsComponent

          I was thinking whether I could implement arbitrary function other than sum(), avg(),... just after I opened this ticket, as Yonik mentioned in this thread: http://www.nabble.com/Sum-of-one-field-td18815666.html#a18854371, but soon I couldn't find time to think about it and was apart from this. I'd like to see your updated patch when I am available, hopefully soon.

          Show
          Koji Sekiguchi added a comment - - edited Lars, Ryan and Sean – thank you for your comments and contribution on this! And thanks again Ryan for Wiki document http://wiki.apache.org/solr/StatsComponent I was thinking whether I could implement arbitrary function other than sum(), avg(),... just after I opened this ticket, as Yonik mentioned in this thread: http://www.nabble.com/Sum-of-one-field-td18815666.html#a18854371 , but soon I couldn't find time to think about it and was apart from this. I'd like to see your updated patch when I am available, hopefully soon.
          Hide
          Ryan McKinley added a comment -

          Updating patch to handle faceting statistics. (with tests)

          • I have not tested this in distributed environment, but it should work
          Show
          Ryan McKinley added a comment - Updating patch to handle faceting statistics. (with tests) I have not tested this in distributed environment, but it should work
          Hide
          Ryan McKinley added a comment -

          I will go ahead and commit this now... we can sort out the default search components issue in SOLR-817.

          Show
          Ryan McKinley added a comment - I will go ahead and commit this now... we can sort out the default search components issue in SOLR-817 .
          Hide
          Ryan McKinley added a comment -

          the "median" calculation is incorrect. As is, it assumens the DocSet has documents in order.

          Also, the "median" is the only "twopass" operation and inherently could not work (easily) with distributed search.

          Since 'median' is only marginally useful, i think we should take it out.

          Show
          Ryan McKinley added a comment - the "median" calculation is incorrect. As is, it assumens the DocSet has documents in order. Also, the "median" is the only "twopass" operation and inherently could not work (easily) with distributed search. Since 'median' is only marginally useful, i think we should take it out.
          Hide
          Ryan McKinley added a comment -

          I removed the median calculation.

          Down the line it might make sense to add it back – but rather then leave in an unreleased broken feature, it seems best to remove it.

          Show
          Ryan McKinley added a comment - I removed the median calculation. Down the line it might make sense to add it back – but rather then leave in an unreleased broken feature, it seems best to remove it.
          Hide
          Harish Agarwal added a comment -

          This patch extends StatsComponent to multivalued fields by adding a method to the UnInvertedField which populates a StatsValues instance as it iterates across attribute values.

          Show
          Harish Agarwal added a comment - This patch extends StatsComponent to multivalued fields by adding a method to the UnInvertedField which populates a StatsValues instance as it iterates across attribute values.
          Hide
          David Smiley added a comment -

          Harish, given that this issue is closed, it seems your patch has fallen off the radar. I recommend convincing the committers to re-open this issue OR you should create a new issue.

          Show
          David Smiley added a comment - Harish, given that this issue is closed, it seems your patch has fallen off the radar. I recommend convincing the committers to re-open this issue OR you should create a new issue.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Koji Sekiguchi
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development