Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-14006 Differences in StatsComponent and JSON facet aggregations
  3. SOLR-11725

json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.0
    • Facet Module
    • None

    Description

      While working on some equivalence tests/demonstrations for facet.pivot+stats.field vs json.facet I noticed that the stddev calculations done between the two code paths can be measurably different, and realized this is due to them using very different code...

      • json.facet=foo:stddev(foo)
        • StddevAgg.java
        • Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))
      • stats.field={!stddev=true}foo
        • StatsValuesFactory.java
        • Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 1.0D)))

      Since I"m not really a math guy, I consulting with a bunch of smart math/stat nerds I know online to help me sanity check if these equations (some how) reduced to eachother (In which case the discrepancies I was seeing in my results might have just been due to the order of intermediate operation execution & floating point rounding differences).

      They confirmed that the two bits of code are not equivalent to each other, and explained that the code JSON Faceting is using is equivalent to the "Uncorrected sample stddev" formula, while StatsComponent's code is equivalent to the the "Corrected sample stddev" formula...

      https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation

      When I told them that stuff like this is why no one likes mathematicians and pressed them to explain which one was the "most canonical" (or "most generally applicable" or "best") definition of stddev, I was told that:

      1. This is something statisticians frequently disagree on
      2. Practically speaking the diff between the calculations doesn't tend to differ significantly when count is "very large"
      3. "Corrected sample stddev" is more appropriate when comparing two distributions

      Given that:

      • the primary usage of computing the stddev of a field/function against a Solr result set (or against a sub-set of results defined by a facet constraint) is probably to compare that distribution to a different Solr result set (or to compare N sub-sets of results defined by N facet constraints)
      • the size of the sets of documents (values) can be relatively small when computing stats over facet constraint sub-sets

      ...it seems like StddevAgg.java should be updated to use the "Corrected sample stddev" equation.

      Attachments

        1. SOLR-11725.patch
          17 kB
          Munendra S N
        2. SOLR-11725.patch
          13 kB
          Munendra S N
        3. SOLR-11725.patch
          1 kB
          Jason Gerlowski

        Activity

          People

            munendrasn Munendra S N
            hossman Chris M. Hostetter
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: