Hive
  1. Hive
  2. HIVE-165

Add standard statistical functions

    Details

    • Type: Wish Wish
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      The last step in the unholy triumvirate of statistical built-ins is the variance. We already have the n (count) and the mean (avg). I currently have a job or two that filters all of the data into a single reducer which just computes mean/n/variance and writes it to a table...so my guess is that this would be a pretty big speed increase. Not a huge deal though, as computing the variance myself is trivial.

      (Average, variance, and n can be co-computed in one pass, so if you're doing var() you can basically have avg() and count() for free.)

        Issue Links

          Activity

          Hide
          Adam Kramer added a comment -

          VAR and SD were added months ago.

          Show
          Adam Kramer added a comment - VAR and SD were added months ago.
          Hide
          Jeff Hammerbacher added a comment -

          Adding to "Query Processor" component.

          Show
          Jeff Hammerbacher added a comment - Adding to "Query Processor" component.
          Hide
          Zheng Shao added a comment -

          Thanks for taking care of this David. UDAFAverage.java can be used as an example.

          Show
          Zheng Shao added a comment - Thanks for taking care of this David. UDAFAverage.java can be used as an example.
          Hide
          Adam Kramer added a comment -

          I agree, and have been annoyed by the inconsistency between POP and SAMP versions of var().

          I have used a similar workaround, but since it (currently) takes two mapreduce steps, it's faster to write my own single-reducer script. If I was using huge data sets, the above would be faster, though...but I do worry a bit that for really huge data sets, SUM(x*x) might overflow.

          Show
          Adam Kramer added a comment - I agree, and have been annoyed by the inconsistency between POP and SAMP versions of var(). I have used a similar workaround, but since it (currently) takes two mapreduce steps, it's faster to write my own single-reducer script. If I was using huge data sets, the above would be faster, though...but I do worry a bit that for really huge data sets, SUM(x*x) might overflow.
          Hide
          David Phillips added a comment - - edited

          Rather than add a non-standard VAR function, we should add these ANSI SQL 2003 functions:

          COVAR_POP
          COVAR_SAMP
          STDDEV_POP
          STDDEV_SAMP
          VAR_POP
          VAR_SAMP

          I recommend against adding the non-standard STDDEV and VARIANCE functions as their behavior differs between MySQL (population) and Oracle/PostgreSQL (sample).

          For the time being, you can use this workaround:

           
          VAR_POP = (SUM(x * x) - (SUM(x) * AVG(x))) / COUNT(x)
          VAR_SAMP = (SUM(x * x) - (SUM(x) * AVG(x))) / (COUNT(x) - 1)
          
          Show
          David Phillips added a comment - - edited Rather than add a non-standard VAR function, we should add these ANSI SQL 2003 functions: COVAR_POP COVAR_SAMP STDDEV_POP STDDEV_SAMP VAR_POP VAR_SAMP I recommend against adding the non-standard STDDEV and VARIANCE functions as their behavior differs between MySQL (population) and Oracle/PostgreSQL (sample). For the time being, you can use this workaround: VAR_POP = (SUM(x * x) - (SUM(x) * AVG(x))) / COUNT(x) VAR_SAMP = (SUM(x * x) - (SUM(x) * AVG(x))) / (COUNT(x) - 1)

            People

            • Assignee:
              David Phillips
              Reporter:
              Adam Kramer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development