Hive
  1. Hive
  2. HIVE-165

Add standard statistical functions

    Details

    • Type: Wish Wish
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      The last step in the unholy triumvirate of statistical built-ins is the variance. We already have the n (count) and the mean (avg). I currently have a job or two that filters all of the data into a single reducer which just computes mean/n/variance and writes it to a table...so my guess is that this would be a pretty big speed increase. Not a huge deal though, as computing the variance myself is trivial.

      (Average, variance, and n can be co-computed in one pass, so if you're doing var() you can basically have avg() and count() for free.)

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          604d 22h 6m 1 Adam Kramer 08/Aug/10 09:50
          Resolved Resolved Reopened Reopened
          1d 8h 32m 1 Carl Steinbach 09/Aug/10 18:23
          Reopened Reopened Resolved Resolved
          12s 1 Carl Steinbach 09/Aug/10 18:23
          Gavin made changes -
          Link This issue depends upon HIVE-194 [ HIVE-194 ]
          Gavin made changes -
          Link This issue depends on HIVE-194 [ HIVE-194 ]
          Carl Steinbach made changes -
          Resolution Duplicate [ 3 ]
          Status Reopened [ 4 ] Resolved [ 5 ]
          Carl Steinbach made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Carl Steinbach made changes -
          Link This issue duplicates HIVE-607 [ HIVE-607 ]
          Adam Kramer made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Adam Kramer added a comment -

          VAR and SD were added months ago.

          Show
          Adam Kramer added a comment - VAR and SD were added months ago.
          David Phillips made changes -
          Link This issue depends on HIVE-194 [ HIVE-194 ]
          David Phillips made changes -
          Summary var(col) built-in to go with avg(col) and count(col) Add standard statistical functions
          Jeff Hammerbacher made changes -
          Component/s Query Processor [ 12312586 ]
          Hide
          Jeff Hammerbacher added a comment -

          Adding to "Query Processor" component.

          Show
          Jeff Hammerbacher added a comment - Adding to "Query Processor" component.
          Hide
          Zheng Shao added a comment -

          Thanks for taking care of this David. UDAFAverage.java can be used as an example.

          Show
          Zheng Shao added a comment - Thanks for taking care of this David. UDAFAverage.java can be used as an example.
          Hide
          Adam Kramer added a comment -

          I agree, and have been annoyed by the inconsistency between POP and SAMP versions of var().

          I have used a similar workaround, but since it (currently) takes two mapreduce steps, it's faster to write my own single-reducer script. If I was using huge data sets, the above would be faster, though...but I do worry a bit that for really huge data sets, SUM(x*x) might overflow.

          Show
          Adam Kramer added a comment - I agree, and have been annoyed by the inconsistency between POP and SAMP versions of var(). I have used a similar workaround, but since it (currently) takes two mapreduce steps, it's faster to write my own single-reducer script. If I was using huge data sets, the above would be faster, though...but I do worry a bit that for really huge data sets, SUM(x*x) might overflow.
          David Phillips made changes -
          Assignee David Phillips [ electrum ]
          Hide
          David Phillips added a comment - - edited

          Rather than add a non-standard VAR function, we should add these ANSI SQL 2003 functions:

          COVAR_POP
          COVAR_SAMP
          STDDEV_POP
          STDDEV_SAMP
          VAR_POP
          VAR_SAMP

          I recommend against adding the non-standard STDDEV and VARIANCE functions as their behavior differs between MySQL (population) and Oracle/PostgreSQL (sample).

          For the time being, you can use this workaround:

           
          VAR_POP = (SUM(x * x) - (SUM(x) * AVG(x))) / COUNT(x)
          VAR_SAMP = (SUM(x * x) - (SUM(x) * AVG(x))) / (COUNT(x) - 1)
          
          Show
          David Phillips added a comment - - edited Rather than add a non-standard VAR function, we should add these ANSI SQL 2003 functions: COVAR_POP COVAR_SAMP STDDEV_POP STDDEV_SAMP VAR_POP VAR_SAMP I recommend against adding the non-standard STDDEV and VARIANCE functions as their behavior differs between MySQL (population) and Oracle/PostgreSQL (sample). For the time being, you can use this workaround: VAR_POP = (SUM(x * x) - (SUM(x) * AVG(x))) / COUNT(x) VAR_SAMP = (SUM(x * x) - (SUM(x) * AVG(x))) / (COUNT(x) - 1)
          Adam Kramer made changes -
          Field Original Value New Value
          Description The last step in the unholy triumvirate of statistical built-ins is the variance...we already have the n (count) and the mean (avg). I currently have one reduce step that just computes mean/n/variance and writes it to a table, so my guess is that this would be a pretty big speed increase. Not a huge deal though, as computing the variance myself is trivial. (Average, variance, and n can be co-computed in one pass) The last step in the unholy triumvirate of statistical built-ins is the variance. We already have the n (count) and the mean (avg). I currently have a job or two that filters all of the data into a single reducer which just computes mean/n/variance and writes it to a table...so my guess is that this would be a pretty big speed increase. Not a huge deal though, as computing the variance myself is trivial.

          (Average, variance, and n can be co-computed in one pass, so if you're doing var() you can basically have avg() and count() for free.)
          Adam Kramer created issue -

            People

            • Assignee:
              David Phillips
              Reporter:
              Adam Kramer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development