Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10384

Univariate statistics as UDAFs

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.6.0
    • ML, SQL
    • None

    Description

      It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include:

      continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
      categorical: number of categories, mode

      If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,

      df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
      

      Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.

      Univariate statistics for continuous variables:

      Univariate statistics for categorical variables:

      Attachments

        Issue Links

          Activity

            People

              mengxr Xiangrui Meng
              mengxr Xiangrui Meng
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: