Description
It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include:
continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis
categorical: number of categories, mode
If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g.,
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation.
Univariate statistics for continuous variables:
minmaxrange(SPARK-10861) - won't addmean- sample variance (
SPARK-9296) - population variance (
SPARK-9296) sample standard deviation(SPARK-6458)population standard deviation(SPARK-6458)- skewness (
SPARK-10641) - kurtosis (
SPARK-10641) - approximate median (
SPARK-6761) -> 1.7.0 - approximate quantiles (
SPARK-6761) -> 1.7.0
Univariate statistics for categorical variables:
- mode: https://en.wikipedia.org/wiki/Mode_(statistics) (
SPARK-10936) -> 1.7.0 number of categories(This is COUNT DISTINCT in SQL.)
Attachments
Issue Links
- relates to
-
SPARK-6761 Approximate quantile
- Resolved
1.
|
Univariate statistics as UDAFs: single-pass continuous stats | Closed | Seth Hendrickson | |
2.
|
Univariate statistics as UDAFs: multi-pass continuous stats | Closed | Unassigned | |
3.
|
Univariate statistics as UDAFs: categorical stats | Resolved | Unassigned | |
4.
|
Univariate Statistics: Adding range support as UDAF | Closed | Jihong Ma | |
5.
|
Univariate Statistics: Adding median & quantile support as UDAF | Resolved | Unassigned | |
6.
|
Benchmark declarative/codegen vs. imperative code for univariate statistics | Resolved | Jihong Ma | |
7.
|
skewness and kurtosis support | Resolved | Seth Hendrickson | |
8.
|
Updating Stddev support with Imperative Aggregate | Resolved | Jihong Ma | |
9.
|
Handle edge cases when count = 0 or 1 for Stats function | Resolved | Jihong Ma |