[SPARK-10385] Bivariate statistics in DataFrames - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML, SQL
Labels:
- bulk-closed

Description

Similar to ~~SPARK-10384~~, it would be nice to have bivariate statistics support in DataFrames (defined as UDAFs). This JIRA discuss general implementation and track subtasks. Bivariate statistics include:

continuous: covariance (~~SPARK-9297~~), Pearson's correlation (~~SPARK-9298~~), and Spearman's correlation (~~SPARK-10645~~)
categorical: ??

If we define them as UDAFs, it would be flexible to use them with DataFrames, e.g.,

df.groupBy("key").agg(corr("x", "y"))

Attachments

Issue Links

relates to

SPARK-9297 covar_pop and covar_samp aggregate functions

Resolved

SPARK-9298 corr aggregate functions

Resolved

Sub-Tasks

1.	Bivariate Statistics: Spearman's Correlation in DataFrames	Resolved	Unassigned
2.	Bivariate Statistics: Pearson's Chi-Squared goodness of fit test	Resolved	Unassigned
3.	Bivariate Statistics: Chi-Squared independence test	Resolved	Unassigned

Activity

People

Assignee:: Burak Yavuz

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 01/Sep/15 06:51

Updated:: 21/May/19 04:35

Resolved:: 21/May/19 04:35