Details
Description
We should support computing correlations between columns in DataFrames with a simple API.
This could be a DataFrame feature:
myDataFrame.corr("col1", "col2") // or myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
Or it could be an MLlib feature:
Statistics.corr(myDataFrame("col1"), myDataFrame("col2")) // or Statistics.corr(myDataFrame, "col1", "col2")
(The first Statistics.corr option is more flexible, but it could cause trouble if a user tries to pass in 2 unzippable DataFrame columns.)
Note: R follow the latter setup. I'm OK with either.
Attachments
Issue Links
- duplicates
-
SPARK-7239 Statistic functions for DataFrames
- Resolved