Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: ML, SQL
    • Labels:
    • Target Version/s:
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      We should support computing correlations between columns in DataFrames with a simple API.

      This could be a DataFrame feature:

      myDataFrame.corr("col1", "col2")
      // or
      myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
      

      Or it could be an MLlib feature:

      Statistics.corr(myDataFrame("col1"), myDataFrame("col2"))
      // or
      Statistics.corr(myDataFrame, "col1", "col2")
      

      (The first Statistics.corr option is more flexible, but it could cause trouble if a user tries to pass in 2 unzippable DataFrame columns.)

      Note: R follow the latter setup. I'm OK with either.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                brkyvz Burak Yavuz
                Reporter:
                josephkb Joseph K. Bradley
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: