Details

    • Sub-task
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • None
    • 1.4.0
    • ML, SQL
    • Spark 1.5 doc/QA sprint

    Description

      We should support computing correlations between columns in DataFrames with a simple API.

      This could be a DataFrame feature:

      myDataFrame.corr("col1", "col2")
      // or
      myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
      

      Or it could be an MLlib feature:

      Statistics.corr(myDataFrame("col1"), myDataFrame("col2"))
      // or
      Statistics.corr(myDataFrame, "col1", "col2")
      

      (The first Statistics.corr option is more flexible, but it could cause trouble if a user tries to pass in 2 unzippable DataFrame columns.)

      Note: R follow the latter setup. I'm OK with either.

      Attachments

        Issue Links

          Activity

            People

              brkyvz Burak Yavuz
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: