Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:
      None

      Description

      This ticket tracks porting the functionality of spark.mllib.Statistics.corr() over to spark.ml.

      Here is a design doc:
      https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#

        Issue Links

          Activity

          Hide
          timhunter Timothy Hunter added a comment -

          Unless someone has started to work on this task, I will take it.

          Show
          timhunter Timothy Hunter added a comment - Unless someone has started to work on this task, I will take it.
          Hide
          timhunter Timothy Hunter added a comment -

          Looking more closely at the code, it makes sense to start by a replacement of MultivariateStatisticalSummary, which is the basis of PearsonCorrelation and the final step of the Spearman correlation. Also, looking at these algorithms, it is not going to write them as UDAFs (unlike the original design), so the interface will need to take a Dataset[Vector] instead of a column.

          Show
          timhunter Timothy Hunter added a comment - Looking more closely at the code, it makes sense to start by a replacement of MultivariateStatisticalSummary, which is the basis of PearsonCorrelation and the final step of the Spearman correlation. Also, looking at these algorithms, it is not going to write them as UDAFs (unlike the original design), so the interface will need to take a Dataset [Vector] instead of a column.
          Hide
          timhunter Timothy Hunter added a comment -

          After working on it, I realized that Column operations do not fit very well the sort of requested operations. Correlations require to chain a UDAF with a UDF then with a UDAF again, which is not something that can be expressed inside catalyst by doing dataframe.select(corr("features")). I am going to have a simpler interface that is simpler to interface (see design doc above).

          Show
          timhunter Timothy Hunter added a comment - After working on it, I realized that Column operations do not fit very well the sort of requested operations. Correlations require to chain a UDAF with a UDF then with a UDAF again, which is not something that can be expressed inside catalyst by doing dataframe.select(corr("features")) . I am going to have a simpler interface that is simpler to interface (see design doc above).
          Hide
          apachespark Apache Spark added a comment -

          User 'thunterdb' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17108

          Show
          apachespark Apache Spark added a comment - User 'thunterdb' has created a pull request for this issue: https://github.com/apache/spark/pull/17108
          Hide
          josephkb Joseph K. Bradley added a comment -

          Issue resolved by pull request 17108
          https://github.com/apache/spark/pull/17108

          Show
          josephkb Joseph K. Bradley added a comment - Issue resolved by pull request 17108 https://github.com/apache/spark/pull/17108

            People

            • Assignee:
              timhunter Timothy Hunter
              Reporter:
              timhunter Timothy Hunter
              Shepherd:
              Joseph K. Bradley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development