Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:
      None

      Description

      This ticket tracks porting the functionality of spark.mllib.Statistics.chiSqTest over to spark.ml.

      Here is a design doc:
      https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#

        Issue Links

          Activity

          Hide
          wm624 Miao Wang added a comment -

          I think there is a related PR opened for quite a while. Let me find it.

          Show
          wm624 Miao Wang added a comment - I think there is a related PR opened for quite a while. Let me find it.
          Hide
          wm624 Miao Wang added a comment -

          https://github.com/apache/spark/pull/13440
          Timothy Hunter This one is related. The author asks for review. I @ you in the PR too.

          Show
          wm624 Miao Wang added a comment - https://github.com/apache/spark/pull/13440 Timothy Hunter This one is related. The author asks for review. I @ you in the PR too.
          Hide
          timhunter Timothy Hunter added a comment -

          After working on it, I realized that Column operations do not fit very well the sort of requested operations. Hypothesis testing require to chain a UDAF with a UDF then with a UDAF again, which is not something that can be expressed inside catalyst by doing dataframe.select(test("features")). I am going to have a simpler interface that is simpler to interface (see design doc above).

          Show
          timhunter Timothy Hunter added a comment - After working on it, I realized that Column operations do not fit very well the sort of requested operations. Hypothesis testing require to chain a UDAF with a UDF then with a UDAF again, which is not something that can be expressed inside catalyst by doing dataframe.select(test("features")) . I am going to have a simpler interface that is simpler to interface (see design doc above).
          Hide
          josephkb Joseph K. Bradley added a comment -

          That PR for trees looks pretty different. This task is just for wrapping chiSqTest with a DataFrame API.
          I'm going to take this one.

          Show
          josephkb Joseph K. Bradley added a comment - That PR for trees looks pretty different. This task is just for wrapping chiSqTest with a DataFrame API. I'm going to take this one.
          Hide
          apachespark Apache Spark added a comment -

          User 'jkbradley' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17110

          Show
          apachespark Apache Spark added a comment - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/17110
          Hide
          josephkb Joseph K. Bradley added a comment -

          Issue resolved by pull request 17110
          https://github.com/apache/spark/pull/17110

          Show
          josephkb Joseph K. Bradley added a comment - Issue resolved by pull request 17110 https://github.com/apache/spark/pull/17110

            People

            • Assignee:
              josephkb Joseph K. Bradley
              Reporter:
              timhunter Timothy Hunter
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development