Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4591

Algorithm/model parity for spark.ml (Scala)

    Details

    • Type: Umbrella
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

      This is an umbrella JIRA for porting spark.mllib implementations to use the DataFrame-based API defined under spark.ml. We want to achieve critical feature parity for the next release.

      Instructions for 3 subtask types

      Review tasks: detailed review of a subpackage to identify feature gaps between spark.mllib and spark.ml.

      • Should be listed as a subtask of this umbrella.
      • Review subtasks cover major algorithm groups. To pick up a review subtask, please:
        • Comment that you are working on it.
        • Compare the public APIs of spark.ml vs. spark.mllib.
        • Comment on all missing items within spark.ml: algorithms, models, methods, features, etc.
        • Check for existing JIRAs covering those items. If there is no existing JIRA, create one, and link it to your comment.

      Critical tasks: higher priority missing features which are required for this umbrella JIRA.

      • Should be linked as "requires" links.

      Other tasks: lower priority missing features which can be completed after the critical tasks.

      • Should be linked as "contains" links.

      Excluded items

      This does not include:

      • Python: We can compare Scala vs. Python in spark.ml itself.
      • Moving linalg to spark.ml: SPARK-13944
      • Streaming ML: Requires stabilizing some internal APIs of structured streaming first

      TODO list

      Critical issues

      Lower priority issues

      • Missing methods within algorithms (see Issue Links below)
      • evaluation submodule
      • stat submodule (should probably be covered in DataFrames)
      • Developer-facing submodules:
        • optimization (including SPARK-17136)
        • random, rdd
        • util

      To be prioritized

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mengxr Xiangrui Meng
              • Votes:
                4 Vote for this issue
                Watchers:
                19 Start watching this issue

                Dates

                • Created:
                  Updated: