Details
-
Umbrella
-
Status: Resolved
-
Critical
-
Resolution: Done
-
None
-
None
-
None
Description
This is an umbrella JIRA for porting spark.mllib implementations to use the DataFrame-based API defined under spark.ml. We want to achieve critical feature parity for the next release.
Instructions for 3 subtask types
Review tasks: detailed review of a subpackage to identify feature gaps between spark.mllib and spark.ml.
- Should be listed as a subtask of this umbrella.
- Review subtasks cover major algorithm groups. To pick up a review subtask, please:
- Comment that you are working on it.
- Compare the public APIs of spark.ml vs. spark.mllib.
- Comment on all missing items within spark.ml: algorithms, models, methods, features, etc.
- Check for existing JIRAs covering those items. If there is no existing JIRA, create one, and link it to your comment.
Critical tasks: higher priority missing features which are required for this umbrella JIRA.
- Should be linked as "requires" links.
Other tasks: lower priority missing features which can be completed after the critical tasks.
- Should be linked as "contains" links.
Excluded items
This does not include:
- Python: We can compare Scala vs. Python in spark.ml itself.
- Moving linalg to spark.ml:
SPARK-13944 - Streaming ML: Requires stabilizing some internal APIs of structured streaming first
TODO list
Critical issues
SPARK-14501: Frequent Pattern MiningSPARK-14709: linear SVMSPARK-15784: Power Iteration Clustering (PIC)
Lower priority issues
- Missing methods within algorithms (see Issue Links below)
- evaluation submodule
- stat submodule (should probably be covered in DataFrames)
- Developer-facing submodules:
- optimization (including
SPARK-17136) - random, rdd
- util
- optimization (including
To be prioritized
- single-instance prediction:
SPARK-10413 - pmml
SPARK-11171
Attachments
Issue Links
- contains
-
SPARK-13025 Allow user to specify the initial model when training LogisticRegression
- Resolved
-
SPARK-14712 spark.ml LogisticRegressionModel.toString should summarize model
- Resolved
- depends upon
-
SPARK-3702 Standardize MLlib classes for learners, models
- Closed
- requires
-
SPARK-14709 spark.ml API for linear SVM
- Resolved
-
SPARK-15784 Add Power Iteration Clustering to spark.ml
- Resolved
-
SPARK-14501 spark.ml parity for fpm - frequent items
- Resolved