Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.5.2
-
None
Description
There is a confusion in the documentation of MLLib as to what exactly MLlib: is it the package, or is it the whole effort of ML on spark, and how it differs from spark.ml? Is MLLib going to be deprecated?
We should do the following:
- refer to the mllib the code package as spark.mllib across all the documentation. Alternative name is "RDD API of MLlib".
- refer to MLlib the project that encompasses spark.ml + spark.mllib as MLlib (it should be the default)
- replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of MLlib". I would deemphasize that this API is for building pipelines. Some users are lead to believe from the documentation that spark.ml can only be used for building pipelines and that using a single algorithm can only be done with spark.mllib.
Most relevant places:
- mllib-guide.md
- mllib-linear-methods.md
- mllib-dimensionality-reduction.md
- mllib-pmml-model-export.md
- mllib-statistics.md
In these files, most references to MLlib are meant to refer to spark.mllib instead.