Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8517 Improve the organization and style of MLlib's user guide
  3. SPARK-12212

Clarify the distinction between spark.mllib and spark.ml

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.2
    • 1.6.0, 2.0.0
    • Documentation
    • None

    Description

      There is a confusion in the documentation of MLLib as to what exactly MLlib: is it the package, or is it the whole effort of ML on spark, and how it differs from spark.ml? Is MLLib going to be deprecated?

      We should do the following:

      • refer to the mllib the code package as spark.mllib across all the documentation. Alternative name is "RDD API of MLlib".
      • refer to MLlib the project that encompasses spark.ml + spark.mllib as MLlib (it should be the default)
      • replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of MLlib". I would deemphasize that this API is for building pipelines. Some users are lead to believe from the documentation that spark.ml can only be used for building pipelines and that using a single algorithm can only be done with spark.mllib.

      Most relevant places:

      • mllib-guide.md
      • mllib-linear-methods.md
      • mllib-dimensionality-reduction.md
      • mllib-pmml-model-export.md
      • mllib-statistics.md
        In these files, most references to MLlib are meant to refer to spark.mllib instead.

      Attachments

        Activity

          People

            timhunter Timothy Hunter
            timhunter Timothy Hunter
            Joseph K. Bradley Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: