Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 1.5.2
    • None
    • Documentation, MLlib

    Description

      When we write examples in the code, we put the generation of the data along with the example itself. We typically have either:

      val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
      ...
      

      or some more esoteric stuff such as:

      val data = Array(
        (0, 0.1),
        (1, 0.8),
        (2, 0.2)
      )
      val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature")
      
      val data = Array(
        Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
        Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
        Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
      )
      val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
      

      I suggest we follow the example of sklearn and standardize all the generation of example data inside a few methods, for example in org.apache.spark.ml.examples.ExampleData. One reason is that just reading the code is sometimes not enough to figure out what the data is supposed to be. For example when using libsvm_data, it is unclear what the dataframe columns are. This is something we should comment somewhere.
      Also, it would help explaining in one place all the scala idiosyncracies such as using Tuple1.apply and such.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              timhunter Timothy Hunter
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: