Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
1.5.2
-
None
Description
When we write examples in the code, we put the generation of the data along with the example itself. We typically have either:
val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") ...
or some more esoteric stuff such as:
val data = Array( (0, 0.1), (1, 0.8), (2, 0.2) ) val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature")
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
I suggest we follow the example of sklearn and standardize all the generation of example data inside a few methods, for example in org.apache.spark.ml.examples.ExampleData. One reason is that just reading the code is sometimes not enough to figure out what the data is supposed to be. For example when using libsvm_data, it is unclear what the dataframe columns are. This is something we should comment somewhere.
Also, it would help explaining in one place all the scala idiosyncracies such as using Tuple1.apply and such.
Attachments
Issue Links
- relates to
-
SPARK-10383 Sync example code between API doc and user guide
- Resolved